TL;DR

Training Process Management Engineer (Backend): Develop and optimize distributed operating system software that orchestrates and supervises large-scale machine learning training workloads across thousands of machines. With an accent on performance, correctness, scalability, and reliability. Focus on designing and debugging high-performance asynchronous systems and managing complex distributed system challenges at frontier AI scale.

Location: London, UK with hybrid work model (3 days in office per week) and relocation assistance available

Company

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity by pushing the boundaries of AI capabilities and safely deploying AI products.

What you will do

  • Work across Python and Rust stacks to build and maintain software for orchestration and monitoring of ML workloads on supercomputers
  • Profile and optimize software stack for computation orchestration at frontier scale
  • Improve reliability, observability, and fault tolerance of long-running jobs
  • Debug complex distributed system issues across large clusters
  • Adapt to evolving ML system requirements to support researchers

Requirements

  • Location: Must be based in or willing to relocate to London, UK
  • Experience developing distributed systems and strong software engineering skills
  • Proficiency in Rust and Python or another systems programming language (e.g., C++)
  • Solid Linux knowledge with systems-level debugging, performance analysis, and memory profiling
  • Experience with asynchronous and concurrent systems development
  • Strong focus on performance, correctness, and reliability

Culture & Benefits

  • Hybrid work model with 3 days in office per week
  • Relocation assistance for new employees
  • Equal opportunity employer with commitment to diversity and inclusion
  • Supportive environment valuing engineering ownership and agency