TL;DR

Staff ML Performance Engineer (AI): Optimizing large-scale ML jobs to enable scaling models, with an accent on identifying and addressing performance bottlenecks in training and inference workloads. Focus on designing and implementing efficiency improvements, observability tools, and benchmarking for large-scale GPU compute clusters.

Location: Onsite in Sunnyvale, California, USA

Company

Wayve is a product company founded in 2017, developing advanced Embodied AI technology and foundation models for automated driving systems.

What you will do

  • Profile ML workloads to identify their bottlenecks, including using tools like NVIDIA Nsight Systems.
  • Design and implement efficiency improvements to maximize MFU and throughput, such as parallelism, model compilation, and mixed precision.
  • Design and implement observability tools to identify bottlenecks and drive performance improvements (e.g., to track MFU, throughput, latency).
  • Design and implement benchmarking tools to track efficiency gains or regressions.
  • Collaborate closely with Research teams to integrate training efficiency improvements and create a culture of performance optimization.

Requirements

  • 10+ years of industry experience driving performance engineering across ML systems, GPU compute infrastructure, or distributed platforms.
  • Experience optimizing large-scale jobs on GPU compute clusters.
  • Experience working in platform teams and with research teams.
  • Experience in writing, reporting, and tracking performance benchmarks in an open and accessible way.
  • Ability to write high-quality, well-structured, and tested Python code.
  • BS or MS in Machine Learning, Computer Science, Engineering, or a related technical discipline or equivalent experience.

Nice to have

  • Experience working with concurrent, parallel, and distributed computing.
  • Experience using NVIDIA Nsight Systems or other system profilers.
  • Experience implementing GPU kernels (CUDA, Triton, etc).
  • Knowledge of computing fundamentals related to code speed, security, and reliability.

Culture & Benefits

  • Committed to creating a diverse, fair, and respectful culture that is inclusive of everyone.
  • Value diversity, embrace new perspectives, and foster an inclusive work environment.
  • Opportunity to make significant contributions to the future of automated driving systems.
  • Fast-paced environment that embraces uncertainty and complex challenges.
  • Focus on continuous learning and evolution in the pursuit of excellence.