TL;DR

Member of Technical Staff, Software Co-Design AI HPC Systems (AI, HPC): Architecting and co-designing next-generation AI systems at datacenter scale with an accent on optimizing end-to-end performance, efficiency, and reliability across hardware and software. Focus on translating insights from real-world AI workloads into concrete improvements, guiding hardware roadmaps, and driving architectural decisions for large-scale AI platforms.

Location: Hybrid in the United Kingdom. Office attendance is required at least four days a week if living within 25 miles of a designated Microsoft office.

Company

Microsoft AI's Superintelligence Team is a startup-like team within Microsoft AI, focused on pushing the boundaries of AI towards Humanist Superintelligence by creating controllable, safety-aligned, and human-values-anchored ultra-capable systems.

What you will do

  • Lead the co-design of AI systems across hardware and software boundaries, spanning accelerators, interconnects, memory, storage, runtimes, and distributed frameworks.
  • Drive architectural decisions by analyzing real workloads, identifying bottlenecks, and translating findings into actionable system and hardware requirements.
  • Co-design and optimize parallelism strategies, execution models, and distributed algorithms to improve scalability, utilization, and cost efficiency of large-scale AI systems.
  • Develop and evaluate what-if performance models to project system behavior and provide early guidance to hardware and platform roadmaps.
  • Partner with compiler, kernel, and runtime teams to unlock the full performance of current and next-generation accelerators.
  • Influence and guide AI hardware design at system and silicon levels, including microarchitecture, interconnect topology, and memory hierarchy.
  • Lead cross-functional efforts to prototype, validate, and productionize high-impact co-design ideas across infrastructure, hardware, and product teams.
  • Mentor senior engineers and researchers, set technical direction, and raise the overall bar for systems rigor and performance engineering.

Requirements

  • Bachelor’s degree in Computer Science, Computer Engineering, Electrical Engineering, or a related technical field, or equivalent practical experience.
  • 10+ years of experience working across systems software, hardware architecture, or AI infrastructure, with demonstrated impact at scale.
  • Strong background in one or more: AI accelerator/GPU architectures, distributed AI training/inference, high-performance computing (HPC), ML systems, runtimes/compilers, performance modeling, or hardware–software co-design for AI workloads.
  • Proficiency in systems-level programming (e.g., C/C++, CUDA, Python) and performance-critical software development.
  • Proven ability to work across organizational boundaries and influence technical decisions involving multiple stakeholders.
  • Must be able to work in a hybrid format from a designated Microsoft office in the United Kingdom.

Nice to have

  • Experience designing or operating large-scale AI clusters for training or inference.
  • Deep familiarity with LLMs, multimodal models, or recommendation systems, and their systems-level implications.
  • Experience with accelerator interconnects and communication stacks (e.g., NCCL, MPI, RDMA, high-speed Ethernet or InfiniBand).
  • Background in performance modeling and capacity planning for future hardware generations.
  • Prior experience contributing to or leading hardware roadmaps, silicon bring-up, or platform architecture reviews.
  • Publications, patents, or open-source contributions in systems, architecture, or ML systems.

Culture & Benefits

  • Mission to empower every person and organization to achieve more, with a growth mindset, innovation, and collaboration.
  • Culture of inclusion, respect, integrity, and accountability.
  • Team operates with end-to-end ownership, deep technical rigor, and a strong bias toward real-world impact.
  • Active contribution to the broader research and engineering community through publishing, prototyping, and open-sourcing impactful technologies.
  • Opportunity to partner with product teams to reach billions of users and create immense positive impact.