TL;DR
Member of Technical Staff, Software Co-Design AI HPC Systems (AI, HPC): Architecting and co-designing next-generation AI systems at datacenter scale with an accent on optimizing end-to-end performance, efficiency, and reliability across hardware and software. Focus on translating insights from real-world AI workloads into concrete improvements, guiding hardware roadmaps, and driving architectural decisions for large-scale AI platforms.
Location: Hybrid in the United Kingdom. Office attendance is required at least four days a week if living within 25 miles of a designated Microsoft office.
Company
Microsoft AI's Superintelligence Team is a startup-like team within Microsoft AI, focused on pushing the boundaries of AI towards Humanist Superintelligence by creating controllable, safety-aligned, and human-values-anchored ultra-capable systems.
What you will do
- Lead the co-design of AI systems across hardware and software boundaries, spanning accelerators, interconnects, memory, storage, runtimes, and distributed frameworks.
- Drive architectural decisions by analyzing real workloads, identifying bottlenecks, and translating findings into actionable system and hardware requirements.
- Co-design and optimize parallelism strategies, execution models, and distributed algorithms to improve scalability, utilization, and cost efficiency of large-scale AI systems.
- Develop and evaluate what-if performance models to project system behavior and provide early guidance to hardware and platform roadmaps.
- Partner with compiler, kernel, and runtime teams to unlock the full performance of current and next-generation accelerators.
- Influence and guide AI hardware design at system and silicon levels, including microarchitecture, interconnect topology, and memory hierarchy.
- Lead cross-functional efforts to prototype, validate, and productionize high-impact co-design ideas across infrastructure, hardware, and product teams.
- Mentor senior engineers and researchers, set technical direction, and raise the overall bar for systems rigor and performance engineering.
Requirements
- Bachelor’s degree in Computer Science, Computer Engineering, Electrical Engineering, or a related technical field, or equivalent practical experience.
- 10+ years of experience working across systems software, hardware architecture, or AI infrastructure, with demonstrated impact at scale.
- Strong background in one or more: AI accelerator/GPU architectures, distributed AI training/inference, high-performance computing (HPC), ML systems, runtimes/compilers, performance modeling, or hardware–software co-design for AI workloads.
- Proficiency in systems-level programming (e.g., C/C++, CUDA, Python) and performance-critical software development.
- Proven ability to work across organizational boundaries and influence technical decisions involving multiple stakeholders.
- Must be able to work in a hybrid format from a designated Microsoft office in the United Kingdom.
Nice to have
- Experience designing or operating large-scale AI clusters for training or inference.
- Deep familiarity with LLMs, multimodal models, or recommendation systems, and their systems-level implications.
- Experience with accelerator interconnects and communication stacks (e.g., NCCL, MPI, RDMA, high-speed Ethernet or InfiniBand).
- Background in performance modeling and capacity planning for future hardware generations.
- Prior experience contributing to or leading hardware roadmaps, silicon bring-up, or platform architecture reviews.
- Publications, patents, or open-source contributions in systems, architecture, or ML systems.
Culture & Benefits
- Mission to empower every person and organization to achieve more, with a growth mindset, innovation, and collaboration.
- Culture of inclusion, respect, integrity, and accountability.
- Team operates with end-to-end ownership, deep technical rigor, and a strong bias toward real-world impact.
- Active contribution to the broader research and engineering community through publishing, prototyping, and open-sourcing impactful technologies.
- Opportunity to partner with product teams to reach billions of users and create immense positive impact.
