TL;DR
Director of Engineering, Cluster Networking (AI): Leading the architecture, design, and engineering delivery of global cluster networking infrastructure for a GPU cloud platform with an accent on high-performance networking fabrics and operational excellence. Focus on defining technical strategy, scaling networking environments for AI workloads, and building a world-class engineering organization.
Location: Remote (Global)
Salary: $150,000 – $300,000 USD
Company
Nscale is a GPU cloud engineered for AI, providing cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers.
What you will do
- Define and evolve the multi-year technical roadmap for cluster networking, aligning with AI platform requirements and growth.
- Lead end-to-end engineering delivery of cluster networking solutions from design to production deployment and optimization.
- Own the operational performance, availability, and reliability of cluster networking infrastructure globally.
- Collaborate closely with Compute, Platform, SRE, Data Centre Operations, and Procurement teams for aligned execution.
- Build, mentor, and scale a high-performing cluster networking engineering team.
- Drive ongoing improvements in network efficiency, performance, and cost optimization.
Requirements
- 12+ years of experience in networking or infrastructure engineering, with at least 5 years in a senior technical leadership role (Head of Engineering, Director, or equivalent).
- Deep hands-on experience designing and operating large-scale data centre or HPC networking environments.
- Proven expertise in high-speed Ethernet and/or InfiniBand fabrics supporting GPU or AI workloads.
- Strong background in data centre networking, routing protocols, congestion management, and high-availability design.
- Experience leading globally distributed engineering teams in high-growth or hyperscale environments.
- Ability to design scalable, resilient, high-performance cluster networking systems and apply an automation mindset.
Nice to have
- Experience designing networking for large-scale GPU clusters or AI training environments.
- Familiarity with HPC networking topologies (Fat Tree, Rail, Dragonfly).
- Experience with SONiC, Cumulus, or other open networking platforms.
- Knowledge of optics, transceivers, and high-speed interconnect standards (400G/800G).
- Background in SRE, distributed systems, or large-scale cloud infrastructure.
Culture & Benefits
- Collaborative, supportive, and innovative environment where contributions spark real impact.
- Highly competitive package (base + equity) with reviews every 12 months.
- Join a fast-growing tech startup pushing boundaries in AI.
- Dynamic progression plan tailored to your ambitions with full support.
- Human-First Flexibility, trusting Nscalers to deliver with autonomy to shape their day.
- Remote-first team with seamless virtual collaboration, no geographical barriers.
