TL;DR
Principal Network Engineer (AI Infrastructure): Owning the reliability, scalability, and long-term evolution of Infiniband and RDMA-based network fabrics for high-performance GPU cloud with an accent on AI interconnect networks. Focus on designing large-scale fabric architectures, resolving complex incidents, and driving cross-team operational improvements.
Location: Remote-first, geography no barrier to impact
Salary: $150,000 - $250,000 USD
Company
GPU cloud engineered for AI, providing cost-effective, high-performance infrastructure for AI startups and enterprises.
What you will do
- Own technical direction and operational strategy for AI interconnect networks
- Design, review, and evolve large-scale Infiniband and RoCE fabric architectures
- Act as senior escalation point for complex network incidents and systemic fixes
- Drive cross-team initiatives to improve fabric reliability, performance, and maturity
- Define standards for hardware, congestion control, routing, firmware, and change safety
- Partner with SRE, Compute Platform, and Network Architecture teams on system design
- Mentor engineers and drive improvements in uptime, latency, and efficiency
Requirements
- 10+ years in network engineering with focus on HPC, AI, or hyperscale data center networking
- Expert operational and architectural experience with Infiniband and/or large-scale RoCE fabrics
- Deep understanding of RDMA internals, congestion management, and fabric failure modes
- Strong expertise in modern data center routing and control planes (BGP, OSPF, ECMP)
- Proven ability to debug cross-layer issues across hardware, firmware, kernel, and applications
- Demonstrated leadership in complex technical initiatives without direct authority
- Systems-level mindset balancing performance, reliability, scalability, and cost
Nice to have
- Extensive experience with NVIDIA/Mellanox in production AI or HPC environments
- Deep familiarity with distributed training frameworks and GPU communication patterns
- Experience designing network observability for high-cardinality environments
- Prior experience influencing platform or infrastructure strategy at scale
Culture & Benefits
- Collaborative, supportive, innovative environment with real impact
- Competitive package (base + equity) with annual reviews
- Dynamic progression plan with autonomy and support
- Human-first flexibility, remote-first team with seamless virtual collaboration
- Competitive benefits including medical, dental, vision, flexible PTO, parental leave, retirement plan
