TL;DR

Principal Network Engineer (AI Infrastructure): Owning the reliability, scalability, and long-term evolution of Infiniband and RDMA-based network fabrics for high-performance GPU cloud with an accent on AI interconnect networks. Focus on designing large-scale fabric architectures, resolving complex incidents, and driving cross-team operational improvements.

Location: Remote-first, geography no barrier to impact

Salary: $150,000 - $250,000 USD

Company

GPU cloud engineered for AI, providing cost-effective, high-performance infrastructure for AI startups and enterprises.

What you will do

  • Own technical direction and operational strategy for AI interconnect networks
  • Design, review, and evolve large-scale Infiniband and RoCE fabric architectures
  • Act as senior escalation point for complex network incidents and systemic fixes
  • Drive cross-team initiatives to improve fabric reliability, performance, and maturity
  • Define standards for hardware, congestion control, routing, firmware, and change safety
  • Partner with SRE, Compute Platform, and Network Architecture teams on system design
  • Mentor engineers and drive improvements in uptime, latency, and efficiency

Requirements

  • 10+ years in network engineering with focus on HPC, AI, or hyperscale data center networking
  • Expert operational and architectural experience with Infiniband and/or large-scale RoCE fabrics
  • Deep understanding of RDMA internals, congestion management, and fabric failure modes
  • Strong expertise in modern data center routing and control planes (BGP, OSPF, ECMP)
  • Proven ability to debug cross-layer issues across hardware, firmware, kernel, and applications
  • Demonstrated leadership in complex technical initiatives without direct authority
  • Systems-level mindset balancing performance, reliability, scalability, and cost

Nice to have

  • Extensive experience with NVIDIA/Mellanox in production AI or HPC environments
  • Deep familiarity with distributed training frameworks and GPU communication patterns
  • Experience designing network observability for high-cardinality environments
  • Prior experience influencing platform or infrastructure strategy at scale

Culture & Benefits

  • Collaborative, supportive, innovative environment with real impact
  • Competitive package (base + equity) with annual reviews
  • Dynamic progression plan with autonomy and support
  • Human-first flexibility, remote-first team with seamless virtual collaboration
  • Competitive benefits including medical, dental, vision, flexible PTO, parental leave, retirement plan