TL;DR

Principal Network Engineer (AI Infrastructure): Owning the reliability, scalability, and long-term evolution of Infiniband and RDMA-based network fabrics for high-performance GPU cloud with an accent on AI interconnect networks. Focus on designing large-scale fabric architectures, resolving complex incidents, and driving cross-team operational improvements.

Location: Remote-first, geography no barrier to impact

Salary: $150,000 - $250,000 USD

Company

GPU cloud engineered for AI, providing cost-effective, high-performance infrastructure for AI startups and enterprises.

What you will do

Own technical direction and operational strategy for AI interconnect networks
Design, review, and evolve large-scale Infiniband and RoCE fabric architectures
Act as senior escalation point for complex network incidents and systemic fixes
Drive cross-team initiatives to improve fabric reliability, performance, and maturity
Define standards for hardware, congestion control, routing, firmware, and change safety
Partner with SRE, Compute Platform, and Network Architecture teams on system design
Mentor engineers and drive improvements in uptime, latency, and efficiency

Requirements

10+ years in network engineering with focus on HPC, AI, or hyperscale data center networking
Expert operational and architectural experience with Infiniband and/or large-scale RoCE fabrics
Deep understanding of RDMA internals, congestion management, and fabric failure modes
Strong expertise in modern data center routing and control planes (BGP, OSPF, ECMP)
Proven ability to debug cross-layer issues across hardware, firmware, kernel, and applications
Demonstrated leadership in complex technical initiatives without direct authority
Systems-level mindset balancing performance, reliability, scalability, and cost

Nice to have

Extensive experience with NVIDIA/Mellanox in production AI or HPC environments
Deep familiarity with distributed training frameworks and GPU communication patterns
Experience designing network observability for high-cardinality environments
Prior experience influencing platform or infrastructure strategy at scale

Culture & Benefits

Collaborative, supportive, innovative environment with real impact
Competitive package (base + equity) with annual reviews
Dynamic progression plan with autonomy and support
Human-first flexibility, remote-first team with seamless virtual collaboration
Competitive benefits including medical, dental, vision, flexible PTO, parental leave, retirement plan

Principal Network Engineer - AI Infrastructure Operations

Описание вакансии