TL;DR
Principal Deployment Engineer (AI): Architecting and leading the bringup of large-scale GPU clusters, responsible for defining how we deploy, validate, and scale AI superclusters across sites with an accent on rack design, fabric architecture, cluster validation frameworks and production readiness standards. Focus on defining technical standards for node, rack, and full-cluster bringup.
Location: Onsite in Seattle, US
Company
We are building AI infrastructure for frontier-scale workloads.
What you will do
- Define technical standards for node, rack, and full-cluster bringup.
- Lead large-scale GPU cluster deployments.
- Architect high-performance network fabrics optimized for AI workloads.
- Establish cluster-level acceptance criteria and validation frameworks.
- Design repeatable deployment models for multi-site expansion.
- Serve as the escalation point for complex bringup and performance issues.
Requirements
- 10+ years of experience in large-scale infrastructure or HPC environments.
- Proven experience bringing up large GPU clusters (hundreds+ GPUs).
- Deep expertise in high-speed networking (InfiniBand, RoCE, Ethernet fabrics).
- Strong understanding of server architecture (PCIe, NUMA, memory hierarchy).
- Experience debugging performance issues across compute and network layers.
- Strong automation and systems-level thinking.
Nice to have
- Experience scaling AI training clusters for frontier models.
- Experience with liquid cooling or ultra-high-density deployments.
- Knowledge of distributed storage systems (Lustre, Ceph, NVMe-oF).
- Experience defining infrastructure standards in a fast-growing organization.
Culture & Benefits
- Move fast, operate with ownership.
- Expect technical leaders to define standards—not just follow them.
