TL;DR

Senior Deployment Engineer (AI): Leading the hands-on bringup of high-performance GPU clusters in data center environments with an accent on hardware integration, high-speed fabric tuning, and performance validation. Focus on executing end-to-end node and rack deployments, troubleshooting complex distributed hardware issues, and building repeatable, scalable infrastructure processes.

Location: Must be based in the United States (Onsite travel required)

Company

A startup building next-generation AI infrastructure and scalable GPU clusters for frontier AI workloads.

What you will do

Execute end-to-end bringup of GPU nodes and racks from installation to production readiness.
Validate BIOS, BMC, firmware configurations, and overall GPU cluster health.
Configure and validate high-speed network fabrics including InfiniBand and RoCE.
Perform cluster-wide burn-in, stress testing, and performance validation using NCCL and RDMA.
Develop automation playbooks to transform ad-hoc deployments into repeatable, scalable systems.
Collaborate with networking and hardware vendors to troubleshoot and resolve deployment issues.

Requirements

Must have 5–8+ years of experience in infrastructure engineering or data center operations.
Hands-on experience deploying GPU servers such as HGX or DGX platforms.
Proficiency with high-speed networking fabrics including InfiniBand, RoCE, and Ethernet.
Strong Linux systems knowledge and troubleshooting skills for distributed performance issues.
Must be comfortable working onsite in data center environments.
Must be authorized to work in the United States.

Nice to have

Experience in AI/ML infrastructure or HPC environments.
Familiarity with CUDA, NCCL, and RDMA protocols.
Automation proficiency using Python, Ansible, Terraform, or Bash.
Experience managing high-density power and cooling data center environments.

Culture & Benefits

Opportunity to work on foundational AI infrastructure at a fast-growing startup.
High-impact role with significant ownership over infrastructure build-out.
Focus on urgency, bias toward action, and engineering excellence.
Direct collaboration with infrastructure and hardware teams.

Senior Deployment Engineer – GPU Infrastructure Bringup

Описание вакансии