TL;DR

Senior Deployment Engineer (AI): Leading the hands-on bringup of high-performance GPU clusters in data center environments with an accent on hardware integration, high-speed fabric tuning, and performance validation. Focus on executing end-to-end node and rack deployments, troubleshooting complex distributed hardware issues, and building repeatable, scalable infrastructure processes.

Location: Must be based in the United States (Onsite travel required)

Company

A startup building next-generation AI infrastructure and scalable GPU clusters for frontier AI workloads.

What you will do

  • Execute end-to-end bringup of GPU nodes and racks from installation to production readiness.
  • Validate BIOS, BMC, firmware configurations, and overall GPU cluster health.
  • Configure and validate high-speed network fabrics including InfiniBand and RoCE.
  • Perform cluster-wide burn-in, stress testing, and performance validation using NCCL and RDMA.
  • Develop automation playbooks to transform ad-hoc deployments into repeatable, scalable systems.
  • Collaborate with networking and hardware vendors to troubleshoot and resolve deployment issues.

Requirements

  • Must have 5–8+ years of experience in infrastructure engineering or data center operations.
  • Hands-on experience deploying GPU servers such as HGX or DGX platforms.
  • Proficiency with high-speed networking fabrics including InfiniBand, RoCE, and Ethernet.
  • Strong Linux systems knowledge and troubleshooting skills for distributed performance issues.
  • Must be comfortable working onsite in data center environments.
  • Must be authorized to work in the United States.

Nice to have

  • Experience in AI/ML infrastructure or HPC environments.
  • Familiarity with CUDA, NCCL, and RDMA protocols.
  • Automation proficiency using Python, Ansible, Terraform, or Bash.
  • Experience managing high-density power and cooling data center environments.

Culture & Benefits

  • Opportunity to work on foundational AI infrastructure at a fast-growing startup.
  • High-impact role with significant ownership over infrastructure build-out.
  • Focus on urgency, bias toward action, and engineering excellence.
  • Direct collaboration with infrastructure and hardware teams.