TL;DR

Senior Systems Engineer (AI): Leading the hands-on bringup and deployment of GPU clusters for large-scale AI training with an accent on rack integration, network fabric validation, and performance tuning. Focus on building repeatable deployment systems and optimizing GPU infrastructure for production readiness.

Location: Must be based in Seattle, US

Company

Nscale is a startup building next-generation AI infrastructure, delivering highly performant and scalable GPU clusters purpose-built for large-scale AI training and inference.

What you will do

  • Execute end-to-end bringup of GPU nodes and racks, from physical installation to production readiness.
  • Configure and validate high-speed network fabrics including InfiniBand and RoCE.
  • Perform GPU-to-GPU and node-to-node performance validation using NCCL and RDMA.
  • Troubleshoot hardware, firmware, and fabric-level issues to ensure stability.
  • Contribute to automation efforts for provisioning and cluster validation processes.
  • Collaborate with networking, systems software, and data center teams to support rapid scaling.

Requirements

  • 5–8+ years of experience in infrastructure engineering, hardware deployment, or data center operations.
  • Hands-on experience deploying GPU servers such as HGX or DGX platforms.
  • Proficiency with high-speed networking including InfiniBand, RoCE, and Ethernet fabrics.
  • Strong Linux systems knowledge and experience troubleshooting distributed systems.
  • Must be comfortable working onsite in data center environments as needed.

Nice to have

  • Experience in AI/ML infrastructure or HPC environments.
  • Familiarity with CUDA and performance tuning tools.
  • Automation proficiency using Python, Ansible, Terraform, or Bash.
  • Experience working in high-density power and cooling environments.

Culture & Benefits

  • Fast-paced startup environment with an emphasis on ownership and bias for action.
  • Opportunity to build foundational infrastructure for frontier AI workloads.
  • Direct impact on scaling AI capabilities through hands-on technical contribution.