TL;DR

Senior Systems Engineer (AI): Leading the hands-on bringup and deployment of GPU clusters for large-scale AI training with an accent on rack integration, network fabric validation, and performance tuning. Focus on building repeatable deployment systems and optimizing GPU infrastructure for production readiness.

Location: Must be based in Seattle, US

Company

Nscale is a startup building next-generation AI infrastructure, delivering highly performant and scalable GPU clusters purpose-built for large-scale AI training and inference.

What you will do

Execute end-to-end bringup of GPU nodes and racks, from physical installation to production readiness.
Configure and validate high-speed network fabrics including InfiniBand and RoCE.
Perform GPU-to-GPU and node-to-node performance validation using NCCL and RDMA.
Troubleshoot hardware, firmware, and fabric-level issues to ensure stability.
Contribute to automation efforts for provisioning and cluster validation processes.
Collaborate with networking, systems software, and data center teams to support rapid scaling.

Requirements

5–8+ years of experience in infrastructure engineering, hardware deployment, or data center operations.
Hands-on experience deploying GPU servers such as HGX or DGX platforms.
Proficiency with high-speed networking including InfiniBand, RoCE, and Ethernet fabrics.
Strong Linux systems knowledge and experience troubleshooting distributed systems.
Must be comfortable working onsite in data center environments as needed.

Nice to have

Experience in AI/ML infrastructure or HPC environments.
Familiarity with CUDA and performance tuning tools.
Automation proficiency using Python, Ansible, Terraform, or Bash.
Experience working in high-density power and cooling environments.

Culture & Benefits

Fast-paced startup environment with an emphasis on ownership and bias for action.
Opportunity to build foundational infrastructure for frontier AI workloads.
Direct impact on scaling AI capabilities through hands-on technical contribution.

Senior Systems Engineer – GPU Infrastructure

Описание вакансии