TL;DR

Network & Server Deployment Engineer (AI): Designing, deploying, and operating large-scale HPC and GPU clusters with an accent on physical infrastructure, network topology, and performance tuning. Focus on automating provisioning, orchestrating complex hardware/software stacks, and ensuring the seamless integration of compute, storage, and interconnect layers.

Location: Singapore

Company

Nscale is a vertically integrated cloud provider building sustainable AI infrastructure, including data centers, software, and applications.

What you will do

  • Design, deploy, and operate high-performance computing clusters and GPU compute environments.
  • Develop hardware architectures, including BOMs, rack layouts, and reference designs.
  • Configure and manage HPC scheduling systems like Slurm.
  • Optimize network topologies using InfiniBand and high-speed Ethernet technologies.
  • Automate provisioning, configuration, and orchestration of hardware and software stacks.
  • Troubleshoot and tune performance across compute, storage, and network layers.

Requirements

  • Experience in designing and operating large-scale HPC or compute clusters.
  • Proficiency with HPC workload management systems such as Slurm, PBS, or LSF.
  • Deep understanding of InfiniBand networking, including RDMA, QoS, and subnet management.
  • Knowledge of high-speed Ethernet protocols and physical layer design.
  • Strong automation scripting skills in Python and Bash.
  • Ability to create detailed hardware documentation and reference architectures.

Culture & Benefits

  • Focus on relentless innovation and creative problem solving.
  • Strong culture of ownership, accountability, and excellence.
  • Transparent and open communication environment.
  • Commitment to sustainability in technology operations.
  • Fast, efficient, and respectful collaborative team environment.
  • Equal opportunity employer supporting diversity and inclusion.