TL;DR
Network & Server Deployment Engineer (AI): Designing, deploying, and operating large-scale HPC and GPU clusters with an accent on physical infrastructure, network topology, and performance tuning. Focus on automating provisioning, orchestrating complex hardware/software stacks, and ensuring the seamless integration of compute, storage, and interconnect layers.
Location: Singapore
Company
Nscale is a vertically integrated cloud provider building sustainable AI infrastructure, including data centers, software, and applications.
What you will do
- Design, deploy, and operate high-performance computing clusters and GPU compute environments.
- Develop hardware architectures, including BOMs, rack layouts, and reference designs.
- Configure and manage HPC scheduling systems like Slurm.
- Optimize network topologies using InfiniBand and high-speed Ethernet technologies.
- Automate provisioning, configuration, and orchestration of hardware and software stacks.
- Troubleshoot and tune performance across compute, storage, and network layers.
Requirements
- Experience in designing and operating large-scale HPC or compute clusters.
- Proficiency with HPC workload management systems such as Slurm, PBS, or LSF.
- Deep understanding of InfiniBand networking, including RDMA, QoS, and subnet management.
- Knowledge of high-speed Ethernet protocols and physical layer design.
- Strong automation scripting skills in Python and Bash.
- Ability to create detailed hardware documentation and reference architectures.
Culture & Benefits
- Focus on relentless innovation and creative problem solving.
- Strong culture of ownership, accountability, and excellence.
- Transparent and open communication environment.
- Commitment to sustainability in technology operations.
- Fast, efficient, and respectful collaborative team environment.
- Equal opportunity employer supporting diversity and inclusion.
