GPU and Data Center Systems EngineerNAVIGO
Own the end-to-end design, installation, and optimization of GPU and server clusters powering containerized modular AI data centers. This includes hardware selection, rack-level integration, firmware and OS imaging, workload benchmarking, cooling and power performance tuning, and lifecycle maintenance. You'll help define the reference architecture for multi-megawatt GPU clusters, automate their provisioning and orchestration, and optimize for thermal efficiency, compute utilization, and fault tolerance in a hybrid on-location and remote-managed environment.
Specify, integrate, and benchmark thousands of GPU nodes.
Build imaging and configuration automation (PXE boot, Terraform, etc.).
Optimize compute performance, power draw, and thermals.
Manage firmware updates, BIOS tuning, and redundancy planning.
Support rack-level maintenance and performance monitoring.
Experience with GPU clusters, DGX systems, or HPC environments.
Strong Linux systems experience (Ubuntu, RHEL, Rocky).
Familiarity with Kubernetes, Slurm, or Run:AI orchestration.
Experience managing >1 MW compute clusters preferred.
- GPU Cluster Management•5 - 8 years
- Linux Systems Administration•4 - 7 years
- Infrastructure Automation•3 - 6 years
- Problem Solving•Good - Excellent
- Technical Documentation•Good - Excellent
- Kubernetes•2 - 5 years
- Power/Thermal Optimization•3 - 5 years
- Cross-functional Collaboration•Good - Excellent








