GPU and Data Center Systems EngineerNAVIGO

Job Summary

Own the end-to-end design, installation, and optimization of GPU and server clusters powering containerized modular AI data centers. This includes hardware selection, rack-level integration, firmware and OS imaging, workload benchmarking, cooling and power performance tuning, and lifecycle maintenance. You'll help define the reference architecture for multi-megawatt GPU clusters, automate their provisioning and orchestration, and optimize for thermal efficiency, compute utilization, and fault tolerance in a hybrid on-location and remote-managed environment.

Responsibilities

Specify, integrate, and benchmark thousands of GPU nodes.
Build imaging and configuration automation (PXE boot, Terraform, etc.).
Optimize compute performance, power draw, and thermals.
Manage firmware updates, BIOS tuning, and redundancy planning.
Support rack-level maintenance and performance monitoring.

Qualifications

Experience with GPU clusters, DGX systems, or HPC environments.
Strong Linux systems experience (Ubuntu, RHEL, Rocky).
Familiarity with Kubernetes, Slurm, or Run:AI orchestration.
Experience managing >1 MW compute clusters preferred.

Requirements

GPU Cluster Management•5 - 8 years
Linux Systems Administration•4 - 7 years
Infrastructure Automation•3 - 6 years
Problem Solving•Good - Excellent
Technical Documentation•Good - Excellent

Nice to Have

Kubernetes•2 - 5 years
Power/Thermal Optimization•3 - 5 years
Cross-functional Collaboration•Good - Excellent