Senior Site Reliability Engineer - AI Research Clusters

NVIDIA Corporation

2/28/2025

US, CA, Santa Clara

Full-time

Salary: $184,000 - $425,500 per year

Job Description

NVIDIA is seeking a Site Reliability Engineer to design and implement GPU compute clusters and support AI research across the organization.

Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience
6+ years of experience designing and operating large scale compute infrastructure
Proven experience in site reliability engineering for high-performance computing environments
Deep understanding of GPU computing and AI infrastructure
Experience with AI/HPC advanced job schedulers and cluster configuration management tools
Solid experience with GPU clusters and container technologies
Experience programming in Python and Bash scripting

Design and implement state-of-the-art GPU compute clusters
Optimize cluster operations for reliability, efficiency, and performance
Drive foundational improvements and automation to enhance researcher productivity
Troubleshoot, diagnose, and root cause system failures
Scale systems sustainably through automation
Practice sustainable incident response and blameless postmortems
Manage upgrades and automated rollbacks across all clusters