Post your job offer for free on H1BConnect with no upfront cost!

Logo

Hire with Us
NVIDIA Corporation logo

Senior Site Reliability Engineer - AI Research Clusters

NVIDIA Corporation

2/28/2025

US, CA, Santa Clara

Full-time

Salary: $184,000 - $425,500 per year


Job Description

NVIDIA is seeking a Site Reliability Engineer to design and implement GPU compute clusters and support AI research across the organization.

Requirements

  • Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience
  • 6+ years of experience designing and operating large scale compute infrastructure
  • Proven experience in site reliability engineering for high-performance computing environments
  • Deep understanding of GPU computing and AI infrastructure
  • Experience with AI/HPC advanced job schedulers and cluster configuration management tools
  • Solid experience with GPU clusters and container technologies
  • Experience programming in Python and Bash scripting

Responsibilities

  • Design and implement state-of-the-art GPU compute clusters
  • Optimize cluster operations for reliability, efficiency, and performance
  • Drive foundational improvements and automation to enhance researcher productivity
  • Troubleshoot, diagnose, and root cause system failures
  • Scale systems sustainably through automation
  • Practice sustainable incident response and blameless postmortems
  • Manage upgrades and automated rollbacks across all clusters

Benefits

  • Multiple relocation packages
  • Two weeklong shutdowns (mid-summer and year-end) in the US (in addition to PTO)
  • 8-week parental leave
  • 9 Employee Resource Groups
  • Annual bonus offering
  • Flexible work arrangements
  • Up to 6% 401K matching
Logo

© 2024 H1BConnect. All rights reserved.

Check out our sister site LatamDev for tech jobs in Latin America! 🌎