Senior Site Reliability Engineer - Datacenter Automation

Opens nvidia.wd5.myworkdayjobs.com in a new tab

Overview

  • NVIDIA is hiring experienced SRE engineers to help scale up its AI Infrastructure.
  • We expect you to have significant experience with site reliability principles and techniques including reliability assessments, incident management processes, production system observability, monitoring and alerting, automated deployments and toil elimination.
  • We view SRE as a software engineering discipline and expect significant contributions to our codebase.
  • We welcome out-of-the-box thinkers who can provide new ideas with strong execution bias.
  • Expect to be constantly challenged, improving, and evolving for the better.
  • You will help advance NVIDIA's capacity to build and deploy leading infrastructure solutions for a broad range of AI-based applications.
  • If you're creative, passionate about SRE, and love having fun, please apply today! For two decades, we have pioneered visual computing, the art and science of computer graphics.
  • With the invention of the GPU - the engine of modern visual computing - the field has expanded to encompass video games, movie production, product design, medical diagnosis and scientific research.
  • Today, we stand at the beginning of the next era, the AI computing era, ignited by a new computing model, GPU deep learning.
  • What you will be doing: You will be part of an DGX Cloud team responsible for production systems that enable large scalable GPU clusters to be used for a variety of AI workloads.
  • This includes working on supporting the operation of custom software related to GPU asset provisioning, configuration, and lifecycle management across many cloud providers.
  • Implementing monitoring and health management capabilities that enable industry leading reliability, availability, and scalability of GPU assets.
  • You will be harnessing multiple data streams, ranging from GPU hardware diagnostics to cluster and network telemetry.
  • Working with teams across NVIDIA to ensure production AI clusters run reliability and consistently with maximum performance.
  • Evaluating system failures and improving services based on a well-defined incident management process.
  • What we need to see: Direct experience in a DevOps/SRE role within a highly technical organization with demonstrable impact from your work.
  • Highly motivated with strong communication skills, you can work successfully with multi-functional teams, principles, and architects and coordinate effectively across organizational boundaries and geographies.
  • 5+ years in similar role and experience on large-scale production systems.
  • Experience with the aforementioned DevOps/SRE principles, tools and techniques.
  • You possess a BS in Computer Science, Engineering, Physics, Mathematics or a comparable Degree or equivalent experience.
  • Technical knowledge, including a systems programming language (Go, Python) and a solid understanding of data structures and algorithms.
  • Ways to stand out from the crowd: Technical competency in managing and automating large-scale distributed systems independent of cloud providers.
  • Advanced hands-on experience and deep understanding of cluster management systems (Kubernetes, Slurm, Bright Cluster Manager) Proven operational excellence in maintaining reliable and performant AI infrastructure.

Tools & Skills

Languages

Sourced directly from NVIDIA’s career page

Your application goes straight to NVIDIA.

NVIDIA logo

NVIDIA

India, Bengaluru

Specialisation
Open roles at NVIDIA
99 positions
Job ID
/job/India-Bengaluru/Senior-Site-Reliability-Engineer----Datacenter-Automation_JR2011458

Get matched to roles like this

Upload your resume once. We’ll notify you when matching roles open up.

Join talent pool — free

Similar Other roles