Senior HPC Site Reliability Engineer

Opens nvidia.wd5.myworkdayjobs.com in a new tab

Overview

  • We are now looking for a Senior HPC Site Reliability Engineer to join our mission and continue improving our HPC infrastructure.
  • A meaningful part of NVIDIA’s strength is our unique and advanced development tools and environments that enable our incredible pace of innovation.
  • We are looking for architects to help us evolve the way our private compute cloud is architected and optimized.
  • What you will be doing: Provide leadership in the design and implementation of our large-scale compute cloud that enables the world's top chip modelers, designers, and deep learning experts to invent groundbreaking technology.
  • Identify architectural changes or completely innovative approaches in our cloud architecture and design.
  • Help with strategic challenges we encounter, including: effective resource utilization in a heterogeneous compute environment, evolving our private/public cloud strategy, capacity modeling, and planning for multi-year growth and scaling across our global computing environment! What we need to see: B.sc in Computer Science, Electrical Engineering or related field or equivalent experience 8+ years of experience designing and operating large scale compute infrastructure.
  • Experience with job schedulers such as IBM/Platform LSF, SGE, SLURM, Marathon, Chronos.
  • Solid understanding of cluster configuration managements tools – Ansible, Puppet, Chef, Salt.
  • Good experience providing compute services using a public cloud (AWS, Azure, Google Cloud) Strong script-writing skills: Python, Bash, Perl Knowledge of and/or experience deploying PaaS microservices – Docker, Docker Swarm, Kubernetes Understanding of fast distributed and network attached storage solutions and Linux file systems, ability to recommend and implement solutions to improve OS performance and reliability.
  • Ways to stand out from the crowd: Linux certification from a well-known vendor - RedHat, Oracle, etc.
  • Prior experience managing large-scale Kubernetes deployment in production.
  • Strong skills in modern container networking and storage architecture.
  • Well-known Cloud Certification(s).

Sourced directly from NVIDIA’s career page

Your application goes straight to NVIDIA.

NVIDIA logo

NVIDIA

2 Locations

Specialisation
Open roles at NVIDIA
2000 positions
Job ID
/job/Israel-Yokneam/Senior-HPC-Site-Reliability-Engineer_JR2014015

Get matched to roles like this

Upload your resume once. We’ll notify you when matching roles open up.

Join talent pool — free

Similar Other roles