Opens nvidia.wd5.myworkdayjobs.com in a new tab
Overview
- We are now looking for a Senior HPC Site Reliability Engineer to join our mission and continue improving our HPC infrastructure.
- A meaningful part of NVIDIA’s strength is our unique and advanced development tools and environments that enable our incredible pace of innovation.
- We are looking for architects to help us evolve the way our private compute cloud is architected and optimized.
- What you will be doing: Provide leadership in the design and implementation of our large-scale compute cloud that enables the world's top chip modelers, designers, and deep learning experts to invent groundbreaking technology.
- Identify architectural changes or completely innovative approaches in our cloud architecture and design.
- Help with strategic challenges we encounter, including: effective resource utilization in a heterogeneous compute environment, evolving our private/public cloud strategy, capacity modeling, and planning for multi-year growth and scaling across our global computing environment! What we need to see: B.sc in Computer Science, Electrical Engineering or related field or equivalent experience 8+ years of experience designing and operating large scale compute infrastructure.
- Experience with job schedulers such as IBM/Platform LSF, SGE, SLURM, Marathon, Chronos.
- Solid understanding of cluster configuration managements tools – Ansible, Puppet, Chef, Salt.
- Good experience providing compute services using a public cloud (AWS, Azure, Google Cloud) Strong script-writing skills: Python, Bash, Perl Knowledge of and/or experience deploying PaaS microservices – Docker, Docker Swarm, Kubernetes Understanding of fast distributed and network attached storage solutions and Linux file systems, ability to recommend and implement solutions to improve OS performance and reliability.
- Ways to stand out from the crowd: Linux certification from a well-known vendor - RedHat, Oracle, etc.
- Prior experience managing large-scale Kubernetes deployment in production.
- Strong skills in modern container networking and storage architecture.
- Well-known Cloud Certification(s).
Sourced directly from NVIDIA’s career page
Your application goes straight to NVIDIA.
Opens nvidia.wd5.myworkdayjobs.com in a new tab
Specialisation
Open roles at NVIDIA
2000 positions
Job ID
/job/Israel-Yokneam/Senior-HPC-Site-Reliability-Engineer_JR2014015
Get matched to roles like this
Upload your resume once. We’ll notify you when matching roles open up.
Join talent pool — freeSimilar Other roles
Samsung Semiconductor
Staff Technical Program Manager
San Jose, California, United States|Other
Samsung Semiconductor
Associate, Executive Administration
San Jose, California, United States|Other
Micron Technology
STAFF ENGINEER GFAC SASIA - ELECTRICAL
Fab 10A, Singapore|Other
Micron Technology
TEST HBM DATA ANALYST
Taichung - MTB, Taiwan|Other