Opens nvidia.wd5.myworkdayjobs.com in a new tab
Overview
- We are now looking for a HPC Operations Engineer to join our mission and continue improving our HPC infrastructure.
- A meaningful part of NVIDIA’s strength is our unique and advanced development tools and environments that enable our incredible pace of innovation.
- We are looking for architects to help us evolve the way our private compute cloud is architected and optimized.
- What you’ll be doing: Troubleshoot incoming support requests in a large-scale HPC environment.
- Contribute enhancements to existing deployment automation, configuration management, observability, and operational monitoring and day to day operation through automation.
- Ensure compute servers are running correct Operating System and configuration.
- Troubleshoot Complex Issues: Perform comprehensive troubleshooting from bare metal to application level, ensuring system reliability and efficiency.
- Collaborate with specialist teams to drive issues to closure.
- Collaborate with domain experts to improve how our chip development process utilizes our infrastructure.
- Directly contribute to the overall quality and improve time to market for our next generation chips.
- What we need to see: BS in Computer Science or similar degree or equivalent experience 2+ years of experience Proficient in administering Centos/RHEL Linux distributions.
- Understating of container technologies like Docker.
- Proficiency in Python and UNIX scripting languages such as bash.
- Excellent problem-solving skills, with the ability to analyze complex systems, identify bottlenecks, and implement scalable solutions.
- Excellent communication and teamwork skills, with the ability to work effectively with diverse teams and individuals.
- Solid understanding of cluster configuration managements tools such as Ansible.
- Ways to stand out from the crowd: Understanding of key Linux technologies such as NFS, automounter, LDAP, DNS, and TCP/IP networking in Red Hat Linux distribution flavors.
- Familiarity with job scheduler administration (e.g.
- IBM Spectrum LSF or SLURM) and experience building/ operating large scale compute infrastructure.
- Knowledge of the FlexLM license management system.
- Proficiency in Perl for maintaining legacy automation scripts.
- Familiarity with High-Speed Networking (InfiniBand, RDMA, RoCE etc.) and fast, distributed storage systems (Lustre, GPFS, etc.).
Sourced directly from NVIDIA’s career page
Your application goes straight to NVIDIA.
Opens nvidia.wd5.myworkdayjobs.com in a new tab
Specialisation
Open roles at NVIDIA
2000 positions
Job ID
/job/Israel-Yokneam/HPC-Operations-Engineer_JR2014025
Get matched to roles like this
Upload your resume once. We’ll notify you when matching roles open up.
Join talent pool — freeSimilar Other roles
Samsung Semiconductor
Staff Technical Program Manager
San Jose, California, United States|Other
Samsung Semiconductor
Associate, Executive Administration
San Jose, California, United States|Other
Micron Technology
STAFF ENGINEER GFAC SASIA - ELECTRICAL
Fab 10A, Singapore|Other
Micron Technology
TEST HBM DATA ANALYST
Taichung - MTB, Taiwan|Other