HPC Operations Engineer

Opens nvidia.wd5.myworkdayjobs.com in a new tab

Overview

  • We are now looking for a HPC Operations Engineer to join our mission and continue improving our HPC infrastructure.
  • A meaningful part of NVIDIA’s strength is our unique and advanced development tools and environments that enable our incredible pace of innovation.
  • We are looking for architects to help us evolve the way our private compute cloud is architected and optimized.
  • What you’ll be doing: Troubleshoot incoming support requests in a large-scale HPC environment.
  • Contribute enhancements to existing deployment automation, configuration management, observability, and operational monitoring and day to day operation through automation.
  • Ensure compute servers are running correct Operating System and configuration.
  • Troubleshoot Complex Issues: Perform comprehensive troubleshooting from bare metal to application level, ensuring system reliability and efficiency.
  • Collaborate with specialist teams to drive issues to closure.
  • Collaborate with domain experts to improve how our chip development process utilizes our infrastructure.
  • Directly contribute to the overall quality and improve time to market for our next generation chips.
  • What we need to see: BS in Computer Science or similar degree or equivalent experience 2+ years of experience Proficient in administering Centos/RHEL Linux distributions.
  • Understating of container technologies like Docker.
  • Proficiency in Python and UNIX scripting languages such as bash.
  • Excellent problem-solving skills, with the ability to analyze complex systems, identify bottlenecks, and implement scalable solutions.
  • Excellent communication and teamwork skills, with the ability to work effectively with diverse teams and individuals.
  • Solid understanding of cluster configuration managements tools such as Ansible.
  • Ways to stand out from the crowd: Understanding of key Linux technologies such as NFS, automounter, LDAP, DNS, and TCP/IP networking in Red Hat Linux distribution flavors.
  • Familiarity with job scheduler administration (e.g.
  • IBM Spectrum LSF or SLURM) and experience building/ operating large scale compute infrastructure.
  • Knowledge of the FlexLM license management system.
  • Proficiency in Perl for maintaining legacy automation scripts.
  • Familiarity with High-Speed Networking (InfiniBand, RDMA, RoCE etc.) and fast, distributed storage systems (Lustre, GPFS, etc.).

Sourced directly from NVIDIA’s career page

Your application goes straight to NVIDIA.

NVIDIA logo

NVIDIA

2 Locations

Specialisation
Open roles at NVIDIA
2000 positions
Job ID
/job/Israel-Yokneam/HPC-Operations-Engineer_JR2014025

Get matched to roles like this

Upload your resume once. We’ll notify you when matching roles open up.

Join talent pool — free

Similar Other roles