Opens nvidia.wd5.myworkdayjobs.com in a new tab
Overview
- NVIDIA is the world leader in computer graphics, artificial intelligence, and accelerated computing.
- For over 25 years, we have been at the forefront of research and engineering around the greatest advances in technology.
- Our history of innovation drives us to solve the worlds hardest problems.
- NVIDIA is looking for Senior Cloud Infrastructure/DevOps Solutions Architect to join its NVIDIA Infrastructure Specialist Team.
- Academic and commercial groups around the world are using NVIDIA products to revolutionize deep learning and data analytics, and to power data centers.
- Join the team building many of the largest and fastest AI/HPC systems in the world! We are looking for someone with the ability to work on a dynamic customer focused team that requires excellent interpersonal skills.
- This role will be interacting with customers, partners and internal teams, to analyze, define and implement large scale Networking projects.
- The scope of these efforts includes a combination of Networking, System Design and Automation and being the face to the customer! What you'll be doing: Maintain large scale HPC/AI clusters with monitoring, logging and alerting Manage Linux job/workload schedulers and orchestration tools.
- Develop and maintain continuous integration and delivery pipelines Develop tooling to automate deployment and management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources.
- Deploy monitoring solutions for the servers, network and storage.
- Perform troubleshooting bottom up from bare metal, operating system, software stack and application level.
- Being a technical resource, develop, re-define and document standard methodologies to share with internal teams Support Research & Development activities and engage in POCs/POVs for future improvements.
- What we need to see: BS/MS/PhD or equivalent experience in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or related fields.
- At least 8 years of professional experience in networking fundamentals, TCP/IP stack, and data center architecture.
- Knowledge of HPC and AI solution technologies, including CPUs, GPUs, high-speed interconnects, and supporting software.
- Extensive knowledge and hands-on experience with Kubernetes, including container orchestration for AI/ML workloads, resource scheduling, scaling, and integration with HPC environments.
- Experience in managing and installing HPC clusters, including deployment, optimization, and troubleshooting.
- Experience with job scheduling workloads and orchestration technologies such as Slurm, Kubernetes, and Singularity.
- Excellent knowledge of Windows and Linux systems (Redhat/CentOS and Ubuntu), including internals, ACLs, OS-level security protections, and common protocols like TCP, DHCP, DNS, etc.
- Experience with multiple storage solutions, including Lustre, GPFS, ZFS, and XFS.
- Familiarity with newer and emerging storage technologies is a plus.
- Proficiency in Python programming and bash scripting.
- Knowledge of CI/CD pipelines for software deployment and automation.
- Comfortable with automation and configuration management tools, including Jenkins, Ansible, Puppet/Chef, etc.
- Ability to communicate technical concepts and collaborate effectively with Japanese-speaking customers.
- Ways to stand out from the crowd: Knowledge of CPU and/or GPU architecture.
- Knowledge of Kubernetes, container related microservice technologies.
- Experience with GPU-focused hardware/software (DGX, CUDA.) Background with RDMA (InfiniBand or RoCE) fabrics.
- NVIDIA is widely considered to be one of the technology world’s most desirable employers.
- We have some of the most forward-thinking and hardworking individuals in the world working for us.
- If you're creative and autonomous, we want to hear from you.
Sourced directly from NVIDIA’s career page
Your application goes straight to NVIDIA.
Opens nvidia.wd5.myworkdayjobs.com in a new tab
Specialisation
Open roles at NVIDIA
2000 positions
Job ID
/job/Japan-Remote/Senior-Solutions-Architect--Cloud-Infrastructure-and-DevOps---NVIS_JR1997336
Get matched to roles like this
Upload your resume once. We’ll notify you when matching roles open up.
Join talent pool — freeSimilar Other roles
Micron Technology
SR ENGINEER, FE GLOBAL MANUFACTURING ENGINEERING
2 Locations|Other
Micron Technology
Process Integration Engineer (BEOL)
Hiroshima - Fab 15, Japan|Other
Micron Technology
Technician - RDA Shift Process
Fab 10N/X, Singapore|Other
Micron Technology
F16N_HVM _ Production/ Equipment/ Process Engineer
Miaoli - Tongluo, Taiwan|Other