Opens nvidia.wd5.myworkdayjobs.com in a new tab
What You'll Do
- will include building AI/HPC infrastructure for new and existing customers.
- Support operational and reliability aspects of large-scale AI clusters, focusing on performance at scale, real-time monitoring, logging, and alerting.
- Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation, and refinement.
- Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
- Provide feedback to internal teams such as opening bugs, documenting workarounds, and suggesting improvements.
- What we need to see: BS/MS/PhD or equivalent experience in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or related fields.
- At least 5+ years of professional experience in networking fundamentals, Ethernet or InfiniBand World.
- Hands-on experience with network switch/router platforms like Cumulus Linux, SONiC, IOS, JunosOS, and EOS, etc.
- Possess solid working knowledge of Ethernet/InfiniBand/RDMA core principles.
- Be proficient in end-to-end IB/Eth cluster deployment, adapter configuration and firmware maintenance, and able to conduct professional performance benchmarking with mainstream RDMA testing tools.
- Capable of independently diagnosing and troubleshooting typical IB/Eth network anomalies, including link flapping, connection failure, as well as bandwidth and latency jitter issues.
- Master practical RDMA network optimization strategies such as QP tuning, MTU configuration and congestion control optimization.
- Hands-on working experience in RDMA-accelerated business scenarios, including distributed storage and high-performance computing clusters.
- Extensive experience delivering automated network provisioning solutions using tools like Ansible, Salt, and Python.
- Ability to develop CI/CD pipelines for network operations.
- Strong written, verbal, and listening skills in English are essential.
- Ways to stand out from the crowd: Familiarity with cloud networks (AWS, GCP, Azure) is a plus.
- Advanced Linux or Networking Certifications.
- Experience with High-performance computing architectures.
- Understanding of how job schedulers(Slurm, PBS) work. luster management technologies knowledge (bonus credit for BCM (Base Command Manager).) Experience with GPU (Graphics Processing Unit) focused hardware/software.
Sourced directly from NVIDIA’s career page
Your application goes straight to NVIDIA.
Opens nvidia.wd5.myworkdayjobs.com in a new tab
Specialisation
Open roles at NVIDIA
2000 positions
Job ID
/job/India-Pune/Senior-Solutions-Architect--Infiniband-and-Networking-Ethernet---NVIS_JR2019584
Get matched to roles like this
Upload your resume once. We’ll notify you when matching roles open up.
Join talent pool — freeSimilar Other roles
Micron Technology
Values & Culture Transformation, Director
Fab 10A, Singapore|Other
Micron Technology
Staff/Senior PAC (Pump, Abatement, and Chiller) Engineer
Boise, ID - Main Site|Other
Micron Technology
IT Disaster Recovery Program Manager
Jalisco, Mexico|Other
Micron Technology
ENGINEER, PACKAGE DEVELOPMENT ENGINEERING, PACKAGE SILICON INTEGRATION
MSB, Singapore|Other