Manager, Software Engineering - NCCL

Opens nvidia.wd5.myworkdayjobs.com in a new tab

Overview

  • We are the GPU Communications Libraries and Networking team at NVIDIA.
  • We deliver communication libraries like NCCL & NVSHMEM for Deep Learning and HPC.
  • DL and HPC applications have a huge compute demand already and run on scales which go up to tens of thousands of GPUs.
  • The GPUs are connected with high-speed interconnects (eg.
  • NVLink, PCIe) within a node and with high-speed networking (eg.
  • Infiniband, Ethernet) across the nodes.
  • Communication performance between the GPUs has a direct impact on the end-to-end application performance; and the stakes are even higher at huge scales! We are looking for a dynamic and technical leader for our China NCCL team.
  • This is an outstanding opportunity to push the limits on the state-of-the-art and deliver platforms the world has never seen before.
  • Are you ready to contribute to the development of innovative technologies and help realize NVIDIA's vision? What you will be doing: Lead, mentor, and grow our China engineering team.
  • Own the end-to-end execution spanning planning, prioritization, quality control and performance.
  • Interact with customers and researchers to understand their use cases and requirements.
  • Collaborate with engineering, program and product management, and partners to define the product roadmap.
  • Contribute to feature design and implementation.
  • Continuously review and identify improvement opportunities in established processes, infrastructure, and practices to ensure the teams are accomplishing work in the most efficient and transparent manner.
  • What we need to see: 10+ overall years of experience in the software industry with 4+ years of management experience.
  • Bachelors, Masters, or Ph.D.
  • in CS, CE, EE (related technical field) or equivalent experience.
  • Specialization in systems software, communication runtimes, or high performance networking.
  • Proven success in managing several complex initiatives or products through the full product life cycle.
  • Strong understanding of computer systems architecture, networking technologies (RDMA, RoCE, Ethernet, EFA, InfiniBand) and topologies, operating systems principles (aka systems software fundamentals), HW-SW interactions and performance analysis/optimizations.
  • Hands-on C/C++ programming and debugging skills in Linux.
  • Experience balancing multiple projects with competing priorities.
  • Flexibility to work and communicate effectively across different teams and timezones.
  • Ways to stand out from the crowd: Active user or developer of NCCL! Customer engagement experience in this space.
  • Experience with parallel programming models (MPI, SHMEM) and at least one communication runtime (MPI, NCCL, NVSHMEM, NIXL, OpenSHMEM, UCX, UCC).
  • Experience with programming using CUDA, MPI, OpenMP, OpenACC, pthreads.
  • Knowledge of HPC and ML/DL fundamentals.
  • Experience with Deep Learning Frameworks such as PyTorch, TensorFlow, vLLM, SGLang, TRT-LLM, etc.

Sourced directly from NVIDIA’s career page

Your application goes straight to NVIDIA.

NVIDIA logo

NVIDIA

China, Shanghai

Specialisation
Open roles at NVIDIA
2000 positions
Job ID
/job/China-Shanghai/Manager--Software-Engineering---NCCL_JR2016650

Get matched to roles like this

Upload your resume once. We’ll notify you when matching roles open up.

Join talent pool — free

Similar Other roles