Deep Learning Kernel Software Performance Architect

Opens nvidia.wd5.myworkdayjobs.com in a new tab

Overview

  • NVIDIA is seeking Software Performance Architects to optimize GPU kernel performance for state-of-the-art data-center platforms.
  • We build automated, data-driven workflows to detect, explain, and prevent performance regressions across key deep learning workloads, partnering closely with kernel developers, compiler teams, infrastructure, and architecture/performance groups.
  • What you'll be doing: Performance analysis + debugging Validate and analyze performance of GPU-accelerated kernels and key deep learning building blocks.
  • Debug performance issues end-to-end: reproduce, isolate root causes, propose fixes or mitigation paths, and drive closure with the owning teams.
  • Build performance narratives using structured evidence: baselines, controlled comparisons, and regression attribution.
  • Automation + regression infrastructure (Python-heavy) Develop and maintain Python-based automation for performance testing and analysis—using modern AI-assisted developer tools (e.g., Cursor/Claude Code/Copilot) to accelerate scripting while keeping code maintainable and reviewable.
  • Design and operate performance test workflows: coverage definition, test/workload generation, automated large-scale execution (CI/nightly/on-demand), rerun rules, and reproducibility standards.
  • Convert raw run outputs into actionable insight: statistics, noise control, post-processing, visualization, and large-scale result mining.
  • Cross-team collaboration and operating model Work with kernel developers and compiler/rotation teams to ensure performance checks are practical, scalable, and aligned to release needs.
  • Partner with SWQA and infrastructure teams for execution at scale and reliable pipelines/dashboards.
  • Contribute to clear ownership/triage/routing rules so regressions close quickly and consistently Following general software engineering best practices including support for regression testing and CI/CD flows What we need to see: Masters or PhD degree or equivalent experience in Computer Science, Computer Engineering, Applied Math, or related field Strong programming ability in Python plus C/C++ (performance-oriented code reading/debugging) Solid fundamentals in computer architecture and performance reasoning (latency/throughput, memory hierarchy, parallelism).
  • Experience with performance analysis workflows: profiling, measurement methodology, reproducibility, and regression triage.
  • Comfortable working across teams and driving issues to decision/closure with clear communication Demonstrated strong C++ programming and software design skills, including debugging, performance analysis, and test design Experience with performance-oriented parallel programming, even if it’s not on GPUs (e.g.
  • with OpenMP or pthreads) Solid understanding of computer architecture and some experience with assembly programming Identify bottlenecks, optimize resource utilization, and improve throughput Ways to stand out from the crowd: Experience with high-performance kernels or math libraries (e.g., GEMM/attention, CUTLASS-like concepts) Experience building CI/nightly regression systems, dashboards, or large-scale performance analytics GPU programming/perf experience (CUDA or equivalent parallel programming) Strong ML/DL workload understanding (training/inference shapes, precision modes, perf bottlenecks) Familiarity with simulators/analytical modeling or performance characterization methodology.

Sourced directly from NVIDIA’s career page

Your application goes straight to NVIDIA.

NVIDIA logo

NVIDIA

2 Locations

Specialisation
Open roles at NVIDIA
2000 positions
Job ID
/job/China-Shanghai/Senior-Performance-Software-Engineer--Deep-Learning-Libraries_JR2004267

Get matched to roles like this

Upload your resume once. We’ll notify you when matching roles open up.

Join talent pool — free

Similar Other roles