Senior Solutions Architect, Cloud Infrastructure and DevOps

Opens nvidia.wd5.myworkdayjobs.com in a new tab

Overview

  • NVIDIA is looking for a Senior Cloud Infrastructure and DevOps Solutions Architect to join its NVIDIA Infrastructure Specialist Team.
  • Academic and commercial organizations around the world are using NVIDIA products to redefine deep learning and data analytics, and to power next-generation data centers.
  • Join the team building and advising on many of the largest and fastest AI/HPC systems in the world! We are looking for someone who combines deep technical expertise with strong consulting and communication skills.
  • This role will engage directly with customers, partners, and multi-functional teams to assess, architect, and guide the implementation of large-scale infrastructure projects.
  • The scope spans system architecture, Kubernetes-based platforms, and automation—serving as both a trusted advisor and a hands-on technical leader.
  • What You’ll Be Doing: Advise on and help maintain large-scale computational and AI infrastructure, including monitoring, logging, and workload orchestration (Kubernetes and Linux job schedulers).
  • Provide consultative guidance and perform hands-on solving across the full stack—from bare metal and operating system, through the software stack, container platform, networking, and storage.
  • Assess customer environments and recommend optimized, production-ready Kubernetes-based container platforms integrated with enterprise-grade networking and storage solutions.
  • Serve as a key technical resource: develop, refine, and document standard methodologies and operational guidelines to be shared with internal teams and customer partners.
  • Support Research & Development activities and engage in POCs/POVs to validate new features, architectures, and upgrade approaches.
  • Create and deliver high-quality documentation, including runbooks, onboarding materials, and best-practice guides for customers and internal teams.
  • Act as the technical leader for assigned customer accounts, providing strategic guidance on DevOps and platform architecture and influencing long-term infrastructure and operations decisions.
  • What We Need to See: Education & Experience: BS/MS/PhD in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or related fields (or equivalent experience), with 8+ years of professional experience in leading scalable cloud environments and automation engineering roles.
  • Cloud & HPC Expertise: Shown understanding of networking fundamentals, data center architectures, and hands-on experience leading HPC/AI clusters, including deployment, optimization, and solving.
  • NVIDIA GPU Expertise: Validated hands-on experience deploying, configuring, and optimizing NVIDIA GPU-accelerated infrastructure, including driver management, CUDA toolkit integration, and GPU workload profiling.
  • Kubernetes & AI/ML Workloads: Extensive experience with Kubernetes for container orchestration, resource scheduling, scaling, and integration with GPU-accelerated and HPC environments.
  • Hardware & Software Knowledge: Strong familiarity with HPC and AI technologies (CPUs, GPUs, high-speed interconnects) and supporting software stacks.
  • Linux & Storage Systems: Deep knowledge of Linux (RedHat, Ubuntu), OS-level security, and protocols.
  • Experience with storage solutions such as Lustre, GPFS, ZFS, XFS, and emerging Kubernetes storage technologies.
  • Automation & Observability: Proficiency in Python and Bash scripting, configuration management, and Infrastructure-as-Code tools (e.g., Ansible, Terraform).
  • Experience with observability stacks (Grafana, Loki, Prometheus) for monitoring, logging, and building fault-tolerant systems.
  • Solution Architecture & Customer Engagement: Strong background in crafting scalable solutions and providing consultative support to customers, including leading architectural reviews and speaking publicly to executive partners.
  • Ways to Stand Out from the Crowd: Knowledge of CI/CD pipelines for software deployment and automation.
  • Experience working with NVIDIA GPU and Network Operators to manage automated resource lifecycle in Kubernetes environments.
  • Solid hands-on knowledge of Kubernetes and container-based microservices architectures.
  • Experience with NVIDIA GPU and Network Operator for automated GPU as well as network resources lifecycle management in Kubernetes environments.
  • Experience with NVIDIA Base Command Manager (BCM) for provisioning, managing, and supervising GPU clusters at scale as well as b ackground with RDMA-based fabrics (InfiniBand or RoCE) in HPC or AI environments.

Sourced directly from NVIDIA’s career page

Your application goes straight to NVIDIA.

NVIDIA logo

NVIDIA

2 Locations

Specialisation
Open roles at NVIDIA
2000 positions
Job ID
/job/UAE-Dubai/Senior-Solutions-Architect--Cloud-Infrastructure-and-DevOps_JR2016420

Get matched to roles like this

Upload your resume once. We’ll notify you when matching roles open up.

Join talent pool — free

Similar Other roles