Full-Time

Senior Site Reliability Engineer

AI Research Clusters

Confirmed live in the last 24 hours

NVIDIA

NVIDIA

10,001+ employees

Designs GPUs and AI computing solutions

Automotive & Transportation
Enterprise Software
AI & Machine Learning
Gaming

Compensation Overview

$184k - $425.5kAnnually

+ Equity

Senior

Company Historically Provides H1B Sponsorship

Austin, TX, USA + 4 more

More locations: Redmond, WA, USA | Santa Clara, CA, USA | Durham, NC, USA | Westford, MA, USA

This is a hybrid position, requiring some in-office presence.

Category
AI Research
AI & Machine Learning
DevOps & Infrastructure
Site Reliability Engineering
Required Skills
Bash
Kubernetes
Python
MySQL
Docker
Terraform
Ansible

You match the following NVIDIA's candidate preferences

Employers are more likely to interview you if you match these preferences:

Degree
Experience
Requirements
  • Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience with a minimum 6+ years of experience designing and operating large scale compute infrastructure
  • Proven experience in site reliability engineering for high-performance computing environments with operational experience of at least 5K GPUs cluster
  • Deep understanding of GPU computing and AI infrastructure
  • Passion for solving complex technical challenges and optimizing system performance
  • Experience with AI/HPC advanced job schedulers, and ideally familiarity with schedulers such as Slurm
  • Solid experience with GPU clusters, and working knowledge of cluster configuration management tools such as BCM or Ansible and infrastructure level applications, such as Kubernetes, Terraform, MySQL, etc.
  • In depth understanding of container technologies like Docker, Enroot, etc.
  • Experience programming in Python and Bash scripting
Responsibilities
  • Design and implement state-of-the-art GPU compute clusters
  • Optimize cluster operations for maximum reliability, efficiency, and performance
  • Drive foundational improvements and automation to enhance researcher productivity
  • Tackle strategic challenges in large-scale, high-performance computing environments
  • Troubleshoot, diagnose and root cause of system failures and isolate the components/failure scenarios while working with internal & external partners
  • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity
  • Practice sustainable incident response and blameless postmortems
  • Be part of an on call rotation to support production systems
  • Write and review code, develop documentation and capacity plans, debug the hardest problems, live, on some of the largest and most complex systems in the world
  • Implement remediations across software and hardware stack according to plan, while keeping a thorough procedural record and data log
  • Manage upgrades and automated rollbacks across all clusters
Desired Qualifications
  • Interest in crafting, analyzing and fixing large-scale distributed systems
  • Familiarity with NVIDIA GPUs, Cuda Programming, NCCL and MLPerf benchmarking
  • Familiarity with InfiniBand with IBoIP and RDMA
  • Experience with Cloud Deployment, BCM, Terraform
  • Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads
  • Familiarity with deep learning frameworks like PyTorch and TensorFlow
  • Multi-cloud experience

NVIDIA designs and manufactures graphics processing units (GPUs) and system on a chip units (SoCs) for various markets, including gaming, professional visualization, data centers, and automotive. Their products include GPUs tailored for gaming and professional use, as well as platforms for artificial intelligence (AI) and high-performance computing (HPC) that cater to developers, data scientists, and IT administrators. NVIDIA generates revenue through the sale of hardware, software solutions, and cloud-based services, such as NVIDIA CloudXR and NGC, which enhance experiences in AI, machine learning, and computer vision. Unlike many competitors, NVIDIA focuses heavily on research and development to maintain its leadership in technology and innovation. The company's goal is to drive advancements in AI and computing to provide effective solutions for a wide range of clients, from gamers to enterprises.

Company Size

10,001+

Company Stage

IPO

Total Funding

$19.5M

Headquarters

Santa Clara, California

Founded

1993

Simplify Jobs

Simplify's Take

What believers are saying

  • NVIDIA's partnership with Together AI enhances its influence in AI infrastructure.
  • Backing GamerBoom shows NVIDIA's potential in the growing Web3 gaming sector.
  • Serve Robotics' expansion reflects demand for AI-driven urban logistics solutions.

What critics are saying

  • Lambda's funding could increase competition for NVIDIA's cloud services.
  • ClustroAI's edge AI technology may challenge NVIDIA's cloud-based solutions.
  • xAI's growth could create competitive tensions with overlapping AI solutions.

What makes NVIDIA unique

  • NVIDIA leads in AI and HPC with cutting-edge GPU technology.
  • The Omniverse platform expands NVIDIA's reach in digital twin and industrial AI markets.
  • NVIDIA's investment in edge AI aligns with trends in local device processing.

Help us improve and share your feedback! Did you find this helpful?

Benefits

Company Equity

401(k) Company Match

Growth & Insights and Company News

Headcount

6 month growth

0%

1 year growth

0%

2 year growth

-1%
Business Wire
Feb 21st, 2025
Lambda Raises $480M to Expand AI Cloud Platform

Lambda, the AI Developer Cloud, today announced it has raised a $480 million Series D, bringing the total equity capital raised to date to $863 millio

PR Newswire
Feb 20th, 2025
Together AI Secures $305M Series B Funding

Together AI announced a $305 million Series B funding round led by General Catalyst and Prosperity7, valuing the company at $3.3 billion. The investment will enhance its AI Acceleration Cloud, focusing on open source models and NVIDIA Blackwell GPU deployment. Together AI supports over 450,000 developers and partners with major firms like Salesforce and Zoom. The platform offers enterprise-grade AI solutions with advanced infrastructure and research innovations for improved efficiency and cost-effectiveness.

CoinCentral
Feb 19th, 2025
NVIDIA-Backed Edge AI Startup ClustroAI Raises $12M to Bring AI Processing to Local Devices - CoinCentral

San Francisco-based ClustroAI raised $12M in Series A funding to advance its edge AI technology that enables local device AI processing without cloud computing

Alexa Blockchain
Feb 12th, 2025
GamerBoom Raises $9M with NVIDIA Backing

GamerBoom, an AI-powered gaming data analytics protocol on Solana, raised $9M in a funding round, totaling over $11M. Investors include Bing Ventures, SKY Ventures, and NVIDIA, enhancing its AI capabilities. The funding will scale AI-driven gaming data solutions for Web3. GamerBoom is part of Binance’s MVB Accelerator Program and plans to launch a rewards program and NFT sales.

TechCrunch
Jan 15th, 2025
Nvidia backs MetAI, a Taiwanese startup that creates AI-powered digital twins | TechCrunch

Nvidia has been doubling down on the opportunity to build robotics and other industrial AI applications, with the launch of its Omniverse platform, and