Full-Time

Senior Site Reliability Engineer

GPU Clusters

Confirmed live in the last 24 hours

NVIDIA

NVIDIA

10,001+ employees

Designs GPUs and AI computing solutions

Automotive & Transportation
Enterprise Software
AI & Machine Learning
Gaming

Compensation Overview

$184k - $356.5kAnnually

+ Equity

Senior, Expert

Company Historically Provides H1B Sponsorship

Austin, TX, USA + 3 more

More locations: Santa Clara, CA, USA | Durham, NC, USA | Westford, MA, USA

Category
DevOps & Infrastructure
Site Reliability Engineering
Required Skills
TCP/IP
Kubernetes
Microsoft Azure
Python
Ruby
Docker
AWS
Go
Terraform
Ansible
Linux/Unix
Google Cloud Platform

You match the following NVIDIA's candidate preferences

Employers are more likely to interview you if you match these preferences:

Degree
Experience
Requirements
  • Minimum BS degree in Computer Science (or equivalent experience)
  • 7+ years of software engineering experience
  • At least 3+ years managing GPU clusters or similar high-performance computing environments
  • Expertise in designing, deploying, and running production-level cloud services
  • Proficiency with orchestration and containerization tools like Kubernetes, Docker, or similar
  • Experience coding/scripting in at least two high-level programming languages (e.g., Python, Go, Ruby)
  • Strong proficiency with Linux operating systems and TCP/IP fundamentals
  • Proficient in modern CI/CD techniques, GitOps, and Infrastructure as Code (IaC) using tools such as Terraform or Ansible
  • Diligent with strong communication and documentation skills
Responsibilities
  • Design, deploy and support large-scale, distributed GPU clusters to run high-performance AI and machine learning workloads
  • Continuously improve infrastructure provisioning, management, and monitoring through automation
  • Ensure the highest level of uptime and quality of service (QoS) through operational excellence, proactive monitoring, and incident resolution
  • Support a globally distributed, cloud environment like AWS, GCP, Azure or OCI as well as on prem
  • Define and implement service level objectives (SLOs) and service level indicators (SLIs) to measure and ensure infrastructure quality
  • Write high-quality Root Cause Analysis (RCA) reports for production-level incidents and work towards preventing future occurrences
  • Participate in the team's on-call rotation to support critical infrastructure
  • Drive the evaluation and integration of new GPU - like GB200 - and cloud technologies to improve system performance
Desired Qualifications
  • Experience managing large-scale Slurm and/or BCM deployments in production environments
  • Expertise in modern container networking and storage architectures
  • Proven track record to define and drive operational excellence in highly distributed, high-performance environments

NVIDIA designs and manufactures graphics processing units (GPUs) and system on a chip units (SoCs) for various markets, including gaming, professional visualization, data centers, and automotive. Their products include GPUs tailored for gaming and professional use, as well as platforms for artificial intelligence (AI) and high-performance computing (HPC) that cater to developers, data scientists, and IT administrators. NVIDIA generates revenue through the sale of hardware, software solutions, and cloud-based services, such as NVIDIA CloudXR and NGC, which enhance experiences in AI, machine learning, and computer vision. What sets NVIDIA apart from its competitors is its strong focus on research and development, allowing it to maintain a leadership position in a competitive market. The company's goal is to drive innovation and provide advanced solutions that meet the needs of a diverse clientele, including gamers, researchers, and enterprises.

Company Stage

IPO

Total Funding

$19.5M

Headquarters

Santa Clara, California

Founded

1993

Growth & Insights
Headcount

6 month growth

0%

1 year growth

0%

2 year growth

-1%
Simplify Jobs

Simplify's Take

What believers are saying

  • Acquisition of VinBrain enhances NVIDIA's AI-driven healthcare solutions.
  • Investment in Nebius Group boosts NVIDIA's AI infrastructure capabilities.
  • Partnership with Serve Robotics aligns with NVIDIA's focus on robotics and AI applications.

What critics are saying

  • Increased competition from AI startups like xAI challenges NVIDIA's market position.
  • Serve Robotics' rapid expansion may lead to financial strain if market growth lags.
  • Integration challenges from VinBrain acquisition may affect NVIDIA's operational efficiency.

What makes NVIDIA unique

  • NVIDIA leads in AI and HPC solutions with cutting-edge GPU technology.
  • The Omniverse platform enhances NVIDIA's capabilities in industrial AI and digital twins.
  • NVIDIA's cloud services, like CloudXR, offer scalable solutions for AI and machine learning.

Help us improve and share your feedback! Did you find this helpful?

Benefits

Company Equity

401(k) Company Match