Full-Time

Senior Site Reliability Engineer

Posted on 3/16/2025

NVIDIA

NVIDIA

10,001+ employees

Designs GPUs and AI computing solutions

Compensation Overview

$168k - $322k/yr

+ Equity

Senior, Expert

Company Historically Provides H1B Sponsorship

Santa Clara, CA, USA

The job is based in Santa Clara, CA, USA.

Category
DevOps & Infrastructure
Site Reliability Engineering
Required Skills
TCP/IP
Kubernetes
Microsoft Azure
Python
Ruby
Groovy
AWS
Go
Linux/Unix
Google Cloud Platform
Requirements
  • B.S. degree in Computer Science or related technical field (or equivalent experience) with over 10 years in building and supporting critical services.
  • Proficiency in Kubernetes administration, modern CI/CD techniques and Infrastructure as Code (IaC).
  • Deep understanding of Linux operating systems and TCP/IP fundamentals.
  • Expertise with at least one major cloud service provider - AWS, GCP, Azure.
  • Demonstrated proficiency with end-to-end SRE capabilities and observability.
  • Proficient in monitoring, metrics gathering, APM, container management, and log collection tools.
  • 5+ years of coding/scripting experience in at least two high-level programming languages such as Python, Go, Ruby, or Groovy.
  • Creative problem solver with excellent debugging skills and great communication and documentation abilities.
Responsibilities
  • Own the solutions you build, collaborating with cross-functional teams to successfully implement them.
  • Collaborate with various teams in a fast-paced environment to ensure seamless project completion.
  • Continuously improve solution provisioning and management through automation.
  • Identify areas to improve service resiliency using industry-standard practices.
  • Detect performance issues and recommend solutions to maintain world-class service quality.
  • Conduct capacity management and planning to meet ongoing operational needs.
  • Participate in incident reviews, assist in root cause identification, and write RCA reports.
  • Deliver SRE solutions in a globally distributed, multi-cloud hybrid environment - AWS, GCP, and On-prem.
  • Ensure the highest level of uptime and Quality of Service (QoS) for internal customers through operational excellence.
  • Participate in the team's on-call rotation.
Desired Qualifications
  • Linux certification from a well-known vendor - RedHat, Oracle, etc.
  • Prior experience managing large-scale Kubernetes deployment in production.
  • Strong skills in modern container networking and storage architecture.
  • Well-known Cloud Certification(s).
  • Hands-on experience working with Slurm/LSF environments.

NVIDIA designs and manufactures graphics processing units (GPUs) and system on a chip units (SoCs) for various markets, including gaming, professional visualization, data centers, and automotive. Their main products are GPUs that enhance gaming experiences and support professional applications, along with AI and high-performance computing platforms for developers and data scientists. NVIDIA stands out from competitors by offering a combination of hardware and software solutions, including cloud-based services like NVIDIA CloudXR and NGC, which cater to a wide range of applications such as AI and machine learning. The company's goal is to drive innovation in technology and provide advanced solutions that meet the needs of gamers, researchers, and enterprises.

Company Size

10,001+

Company Stage

IPO

Headquarters

Santa Clara, California

Founded

1993

Simplify Jobs

Simplify's Take

What believers are saying

  • Acquisition of Gretel enhances AI training with robust synthetic data capabilities.
  • Integration of Augtera Networks boosts NVIDIA's data center networking efficiency.
  • Backing ClustroAI aligns with the trend of local device AI processing.

What critics are saying

  • Increased competition from AI cloud platforms like Lambda and Together AI.
  • Potential integration challenges with Gretel affecting operational efficiency.
  • Investment in ClustroAI may cannibalize NVIDIA's cloud-based AI solutions.

What makes NVIDIA unique

  • NVIDIA leads in AI and HPC with cutting-edge GPU technology.
  • The company excels in diverse markets: gaming, data centers, and autonomous vehicles.
  • NVIDIA's Omniverse platform enhances industrial AI applications and digital twin creation.

Help us improve and share your feedback! Did you find this helpful?

Benefits

Company Equity

401(k) Company Match

Growth & Insights and Company News

Headcount

6 month growth

0%

1 year growth

0%

2 year growth

-1%
TechCrunch
Mar 19th, 2025
Nvidia reportedly acquires synthetic data startup Gretel | TechCrunch

Nvidia has reportedly acquired Gretel, a San Diego-based startup that's developed a platform to generate synthetic AI training data. Terms of the

Data Center Dynamics
Mar 4th, 2025
Nvidia quietly acquires AIOps firm Augtera Networks

GPU giant rolls networking monitoring firm into Spectrum-X portfolio

Business Wire
Feb 21st, 2025
Lambda Raises $480M to Expand AI Cloud Platform

Lambda, the AI Developer Cloud, today announced it has raised a $480 million Series D, bringing the total equity capital raised to date to $863 millio

PR Newswire
Feb 20th, 2025
Together AI Secures $305M Series B Funding

Together AI announced a $305 million Series B funding round led by General Catalyst and Prosperity7, valuing the company at $3.3 billion. The investment will enhance its AI Acceleration Cloud, focusing on open source models and NVIDIA Blackwell GPU deployment. Together AI supports over 450,000 developers and partners with major firms like Salesforce and Zoom. The platform offers enterprise-grade AI solutions with advanced infrastructure and research innovations for improved efficiency and cost-effectiveness.

CoinCentral
Feb 19th, 2025
NVIDIA-Backed Edge AI Startup ClustroAI Raises $12M to Bring AI Processing to Local Devices - CoinCentral

San Francisco-based ClustroAI raised $12M in Series A funding to advance its edge AI technology that enables local device AI processing without cloud computing

INACTIVE