Full-Time

Senior Software Engineer

AI Resiliency

Confirmed live in the last 24 hours

NVIDIA

NVIDIA

10,001+ employees

Designs GPUs and AI computing solutions

Compensation Overview

$184k - $287.5k/yr

+ Equity

Senior

Company Historically Provides H1B Sponsorship

Redmond, WA, USA + 1 more

More locations: Santa Clara, CA, USA

Category
Backend Engineering
FinTech Engineering
Software Engineering
Required Skills
Python
Tensorflow
Pytorch
C/C++
Requirements
  • Bachelor’s, Master’s or PhD in Computer Science, Electrical Engineering, or a related field, or equivalent experience
  • Proficiency in C++ and Python, with experience in writing efficient, high-performance code
  • 6+ years of relevant experience
  • Strong understanding of distributed systems concepts, parallel programming, and fault tolerance in large-scale computing environments
  • Familiarity with AI frameworks such as PyTorch, JAX/XLA, TensorFlow, or similar
  • Experience with debugging and profiling tools (e.g., gdb, perf, valgrind, NVIDIA Nsight)
  • Excellent problem-solving skills and ability to work in a fast-paced, highly collaborative environment
Responsibilities
  • Develop AI Software Resiliency Features: Implement and optimize software features that improve AI system reliability at a massive scale, such as fast checkpoint-recovery, error detection, error isolation, and straggler/hang detection
  • Hands-On Coding & Optimization: Contribute to large-scale distributed systems with high-quality, production-level C++ and Python code. Enhance performance for AI workloads running on thousands of GPUs
  • Fault Tolerance & Debugging: Work on AI system error handling, implementing techniques to detect silent data corruption (SDC) and other failure scenarios. Assist in developing monitoring tools for proactive failure mitigation
  • Collaborate Across Teams: Work closely with senior engineers, AI researchers, and hardware/software teams to integrate resiliency features into AI frameworks like PyTorch and JAX/XLA
  • Testing & Automation: Develop and implement tests to ensure robustness, scalability, and efficiency of resiliency mechanisms. Contribute to CI/CD pipelines to automate validation of AI workloads
  • Support Production Deployments: Assist in debugging and performance tuning large-scale AI workloads in cloud and HPC environments, ensuring seamless operation of AI training and inference workloads
Desired Qualifications
  • Hands-on experience in training models or working with model training teams
  • Hands-on experience with CUDA, NCCL, or MPI for GPU-accelerated computing, especially at extreme-scale
  • Knowledge of checkpointing strategies, error mitigation, or fault-tolerant computing in AI training
  • Experience working with large-scale AI clusters, HPC environments, or cloud-based AI workloads
  • Strong systems programming skills and experience with low-level performance tuning

NVIDIA designs and manufactures graphics processing units (GPUs) and system on a chip units (SoCs) for various markets, including gaming, professional visualization, data centers, and automotive. Their main products are GPUs that enhance gaming experiences and support professional applications, along with AI and high-performance computing platforms for developers and data scientists. NVIDIA differentiates itself from competitors by focusing on advanced technology and continuous innovation, ensuring their products meet the evolving needs of users. The company's goal is to lead in AI and high-performance computing by providing powerful hardware and software solutions that enable immersive experiences and drive advancements in various industries.

Company Size

10,001+

Company Stage

IPO

Headquarters

Santa Clara, California

Founded

1993

Simplify Jobs

Simplify's Take

What believers are saying

  • Acquisition of Lepton AI enhances NVIDIA's cloud service offerings and AI market position.
  • Investment in nEye Systems aligns with improving data transfer efficiency in data centers.
  • Backing SandboxAQ suggests strategic interest in quantum computing for future AI innovations.

What critics are saying

  • Emerging competition from startups like nEye Systems challenges NVIDIA's AI hardware dominance.
  • Integration challenges may arise from acquiring Lepton AI with its distinct focus.
  • Investment in SandboxAQ may not yield immediate returns due to nascent quantum technology.

What makes NVIDIA unique

  • NVIDIA leads in AI and HPC with cutting-edge GPU technology.
  • The company excels in gaming and professional visualization markets with innovative solutions.
  • NVIDIA's cloud services, like CloudXR, offer scalable AI and machine learning applications.

Help us improve and share your feedback! Did you find this helpful?

Benefits

Company Equity

401(k) Company Match

Growth & Insights and Company News

Headcount

6 month growth

0%

1 year growth

1%

2 year growth

-1%
SiliconANGLE
Apr 11th, 2025
nEye Systems raises $58M for AI chips

Silicon photonics startup nEye Systems raised $58M in funding led by CapitalG, with participation from Microsoft, Micron, Nvidia, and others. The Emeryville-based company is developing optical networking chips for AI data centers, promising faster, more efficient, and cost-effective data transfers. nEye's technology aims to overcome bandwidth and energy limitations of current electrical interconnects. Prototypes are ready, with production samples expected next year. Total funding exceeds $72M.

Aibase
Apr 8th, 2025
Nvidia Acquires Lepton AI for Millions

Nvidia has completed its acquisition of Lepton AI, a startup founded by former Alibaba VP Yangqing Jia, for reportedly hundreds of millions of dollars. Lepton AI, established in 2023, focuses on AI infrastructure and cloud solutions. Co-founders Yangqing Jia and Junjie Bai have joined Nvidia. Jia, a notable AI expert, previously contributed to TensorFlow at Google and led AI R&D at Alibaba.

Yahoo Finance
Apr 7th, 2025
Rescale Raises $115M in Venture Funding

Rescale, a San Francisco-based startup specializing in engineering software for designing race cars and computer chips, secured $115 million in venture financing.

Reuters
Apr 4th, 2025
AI startup SandboxAQ adds Nvidia, Google as backers, raises additional $150 million

SandboxAQ, a startup drawing on quantum computing techniques to develop quantitative artificial intelligence models for enterprises, said it has raised $150 million from new investors including Google, Nvidia and BNP Paribas .

TechCrunch
Apr 3rd, 2025
Runway, best known for its video-generating AI models, raises $308M | TechCrunch

Runway, a startup best known for its suite of generative AI media tools, has raised $308 million in a new funding round.