Full-Time

Senior Observability Engineer

AI and HPC

Posted on 3/15/2025

NVIDIA

NVIDIA

10,001+ employees

Designs GPUs and AI computing solutions

Compensation Overview

$184k - $356.5k/yr

+ Equity

Senior, Expert

Company Historically Provides H1B Sponsorship

Santa Clara, CA, USA

Category
Applied Machine Learning
AI Research
AI & Machine Learning
Required Skills
Python
Grafana
Apache Spark
Prometheus
Data Analysis
Requirements
  • Experience developing large scale, distributed observability systems.
  • Ability to collaborate with data scientists, researchers, and engineering teams to identify high value data for collection and analysis.
  • Experience with turning raw data into actionable reports
  • Experience with observability platforms such as Apache Spark, Elastic/Open Search, Grafana, Prometheus, and other similar open-source tools
  • Python programming experience and use of API calls
  • Passion for improving the productivity of others
  • Excellent planning and interpersonal skills
  • Flexibility/adaptability working in a dynamic environment with changing requirements
  • MS (preferred) or BS in Computer Science, Electrical Engineering, or related field (or equivalent experience)
  • 8+ yrs of proven experience.
Responsibilities
  • Collaborate with AI, HW, SW engineering and research teams to deliver observability solutions that meet their needs in AI/HPC clusters.
  • Develop, test, and deploy data collectors, pipelines, visualization and retrieval services.
  • Define data collection and retention policies to balance network bandwidth, system load, and storage capacity costs with data analysis requirements.
  • Work in a diverse team to provide operational and strategic data to empower our engineers and researchers to improve performance, productivity, and efficiency.
  • Continuously improve quality, workloads, and processes through better observability.
Desired Qualifications
  • Background in computer science, machine learning, deep learning, open-source software, infrastructure technologies, and GPU technology.
  • Prior experience in infrastructure software, production application software development, software development, release and support methodology and DevOps
  • Experience in the management of datacenters and large-scale distributed computing
  • Experience working with AI researchers and/or EDA developers
  • Consistent track record of driving process improvements and measuring efficiency and a passion for sharing knowledge and experience driving complex projects end-to-end.

NVIDIA designs and manufactures graphics processing units (GPUs) and system on a chip units (SoCs) for various markets, including gaming, professional visualization, data centers, and automotive. Their main products are GPUs that enhance gaming experiences and support professional applications, along with AI and high-performance computing platforms for developers and data scientists. NVIDIA's products work by providing powerful processing capabilities that enable complex graphics rendering and data analysis. Unlike many competitors, NVIDIA focuses on a wide range of applications, from gaming to AI, and offers cloud-based services that enhance the usability of their hardware. The company's goal is to drive innovation in technology and provide advanced solutions that meet the needs of diverse clients, including gamers, researchers, and enterprises.

Company Size

10,001+

Company Stage

IPO

Headquarters

Santa Clara, California

Founded

1993

Simplify Jobs

Simplify's Take

What believers are saying

  • Acquisition of Lepton AI enhances NVIDIA's cloud service offerings and AI market position.
  • Investment in nEye Systems aligns with improving data transfer efficiency in data centers.
  • Backing SandboxAQ suggests strategic interest in quantum computing for future AI innovations.

What critics are saying

  • Emerging competition from startups like nEye Systems challenges NVIDIA's AI hardware dominance.
  • Integration challenges may arise from acquiring Lepton AI with its distinct focus.
  • Investment in SandboxAQ may not yield immediate returns due to nascent quantum technology.

What makes NVIDIA unique

  • NVIDIA leads in AI and HPC with cutting-edge GPU technology.
  • The company excels in gaming and professional visualization markets with innovative solutions.
  • NVIDIA's cloud services, like CloudXR, offer scalable AI and machine learning applications.

Help us improve and share your feedback! Did you find this helpful?

Benefits

Company Equity

401(k) Company Match

Growth & Insights and Company News

Headcount

6 month growth

0%

1 year growth

1%

2 year growth

0%
SiliconANGLE
Apr 11th, 2025
nEye Systems raises $58M for AI chips

Silicon photonics startup nEye Systems raised $58M in funding led by CapitalG, with participation from Microsoft, Micron, Nvidia, and others. The Emeryville-based company is developing optical networking chips for AI data centers, promising faster, more efficient, and cost-effective data transfers. nEye's technology aims to overcome bandwidth and energy limitations of current electrical interconnects. Prototypes are ready, with production samples expected next year. Total funding exceeds $72M.

Aibase
Apr 8th, 2025
Nvidia Acquires Lepton AI for Millions

Nvidia has completed its acquisition of Lepton AI, a startup founded by former Alibaba VP Yangqing Jia, for reportedly hundreds of millions of dollars. Lepton AI, established in 2023, focuses on AI infrastructure and cloud solutions. Co-founders Yangqing Jia and Junjie Bai have joined Nvidia. Jia, a notable AI expert, previously contributed to TensorFlow at Google and led AI R&D at Alibaba.

Yahoo Finance
Apr 7th, 2025
Rescale Raises $115M in Venture Funding

Rescale, a San Francisco-based startup specializing in engineering software for designing race cars and computer chips, secured $115 million in venture financing.

Reuters
Apr 4th, 2025
AI startup SandboxAQ adds Nvidia, Google as backers, raises additional $150 million

SandboxAQ, a startup drawing on quantum computing techniques to develop quantitative artificial intelligence models for enterprises, said it has raised $150 million from new investors including Google, Nvidia and BNP Paribas .

TechCrunch
Apr 3rd, 2025
Runway, best known for its video-generating AI models, raises $308M | TechCrunch

Runway, a startup best known for its suite of generative AI media tools, has raised $308 million in a new funding round.

INACTIVE