Full-Time

HPC Operations Manager

Hardware Engineering

Confirmed live in the last 24 hours

NVIDIA

NVIDIA

10,001+ employees

Designs GPUs and AI computing solutions

Compensation Overview

$272k - $425.5k/yr

+ Equity

Senior, Expert

Company Historically Provides H1B Sponsorship

Austin, TX, USA + 3 more

More locations: Santa Clara, CA, USA | Durham, NC, USA | Westford, MA, USA

Category
Hardware Engineering
System Hardware Engineering
Required Skills
Development Operations (DevOps)
Linux/Unix
Requirements
  • B.S. or M.S. in Computer Science, Computer Engineering, Information Science (or equivalent experience)
  • 15+ years overall
  • 5+ years managing IT infrastructure teams of 10+ people
  • 10+ years experience running Linux servers, NFS storage, and Ethernet networks
  • Knowledge of HPC schedulers (IBM LSF preferred)
  • Knowledge of hardware design workflows (EDA tools and methodology)
  • Experience using project management and capacity planning software
  • Datacenter operations (rack and stack, maintenance)
Responsibilities
  • Collaborating with partners to develop programs driving around storage, networking, and compute in data centers
  • Lead, cultivate, and mentor a multi-national team of sysadmins and devops engineers
  • Ensure the highest reliability of HPC clusters
  • Develop critical metrics, program schedules to measure program health, predictability, and achievements
  • Identify failures, lead retrospective analysis, and help to develop improvement action plans
  • Build standard methodologies that cut through complexity and can be used across Nvidia
  • Evaluate the latest technologies (hardware and cloud computing) and recommend future evolution of the infrastructure
  • Plan deployments and refresh of hardware (compute, storage, network equipment), and associated software stack (e.g. OS)
  • Work multi-functionally with hardware engineering leaders to support their future chip design needs
  • Lead all aspects of the HPC scheduler (LSF), set/adjust policy, ensure delivery of forecasted compute demand
  • Track software licensing servers and drive efficient license utilization
  • Develop and manage program schedules, milestones and deliverables
  • Regularly communicate program status and key issues to senior management
Desired Qualifications
  • HPC storage (e.g. Netapp, Pure Storage, Lustre, ZFS, Isilon)
  • Infiniband (operations, debugging, performance tuning)
  • Software development, especially in a devops context
  • Knowledge of relational databases, data lakes, metrics/visualization/analytics platforms
  • Deploying and maintaining FlexLM-based software license servers
  • Established relationships with enterprise-level equipment suppliers

NVIDIA designs and manufactures graphics processing units (GPUs) and system on a chip units (SoCs) for various markets, including gaming, professional visualization, data centers, and automotive. Their main products are GPUs that enhance gaming experiences and support professional applications, along with AI and high-performance computing platforms for developers and data scientists. NVIDIA stands out from competitors by offering a combination of hardware and software solutions, including cloud-based services like NVIDIA CloudXR and NGC, which enable scalable applications in AI and machine learning. The company's goal is to drive innovation in technology and provide advanced solutions that cater to a wide range of clients, from gamers to enterprises.

Company Size

10,001+

Company Stage

IPO

Headquarters

Santa Clara, California

Founded

1993

Simplify Jobs

Simplify's Take

What believers are saying

  • NVIDIA's acquisition of Gretel strengthens its AI training data capabilities.
  • Partnership with Together AI expands NVIDIA's influence in AI cloud infrastructure.
  • Backing ClustroAI aligns NVIDIA with the trend of local device AI processing.

What critics are saying

  • Increased competition from startups like Lambda challenges NVIDIA's AI market share.
  • Integration challenges with Gretel may delay synthetic data utilization.
  • Investment in SandboxAQ could be risky if quantum AI models underperform.

What makes NVIDIA unique

  • NVIDIA leads in AI and HPC with cutting-edge GPU technology.
  • Acquisitions like Lepton AI enhance NVIDIA's cloud and AI capabilities.
  • NVIDIA's strategic investments in quantum AI models set it apart from competitors.

Help us improve and share your feedback! Did you find this helpful?

Benefits

Company Equity

401(k) Company Match

Growth & Insights and Company News

Headcount

6 month growth

0%

1 year growth

1%

2 year growth

-1%
Aibase
Apr 8th, 2025
Nvidia Acquires Lepton AI for Millions

Nvidia has completed its acquisition of Lepton AI, a startup founded by former Alibaba VP Yangqing Jia, for reportedly hundreds of millions of dollars. Lepton AI, established in 2023, focuses on AI infrastructure and cloud solutions. Co-founders Yangqing Jia and Junjie Bai have joined Nvidia. Jia, a notable AI expert, previously contributed to TensorFlow at Google and led AI R&D at Alibaba.

Yahoo Finance
Apr 7th, 2025
Rescale Raises $115M in Venture Funding

Rescale, a San Francisco-based startup specializing in engineering software for designing race cars and computer chips, secured $115 million in venture financing.

Reuters
Apr 4th, 2025
AI startup SandboxAQ adds Nvidia, Google as backers, raises additional $150 million

SandboxAQ, a startup drawing on quantum computing techniques to develop quantitative artificial intelligence models for enterprises, said it has raised $150 million from new investors including Google, Nvidia and BNP Paribas .

TechCrunch
Apr 3rd, 2025
Runway, best known for its video-generating AI models, raises $308M | TechCrunch

Runway, a startup best known for its suite of generative AI media tools, has raised $308 million in a new funding round.

TechCrunch
Mar 19th, 2025
Nvidia reportedly acquires synthetic data startup Gretel | TechCrunch

Nvidia has reportedly acquired Gretel, a San Diego-based startup that's developed a platform to generate synthetic AI training data. Terms of the