Full-Time

HPC Operations Engineer

Posted on 4/18/2024

Lambda

Lambda

51-200 employees

On-demand NVIDIA GPUs for deep learning

Hardware

Compensation Overview

$120,000 - $160,000Annually

+ Cash Compensation + Equity Compensation

Mid

Remote in USA

Required Skills
Linux/Unix
Requirements
  • 3+ years of experience in deploying and configuring HPC clusters for AI workloads
  • Good understanding of HPC/AI architecture, operating systems, firmware, software, and networking
  • Familiarity with Bright Cluster Manager or similar cluster management tools
  • Experience in configuring and troubleshooting SFP+ fiber, InfiniBand (IB), and 100 GbE network fabrics
  • Experience with Linux-based compute nodes, firmware updates, driver installation
  • Experience with SLURM, Kubernetes, or other job scheduling systems
  • Ability to work independently and as part of a team
Responsibilities
  • Remotely deploy and configure large-scale HPC clusters for AI workloads
  • Install and configure operating systems, firmware, software, and networking on HPC clusters
  • Troubleshoot and resolve HPC cluster issues
  • Provide clear and detailed requirements back to HPC design team
  • Contribute to the creation and maintenance of Standard Operating Procedures
  • Stay up-to-date on the latest HPC/AI technologies and best practices

Lambda offers on-demand access to NVIDIA H100 Tensor Core GPUs in a public cloud, designed for deep learning and generative AI, with the ability to reserve thousands of GPUs with 3200 Gbps Infiniband for cloud clusters. Additionally, Lambda provides an open source AI software stack used by over 50,000 ML teams, including PyTorch®, TensorFlow, CUDA, cuDNN, and NVIDIA Drivers.

Company Stage

Series C

Total Funding

$932.2M

Headquarters

San Jose, California

Founded

2012

Growth & Insights
Headcount

6 month growth

8%

1 year growth

19%

2 year growth

102%
INACTIVE