Full-Time

Senior HPC Operations Engineer

Posted on 5/14/2024

Lambda

Lambda

51-200 employees

On-demand public cloud with NVIDIA GPUs for deep learning

Hardware

Senior

Remote in USA

Required Skills
Linux/Unix
Requirements
  • 10+ years of experience in managing HPC clusters
  • 10+ years of everyday Linux experience
  • Strong understanding of HPC architecture
  • Experience with Bright Cluster Manager or similar tools
  • Expertise in configuring and troubleshooting network fabrics, compute nodes, job scheduling systems, and more
  • Problem-solving and troubleshooting skills
  • Flexibility to travel to North American data centers
  • Bachelor's degree in EE, CS, Physics, Mathematics, or equivalent work experience
Responsibilities
  • Remotely provision and manage large-scale HPC clusters for AI workloads
  • Install and configure operating systems, firmware, software, and networking on HPC clusters
  • Troubleshoot and resolve HPC cluster issues
  • Provide clear requirements to automation and design teams
  • Contribute to SOP creation and maintenance
  • Provide updates to project leads
  • Mentor team members
  • Stay updated on HPC/AI technologies

With a focus on deep learning and generative AI, this company offers on-demand access to advanced NVIDIA H100 Tensor Core GPUs in a public cloud, catering specifically to massive-scale AI projects. It facilitates robust cloud clusters enhanced by 3200 Gbps Infiniband, ensuring exceptional processing speeds and efficiency. Moreover, its adoption of an open source AI software stack, used by over 50,000 machine learning teams, underscores its commitment to community-driven innovation and support for industry-standard tools like PyTorch® and TensorFlow.

Company Stage

Series C

Total Funding

$932.2M

Headquarters

San Jose, California

Founded

2012

Growth & Insights
Headcount

6 month growth

8%

1 year growth

19%

2 year growth

89%