Full-Time

Senior HPC Systems Engineer

Posted on 5/14/2024

Lambda

Lambda

51-200 employees

On-demand public cloud with NVIDIA GPUs for deep learning

Hardware

Senior

Remote in USA + 1 more

Required Skills
Python
Linux/Unix
Requirements
  • Expertise in architecting, operating, and debugging large-scale HPC network and storage infrastructure
  • Experience with MPI, NCCL, RDMA, Infiniband, and parallel file systems
  • Proficiency in building complex software using Python
  • Deep understanding of Linux fundamentals, especially its networking stack
  • Experience with large GPU clusters and virtualization
  • Background in Computer Science, Electrical Engineering, Mathematics, or Physics
Responsibilities
  • Design and architect AI supercomputers for the cloud
  • Introduce technology to improve performance of HPC storage and networking infrastructure
  • Benchmark, tune, and optimize hypervisors, network, and storage
  • Set up monitoring and alerting for high availability
  • Provide guidance to HPC customers

With a focus on deep learning and generative AI, this company offers on-demand access to advanced NVIDIA H100 Tensor Core GPUs in a public cloud, catering specifically to massive-scale AI projects. It facilitates robust cloud clusters enhanced by 3200 Gbps Infiniband, ensuring exceptional processing speeds and efficiency. Moreover, its adoption of an open source AI software stack, used by over 50,000 machine learning teams, underscores its commitment to community-driven innovation and support for industry-standard tools like PyTorch® and TensorFlow.

Company Stage

Series C

Total Funding

$932.2M

Headquarters

San Jose, California

Founded

2012

Growth & Insights
Headcount

6 month growth

8%

1 year growth

19%

2 year growth

89%
INACTIVE