Full-Time

Senior Kubernetes Operations Engineer

Posted on 6/3/2024

Lambda

Lambda

51-200 employees

On-demand public cloud with NVIDIA GPUs for deep learning

Hardware

Senior

Remote in UK

Required Skills
Kubernetes
Linux/Unix
Requirements
  • Experienced operations engineer, SRE, sysadmin or similar
  • Deep knowledge of running Linux clusters and systems
  • Familiarity with running on bare-metal
  • Good understanding of containers, virtualization, and Kubernetes
  • Experience in an on-call environment and incident response
  • Ability to learn on-the-fly and adapt to solve problems
  • Experience with customer interaction during incidents
Responsibilities
  • Remotely install, upgrade, operate, and maintain bare-metal Kubernetes clusters
  • Handle cluster degradation, recovery, and resizing
  • Perform on-call response for critical incidents
  • Improve tooling, automation, and processes for daily operations
  • Assist customers with Kubernetes questions and integration
  • Assist with cluster build-outs and validation
  • Work closely with other Ops teams
  • Mentor and assist team members
  • Contribute to product direction

With a focus on deep learning and generative AI, this company offers on-demand access to advanced NVIDIA H100 Tensor Core GPUs in a public cloud, catering specifically to massive-scale AI projects. It facilitates robust cloud clusters enhanced by 3200 Gbps Infiniband, ensuring exceptional processing speeds and efficiency. Moreover, its adoption of an open source AI software stack, used by over 50,000 machine learning teams, underscores its commitment to community-driven innovation and support for industry-standard tools like PyTorch® and TensorFlow.

Company Stage

Series C

Total Funding

$932.2M

Headquarters

San Jose, California

Founded

2012

Growth & Insights
Headcount

6 month growth

9%

1 year growth

13%

2 year growth

82%