Full-Time

Senior Site Reliability Engineer

Supply

Posted on 6/2/2025

Foundry

Foundry

11-50 employees

Self-serve GPU cloud for ML

Compensation Overview

$170k - $230k/yr

San Francisco, CA, USA

In Person

In-office requirement specifies HQ Palo Alto or SF office with mandatory in-person presence; remote work not fully allowed; candidates expected to work primarily from local office with one HQ visit per week.

Category
DevOps & Infrastructure (2)
,
Requirements
  • Proven experience deploying, scaling, and maintaining production-grade Kubernetes clusters across both cloud or on-prem environments
  • Bachelor’s degree in Computer Science, Computer Engineering, or a related field, or equivalent professional experiences
  • Experience working with Linux systems administration and command-line interfaces
  • Ability to create technical documentation and technical specs
  • Scripting and automation skills (Python, Bash, or similar)
  • Understanding of key infrastructure metrics (CPU, memory, network utilization, error rates)
  • Understanding of data center operations: disaster recovery, maintenance schedules, capacity planning
  • Strong written and verbal communication skills, with ability to translate technical concepts for various audiences
  • Project management experience and ability to handle multiple priorities
  • Demonstrated problem-solving and analytical thinking skills
  • Experience leading or participating in incident response and root cause analysis
Responsibilities
  • Design, deploy, and manage scalable, secure, and highly available Kubernetes clusters in both cloud and on-premises environments
  • Execute, refine, and create Ansible playbooks to perform routine maintenance, load testing, and system burn-in operations across the Mithril’s fleet
  • Deploy and oversee monitoring systems, such as Grafana, to proactively detect issues and anomalies in our supplier environment
  • Establish and uphold service level objectives (SLOs) and service level indicators (SLIs) to gauge and uphold system reliability
  • Leading or participating in incident response and root cause analysis
  • Provide regular updates on machine operability, swiftly notifying internal and external partners of disruptions to maintain system availability and supplier confidence
  • Serve as the primary liaison with suppliers, maintaining a regular meeting cadence to communicate Mithril’s requirements and address supplier inquiries
  • Coordinate cross-functional supply-related initiatives, ensuring all stakeholders are informed, aligned, and prepared for upcoming changes or maintenance events
Desired Qualifications
  • Familiarity with GPU/CPU cluster management and optimization
  • Proficiency with Git or similar version control systems
  • Experience with Prometheus or Grafana monitoring and observability tools
  • Experience in technical training or presenting technical content
  • Prior experience as a Site Reliability Engineer (SRE) in the AI/ML domain is highly desirable
  • Familiar with the challenges around scaling large scale infrastructure
  • Familiarity with hardware lifecycle management (RMA)
  • Experience in technical customer or vendor-facing roles

Foundry provides a self-serve cloud platform for machine learning research and infrastructure teams to access NVIDIA GPUs on demand. Users can spin up short-term GPU instances for model development and experimentation, designed for burst compute so resources can scale quickly as needed. Pricing is market-based and updated weekly, allowing teams to access GPU power at competitive rates. Compared to traditional fixed-asset or long-term cloud contracts, Foundry emphasizes flexible, on-demand GPU capacity and real-time pricing to support demanding ML infrastructure needs. The goal is to give ML teams fast, scalable, and cost-conscious access to GPU compute for development and experimentation.

Company Size

11-50

Company Stage

Series A

Total Funding

$80M

Headquarters

Palo Alto, California

Founded

2022

Simplify Jobs

Simplify's Take

What believers are saying

  • $80M seed and Series A funding fuels global compute capacity orchestration.
  • Nebius integration expands beyond NVIDIA GPUs for diverse ML workloads.
  • Open-source fine-tuning demand aligns with flexible short-term GPU access.

What critics are saying

  • Nebius direct Mlfoundry integration bypasses Foundry in 3-6 months.
  • CoreWeave captures 40% AI GPU contracts with lower latency in 6-12 months.
  • Lambda Labs spot pricing undercuts Foundry by 25% in 6-9 months.

What makes Foundry unique

  • Foundry enables resellable reserved GPU instances for as short as three hours.
  • Market-based pricing updates weekly for competitive burst compute access.
  • Kubernetes orchestration supports programmatic scaling without manual scheduling.

Help us improve and share your feedback! Did you find this helpful?

Benefits

Health Insurance

Dental Insurance

Vision Insurance

401(k) Retirement Plan

401(k) Company Match

Paid Vacation

Growth & Insights and Company News

Headcount

6 month growth

2%

1 year growth

5%

2 year growth

21%
Foundry
Jul 8th, 2025
Access Nebius through Foundry Cloud Platform

In addition to new compute types, Mlfoundry is working on additional technical integrations with Nebius to make it even easier for Foundry customers to provision exactly the compute they need.

Business Wire
Mar 22nd, 2024
Foundry Raises $80 Million in Seed and Series A Funding to Build a New Breed of Public Cloud

Today, Foundry, a new breed of public cloud for AI/ML workloads, announced $80M in seed and Series A funding to orchestrate the world’s compute capaci

INACTIVE