Site Reliability Engineer
London
Confirmed live in the last 24 hours
Anthropic

51-200 employees

AI research firm creating reliable, interpretable systems
Company Overview
Anthropic stands out as a leader in the AI research industry, with a focus on creating reliable, interpretable, and steerable AI systems, demonstrated by their first product, Claude, an AI assistant designed for tasks of any scale. The company's diverse and interdisciplinary team, with expertise in ML, physics, policy, and product, fosters a collaborative culture that drives the development of beneficial AI systems. Their broad research interests, encompassing areas such as natural language, human feedback, scaling laws, reinforcement learning, code generation, and interpretability, position them at the forefront of technical innovation in AI.
AI & Machine Learning
Consulting
B2B

Company Stage

N/A

Total Funding

$3B

Founded

2021

Headquarters

San Francisco, California

Growth & Insights
Headcount

6 month growth

-16%

1 year growth

85%

2 year growth

348%
Locations
London, UK
Experience Level
Entry
Junior
Mid
Senior
Expert
Desired Skills
AWS
Development Operations (DevOps)
Google Cloud Platform
Linux/Unix
Terraform
Kubernetes
Python
CategoriesNew
DevOps & Infrastructure
Requirements
  • Have significant experience with Kubernetes and cloud-native infrastructure
  • Have a DevOps/SRE mindset: you enjoy debugging complex systems and automating solutions
  • Have strong communication skills to work with a range of technical and non-technical colleagues
  • Have an interest in the societal impacts of ML and a commitment to building robust, reliable systems
  • Cloud infrastructure on AWS/GCP
  • Terraform/Infrastructure as Code
  • Monitoring/alerting tools like Prometheus/Grafana
  • Python and Linux SysAdmin skills
  • Significant experience with Kubernetes architecture and administration
  • Strong Linux skills and cloud infrastructure expertise
  • Familiarity with networking, caching, and storage optimizations
  • Track record of building resilient, scalable systems
  • Comfort debugging complex, distributed systems
  • Excellent communication and collaboration skills
Responsibilities
  • Own Kubernetes clusters with thousands of nodes
  • Troubleshoot and resolve issues across the stack, from networking to applications
  • Improve monitoring, alerting, and incident response
  • Automate operations and infrastructure management
  • Partner with ML researchers and engineers to meet their infrastructure needs
  • Tune autoscaling and resource allocation for ML jobs
  • Build fault-tolerance into infrastructure to handle node failures
  • Monitor clusters and set up alerts/on-call playbooks
  • Migrate cloud deployments to Kubernetes using Terraform