Full-Time

Lead Site Reliability Engineer

Confirmed live in the last 24 hours

Replit

Replit

51-200 employees

Cloud-based platform for coding collaboration

Enterprise Software
AI & Machine Learning
Education

Senior

San Mateo, CA, USA

Hybrid role with in-office requirement on Monday, Wednesday, and Friday.

Category
DevOps & Infrastructure
Site Reliability Engineering
Required Skills
Kubernetes
Python
Go
Terraform
Ansible
Development Operations (DevOps)
Requirements
  • 5+ years of experience in Site Reliability Engineering or similar roles (DevOps, Systems Engineering)
  • Strong programming skills in languages commonly used for automation (Python, Go, or similar)
  • Deep understanding of distributed systems
  • Experience with container orchestration platforms (Kubernetes) and cloud-native technologies
  • Proven track record of implementing and maintaining monitoring/observability solutions
  • Strong incident management skills with experience leading incident response
  • Experience with infrastructure as code and configuration management tools
  • Solid understanding of networking concepts and security best practices
Responsibilities
  • Design and Implement Observability Solutions: Develop comprehensive monitoring and alerting systems using modern observability tools. Create dashboards and metrics that provide real-time visibility into system health and performance. Implement logging strategies that enable quick problem identification and resolution.
  • Drive Automation and Infrastructure as Code: Architect and implement infrastructure automation solutions using tools like Terraform, Ansible, or Pulumi. Design and maintain CI/CD pipelines that enable reliable and consistent deployments. Create self-healing systems that can automatically respond to common failure scenarios.
  • Establish SLOs and SLIs: Work with product and engineering teams to define and implement Service Level Objectives (SLOs) and Service Level Indicators (SLIs). Build systems to track and report on these metrics, ensuring we maintain high reliability standards while balancing innovation speed.
  • Incident Management and Response: Lead incident response efforts, conducting thorough post-mortems, and implementing improvements to prevent future occurrences. Develop and maintain runbooks for critical services. Build tools and processes that reduce Mean Time To Recovery (MTTR).
  • Performance Optimization: Identify and resolve performance bottlenecks across our infrastructure. Implement capacity planning strategies and optimize resource utilization. Work on reducing latency and improving system efficiency across global regions.
Desired Qualifications
  • Experience with Google Cloud Platform (GCP) services and tools
  • Knowledge of modern observability platforms (Prometheus, Grafana, Datadog, etc.)
  • Experience with chaos engineering practices and tools
  • Contributions to open-source projects related to SRE, DevOps, or infrastructure tools
  • Experience working with developer platforms or PaaS solutions
  • Background in performance optimization and capacity planning

Replit provides a cloud-based platform for software development and deployment, allowing users to write, run, and share code directly from their web browser. This eliminates the need for complicated local setups, making it easier for a wide range of users, including enterprises, freelancers, and students, to engage in coding. Replit features an online code editor and an integrated development environment (IDE) that supports multiple programming languages. It also includes tools for real-time collaboration, code sharing, and project management. Users can access AI-powered coding assistance to enhance their development experience. Replit operates on a subscription model with various pricing tiers that offer additional features, and it generates revenue through enterprise solutions and educational partnerships. What sets Replit apart from its competitors is its focus on community engagement and accessibility, making it suitable for both beginners and experienced developers.

Company Stage

Late Stage VC

Total Funding

$216M

Headquarters

San Francisco, California

Founded

2016

Growth & Insights
Headcount

6 month growth

0%

1 year growth

4%

2 year growth

4%
Simplify Jobs

Simplify's Take

What believers are saying

  • Replit raised $97.4M to expand cloud services and lead in AI development.
  • The platform benefits from increased demand for remote and collaborative coding tools.
  • Educational institutions are adopting Replit for remote learning, boosting its user base.

What critics are saying

  • Replit faces competition from GitHub Codespaces with similar features.
  • Market saturation in online coding environments may challenge Replit's differentiation.
  • Significant investment in AI development could strain Replit's financial resources.

What makes Replit unique

  • Replit offers a browser-based IDE supporting over 50 programming languages.
  • The platform enables real-time collaboration and code sharing across multiple devices.
  • Replit's AI-powered coding assistance enhances developer productivity and efficiency.

Help us improve and share your feedback! Did you find this helpful?

Benefits

Competitive salary & equity

Your choice of new equipment & software

Health, dental, & vision insurance

Autonomy at work

Flexible work hours

Learning & development stipend

Monthly health & wellness stipend

Generous parental leave

Unlimited PTO (2 weeks minimum required)

401k matching

Commuter benefits

Expensed lunch

Yearly off-sites