Staff Site Reliability Engineer
Confirmed live in the last 24 hours
Platform Science

201-500 employees

Telematics & fleet management for trucking
Company Overview
Platform's mission is to make transportation smarter.
Hardware

Company Stage

Series C

Total Funding

$297.7M

Founded

2015

Headquarters

San Diego, California

Growth & Insights
Headcount

6 month growth

-17%

1 year growth

-11%

2 year growth

18%
Locations
Remote in USA
Experience Level
Entry
Junior
Mid
Senior
Expert
Desired Skills
Datadog
Bash
Kubernetes
Python
Docker
AWS
Jenkins
CategoriesNew
DevOps & Infrastructure
DevOps Engineering
Site Reliability Engineering
IT & Security
Cloud Engineering
Requirements
  • 9+ years of hands-on experience in SRE or Platform Engineering roles
  • Expertise in automation technologies like Jenkins, ArgoCD, or similar (4+ years)
  • Experience with Kubernetes, Helm, and Docker within production environments (3+ years)
  • Proficiency with software development lifecycle (SDLC) concepts and best practices
  • Experience with AWS, EKS, IAM, autoscaling, networking, and load balancing/request routing in a production environment
  • Proficiency in Python, Bash, Nodejs, and/or Go
  • Proficiency with distributed tracing methodologies and observability tools such as Prometheus, ELK, or Datadog
  • Strong emphasis on documentation and fostering knowledge-sharing practices
  • Track record of successfully training and mentoring engineers
  • Expertise in optimizing performance and managing costs within cloud environments
  • Sound understanding of SLI/SLO concepts and adherence to SRE best practices
Responsibilities
  • Lead the development and enhancement of Continuous Integration/Continuous Deployment (CI/CD) pipelines, along with refining release management processes and associated toolsets
  • Architect and maintain Helm charts to streamline application deployment and management
  • Establish standardized observability solutions to empower development teams in efficiently managing their applications
  • Lead the effort in promoting and prioritizing reliability, driving achievement of uptime goals and mentoring colleagues in SRE best practices
  • Conduct comprehensive Production Readiness Reviews, working with teams to identify and establish Service Level Objectives (SLOs), and ensure high-quality and dependable services
  • Design and develop software solutions to address operational challenges effectively to improve system stability and reliability
  • Fulfill on-call duties, providing expert support to development teams for mission-critical applications in production environments
  • Improve the resiliency of applications and systems using chaos engineering