Staff Site Reliability Engineer
Posted on 2/3/2024
Platform Science

201-500 employees

Telematics & fleet management for trucking
Company Overview
Platform's mission is to make transportation smarter.
Hardware

Company Stage

Series C

Total Funding

$297.7M

Founded

2015

Headquarters

San Diego, California

Growth & Insights
Headcount

6 month growth

-15%

1 year growth

-7%

2 year growth

26%
Locations
San Diego, CA, USA
Experience Level
Entry
Junior
Mid
Senior
Expert
Desired Skills
Datadog
Bash
Kubernetes
Microsoft Azure
Python
Docker
AWS
Jenkins
Google Cloud Platform
CategoriesNew
DevOps & Infrastructure
Software Engineering
Requirements
  • 9+ years of hands-on experience in SRE or Platform Engineering roles
  • 4+ years of expertise with automation technologies like Jenkins, ArgoCD, or similar
  • 3+ years of experience with Kubernetes, Helm, and Docker within production environments
  • Proficiency with current software development lifecycle (SDLC) concepts and best practices, CI/CD pipelines, and test-driven development
  • Experience with AWS, encompassing proficiency in EKS, IAM, autoscaling, networking, and load balancing/request routing in a production environment
  • Proficiency in Python, Bash, Nodejs, and/or Go
  • Proficiency with distributed tracing methodologies and observability tools such as Prometheus, ELK, or Datadog
  • Strong emphasis on documentation and fostering knowledge-sharing practices within the team and organization
  • Track record of successfully training and mentoring engineers
  • Proven expertise in optimizing performance and managing costs within cloud environments
  • Sound understanding of SLI/SLO concepts and adherence to SRE best practices
Responsibilities
  • Lead the development and enhancement of Continuous Integration/Continuous Deployment (CI/CD) pipelines, along with refining release management processes and associated toolsets
  • Architect and maintain Helm charts to streamline application deployment and management
  • Establish standardized observability solutions to empower development teams in efficiently managing their applications
  • Lead the effort in promoting and prioritizing reliability, driving achievement of uptime goals and mentoring colleagues in SRE best practices
  • Conduct comprehensive Production Readiness Reviews, working with teams to identify and establish Service Level Objectives (SLOs), and ensure high-quality and dependable services
  • Design and develop software solutions to address operational challenges effectively to improve system stability and reliability
  • Fulfill on-call duties, providing expert support to development teams for mission-critical applications in production environments
  • Improve the resiliency of applications and systems using chaos engineering
Desired Qualifications
  • Experience with Azure and GCP
  • Experience with mobile apps, hardware, websites, messaging queues, and serverless pipelines
  • Experience with chaos engineering