Site Reliability Engineer
Posted on 9/15/2023
Twilio

5,001-10,000 employees

Customer engagement platform & developer of communications APIs
Company Overview
Twilio's mission is to fuel the future of communications. By making communications a part of every software developer's toolkit, Twilio is enabling innovators across every industry to reinvent how companies engage with their customers.
Locations
Remote in USA
Experience Level
Entry
Junior
Mid
Senior
Expert
Desired Skills
AWS
Java
Kubernetes
Python
CategoriesNew
DevOps & Infrastructure
Software Engineering
Requirements
  • 5+ years experience writing production-grade code in a modern programming language
  • Proven experience in designing, implementing, and maintaining observability solutions, preferably within a cloud-based SaaS environment
  • Strong proficiency in programming languages such as Python, Go, or Java
  • Familiarity with open-source observability tools and standards, including Prometheus, Grafana, OpenTelemetry, and others
  • Knowledge of distributed tracing, log management, and metric aggregation techniques
  • Proficiency in IaC, Kubernetes, and AWS concepts, best practices, and tools
  • Participate in team on-call rotations
  • Solid problem-solving skills, proactive attitude, and ability to work collaboratively in a dynamic team environment
Responsibilities
  • Design, implement, and maintain observability infrastructure and tooling, focusing on logging, tracing, metrics, and continuous profiling
  • Collaborate with software engineers to provide comprehensive instrumentation to capture relevant telemetry data for observability purposes
  • Leverage open-source standards, such as OpenTelemetry, to build scalable and interoperable solutions
  • Develop data pipelines to handle high cardinality data and enable interactive troubleshooting capabilities for engineers
  • Enable effective telemetry correlation and allow engineers to understand the behavior of distributed systems
  • Work on building affordable and engineer-friendly observability tooling, facilitating real-time root-cause analysis and reducing mean time to resolution (MTTR) for incidents
  • Contribute to the development of the Observability platform's features and functionalities, continuously enhancing the user experience and ensuring self-service capabilities for other teams
  • Collaborate with the OpenTelemetry community and contribute to open-source initiatives to foster a broader adoption of observability solutions
Desired Qualifications
  • Experience with context propagation and telemetry correlation to enable effective troubleshooting and monitoring of distributed systems
  • Experience in building data pipelines
  • Understanding of high cardinality data challenges and strategies for handling complex telemetry data
  • Proficiency in optimizing cloud infrastructure and compute costs through the implementation of cost observability software and workflows