Software Engineer
Resilience
Confirmed live in the last 24 hours
Monitoring platform for cloud applications and services
Company Overview
Datadog stands out as a leading monitoring platform for cloud applications, offering comprehensive observability of data from various sources, which aids DevOps teams in preventing downtime and enhancing user experience. The company's culture emphasizes technical excellence and problem-solving, fostering an environment that encourages continuous learning and growth. With its unique ability to analyze and explore logs for rapid troubleshooting, Datadog holds a competitive edge in the industry, demonstrating its commitment to technical innovation and industry leadership.
Data & Analytics
Company Stage
N/A
Total Funding
$150.6M
Founded
2010
Headquarters
New York, New York
Growth & Insights
Headcount
6 month growth
↑ 3%1 year growth
↑ 16%2 year growth
↑ 85%Locations
Cambridge, MA, USA • New York, NY, USA
Experience Level
Entry
Junior
Mid
Senior
Expert
Desired Skills
Kubernetes
Python
CategoriesNew
Software Engineering
Requirements
- Writing software that solves real user problems, as well as reviewing others' code in an empathetic and collaborative way. We mainly use Go and Python
- Analyzing incidents, identifying broader risk patterns, and sharing your findings in an engaging way that other people can understand and learn from
- Responding to incidents as an incident commander or responder (preferably those with high-impact), and iteratively improving incident response processes
- Teaching and training other engineers on best practices
- Familiarity with Kubernetes and distributed systems as well as their potential failure scenarios
Responsibilities
- Blamelessness in our processes. Our primary goal in incident reviews is to learn from and adapt our mental models of how our systems run in production. As Nabokov said, complacency is a state of mind that exists only in retrospective
- A people-centered approach: ensuring that automation and systems support engineers doing work, not vice versa
- An understanding that systems are inherently complex and failure is inevitable. What we can control is how resilient our systems and organization are when responding to these inevitable events
- The idea that safety and risk are emergent properties in a socio-technical system and that they arise from a complex interaction of factors that constitute normal work. Resilience is a dynamic process of steering rather than a static quality
- Help run the post-mortem process for the company and partner with teams on writing them, as well as identifying and implementing opportunities to reduce friction and maximize learning value to the organization
- Define how we respond to incidents as a company and write software to streamline that process, partnering with our product teams where necessary. Our goal is to support our incident responders as much as possible to deal with complexity
- Train our on-callers in our incident and post-mortem processes. This involves both introducing newcomers to on-call responsibilities and refreshing the knowledge of existing engineers
- Perform cross-functional engagements with different teams across the organization, embedding in their group for a few weeks in order to either learn about how work is performed or to solve a specific reliability problem
- Facilitate incident reviews in a way that emphasizes learning and blamelessness
- Write reliability bulletins, blog posts, and other forms of documentation that identify systemic risks to the company, provide actionable remediations, and promote best reliability practices