Who we are
About Stripe
Stripe is a financial infrastructure platform for businesses. Millions of companies—from the world’s largest enterprises to the most ambitious startups—use Stripe to accept payments, grow their revenue, and accelerate new business opportunities. Our mission is to increase the GDP of the internet, and we have a staggering amount of work ahead. That means you have an unprecedented opportunity to put the global economy within everyone’s reach while doing the most important work of your career.
About the team
The Incident Ops team is a global 24/7 team responsible for driving incident response and management from detection to resolution. Stripe is proud of its five 9s reliability and this team is at the forefront of ensuring we keep it that way - working hand-in-hand with Reliability Eng and across the Tech Org. This team of incident response managers (IRM) is defined by our sense of ownership and how we drive incidents to resolution - marshaling the necessary cross-functional resources to respond to and resolve service outages, critical bugs, security attacks and anything that significantly impacts the users of our products. The team is user-first and ensures appropriate external communications from Stripe and senior management to keep our users informed of disruption to their experience of Stripe. The team is skilled in communications, incident handling and technical adeptness as incidents can arise from anywhere and cut across products and orgs in Stripe.
What you’ll do
As an Incident Response Manager (IRM), you’ll play the key role in driving the right level of response from Stripes to incidents, determining impact, rallying Stripes to mitigate, communicating to users and ensuring appropriate remediations and orchestrate the Root Cause Analysis (RCA) process. You’ll work hand-in-hand with IRMs and engineers globally to ensure solid 24/7 coverage on how we monitor, detect, respond, communicate and mitigate incidents. When not managing incidents, you’ll help scale our ability to respond to incidents, improve our operations, analyze data to provide insights and deepen our technical expertise in products. As a result, you’ll be seen as the protector of our users - in minimizing the impact of incidents on their business and ensuring that Stripe is always thinking of our users.
Responsibilities
- Act as an on-call Incident Commander, responsible for driving and managing incident resolution with a high level of urgency, cross-functional collaboration, and accuracy, while partnering with a global and diverse set of teams, including Engineering, Product, Policy, Risks, PR, Legal, Execs, etc.
- Lead all user-facing incidents across domains at Stripe - including reliability, technical, security, and data privacy
- "User First" approach to determine impact, providing accurate situation reports, facilitating comms bridges, and ensuring useful and timely external communications to users
- Proactively update internal stakeholders, make decisions through data and influence by partnering with Engineering, Sales, Support and other cross-functional teams
- Contribute to the root cause analysis process while conducting post-mortems, remediations identification, and ensure problem management tasks meet SLA and user expectations
- Drive improvements in the incident handling process and incident management metrics and tooling based on trends and data of Stripe’s incidents in collaboration with engineering, product and operations teams
- Collaborate closely with leadership for building team strategy based on the team vision
- Collaborate and coach other Incident Response Managers on the team
Who you are
We’re looking for someone who meets the minimum requirements to be considered for the role. If you meet these requirements, you are encouraged to apply. The preferred qualifications are a bonus, not a requirement.
Minimum requirements
- 5+ years of demonstrable major incident experience for organizations that run mission critical applications or always-on Saas environments.
- Demonstrated ability to lead multiple incidents concurrently with authority and influence responders with agency and reasoning skills to resolve ambiguous problems and drive to root cause.
- Strong full stack technical skills with development/support experience with cloud based technologies
- Demonstrated experience developing code and automation using Python, Ruby, JavaScript or shell scripting.
- Solid understanding of infrastructure, including physical, virtual, and container-based compute platforms
- Strong quantitative, and analytical skills in data manipulation using SQL, Splunk or other tools.
- Excellent task management skills, must be detail-oriented with ability to remain composed, methodical, and think fast in a high-pressured environment.
- Exceptional written and verbal English communication skills, with the ability to translate complex technical issues for internal and external stakeholders
Preferred qualifications
- Domain expertise in classes of incidents such as technical, privacy, security or crisis with a strong desire to continuously learn about Stripe’s products, technical issues and systems.
- Ability to review complex technical details regarding ongoing issues/events and convey the key details to senior stakeholders to facilitate real-time decision making.
- Experience with broad user-facing communications (e.g. status pages, tweets) and/or targeted communications (e.g. direct emails, support ticket responses).
- Familiarity operating or managing distributed architectures with the ability to correlate system behaviors based on known inter-dependencies.
- Demonstrated experience with full stack development and support