Facebook pixel

Site Reliability Engineer / Senior Site Reliability Engineer
Reliability
Posted on 12/7/2022
INACTIVE
Locations
Remote • Dewey, OK, USA
Experience Level
Entry
Junior
Mid
Senior
Expert
Desired Skills
Docker
Git
Linux/Unix
nginx
Operating Systems
Ruby
Terraform
Kubernetes
Chef
Requirements
  • Are able to reason about large systems - how they work on large scale, edge cases, failure modes, behaviors
  • Know your way around Linux and the Unix Shell
  • Have experience in collaborating and communicating asynchronously
  • Have an urge to document all the things so you don't need to learn the same thing twice
  • Have an enthusiastic, go-for-it attitude. When you see something broken, you can't help but fix it
  • Have a strong sense for action and know how to iterate through a problem quickly
  • Share our values, and work in accordance with those values
  • Have experience with Nginx, HAProxy, Docker, Kubernetes, Terraform, or similar technologies
  • Are able to leverage GitLab as your day to day go-to tool
Responsibilities
  • Automating every operational task is a core requirement for this role. For example, package updates, configuration changes across all environments, creating tools for automatic provisioning of user facing services, etc
  • Responding to platform emergencies, alerts, and escalations from Customer Support
  • Ensure systems exist to manage software life-cycles (e.g. Operating Systems) with a minimum of manual effort
  • Develop a fully automated multi-environment observability stack based on the existing SaaS system, and extend it to predict capacity needs based on the usage patterns
  • Plan for new service roll-outs, expansion and capacity management of existing services, and work with users to optimise their resource consumption
  • Be on a PagerDuty rotation to respond to GitLab.com availability incidents and provide support for service engineers with customer incidents
  • Analyze existing, create and maintain new GitLab.com Service Level Objectives
  • Troubleshoot, evaluate and resolve operational challenges contributing to defined SLO's
  • Define, improve, and engage in adapting architectural application bottlenecks as observed on GitLab.com
  • Work with other engineering stakeholders on resolving larger architectural bottlenecks and participate by offering GitLab.com point of view
  • Work in close collaboration with software development teams to shape the future roadmap and establish strong operational readiness across teams
  • Scale systems through automation, improving change velocity and reliability
  • Leverage technical skills to partner with team members and be comfortable diving into a problem as needed
  • Work with counterparts in other teams of the Infrastructure department to improve infrastructure running with Chef, Terraform and Kubernetes
  • Make monitoring and alerting alert on symptoms and not on outages
  • Document every action so your findings turn into repeatable actions-and then into automation
  • Debug production issues across services and levels of the stack
  • GitLab.com Availability
  • GitLab.com Performance
  • Apdex and Error SLO per Service
  • Mean Time to Detection
  • Mean Time to Resolution
  • Mean Time Between Failure
  • Mean Time to Production
  • Disaster Recovery Time to Recovery
Desired Qualifications
  • : Strong programming skills as a (former) backend engineer - Preferably with Ruby and/or Go
GitLab

1,001-5,000 employees

Repository hosting manager tool
Company Overview
It is GitLab's mission to make it so that everyone can contribute. When everyone can contribute, users become contributors and greatly increases the rate of innovation.
Benefits
  • Spending Company Money
  • Equity Compensation
  • Life Insurance
  • Financial Wellness
  • Paid Time Off
  • Growth and Development Benefit
  • GitLab Contribute
  • Business Travel Accident Policy
  • Immigration
  • Employee Assistance Program
  • Incentives
  • All-Remote
  • Part-time contracts
  • Meal Train
  • Fertility & Family Planning
  • Parental Leave
Company Core Values
  • Collaboration: To achieve results, team members must work together effectively.
  • Results: We do what we promised to each other, customers, users, and investors.
  • Efficiency: Working efficiently on the right things enables us to make fast progress, which makes our work more fulfilling.
  • Diversity, Inclusion, and Belonging.
  • Iteration: We do the smallest thing possible and get it out as quickly as possible.
  • Transparency: Be open about as many things as possible.