Facebook pixel

Staff Site Reliability Engineer
Confirmed live in the last 24 hours
Locations
Remote in USA
Experience Level
Entry
Junior
Mid
Senior
Expert
Desired Skills
Git
Terraform
Python
Sentry
Ansible
Datadog
Requirements
  • Lead the administration of tools like DataDog, Sentry, and PagerDuty
  • Identify strategies to improve our full-stack telemetry and monitoring capabilities
  • Mentor other SREs who contribute to observability-related work
  • Help drive organizational maturity by evolving and improving reliability and software engineering best practices
  • Combination of experience in both software engineering and operations
  • 7+ years working in a relevant role, including 3+ years of technical leadership experience mentoring junior engineers
  • 3+ years of experience architecting and administrating observability stacks, either managed or self-hosted (e.g. DataDog, New Relic, Prometheus, Elastic Stack/ELK)
  • Operation of containerized microservices running on public cloud, asynchronous event processing, and databases
  • Strong command ofLinux, Git and CI/CD pipelines
  • Design and build new tools to automate repetitive tasks, prevent incidents or improve TTR using an object oriented programming language such as Python
  • Infrastructure as Code using tools like Terraform, Terragrunt, Ansible or CloudFormation
  • Work with the SRE manager and other engineering managers to define SLOs to help drive SLA compliance
  • Act as the resident technical expert for our team to share knowledge, experience, and expertise, focusing on the more senior members when possible
  • Understand how application components interact, and contribute to architectural discussions
  • Unwavering commitment to operational security and best practices
  • Ownership: identify problems but also propose solutions, then go out and implement them--from submitting a merge request on another team's repository to scoping out a new reliability project
  • Connection: motivated to help other teams improve their service reliability through reviews, pair programming, hands-on training and continuous improvement of tooling and services
  • Experience with and interest in chaos engineering (Gremlin, Litmus, Chaos Mesh) is a nice to have but not required
  • On-call support of highly available production systems
Responsibilities
  • Expand and improve our observability and monitoring footprint
  • Collaborate with the engineering manager, product managers, other SREs, and cloud infrastructure engineers to create architectural plans, define project requirements, and establish technical standards
  • Connect with non-engineering business units across the organization to better our understanding of the needs and requirements of reliability and the incident management process for Checkr and our customers
  • Pair program with team members, review merge requests, help engineers get unblocked, and provide peer mentoring
  • Improve common operational challenges by building tools and automating scripts
  • Automate observability and alerting across an ever-changing landscape of microservices
  • Automate Service Reliability Scorecards and Production Readiness Standards
  • Software engineering project work, proposed and driven by individual SRE team members, to remove operational bottlenecks and increase velocity in ways we've never considered before
  • Serve as the on-call incident commander to help debug and drive resolution of reliability issues, contribute to the postmortem, and work to prevent recurrence
  • Participate in design and production reviews for new features, products, and infrastructure
  • Audit and tune the configuration of systems owned by other engineering teams
  • Assist in planning for the growth of Checkr's infrastructure and infrastructure reliability/resiliency
Checkr

501-1,000 employees

Automating professional background checks
Company Overview
Checkr powers people infrastructure for the future of work. With artificial intelligence and machine learning, Checkr's solutions make background checks faster—building a fairer future by designing technology to create opportunities for all.
Benefits
  • A fast-paced and collaborative environment
  • Learning and development allowance
  • Competitive compensation and opportunity for advancement
  • 100% medical, dental, and vision coverage
  • Up to 25K reimbursement for fertility, adoption, and parental planning services
  • Flexible PTO policy
  • Monthly wellness stipend, home office stipend
Company Core Values
  • Humility: We are respectful and free from arrogance. We put the success of our employees over our company and are excited to learn from each other.
  • Transparency: We trust each other to communicate the good and the bad as it relates to doing our best work. We aren’t afraid to voice our opinions and are receptive to feedback.
  • Grit: We are passionate and hustle to raise the bar. We persevere through our challenges and grow from our failures.
  • Ownership: We strive for thoughtful impact, take pride in our work, and hold ourselves accountable. We step up and take on new challenges to help further the success of the company.
  • Connection: We genuinely care about each other and understand that our people are our power. We celebrate our lived experiences and enjoy helping and supporting each other.