Lead Site Reliability Engineer
Posted on 10/3/2023
Online platform connecting consumers to healthcare professionals
Hims & Hers stands out as a leading telehealth platform, offering comprehensive and personalized healthcare services across all 50 states, from sexual health to skincare and mental health. Their competitive edge lies in their ability to connect consumers directly to licensed healthcare professionals, providing high-quality medical care in a convenient, online format. The company's commitment to normalizing health and wellness challenges, along with their NYSE listing, further demonstrates their industry leadership and dedication to transforming healthcare accessibility.
San Francisco, California
Growth & Insights
6 month growth↑ 19%
1 year growth↑ 47%
2 year growth↑ 140%
Development Operations (DevOps)
DevOps & Infrastructure
- 10+ years of total experience in a technical environment as an engineer and manager
- Experience with service-oriented architectures and microservices at scale
- Strong proficiency with RDBMS databases (PostgreSQL, MySQL, SQL Server, etc.)
- Strong proficiency in SQL scripting
- Ability to use containers and orchestration frameworks (Kubernetes, Docker, Container registries etc.)
- Proficiency in Git or other VCS
- Proficiency developing in one or more languages such as Java, Kotlin, Python, and/or others
- Experience with configuring, customizing, and extending monitoring tools (Datadog, Prometheus, New Relic etc.)
- Excellent debugging and troubleshooting skills
- Strong technical competency, with a data-driven analytical approach towards solving complex challenges
- Have a systematic problem-solving approach, coupled with strong and effective communication skills and a sense of drive
- Nice-to-have: Experience with Terraform or other IAC tools such as Chef, Puppet or Ansible
- Develop and Build software to help DevOps, ITOps & support teams.
- Independently drive SRE projects to completion by working closely with key stakeholders.
- Hands on coding skills with any one of the programming technologies, Java [Springboot] OR Kotlin
- Evangelize SRE discipline, and practices across the organization to improve overall system performance and stability.
- Participate in platform Architecture discussions, and ensure that the non-functional requirements related to performance, stability, and monitoring are baked into the design
- Ability to influence engineers, and product owners in a matrixed organization through technical know-how and thought leadership
- Actively seek and identify opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation.
- Handle emergency response either by being on-call or by reacting to symptoms according to monitoring and escalation when needed.
- Identify Service Level Indicators (SLIs), that will align the team to meet the availability and performance objectives.
- Perform and run blameless RCAs on incidents and outages aggressively looking for answers that will prevent incident reoccurrence.
- Use automation extensively to design, configure, manage, and monitor systems in support of our product development teams
- Design and implement SRE practices ensuring availability, scalability, and observability of production systems with a strong focus on excellent customer experience
- Standardize, and implement monitoring, logging, alerting, and SLO Reporting
- Manage Infrastructure through automation (Infrastructure as Code)
- Manage incidents and emergency response, track outages, ensure data integrity and engineer releases to promote safe, efficient and rapid deployments.
- Experience with Terraform or other IAC tools such as Chef, Puppet or Ansible