Sr. Site Reliability Engineer
Posted on 12/12/2023
INACTIVE
Aya Healthcare

10,001+ employees

Comprehensive healthcare staffing and management software provider
Company Overview
Aya Healthcare, the largest healthcare talent software and staffing company in the U.S., offers a comprehensive suite of labor services and software solutions, providing hospital systems with increased efficiency and superior operating results. The company's unique corporate culture and dedicated employees have earned it recognition as a top workplace by several notable publications. With a focus on simplifying processes for healthcare professionals, Aya Healthcare provides exclusive job opportunities, competitive pay rates, and comprehensive support, making it a preferred choice for clinicians nationwide.
Consulting
Data & Analytics

Company Stage

N/A

Total Funding

N/A

Founded

2001

Headquarters

San Diego, California

Growth & Insights
Headcount

6 month growth

8%

1 year growth

25%

2 year growth

123%
Locations
Remote in USA
Experience Level
Entry
Junior
Mid
Senior
Expert
Desired Skills
Datadog
Chef
Kubernetes
Microsoft Azure
Agile
Puppet
Atlassian
Management
Docker
AWS
JIRA
Terraform
Ansible
SCRUM
Development Operations (DevOps)
Linux/Unix
Google Cloud Platform
CategoriesNew
DevOps & Infrastructure
Requirements
  • Bachelor's Degree in Computer Science, Information Technology, Engineering or related field, or the equivalent combination of education, training, and experience
  • 8+ years of experience in a combination of Site Reliability Engineering, DevOps, or similar roles
  • 2+ years working specifically with Azure architecture, configuration, and management
  • 2+ years using Infrastructure as Code (IaC) tools to automate infrastructure deployment and configuration – preferably Terraform
  • Subject matter expert in cloud platforms (AWS, GCP, Azure) and containerization technologies (Docker, Kubernetes, AKS, EKS)
  • Experience in a technical lead/principal role
  • Experience using Configuration-as-Code solutions such as Chef, Ansible, Salt Stack, and Puppet
  • Experience with cloud-based APM and monitoring tools such as DataDog, NewRelic, AppDynamics, or Dynatrace
  • Extensive experience with scripting and debugging Linux and Windows environments
  • Experience with automated Change Management and related methodologies
  • Experience with advanced project management principles such as Scrum, Agile, sprints, and expertise with Atlassian Jira
  • Expert analytical/quantitative, problem-solving, and deductive reasoning skills, with demonstrated experience performing advanced troubleshooting and root cause analysis of complex technical issues
  • Excellent organizational, planning, and time management skills and ability to work either independently or in a team environment to manage competing priorities and meet deadlines
  • Advanced verbal and written communication skills with the ability to present findings, conclusions, alternatives, and information clearly and concisely
Responsibilities
  • Help lead efforts to improve system reliability through code changes, architectural enhancements, observability development, and infrastructure optimizations
  • Diagnose and solve problems with our highly available production systems and build solutions and automation to eliminate toil and prevent issues in the future
  • Drive our "zero error" and "zero downtime" initiatives, underlining our obsessive commitment to flawless service and operational excellence
  • Continually drive down time-to-detect and time-to-resolve through improved outlier detection and real-time root cause analysis
  • Spearhead the creation, development, and maintenance of scalable monitoring, alerting, and logging solutions
  • Promote a culture of learning from outages, continuously improving incident response protocols and tooling
  • Collaborate with cross-functional teams to identify and address reliability opportunities to ensure optimal performance and reliability
  • Participate in software releases and deployments
  • Participate in 24/7 on-call rotations and respond to incidents promptly, providing effective resolutions and root cause analysis
Desired Qualifications
  • Experience with other cloud platforms such as AWS and GCP
  • Experience with other containerization technologies such as Docker and Kubernetes
  • Experience with other Configuration-as-Code solutions such as Chef, Ansible, Salt Stack, and Puppet
  • Experience with other cloud-based APM and monitoring tools such as DataDog, NewRelic, AppDynamics, or Dynatrace