Senior Site Reliability Engineer
Confirmed live in the last 24 hours
Locations
Glendale, CA, USA
Experience Level
Entry
Junior
Mid
Senior
Expert
Desired Skills
Agile
AWS
Bash
Development Operations (DevOps)
Docker
Linux/Unix
Puppet
Terraform
Kubernetes
Python
Go
Ansible
Chef
Datadog
Requirements
- At least 7+ years of experience as SRE, DevOps or equivalent engineering roles
- At least 3+ years of production experience with AWS or other Cloud providers
- Strong hands-on experience in Infrastructure-as-Code and Configuration management tools and technologies: Terraform, Helm, AWS CloudFormation, Packer, Chef, Puppet, Ansible, etc
- Strong hands-on experience with scripting languages for automation - Bash, Python, Golang, etc
- Strong hands-on experience building and managing CI/CD pipelines
- Strong hands-on experience in Linux architecture, microservices and container orchestration (Docker, Kubernetes, etc.)
- Strong hands-on experience with application monitoring tools (New Relic, DataDog, SumoLogic, Prometheus, Grafana, ELK, etc.)
- Experience working on infrastructure projects in an Agile environment
- Great communication, collaboration and presentation skills
- Ability to team up with people from different disciplines and drive for a win-win
- Automation first mindset and orientation
Responsibilities
- Partner with a team of high-performing developers who are focused on delivering best in class software products
- Function as a Change Agent to introduce and evangelize DevOps mindset, solving availability, performance, capacity planning, cost-effectiveness, monitoring and alerting
- Cultivate an automation-first attitude and work to champion code-centric solutions throughout our department to improve velocity and deliverability
- Participate in the design process, representing, solving and planning for operational prism in the product roadmap
- Collaborate and communicate with cross-functional colleagues belonging to the same job family to drive DevOps centric cross-org initiatives, tooling and standards
- Participate in incident response in collaboration with application owners and the platform team
- Develop tooling to surface the day-to-day health, uptime and reliability of our cloud infrastructure
- Think long-term and avoid band-aids
- Identify unnecessary complexities and remove them