Principal Site Reliability Engineer
Updated on 11/13/2023
Data management & visualization platform
Splunk's mission is to address the challenges and opportunities of managing massive streams of machine-generated big data. Splunk is the leading software platform for machine data that enables customers to gain real-time Operational Intelligence.
AI & Machine Learning
Data & Analytics
San Francisco, California
Growth & Insights
6 month growth↑ 1%
1 year growth↑ 1%
2 year growth↑ 9%
Google Cloud Platform
DevOps & Infrastructure
- 7+ years of SRE experience in handling large-scale cloud-native microservices platforms.
- 3+ years of strong hands-on experience deploying, handling, and monitoring large-scale Kubernetes clusters in the public cloud specifically AWS or GCP
- Experience with infrastructure automation and scripting using Python and/or bash scripting.
- Strong hands-on experience in monitoring tools such as Splunk, Prometheus, Grafana, ELK stack, etc. in order to build observability for large-scale microservices deployments.
- Experience with deployment, operations and performance management of one or more of the following large-scale clusters such as Cassandra, Kafka, Elastic Search, MongoDB, ZooKeeper, Redis, etc.
- Experience leading large-scale technical initiatives across multiple teams.
- Excellent problem-solving, triaging, and debugging skills in large-scale distributed systems
- Set technical direction, author design docs, and get consensus from internal and external partners.
- Develop new processes to make the team more efficient and effective.
- Collaborate with other team leaders to orchestrate large system changes.
- Spend a significant amount of time on technical leadership activities in addition to hands-on technical work
- Design new services, tools, and monitoring to be implemented by the entire team.
- Analyze the tradeoffs of the proposed design and make recommendations based on these tradeoffs.
- Mentor new engineers to achieve more than they thought possible. You enjoy making other teams successful and are fulfilled through the success of others.
- Work on reliability projects, including:
- - HA, Business Continuity Planning, disaster recovery, backup/restore, RTO, RPO
- - Chaos engineering
- - Application uptime and performance
- - Capacity management & planning
- - SLIs, SLOs, error budgets, and monitoring dashboards
- - Responsible for deployment and operations of large-scale distributed data stores and streaming services
- - Establishing design patterns for monitoring and benchmarking
- - Establishing and documenting production run books and guidelines for developers
- - Tooling, toil reduction, runbooks & automation to handle production environments
- - Incident management and improving MTTD/MTTR for services
- - Cloud cost optimization
- AWS Solutions Architect certification preferred.
- Confluent Certified Administrator for Apache Kafka and/or Apache Cassandra Administrator Associate certifications are preferred
- Experience with Infrastructure-as-Code using Terraform, CloudFormation, Google Deployment Manager, Pulumi, Packer, ARM, etc.
- Experience with CI/CD frameworks and Pipeline-as-Code such as Jenkins, Spinnaker, Gitlab, Argo, Artifactory, etc.
- Experience with one or more security/compliance frameworks such as SOC2, PCI, and/or FedRAMP.
- Proven skills to effectively work across teams and functions to influence the design, operations, and deployment of highly available software.
- Bachelors/Masters in Computer Science, Engineering, or related technical field, or equivalent practical experience.