Principal Site Reliability Engineer
Updated on 11/13/2023
Splunk

5,001-10,000 employees

Data management & visualization platform
Company Overview
Splunk's mission is to address the challenges and opportunities of managing massive streams of machine-generated big data. Splunk is the leading software platform for machine data that enables customers to gain real-time Operational Intelligence.
AI & Machine Learning
Data & Analytics
Cybersecurity

Company Stage

IPO

Total Funding

$1.4B

Founded

2003

Headquarters

San Francisco, California

Growth & Insights
Headcount

6 month growth

1%

1 year growth

1%

2 year growth

9%
Locations
Remote
Experience Level
Entry
Junior
Mid
Senior
Expert
Desired Skills
AWS
Bash
Apache Kafka
Google Cloud Platform
Jenkins
Git
Management
MongoDB
Redis
Terraform
Kubernetes
Python
Cassandra
CategoriesNew
DevOps & Infrastructure
Software Engineering
Requirements
  • 7+ years of SRE experience in handling large-scale cloud-native microservices platforms.
  • 3+ years of strong hands-on experience deploying, handling, and monitoring large-scale Kubernetes clusters in the public cloud specifically AWS or GCP
  • Experience with infrastructure automation and scripting using Python and/or bash scripting.
  • Strong hands-on experience in monitoring tools such as Splunk, Prometheus, Grafana, ELK stack, etc. in order to build observability for large-scale microservices deployments.
  • Experience with deployment, operations and performance management of one or more of the following large-scale clusters such as Cassandra, Kafka, Elastic Search, MongoDB, ZooKeeper, Redis, etc.
  • Experience leading large-scale technical initiatives across multiple teams.
  • Excellent problem-solving, triaging, and debugging skills in large-scale distributed systems
Responsibilities
  • Set technical direction, author design docs, and get consensus from internal and external partners.
  • Develop new processes to make the team more efficient and effective.
  • Collaborate with other team leaders to orchestrate large system changes.
  • Spend a significant amount of time on technical leadership activities in addition to hands-on technical work
  • Design new services, tools, and monitoring to be implemented by the entire team.
  • Analyze the tradeoffs of the proposed design and make recommendations based on these tradeoffs.
  • Mentor new engineers to achieve more than they thought possible. You enjoy making other teams successful and are fulfilled through the success of others.
  • Work on reliability projects, including:
  • - HA, Business Continuity Planning, disaster recovery, backup/restore, RTO, RPO
  • - Chaos engineering
  • - Application uptime and performance
  • - Capacity management & planning
  • - SLIs, SLOs, error budgets, and monitoring dashboards
  • - Responsible for deployment and operations of large-scale distributed data stores and streaming services
  • - Establishing design patterns for monitoring and benchmarking
  • - Establishing and documenting production run books and guidelines for developers
  • - Tooling, toil reduction, runbooks & automation to handle production environments
  • - Incident management and improving MTTD/MTTR for services
  • - Cloud cost optimization
Desired Qualifications
  • AWS Solutions Architect certification preferred.
  • Confluent Certified Administrator for Apache Kafka and/or Apache Cassandra Administrator Associate certifications are preferred
  • Experience with Infrastructure-as-Code using Terraform, CloudFormation, Google Deployment Manager, Pulumi, Packer, ARM, etc.
  • Experience with CI/CD frameworks and Pipeline-as-Code such as Jenkins, Spinnaker, Gitlab, Argo, Artifactory, etc.
  • Experience with one or more security/compliance frameworks such as SOC2, PCI, and/or FedRAMP.
  • Proven skills to effectively work across teams and functions to influence the design, operations, and deployment of highly available software.
  • Bachelors/Masters in Computer Science, Engineering, or related technical field, or equivalent practical experience.