Senior Site Reliability Engineer
Updated on 9/22/2022
Toronto, ON, Canada
Development Operations (DevOps)
Google Cloud Platform
- 5+ years of relevant experience in the following areas: SRE, DevOps, Cloud Operations, Systems Engineering, or Software Engineering
- BS/MS/PhD in Computer Science or related field
- Excellent command of cloud services on AWS/GCP/Azure, Kubernetes and CI/CD pipelines
- Experience with monitoring/alerting (Prometheus, Thanos, Victoria Metrics, Grafana, vmrules)
- Have moderate-advanced experience in Java, C, C++, Python, Go or other object-oriented programming languages
- You are Interested in designing, analyzing and troubleshooting large-scale distributed systems
- You have a systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive
- You have a great ability to debug and optimize code and automate routine tasks
- You have a solid background in software development and architecting resilient and reliable applications
- You are a good communicator and comfortable working with other engineers across the organization
- Evangelize and advocate for reliability practices across our organization
- Collaborate with other Engineering teams to support services before they go live through activities such as system design consulting, developing software platforms and frameworks, monitoring/alerting, capacity planning and production readiness reviews
- Ability to debug and optimize code and automate routine tasks: reduce toil
- Analyze and optimize our core product by developing and implementing reliability and performance practices
- Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity
- Be on-call for production services
- Practice sustainable incident response and blameless retrospectives
- Experience being on-call for an internet facing production system
- Expertise in k8s, helm, yaml, GitOps, ArgoCD, Distributed Tracing (Lightstep, Honeycomb, OpenTelemetry), k8s resource management (e.g. kubecost)
Data lake engine
Dremio’s leading the way to reimagine your data architecture. Removing barriers, accelerating time to insight, putting control in the hands of the user.
- Health, Dental, and Vision Insurance
- Stock Options
- Work From Home
- Office Events
- Parental Leave Benefits
- Paid Time Off
- Communicate with clarity.
- Drive accountability.
- Be respectful.
- Confront brutal facts.
- Focus on results.
- Operate with urgency.
- Build a flywheel.