Facebook pixel

Senior Site Reliability Engineer
Updated on 11/29/2022
Locations
New York, NY, USA
Experience Level
Entry
Junior
Mid
Senior
Expert
Desired Skills
Node.js
Agile
Atlassian
AWS
Bash
Apache Kafka
Confluence
Development Operations (DevOps)
Google Cloud Platform
JavaScript
JIRA
Git
Java
Linux/Unix
Management
PHP
Scala
SCRUM
Terraform
Python
Requirements
  • 5+ years' experience in a Site Reliability Engineering (SRE) role
  • Ability to work with, mentor, and lead a team in a high-growth organization
  • Experience in SRE topics like SLOs, Error Budgets, resiliency, auto-scaling, self-healing, performance, and more
  • 2-3 years of hands on experience with AWS Cloud Infrastructure Operations and Support
  • Experience in one or more of the following: node.js, Java, Linux, Python, Go, PHP, or Scala
  • Good understanding of Monitoring & Alerting tools (Datadog, Pagerduty, Alert Manager, etc)
  • Ability to provide 24x7 support for the production environments
  • Knowledge of Terraform
  • Experience with scrum or agile methodologies
  • Familiarity with managing Kafka, RDS, DynamoDB
  • Understanding of network troubleshooting skills
  • Enjoys general problem solving and has excellent troubleshooting skills
  • Strong interest in the areas of DevOps, SRE, Incident Response, Resilience Engineering, and Technical Operations
Responsibilities
  • Make monitoring and alerting on symptoms and not just on outages. We want to know about imminent issues
  • Design and maintain monitoring, log centralization, and alerting for all services to facilitate observability and incident management
  • Using log analysis troubleshoot performance problems and system outages
  • Partner with the product development teams to design and enhance software architecture to improve scalability, service reliability, cost, and performance
  • Design and write tests that investigate how our infrastructure handles failure and scaling
  • Build internal tooling to support and enable engineering workflows in our production environment
  • Work with engineers to rearchitect and rebuild core services their teams rely on to be more efficient and cost effective
  • Standardizing our approaches to observability so it is easy for a developer to do the right thing
  • Coordinate the initial response activities for incidents across the AIQ environment, including creating incident records
  • Manage low and mid-level severity incidents; escalate high severity incidents to resolution team as appropriate, ensuring each incident has a incident declared and JIRA ticket assigned
  • Introduce SLOs, Error Budget and actionable alerts such as auto scaling, self healing, etc
  • You will work with your team to monitor and ensure the health of the platform, which includes a 24/7 on-call rotation, to ensure a great customer experience
  • Reduce manual labor by creating, evaluating, and fixing automated tasks
  • Monitor system health, latency, and availability to maintain services after they are in production
  • Increase observability to aid in locating problems or bottlenecks
  • Provide assistance with faultless post-mortems and troubleshoot priority incidents
ActionIQ

201-500 employees

Company Overview
ActionIQ is a purpose-built enterprise Customer Data Platform solving complex data problems: flow and scale, analytics, and orchestration.