Full-Time

Site Reliability Engineering Lead

SRE, Observability

Posted on 9/27/2025

HEXAWARE

HEXAWARE

No salary listed

Fort Mill, SC, USA

Hybrid

2-3 days onsite in a week, hybrid

Category
DevOps & Infrastructure (1)
Required Skills
Datadog
Kubernetes
Python
Docker
CloudFormation
Jenkins
Terraform
DevOps
CircleCI
Linux/Unix
Requirements
  • Proven experience in SRE/DevOps roles with responsibility for production reliability and observability.
  • Prior experience leading or mentoring engineering teams.
  • Strong Python experience, particularly for server-side code, automation, and operational tooling.
  • Hands-on expertise with Datadog: metrics, APM/tracing, logs, synthetics, dashboards, and alerting.
  • Deep understanding of observability concepts and best practices (SLIs/SLOs, tracing, contextual logging).
  • Solid experience with container platforms and orchestration (Docker, Kubernetes).
  • Experience with CI/CD systems and pipelines (e.g., GitHub Actions, Jenkins, CircleCI, GitLab CI).
  • Familiar with Infrastructure as Code (Terraform, CloudFormation, or equivalent).
  • Comfortable with Linux systems administration, networking fundamentals, and security basics.
  • Strong incident management experience and familiarity with runbooks and on-call rotations.
  • Excellent communication skills and the ability to influence cross-functional teams.
Responsibilities
  • Lead the SRE function: set technical direction, define best practices, and coach engineers on reliability and operational excellence.
  • Establish and maintain SLOs/SLIs, alerting policies, and error budgets in partnership with product and engineering teams.
  • Design, implement, and improve observability: metrics, traces, logs, dashboards, and runbooks (Datadog as primary tool).
  • Automate operations to reduce toil: CI/CD pipelines, automated rollouts, self-healing mechanisms, and runbook automation.
  • Own incident management: lead incident response, coordinate cross-team communications, drive blameless postmortems and remediation.
  • Drive capacity planning, performance tuning, and disaster recovery planning for Python server applications and services.
  • Manage tooling and infrastructure: container orchestration, infrastructure-as-code, secrets management, and monitoring integrations.
  • Partner with DevOps, platform, and application teams to deliver secure, observable, and highly available services.
  • Mentor and grow the SRE team, promote knowledge sharing and continuous improvement.
Desired Qualifications
  • Experience with Prometheus, Grafana, or other monitoring stacks alongside Datadog.
  • Familiarity with cloud platforms (AWS, GCP, Azure) and managed services.
  • Experience building internal developer/platform tooling.
  • Background in reliability engineering for high-throughput or low-latency systems.
  • Certifications in cloud platforms or SRE-related areas.

Company Size

N/A

Company Stage

N/A

Total Funding

N/A

Headquarters

N/A

Founded

N/A

INACTIVE