Site Reliability Engineering Lead @ HEXAWARE

What Working at Hexaware offers:

Hexaware is a dynamic and innovative IT organization committed to delivering cutting-edge solutions to our clients worldwide. We pride ourselves on fostering a collaborative and inclusive work environment where every team member is valued and empowered to succeed.

Hexaware provides access to a vast array of tools that enhance, revolutionize, and advance professional profile. We complete the circle with excellent growth opportunities, chances to collaborate with highly visible customers, chances to work alongside bright brains, and the perfect work-life balance.

With an ever-expanding portfolio of capabilities, we delve deep into and identify the source of our motivation. Although technology is at the core of our solutions, it is still the people and their passion that fuel Hexaware s commitment towards creating smiles.

At Hexaware we encourage to challenge oneself to achieve full potential and propel growth. We trust and empower to disrupt the status quo and innovate for a better future. We encourage an open and inspiring culture that fosters learning and brings talented, passionate, and caring people together.

We are always interested in, and want to support, the professional and personal you. We offer a wide array of programs to help expand skills and supercharge careers. We help discover passion the driving force that makes one smile and innovate, create, and make a difference every day.

What would you do?

Position: Site Reliability Engineering (SRE) Lead

Location: Fortmill, SC (2-3 days onsite in a week, hybrid)

Role overview

We are seeking an experienced Site Reliability Engineering (SRE) Lead to drive reliability, scalability, and observability across our services. The ideal candidate combines hands-on SRE and DevOps expertise with strong leadership skills, deep knowledge of observability tooling (Datadog), and practical experience building and operating Python-based server applications.

Key responsibilities

Lead the SRE function: set technical direction, define best practices, and coach engineers on reliability and operational excellence.
Establish and maintain SLOs/SLIs, alerting policies, and error budgets in partnership with product and engineering teams.
Design, implement, and improve observability: metrics, traces, logs, dashboards, and runbooks (Datadog as primary tool).
Automate operations to reduce toil: CI/CD pipelines, automated rollouts, self-healing mechanisms, and runbook automation.
Own incident management: lead incident response, coordinate cross-team communications, drive blameless postmortems and remediation.
Drive capacity planning, performance tuning, and disaster recovery planning for Python server applications and services.
Manage tooling and infrastructure: container orchestration, infrastructure-as-code, secrets management, and monitoring integrations.
Partner with DevOps, platform, and application teams to deliver secure, observable, and highly available services.
Mentor and grow the SRE team, promote knowledge sharing and continuous improvement.

Required skills & experience

Proven experience in SRE/DevOps roles with responsibility for production reliability and observability.
Prior experience leading or mentoring engineering teams.
Strong Python experience, particularly for server-side code, automation, and operational tooling.
Hands-on expertise with Datadog: metrics, APM/tracing, logs, synthetics, dashboards, and alerting.
Deep understanding of observability concepts and best practices (SLIs/SLOs, tracing, contextual logging).
Solid experience with container platforms and orchestration (Docker, Kubernetes).
Experience with CI/CD systems and pipelines (e.g., GitHub Actions, Jenkins, CircleCI, GitLab CI).
Familiar with Infrastructure as Code (Terraform, CloudFormation, or equivalent).
Comfortable with Linux systems administration, networking fundamentals, and security basics.
Strong incident management experience and familiarity with runbooks and on-call rotations.
Excellent communication skills and the ability to influence cross-functional teams.

Nice-to-have

Experience with Prometheus, Grafana, or other monitoring stacks alongside Datadog.
Familiarity with cloud platforms (AWS, GCP, Azure) and managed services.
Experience building internal developer/platform tooling.
Background in reliability engineering for high-throughput or low-latency systems.
Certifications in cloud platforms or SRE-related areas.

Qualifications

Degree in Computer Science, Engineering, or equivalent practical experience.
Typically 5+ years in SRE/DevOps roles and 2+ years in a lead or senior position (flexible for exceptional candidates).
Equal Opportunities Employer:
Hexaware Technologies is an equal opportunity employer. We are dedicated to providing a work environment free from discrimination and harassment. All employment decisions at Hexaware are based on business needs, job requirements, and individual qualifications. We do not discriminate based on race including colour, nationality, ethnic or national origin, religion or belief, sex, age, disability, marital status, sexual orientation, parental status, gender reassignment, or any other status protected by law. We encourage candidates of all backgrounds to apply.
Find out more at Hexaware.com.

Site Reliability Engineering Lead

SRE, Observability

HEXAWARE