Full-Time

Service Engineer

Customer Reliability Engineering, Azure

Posted on 9/9/2025

Microsoft

Microsoft

10,001+ employees

Develops software, OS, and cloud services

No salary listed

Hyderabad, Telangana, India

In Person

Category
DevOps & Infrastructure (1)
Required Skills
PowerShell
Datadog
Kubernetes
Microsoft Azure
Python
Grafana
Docker
Microservices
AWS
Prometheus
Terraform
Splunk
Linux/Unix
Google Cloud Platform
Requirements
  • 5+ years’ proven expertise in mission-critical cloud operations, high-severity incident response, SRE, or large-scale systems engineering on hyperscale platforms like Azure, AWS, or GCP.
  • Must have Service Engineering experience in a 24 x 7 x 365 enterprise environments
  • Exceptional command-and-control communication skills—able to drive clarity and direction with customers - internal Microsoft stakeholders and third-party vendors during ambiguity and chaos.
  • Deep understanding of cloud architecture patterns, microservices, and containerization.
  • Demonstrated ability to make decisions quickly, under pressure, and with limited data—without compromising long-term reliability.
  • Familiarity with monitoring and observability tools (e.g., Grafana, Prometheus, Datadog, Splunk, New Relic).
  • Contribute to Implement observability frameworks to proactively detect performance bottlenecks.
  • Strong knowledge of CI/CD pipelines, container orchestration (Kubernetes, Docker), and infrastructure as code (Terraform, ARM, Bicep).
  • Familiarity with AI/ML frameworks and cloud AI services.
  • Experience implementing AI-driven monitoring, alerting, and remediation systems
  • Fluency in one or more automation languages (PowerShell, Python, CLI etc.)
  • Understanding ITIL or other incident management frameworks is a must.
  • Understand High Availability, Disaster Recovery, Business Continuity, Performance Tuning
  • Demonstrates strategic thinking, quantitative and analytical skills, team leadership, and collaboration
  • Excellent problem resolution, judgment, negotiating and decision-making skills
  • Desired Strong knowledge of Windows Platform or Linux, developer tools and ability to diagnose and debug user code
  • Effectively manage and prioritize multiple tasks in accordance with high level objectives/projects.
  • Excellent communication skill (written + verbal) in English, especially in high-pressure scenarios.
  • Ability to communicate with a variety of audiences; including high-profile customers, executive management, and engineering teams.
  • Experience with Azure, AWS, or GCP core services and their interdependence.
  • Bachelor’s or master’s degree in computer science, Information Technology or equivalent experience.
Responsibilities
  • Manage high-severity incidents (SEV0/SEV1/SEV2) across Azure services, serving as the single point of accountability to ensure rapid detection, triage, resolution, and customer communication.
  • Act as the central authority during live site incidents, driving real-time decision-making and coordination across Engineering, Support, PM, Communications, and Field teams.
  • Participate in the on-call rotation.
  • Provide calm, decisive leadership in crisis situations, escalating as needed to senior leadership.
  • Promote a customer-first culture by prioritizing availability, reliability, and platform trust in every response.
  • Contribute in analyzing customer-impacting signals from telemetry, support cases, and feedback to identify root causes, drive incident reviews (RCAs/PIRs), and implement preventative service improvements.
  • Contribute to Azure platform improvements by incorporating learnings from live site events and customer feedback, ensuring improved reliability, observability, and supportability.
  • Collaborate closely with Engineering and Product teams to influence and implement service resiliency enhancements, auto-remediation tools, and customer-centric mitigation strategies.
  • Identify and advocate for customer self-service capabilities, improved documentation, and scalable solutions that empower customers to resolve common issues independently.
  • Contribute to the development and adoption of incident response playbooks, mitigation levers, and operational frameworks aligned to real-world support scenarios and strategic customer needs
  • Contribute to the design of next-generation architecture for cloud infrastructure services with a focus on reliability and strategic customer support outcomes.
  • Build and maintain cross-functional partnerships, ensuring alignment across engineering, business, and support organizations.
  • Be data-driven and results-focused, using metrics to evaluate incident response effectiveness and platform health.
  • Apply engineering mindset to operational challenges, balancing agility, scalability, and technical quality in collaboration with peers
  • Demonstrate strong collaboration and results-focused execution under pressure while working closely with other teams.
Desired Qualifications
  • 8+ Years of demonstrated experience as an Incident Commander or Crisis Manager for critical, high-severity incidents in high-availability, distributed environments.
  • Experience with SRE (Site Reliability Engineering) principles and practices.
  • Exposure to chaos engineering, fault injection, or high availability architecture.
  • AI/ML Experience: [Beginner to Intermediate]
  • Familiarity with how AI/ML models are integrated into cloud infrastructure and their potential failure modes.
  • Experience using AI-powered tools for incident analysis, log correlation, or predictive alerting.
  • An understanding of the challenges and risks associated with AI/ML systems in a production environment.
  • Certifications:
  • Relevant cloud certifications (e.g., AWS Certified DevOps Engineer, Azure Solutions Architect, GCP Professional Cloud Architect).
  • Certifications in ITIL, SRE, or other relevant frameworks.

Microsoft develops software, devices, and cloud services. Windows is an operating system that runs on personal computers, Office provides productivity apps, and Azure offers cloud computing and developer tools. The company differentiates itself with a large, integrated ecosystem of software, devices, and services, plus long-standing partnerships with PC makers and a broad enterprise footprint. Its goal is to put a computer on every desk and in every home, and to extend that reach through cloud services, professional networking (LinkedIn), and gaming.

Company Size

10,001+

Company Stage

IPO

Headquarters

Redmond, Washington

Founded

1975

Simplify Jobs

Simplify's Take

What believers are saying

  • Azure revenue surges 40% in Q3 FY26 from AI infrastructure demand.
  • Microsoft 365 Copilot hits 20 million seats, boosting $37B AI run rate.
  • Publisher Marketplace licenses content for AI search, enhancing ecosystem revenue.

What critics are saying

  • Generative AI erodes $70B Office per-seat licensing model within 12 months.
  • Kenya $1B data center stalls permanently over government payment disputes.
  • Azure gross margins compress structurally to 68% from AI capex in 18 months.

What makes Microsoft unique

  • Microsoft pioneered Altair BASIC in 1975, launching personal computing software.
  • MS-DOS deal with IBM in 1980 established PC operating system dominance.
  • GitHub Copilot Enterprise serves 140,000 organizations, driving developer AI adoption.

Help us improve and share your feedback! Did you find this helpful?

Benefits

Health Insurance

Dental Insurance

Vision Insurance

401(k) Company Match

Professional Development Budget

Conference Attendance Budget

Flexible Work Hours

Remote Work Options

Growth & Insights and Company News

Headcount

6 month growth

0%

1 year growth

-1%

2 year growth

0%
Tech in Asia
Apr 14th, 2026
Microsoft adds 30,000 Nvidia chips to Norway site after $6.2B commitment

Microsoft has secured a deal with neocloud provider Nscale to expand its Norway data centre site with 30,000 Nvidia chips. The agreement adds to Microsoft's earlier $6.2 billion commitment to the location, whilst OpenAI did not finalise a capacity agreement there. The move is part of Microsoft's roughly $60 billion spending wave on specialised neocloud providers that rent AI computing infrastructure. CEO Satya Nadella has identified power availability and data centre construction speed as the company's biggest bottleneck, rather than chip supply. The deal reflects how cheap electricity and clear regulations increasingly shape AI data centre locations. Nscale, a UK-based startup that emerged from crypto-mining firm Arkon Energy in 2024, raised $2 billion at a $14.6 billion valuation in March 2026.

Bloomberg L.P.
Apr 14th, 2026
Microsoft takes over $6.2B Stargate data centre from OpenAI in Norway

Microsoft has agreed to rent data centre capacity at a Norwegian site originally intended for OpenAI as part of its Stargate initiative. The company will rent 30,000 additional Nvidia Vera Rubin chips from neocloud provider Nscale at a campus inside the Arctic Circle in Narvik, Norway. The deal builds on Microsoft's prior $6.2 billion commitment at the same location. Nscale announced the agreement in a statement, marking a shift in the facility's intended purpose from OpenAI to Microsoft operations.

Yahoo Finance
Apr 14th, 2026
Microsoft stock down 23% despite Azure growing 39% and $625B revenue backlog

Microsoft shares have fallen 23.14% year-to-date to $370.87, despite strong Q2 FY2026 results showing non-GAAP EPS of $4.14, a 7.57% beat. Revenue reached $81.27 billion, up 16.72% year-over-year, with Azure growing 39%. The company's commercial remaining performance obligation surged 110% to $625 billion in contracted future revenue, providing multi-year visibility. Microsoft's OpenAI partnership includes a $250 billion incremental Azure services commitment, whilst the company holds a 27% stake valued at approximately $135 billion. Despite the decline, 95% of covering analysts remain bullish, with a consensus price target of $587.31. Analysts cite the year-to-date drop as creating an entry point for investors confident in Azure's AI growth trajectory.

The Register
Apr 14th, 2026
Microsoft hikes UK Surface prices by up to $280 as RAM shortage hits consumers

Microsoft has quietly raised UK prices for its Surface devices by £170 to £220, citing rising memory and component costs. The 13-inch Surface Laptop now starts at £1,099, up from £899 in February, whilst the 15-inch model has increased from £1,349 to £1,519. In the US, some configurations have jumped from $999 to $1,499, according to Windows Central. Microsoft acknowledged the increases are due to rising memory and component costs, as chip manufacturers prioritise high-bandwidth memory production, leaving DRAM and NAND supplies constrained. The changes were not announced; Microsoft simply updated pricing on its website and removed cheaper configurations. The memory shortage is affecting manufacturers across the board, from Chromebooks to Raspberry Pi devices, with geopolitical tensions and freight costs further driving prices higher.

YouTube
Apr 13th, 2026
Microsoft Plots New OpenClaw-Inspired Copilot Features

Microsoft Reporter Aaron Holmes reveals how Satya Nadella is pushing for autonomous "always-on" agents within Copilot to rival OpenClaw. He explains the internal reorganization designed to prioritize these background AI tools for enterprise users. Read more: https://www.theinformation.com/articles/microsoft-plots-new-copilot-features-inspired-openclaw Subscribe: https://www.theinformation.com/subscribe_youtube The Information’s TITV airs weekdays on YouTube, X and LinkedIn at 10AM PT / 1PM ET. Or check us out wherever you get your podcasts. Follow us: X: https://x.com/theinformation IG: https://www.instagram.com/theinformation/ TikTok: https://www.tiktok.com/@titv.theinformation LinkedIn: https://www.linkedin.com/company/theinformation/

INACTIVE