The primary responsibility of the Lead Observability Engineer is leading the technical direction and implementation of the pipelines and infrastructure focused on monitoring and observability within Sands.
This platform is essential, providing the tooling, practices, and visibility that our infrastructure and development engineering teams leverage to observe, and maintain the environments and platforms in our cloud and on-premises systems.
The Lead Observability Engineer will be responsible for a team of Observability Engineers ensuring that technical infrastructure build and operational teams have effective tools to monitor, observe and operate systems and platforms within the framework of large enterprise compliance and governance needs.
The team will develop, maintain and execute infrastructure such as code scripts and playbooks to automate deployment and maintenance tasks to ensure the availability, reliability, and efficient operation of the enterprise systems.
The position demands someone who is highly technically competent, detail oriented, and driven to stay current with evolving technologies.
All duties are to be performed in accordance with departmental and Las Vegas Sands Corp.’s policies, practices, and procedures. All Las Vegas Sands Corp. Team Members are expected to conduct and carry themselves in a professional manner at all times. Team Members are required to observe the Company’s standards, work requirements and rules of conduct.
Lead a team of observability engineers in designing and implementing observability solutions, monitoring system health, and troubleshooting incidents to ensure high availability and performance of software applications and infrastructure.
Work with Central Head of Operations to decide 7 execute upon on priorities for monitoring, alerting and observability KPIs that are required.
Develop solutions to observability demands.
Deliver broad services that cover the following domains:
Log Collection and Analysis
Operational Metrics
Distributed Tracing
Build, Test, and Deployment Automation
Platform reliability engineering monitoring
Act as an evangelist for the observability domain across the enterprise and influence IT stakeholders to apply observability best practices.
Design, develop, and maintain automation solutions to support observability and operations, focusing on improving system monitoring, alerting, and reporting capabilities.
Provide technology and/or process solutions to high-impact problems/projects through in-depth evaluation of complex business processes, system processes, and industry standards.
Own, develop, and be accountable for observability policies, processes, and architectural decisions.
Responsible for ensuring operational methods, procedures, facilities, and tools are established, reviewed, and maintained.
Monitor and research emerging observability trends and technologies with the potential to improve efficiency, security, and business capabilities.
Develop and execute proof-of-concept projects to evaluate new solutions for potential adoption.
Develop documentation (e.g., including data flow diagrams, logical diagrams, and physical diagrams) and training in compliance with standards.
Apply enterprise design principles and best practices for implementing and supporting observability services.
Operate with a limited level of direct supervision and exercise independence of judgment and autonomy.
Serve as advisor and coach to less senior team members, allocating work as necessary.
Be a strong thought leader in Observability, Site Reliability engineering Principles
Consistently share standard methodologies and improve processes within and across teams.
Perform job duties in a safe manner.
Attend work as scheduled on a consistent and regular basis.
Perform other related duties as assigned.
At least 21 years of age.
Proof of authorization to work in the United States.
Bachelor’s Degree in Computer Science, Engineering or related discipline required.
Advanced degree in technology or engineering is a plus.
Must be able to obtain and maintain any certification or license, as required by law or policy.
5-10+ years demonstrated experience leading distributed Monitoring, Observability, IT operations, DevOps, SRE, or observability groups with expertise in on-premises IT infrastructure, applications and private & public cloud monitoring.
Experience in ITRS, Geneos and OpsView is a plus.
Strong expertise with scripting in Python, Java and RESTful Services, with focus on building high throughput/High volume distributed systems.
Strong expertise in Linux/Unix, Container orchestration (e.g., Kubernetes), container runtimes and optimization.
Strong understanding of Site Reliability Engineering and DevOps principles.
Strong technical acumen in Cloud Architecture, Performance Benchmarking, and Capacity planning.
Demonstrated experience leading and growing engineers and teams.
Strong Cloud (AWS, GCP, Azure etc.) platform knowledge.
Proficiency in Project Management and work item management tools such as Azure DevOps and Portfolio.
Strong knowledge of logging systems, experience with ELK Stack (Elasticsearch, Logstash, Kibana), Splunk, or similar platforms.
Experience with tools like Harness, GitLab, Terraform, Ansible, or CloudFormation for managing and monitoring infrastructure.
Demonstrated experience diagnosing performance bottlenecks and other system issues using observability data.
Demonstrated understanding and respect of IT service management practices (e.g., change, release, incident, problem management).
Able to multi-task and handle various types of requests from different people/areas.
Strong analytical and problem-solving skills.
Effective written and verbal communication skills in English.
Physically access assigned workspace areas with or without reasonable accommodation.
Work indoors and be exposed to various environmental factors such as, but not limited to, CRT, noise, and dust.
Utilize laptop and standard keyboard to perform essential functions of the job.