Full-Time

Sr. Software Engineer

Site Reliability Engineer

Posted on 12/19/2024

Lowe's

Lowe's

10,001+ employees

Industrial & Manufacturing
Consumer Goods

Senior

Charlotte, NC, USA

Requires relocation to the Charlotte region and in-person work two days per week.

Category
DevOps & Infrastructure
Site Reliability Engineering
Software Engineering
Required Skills
Bash
Kubernetes
Python
NoSQL
SQL
Java
Docker
Go
Prometheus
C/C++
Splunk
Requirements
  • 5 years of demonstrated proficiency in one or more scripting languages such as bash, python, Go, etc.
  • 5 years of experience with Kubernetes or equivalent
  • 5 years of Software development experience in Java, C, C++
  • 5 years of experience with containers and container orchestrators - Docker, Kubernetes
  • 5 years of demonstrated experience debugging and fixing system/infrastructure and application issues.
  • 5 years of experience working with monitoring tools such as Prometheus, Grafana, Splunk, Google stackdriver, etc.
  • 5 years of experience creating CUJs (Critical User Journeys) by identifying SLIs/SLOs and working with service/application teams to implement monitoring and alerting tools.
  • 5 years of experience with databases (SQL or NoSQL)
  • 5 years of experience with log analysis and building dashboards.
  • Retail knowledge is a plus.
  • Master's Degree in Computer Science, CIS, or related field
  • 5 years of IT experience developing and implementing business systems within an organization
  • 5 years of experience working with defect or incident-tracking software
  • 5 years of experience writing technical documentation in a software development environment
  • 3 years of experience working with an IT Infrastructure Library (ITIL) framework
  • 3 years of experience leading teams, with or without direct reports
  • 5 years of experience working with source code control systems
  • Experience working with Continuous Integration/ Continuous Deployment tools
  • 5 years of experience in systems analysis, including defining technical requirements and performing high-level design for complex solutions
  • 4 years' experience with Reactive programming.
Responsibilities
  • Run the production environment by monitoring availability and taking a holistic view of system health.
  • Build software and systems to manage platform infrastructure and applications.
  • Improve reliability, quality, and time-to-market of our suite of software solutions.
  • Measure and optimize system performance, with an eye toward pushing capabilities forward.
  • Provide primary operational support and engineering for multiple large, distributed software applications.
  • Improve reliability, quality, and reduce MTTR.
  • Participate in system design consulting, platform management, capacity planning, and cost analysis.
  • Measure and optimize system performance, to push our capabilities forward and innovate to continually improve.
  • Gather and analyze metrics from applications and services to assist in performance tuning and fault finding.
  • Contribute to capacity planning, demand forecasting, software performance analysis, and systems tuning.
  • Develop and Implement monitoring, observability, and alerting tools such as dashboards and logging systems to understand the health and availability of our infrastructure and applications.
  • Collect and analyze information from distributed systems into simple views of the technology portfolio to identify trends and spot stability threats.
  • Monitor application availability, latency, and overall system health.
  • Develop self-service solutions to help increase productivity by removing toil and reducing unnecessary roadblocks.
  • Resolve technical issues in production, learn to mitigate them quickly, and find ways to prevent them.
  • Document every action so lessons learned turn into repeatable actions and then into automation.
  • Triage, analyze, and provide solutions to critical & high-priority technical issues occurring in the ecosystem, and optimize incident management processes.
  • Respond, react & communicate as per the ITSM incident management process. This process involves detection of the incident, timely communication to leadership during the life of the incident, and service restoration, followed by root cause analysis to prevent the incident from occurring in the future.
  • Drive blameless postmortem culture.
  • Regularly review key site technical metrics such as transaction errors, logging, response times, caching strategies, conversion/bounce rates, capacity & resource utilization.

Company Stage

IPO

Total Funding

$136.1M

Headquarters

Mooresville, North Carolina

Founded

1946

Simplify Jobs

Simplify's Take

What believers are saying

  • Growing trend of smart home technology integration benefits Lowe's product offerings.
  • DIY home improvement projects popularity presents expansion opportunities for Lowe's.
  • Sustainability trend allows Lowe's to expand eco-friendly product range.

What critics are saying

  • Increased competition from Home Depot's same-day delivery service.
  • Rising raw material costs, especially lumber, affect Lowe's profit margins.
  • Ongoing labor shortage challenges Lowe's staffing despite hiring efforts.

What makes Lowe's unique

  • Lowe's launched the Aging in Place Program in November 2021.
  • Lowe's offers one-hour delivery with Instacart in select markets.
  • Lowe's introduced the MVPs Pro Rewards and Partnership Program.

Help us improve and share your feedback! Did you find this helpful?

Benefits

Health Insurance

Dental Insurance

Vision Insurance

Paid Vacation

Paid Sick Leave

Paid Holidays

Performance Bonus

INACTIVE