Full-Time

Principal Site Reliability Engineer

Posted on 5/1/2024

Lightspeed Commerce

Lightspeed Commerce

1,001-5,000 employees

Cloud-based POS and payments platform

Data & Analytics
Consumer Software

Senior, Expert

Toronto, ON, Canada

Required Skills
Datadog
Kubernetes
Microsoft Azure
Python
Communications
MySQL
NoSQL
Git
BigQuery
SQL
Java
Postgres
AWS
Terraform
Redis
MongoDB
Cassandra
Google Cloud Platform
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or possess a related level of real-world experience.
  • 9-10+ years of experience across site reliability engineering, systems administration, and/or software engineering.
  • Strong expertise in container orchestration platforms, specifically Kubernetes.
  • Strong understanding of both relational (e.g., PostgreSQL, MySQL) and NoSQL databases (e.g., MongoDB, Cassandra, Redis).
  • Deep understanding of network protocols and IP networking, as well as experience with network troubleshooting.
  • Proficiency in programming languages such as Java, Python, Go, etc.
  • Proven track record of managing large-scale infrastructure in cloud environments, such as Google Cloud, AWS or Azure.
  • Experience with monitoring tools (e.g., Prometheus, Grafana, Datadog) and logging solutions (e.g., ELK stack).
  • Strong understanding of security best practices.
  • Exceptional problem-solving skills and the ability to work under pressure to troubleshoot and resolve complex issues.
  • Excellent communication skills to effectively collaborate with cross-functional teams.
  • Strong leadership skills, capable of leading projects and influencing engineering decisions across the organization.
Responsibilities
  • Work closely with development teams to empower them with the necessary tools and practices for monitoring software health in production, defining and measuring reliability metrics (SLI, SLO), and managing error budgets.
  • Design, build and maintain robust infrastructure built upon GCP, leveraging cloud native technologies such as GKE, Cloud SQL, BigQuery, etc.
  • Develop and manage CI/CD pipelines for efficient deployment and release using a number of technologies (GitLab, Gihub, Helm, Terraform, etc.).
  • Drive incident management process and conduct post-mortem analysis to prevent future outages.
  • Mentor junior SREs and developers, providing guidance on best practices in cloud architecture, data management, and software development.
  • Conduct system performance benchmarks and implement enhancements to improve system reliability and throughput.
  • Collaborate with cross-functional teams to identify, design, and implement internal process improvements in a cost-efficient manner.
  • Design and build robust, scalable, and highly available systems.
  • Build platform solutions and apply software engineering principles to improve the reliability of our software and accelerate software delivery
  • Manage infrastructure change through infrastructure as code (IaC)
  • Be part of our on-call rotation.
  • Stay current with industry trends and emerging technologies, advocating for the adoption of new technologies and practices that improve product quality and team efficiency.

Lightspeed offers a unified cloud-based point of sale and payments platform for retail, hospitality, and golf businesses, integrating multichannel sales, global payments, and supplier network connections, utilizing cloud technology and multichannel sales integration to simplify operations, streamline workflows, and enable businesses to scale and grow by providing exceptional customer experiences. The platform leverages cloud technology and multichannel sales integration to simplify operations, streamline workflows, and enable businesses to scale and grow.

Company Stage

IPO

Total Funding

$1.2B

Headquarters

Montreal, Canada

Founded

2005

Growth & Insights
Headcount

6 month growth

↑ 5%

1 year growth

↑ 12%

2 year growth

↑ 17%