Site Reliability Engineer
Cloud
Confirmed live in the last 24 hours
Global provider of solar energy solutions and storage
Company Overview
Qcells North America, a globally recognized energy solutions provider, offers a comprehensive range of services including solar cell and module production, energy storage, and energy retail. With a strong presence across multiple continents, Qcells stands out for its commitment to quality service and long-term partnerships in various markets, including utility, commercial, governmental, and residential. The company's diverse geographical headquarters in Seoul, Thalheim, and San Francisco, each focusing on different aspects of the business, underscores its industry leadership and competitive advantage.
Energy
Growth & Insights
Headcount
6 month growth
↑ 14%1 year growth
↑ 26%2 year growth
↑ 28%Locations
San Francisco, CA, USA
Experience Level
Entry
Junior
Mid
Senior
Expert
Desired Skills
AWS
Development Operations (DevOps)
Java
Management
Terraform
Kubernetes
Python
Ansible
CategoriesNew
DevOps & Infrastructure
Requirements
- Bachelor's degree in computer science, information technology, or a related field
- At least 5+ years as an SRE and at least 2+ years managing large production scale systems
- Deep understanding of distributed systems, cloud computing, and containerization technologies
- Strong programming and scripting skills (e.g., Python, Shell, Java) for automation and tools development
- AWS cloud platform experience preferred
- Security best practices and knowledge of network and application security
- Experience with APM (Application Performance Monitoring) tools
- Experience with configuration management, infrastructure as code, and orchestration tools (we use Ansible, Terraform, and Kubernetes)
- Proficiency in monitoring and observability tools (e.g., Prometheus, Grafana, ELK stack)
- Knowledge of incident management and root cause analysis
- Strong problem-solving skills and the ability to work under pressure
- Excellent communication and skills
- Energy domain experience preferred
Responsibilities
- System Reliability: Ensure the reliability of our systems and services, minimizing downtime and outages through proactive monitoring, alerting, and troubleshooting.
- Incident Management: Respond to and manage incidents, conduct root-cause analysis, and implement preventive measures to reduce the impact of future incidents.
- Capacity Planning: Collaborate with teams to plan for capacity and scalability, making data-driven decisions about resource allocation and performance optimization.
- Automation: Develop and maintain automation tools for production environments to enhance system reliability and efficiency.
- Change Management: Oversee changes and updates to production systems, prioritizing risk mitigation and minimizing service disruptions during deployments.
- Performance Optimization: Work on performance tuning, profiling, and optimization of systems, making them faster and more efficient.
- On-Call Duty: Participate in an on-call rotation to respond to incidents and issues outside of regular working hours.
- Collaboration: Collaborate closely with Software Engineering, Product, DevOps, and other teams to ensure that reliability is built into the design and development process.
- Documentation: Contribute to documentation of production processes, systems, and incident response playbooks.
Desired Qualifications
- Energy domain experience