Site Reliability Engineer
Confirmed live in the last 24 hours
Global provider of solar energy solutions and storage
Qcells North America, a globally recognized energy solutions provider, offers a comprehensive range of services including solar cell and module production, energy storage, and energy retail. With a strong presence across multiple continents, Qcells stands out for its commitment to quality service and long-term partnerships in various markets, including utility, commercial, governmental, and residential. The company's diverse geographical headquarters in Seoul, Thalheim, and San Francisco, each focusing on different aspects of the business, underscores its industry leadership and competitive advantage.
Growth & Insights
6 month growth↑ 14%
1 year growth↑ 26%
2 year growth↑ 28%
San Francisco, CA, USA
Development Operations (DevOps)
DevOps & Infrastructure
- Bachelor's degree in computer science, information technology, or a related field
- At least 5+ years as an SRE and at least 2+ years managing large production scale systems
- Deep understanding of distributed systems, cloud computing, and containerization technologies
- Strong programming and scripting skills (e.g., Python, Shell, Java) for automation and tools development
- AWS cloud platform experience preferred
- Security best practices and knowledge of network and application security
- Experience with APM (Application Performance Monitoring) tools
- Experience with configuration management, infrastructure as code, and orchestration tools (we use Ansible, Terraform, and Kubernetes)
- Proficiency in monitoring and observability tools (e.g., Prometheus, Grafana, ELK stack)
- Knowledge of incident management and root cause analysis
- Strong problem-solving skills and the ability to work under pressure
- Excellent communication and skills
- Energy domain experience preferred
- System Reliability: Ensure the reliability of our systems and services, minimizing downtime and outages through proactive monitoring, alerting, and troubleshooting.
- Incident Management: Respond to and manage incidents, conduct root-cause analysis, and implement preventive measures to reduce the impact of future incidents.
- Capacity Planning: Collaborate with teams to plan for capacity and scalability, making data-driven decisions about resource allocation and performance optimization.
- Automation: Develop and maintain automation tools for production environments to enhance system reliability and efficiency.
- Change Management: Oversee changes and updates to production systems, prioritizing risk mitigation and minimizing service disruptions during deployments.
- Performance Optimization: Work on performance tuning, profiling, and optimization of systems, making them faster and more efficient.
- On-Call Duty: Participate in an on-call rotation to respond to incidents and issues outside of regular working hours.
- Collaboration: Collaborate closely with Software Engineering, Product, DevOps, and other teams to ensure that reliability is built into the design and development process.
- Documentation: Contribute to documentation of production processes, systems, and incident response playbooks.
- Energy domain experience