Site Reliability Engineer III
Observability
Confirmed live in the last 24 hours
Wikimedia Foundation

51-200 employees

Nonprofit charitable organization
Company Overview
The mission of the Wikimedia Foundation is to empower and engage people around the world to collect and develop educational content under a free license or in the public domain, and to disseminate it effectively and globally.
Social Impact

Company Stage

Grant

Total Funding

$144.9M

Founded

2003

Headquarters

San Francisco, California

Growth & Insights
Headcount

6 month growth

-89%

1 year growth

-88%

2 year growth

-87%
Locations
Remote
Experience Level
Entry
Junior
Mid
Senior
Expert
Desired Skills
Development Operations (DevOps)
Management
Operating Systems
Puppet
Ruby
Terraform
Python
Ansible
CategoriesNew
DevOps & Infrastructure
Software Engineering
Requirements
  • 2+ years experience in an SRE/Operations/DevOps role
  • Experience with operating highly available infrastructure
  • Comfortable with shell and a programming language used in an SRE/Operations engineering context (Python, Go, Ruby, etc.)
  • Experience with package management for operating systems (Debian, etc.)
  • Comfortable with Open Source configuration management and orchestration tools (Puppet, Ansible, TerraForm, etc.)
  • Past exposure to automation and streamlining of tasks
  • Communicative technical English
  • A history of contributing to Open Source projects
  • Prior participation in the Wikimedia movement
  • Hands-on experience with Observability related tools practices (tracing, logging, metrics, performance monitoring, SLOs, and alerting)
Responsibilities
  • Implementation and maintenance of Internet-facing infrastructure and services
  • Use of configuration management and deployment tools
  • Monitoring of systems and services, optimization of performance, and resource utilization
  • Typical operating system-level tasks such as logging and backup / restore
  • Cookbook/runbook implementation for everyday maintenance actions
  • Incident response, diagnosis, and follow-up on system outages or alerts
  • Collaborating with a global and asynchronously communicating team (don't worry if you have never worked remotely; we'll help you get used to it)