Facebook pixel

Senior Site Reliability Engineer
Incident Response
Confirmed live in the last 24 hours
London, UK
Experience Level
Desired Skills
Apache Kafka
Google Cloud Platform
  • You have 5+ years of large-scale production/platform operations experience in a large, SaaS provider environments, preferably as a Technical Duty Officer/Major Incident Manager, SRE team leader or Infrastructure (IaaS) or Platform (PaaS) Architecture SME in a Managed Service Provider environment
  • Experience in bare metal, Openstack, and K-8 architectures supporting a large number of SOA-API-based services
  • Exposure to Open Source Service-Meshes, Proxies, Caching, Message Buses (Kafka, MQS), NOSQL (Hbase, Hadoop), MYSQL clusters, and Search environments (SOLR, ES)
  • You should be competent in debugging global, distributed Web/API sites based on Linux systems (Ubuntu, RHL, Centos), BGP, iBGP, and IP Anycast networking in multi-vendor virtualized, Edge and hybrid public cloud architectures
  • You are not expected to be an expert in all areas, but you should be familiar with common terminologies, processes, and architectures in Linux Open Source environments, as well as a thorough understanding of Virtualization, Containers, and Kubernetes
  • You are confident and comfortable communicating and interacting with individual-contributors through C-level executives from multiple countries, ethnicities, and backgrounds
  • You have a rock solid command presence and are calm and collected in highly stressful situations, such as a major service outage
  • You're driven to continuously learn new skills and technologies
  • Bachelor's degree in Computer Science or Information Systems or equivalent technical field, or similar work experience in a large-scale 24/7 production environment supporting critical, real-time applications
  • Flexibility to work different shifts and provide weekend coverage depending on need
  • Solid understanding of ITILv4 Service Lifecycle Management, Service Delivery KPIs, SLIs, SLOs, and Incident, Change, and Problem Management framework, terminology, tools (ServiceNow, Remedy, Jira Service Desk), and processes
  • Solid knowledge and understanding of security standards and best practices, such as: OWASP, W3C, ISO 27001, SOC1-2, PCI, and SOX
  • Ability to troubleshoot secured protocols such as: SSH, SSO, TLS, FTPS, WebDav, HTTPS
  • Solid understanding and debugging skills in TCP/IP, BGP, IP Anycast, and distributed internal and external DNS
  • Two years working experience and knowledge with multi-regional public cloud providers
  • Experience with observability tools and distributed tracing in large scale environments (Splunk, Datadog, Wavefront, Catchpoint, ThousandEyes, Sensu, SignalFX RUM, Open Telemetry, SNMP)
  • Good understanding and experience with configuration management tools and CI/CD pipelines - Puppet, Ansible, Terraform, Artifactory
  • Excellent interpersonal and communication skills
  • Own and direct live-site Major Incident Management from detection, identification, escalation, mitigation, and recovery
  • Triage, refine, and verify the Problem Statement, notifies and coordinate the efforts of all appropriate SME resources, and lead cross-functional Incident Bridges to quickly identify and mitigate the problem and restore service. You'll be evaluated in how well you are able to reduce MTTD to MTTR
  • Ensure accurate, valid and timely communication to key stakeholders and business entities
  • Lead daily Incident and Change ticket reviews, coordinate and monitor change windows, and coordinate with Problem Management on TopOps Issues and action items
  • Operate across organizational boundaries (Business, Dev, Ops, CS) to protect our customers, their data, and the availability of all Box services, from internal and external security threats, unanticipated volume surges, and significant performance issues
  • Troubleshoot and identify critical problems in a SOA/API-based, global hybrid cloud, distributed edge architecture on multiple enterprise and public clouds regions
  • Provide day to day technical expertise and experience to the organization to address issues in globally diverse, high velocity 24x7 environments - from policy and procedural decisions to key architectural and tooling insights to improve Box's Incident, Change, and Problem Management engineering capabilities
  • Lead daily reviews of planned changes (CAB) in Jira; accountable for reviewing and minimizing change risk, ensuring adequate and appropriate change timing and duration, and complete rollout, validation, and rollback plans that are optimized to prevent site or service impact
  • Ensure all customer-impacting Incident tickets are completely and correctly documented and augmented with appropriate metrics, timelines, actions taken, and actions still pending
  • Contributes and reviews Incident postmortems to ensure adequate documentation and appropriate prioritization of action items related to reducing MTTI, MTTM and MTTR
  • Participates in Problem Management scrums and Postmortems to identify leading organizational and company-wide technical issues, threats, and trends that block the ability of the organization or teams to perform their roles and provide services optimally and reliably
  • Lead projects to improve tools and processes related to overall site and service manageability, observability, and resiliency
  • Coordinate regularly with Infosec, Customer Success, Platform and Dev leaders to continuously access new security and customer on-boarding threats and known issues
  • Continuously mentor and train Global NOC and system engineers
Desired Qualifications
  • Understanding of Agile methods and tools (Jira)
  • Experience with WAF, Bot Managers, and Content Delivery Networks (Cloudflare, Akamai)
  • Experience working in and transitioning into multi-regional hybrid cloud architectures (GCP preferred, AWS)
  • Understanding of Apache Zookeeper and Hadoop
  • Experience with large production Scala, Java, Node, PHP environments helpful
  • Experience working with various message bus technologies (Kafka, RabbitMQ, MQS)
  • Experience working with relational and non-relational databases and search engines (Mysql, Postgres, HBase, Elastic Search, SOLR)
  • Experience with caching apps (Squid, Redis, Memcache)
  • Experience with service mesh technologies in a hybrid-cloud environment (Zookeeper, Smart Stack)

1,001-5,000 employees

Cloud content management and file sharing service
Company Overview
Box is on a mission to make businesses more productive, competitive, and powerful by connecting people and their most important information. The company operates one of the world's largest cloud storage platforms.
  • Health and Wellness
  • Family Support
  • Generous Time Off
  • Financial Benefits
  • Community
  • Evolving Workplace
Company Core Values
  • Blow our customers' minds
  • Take risks. Fail fast. GSD
  • 10x it!
  • Be an owner. It's your company
  • Bring you (___) self to work every day
  • Be candid and assume good intent
  • Make mom proud