Incident Response Manager
Posted on 7/19/2023
INACTIVE
Crunchyroll

501-1,000 employees

Anime streaming service
Company Overview
Crunchyroll stands out as a global leader in the anime and manga industry, offering one of the largest collections of licensed content, translated into multiple languages, and accessible across 200+ countries. The company's competitive edge lies in its unique offerings such as immediate access to top series after Japanese broadcast and a wide range of anime-related products and experiences. With its app available on over 15 platforms, including all gaming consoles, Crunchyroll demonstrates its commitment to technological adaptability and user accessibility.

Company Stage

N/A

Total Funding

$26.8M

Founded

2006

Headquarters

Culver City, California

Growth & Insights
Headcount

6 month growth

0%

1 year growth

4%

2 year growth

25%
Locations
Dallas, TX, USA
Experience Level
Entry
Junior
Mid
Senior
Expert
Desired Skills
Bash
PHP
Python
JavaScript
Management
Java
AWS
Development Operations (DevOps)
Linux/Unix
Google Cloud Platform
CategoriesNew
DevOps & Infrastructure
Software Engineering
Requirements
  • Five (5) years managerial experience for teams of ten or more technical people
  • Minimum 8 years of IT experience including three (3) years of Incident Response experience
  • Previous experience with Java, bash and/or Python or other Linux scripting/ programming languages
  • Experience with troubleshooting web, mobile and living room applications
  • Experience running an Incident Response team
  • Experience ensuring that underlying infrastructure is running smoothly and that systems and tools are working as expected
  • Monitor critical applications and services to minimize downtime and ensure their availability
  • Automate monitoring, incident response, and alerting to reduce time-consuming functions that are still necessary
  • Extensive technical knowledge with microservices
  • In-depth understanding of IT infrastructure, deployment and release pipelines
  • Skilled utilizing Observability Tools: New Relic, Data dog, Cloudwatch
  • Strong analytical and problem-solving skills
  • Skilled using ITSM ticketing products
  • 3 years of Coding, Automation testing experience with any language or tools
  • Designs, codes, tests, deploys and evaluates highly reliable web pages in PHP, Go, JavaScript
Responsibilities
  • Functional responsibilities for all aspects of team staffing and operations to include hiring, work scheduling, performance management, rewards and recognition, training, and career development
  • Ensure that team personnel are implementing effective Incident Management processes to detect and resolve service outages and degradations as quickly as possible and are returned to normal service levels
  • On-call rotation for incident response and proactive incident measures
  • After incidents, document actions in order to create automated solutions during incident response
  • Build monitoring alerts and incident response processes
  • Bring about cultural shifts to provide a foundation for process changes
  • Provide technical services expertise & strong troubleshooting skills to the team and ensure that personnel possess and grow their technical skills to perform their roles
  • Work closely with Devops / SRE / delivery groups, Product Development to ensure integrity and reliability of our services
  • Select and mentor Shift Leads and participate in career development and coaching
  • Ensure timely response to customer inquiries for assistance, coordinate potentially impacting long haul vendor service maintenance or repairs with customer and program personnel and maintain peer level relationships with our customers
  • Support the staffing, onboarding, training, development and performance management of the team staff
  • Ensure all Trouble Tickets are appropriately entered and managed in accordance with program policies, and processes
  • Utilize metrics and trend analysis to reduce Mean Time To Repair, improve performance, and track Preventive Maintenance (PM) completions, aging tickets and service-affecting issues
  • Experience working in Cloud environment (AWS & GCP)
  • Ensure all employees remain proficient; identify training for each skill set providing progress reports on operation team training and certifications
  • Analyze technical functions, recommend upgrades/changes, and assess current and future team needs. Drive and support Continual Service Improvement