Senior Site Reliability Engineer
Posted on 11/16/2022
INACTIVE
Locations
Remote • United States
Experience Level
Entry
Junior
Mid
Senior
Expert
Desired Skills
Apache Spark
AWS
Development Operations (DevOps)
Google Cloud Platform
JavaScript
C/C++/C#
Git
Java
Microsoft Azure
MongoDB
MySQL
Operating Systems
Postgres
Redis
Ruby
Tensorflow
Terraform
Kubernetes
Python
Nginx
Requirements
  • Experience administering Kubernetes-based microservices, ingress controllers, web servers (nginx), and databases (Postgres, MySql, MongoDB; Desirable - Redis, Clickhouse)
  • Ability to program (structured and OO) with one or more high-level languages, such as Python, Java, C/C++, Ruby, and JavaScript
  • Strong experience with AWS technologies such as EKS, ELB, RDS, S3/EBS/Glacier and VPC
  • Experience architecting highly scalable, fault-tolerant, secure, and available systems within the AWS ecosystem
  • Strong troubleshooting experience in the realm of networking fundamentals, web applications, and DNS
  • Hands-on experience developing automation to streamline development processes
  • Experience working with modern CI/CD tools such as ArgoCD, GitHub Actions, or similar solutions
  • Experience with Infrastructure as Code tools (e.g. Terraform, CloudFormation)
  • BS or MS from a top-notch CS program (or equivalent experience)
  • 5+ years of professional experience in hands-on engineering roles (SRE/DevOps)
  • 3+ years operating high-traffic production environments in public clouds: AWS, GCP, or Azure
  • Python programming experience in production environments
  • Experience with modern cloud environments: containerization, infrastructure-as-code, DevOps, CI/CD pipelines and general automation
  • Hands-on experience with network security, databases systems and related tools
  • Operating Kubernetes clusters in a compliance-regulated environment
  • Experience performing stress-testing, failure analysis, and load-testing apps
  • Experience with cloud and infrastructure security regulations & compliance programs: SOC2, ISO27001, HIPAA, GDPR, CCPA
  • Experience with ML Ops: Spark, TensorFlow, GPUs
Responsibilities
  • Running the production environment to provide the highest levels of uptime, performance, and reliability
  • Identify toil in the day-to-day operations and automate whatever can be automated
  • Work with development teams to make sure the applications are production ready, scalable, reliable, and observable from day zero
  • Identify and drive opportunities to improve automation for code deployment, management, and visibility of application services
  • Establish end-to-end monitoring and alerting on all critical components within the platform
  • Participate in the on-call rotation, supporting the platform and production applications
  • Manage end-to-end availability and performance of critical services and build automation
  • Perform root cause analysis on issues, and participate in blameless post-mortems so we can learn from incidents and automate them out of recurrence
  • Independently troubleshoot complex systems and environments including applications, microservices, DNS, and networking components
  • Create load test scenarios and streamline their execution so performance regressions can be caught in pre-production
  • Enable developers and product teams to move rapidly with features without sacrificing the reliability, availability, and overall performance of our systems
  • Participate in architecture reviews and work cross-functionally with Engineering teams on operational readiness and tactical day-to-day scenarios
  • Work with engineering teams to better address needs and enable more effective and efficient developer throughput
  • Identify performance bottlenecks and triage with Engineering teams to design and implement a secure and performant solution
  • Guide development teams towards security, reliability, and availability best practices during the SDLC
  • Daily and Monthly Responsibilities
  • Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding
  • Partner with development teams to improve services through rigorous testing and release procedures
  • Participate in system design consulting, platform management, and capacity planning
  • Create sustainable systems and services through automation and uplifts
  • Balance feature development speed and reliability with well-defined service level objectives and service-level indicators to honor SLAs
Workato

501-1,000 employees

Cloud integration software company
Company Overview
Workato's mission is to enable companies to tap into the growth mindset and transform their organization with Workato. Wrokato is moved by innovation — a passion to create the best possible way and the drive to continue to make it better.
Benefits
  • Flexible working arrangements
  • EAP
  • Health insurance
  • Stock options
  • Professional development
  • PTO
  • Company events & recreation time
Company Core Values
  • Prioritize customers
  • Win together
  • Act now
  • Think ahead
  • Better each other
  • Go offbeat
  • Have fun