Senior Site Reliability Engineer
Engineering Platforms
Posted on 9/18/2023
INACTIVE
CoreWeave

201-500 employees

Specialized cloud provider offering high-performance GPU compute resources
Company Overview
CoreWeave is a specialized cloud provider that offers a broad range of high-performance GPU compute resources, making it a leader in the industry for compute-intensive tasks such as VFX and rendering, machine learning, and AI. As an NVIDIA Elite Cloud Solutions Provider, the company provides reliable, on-demand access to GPU resources, which has resulted in significant cost savings and performance improvements for its clients. CoreWeave's commitment to delivering world-class results and its ability to quickly and easily scale resources makes it an ideal workplace for those seeking to work at the forefront of cloud computing technology.
AI & Machine Learning
Data & Analytics
Hardware
B2B

Company Stage

N/A

Total Funding

$2.8B

Founded

2017

Headquarters

New York, New York

Growth & Insights
Headcount

6 month growth

80%

1 year growth

296%

2 year growth

737%
Locations
Remote in USA
Experience Level
Entry
Junior
Mid
Senior
Expert
Desired Skills
Linux/Unix
Kubernetes
CategoriesNew
DevOps & Infrastructure
Requirements
  • Design and implement services and tools to reduce friction and toil in the lives of our engineering and operations
  • Improve the performance, security, reliability, and scalability of our observability, CI/CD, and related services and participate in the Engineering Platforms on-call rotation
  • Create and maintain Kubernetes operators, custom controllers, and other tools to intelligently scale our operational capability
  • Develop dashboards, alerts, and insights into the customer experience using Grafana-ecosystem tools such as Mimir and Loki
  • Enable and evangelize the practice of reliability engineering across CoreWeave's engineering teams
  • Grow, change, invest in your teammates, be invested-in, share your ideas, listen to others, be curious, have fun, and, above all, be yourself
  • You have four or more years of experience in a software or infrastructure engineering industry
  • You enjoy helping your colleagues achieve more with less effort
  • You have experience operating services in production and at scale and are versed in reliability engineering concepts such as the different types of testing, progressive deployments, error budgets, the role observability, and fault-tolerant design
  • You're familiar with Kubernetes and have interest or experience with using it for event-driven and/or stateful orchestration
  • You're comfortable with the idea of using Go as your primary programming language
  • You know your way around a Linux distro, shell scripting, and/or the Linux storage and networking stacks
  • You can transform problems in elastic solutions, decompose them into achievable tasks, and socialize both to your teammates
  • Be Curious at your Core
  • Act like an Owner
  • Empower Employees
  • Deliver Best In-Class Client Experience
  • Achieve More Together