Full-Time

Engineer – Fleet Monitoring & Analysis

Updated on 12/2/2024

CoreWeave

CoreWeave

501-1,000 employees

Cloud service for GPU-accelerated workloads

Enterprise Software
AI & Machine Learning

Compensation Overview

$160k - $185kAnnually

Junior, Mid

Livingston, NJ, USA + 3 more

More locations: New York, NY, USA | Bellevue, WA, USA | Sunnyvale, CA, USA

Hybrid workplace; in-office presence required.

Category
DevOps & Infrastructure
Site Reliability Engineering
Cloud Engineering
Requirements
  • 2 or more years experience in a software or infrastructure engineering industry.
  • Experience in the domains of automation and orchestration workflows and are knowledgeable about server hardware, components, and related technologies and strategies for the management of physical infrastructure at scale.
  • Experience implementing metrics collection and alerting on standard platforms.
  • Believe in the value of automation and will champion practices that drive reliability and prioritize the CoreWeave customer experience.
Responsibilities
  • Design and implement solutions to large-scale server observability to continually improve the stability of CoreWeave’s global hardware fleet.
  • Adapt, extend, and implement open-source solutions to augment the depth and breadth of our visibility into our operating environment.
  • Generate and maintain custom reports, alarms, and visualizations to help teams understand and respond to our growth and changes.
  • Create test plans, deployment automation, dashboards, alerts, and insights into our fleet operations, as well as participate in the Fleet Engineering Developers’ on-call rotation.

CoreWeave provides cloud computing services that focus on GPU-accelerated workloads, which are essential for tasks requiring high computational power like Generative AI, Machine Learning, and VFX rendering. Their services allow clients to access powerful computing resources without needing to invest in expensive hardware, operating on a pay-as-you-go model. This flexibility is particularly beneficial for tech companies, film studios, and enterprises that need scalable solutions. CoreWeave's infrastructure utilizes a bare metal serverless Kubernetes platform, which enhances performance while minimizing operational complexity for clients. Unlike many competitors, CoreWeave offers a wide range of NVIDIA GPUs, enabling clients to tailor their computing power to specific needs. The company's goal is to provide efficient and scalable cloud resources that meet the growing demands of industries reliant on high-performance computing.

Company Stage

N/A

Total Funding

$1.6B

Headquarters

New York City, New York

Founded

2017

Growth & Insights
Headcount

6 month growth

53%

1 year growth

174%

2 year growth

828%
Simplify Jobs

Simplify's Take

What believers are saying

  • Securing $1.1 billion in funding positions CoreWeave for aggressive growth and innovation in the AI and HPC sectors.
  • The appointment of former AWS executive Chetan Kapoor as Chief Product Officer brings valuable expertise and leadership to drive product strategy during a hypergrowth phase.
  • CoreWeave's $2.2 billion investment in European data centers demonstrates their commitment to expanding global reach and meeting surging demand for AI infrastructure.

What critics are saying

  • The competitive landscape with giants like AWS launching high-core instances could pressure CoreWeave to continuously innovate to maintain its edge.
  • Rapid expansion, including significant investments in new data centers, could strain resources and operational capabilities.

What makes CoreWeave unique

  • CoreWeave specializes in GPU-accelerated workloads, setting it apart from general cloud service providers like AWS and Azure.
  • Their fully managed, bare metal serverless Kubernetes platform offers high performance with reduced operational burden, a unique selling point in the cloud computing market.
  • CoreWeave's strategic partnerships, such as with Bloom Energy for on-site power generation, enhance their infrastructure's reliability and sustainability.

Help us improve and share your feedback! Did you find this helpful?