Full-Time

Infrastructure Reliability Engineer

Bare Metal

Posted on 4/22/2025

CoreWeave

CoreWeave

1,001-5,000 employees

GPU-accelerated cloud computing platform

Compensation Overview

$122k - $163k/yr

+ Discretionary Bonus + Equity Awards

Bellevue, WA, USA + 1 more

More locations: Sunnyvale, CA, USA

Hybrid

Remote work may be considered for candidates located more than 30 miles from an office.

Category
DevOps & Infrastructure (2)
,
Required Skills
Bash
Python
Grafana
JIRA
Elasticsearch
Prometheus
Ansible
Kibana
Linux/Unix
Requirements
  • Bachelor's degree in Computer Science, Electrical Engineering, or related technical discipline
  • 5+ years of experience in hands-on management and support of complex bare metal infrastructure environments and data center operations
  • Comprehensive understanding of modern server hardware architectures, including specialized compute accelerators (GPUs) and high-speed interconnect technologies from leading high-performance computing vendors such as NVIDIA, Dell, or HPE
  • Demonstrated expertise in Linux system administration, encompassing deep familiarity with command-line operations and system configuration
  • Proficiency in at least one high-level scripting language (e.g., Python) and practical experience with infrastructure and/or network automation tools, methodologies, and frameworks (e.g., Ansible)
  • Extensive experience with modern infrastructure monitoring and logging tools such as Prometheus, Grafana, and the ELK stack (Elasticsearch, Logstash, Kibana)
  • Working knowledge of enterprise ticketing systems (e.g., Jira) and an understanding of IT Service Management (ITSM) frameworks and best practices
  • Strong analytical and problem-solving skills, with the ability to systematically diagnose and resolve complex technical issues
  • Excellent communication and collaboration abilities, with experience working effectively across multidisciplinary technical teams
  • Self-motivated and proactive, with a demonstrated sense of ownership and a commitment to ensuring infrastructure reliability and performance
  • Proven ability to manage multiple tasks and priorities effectively in a fast-paced and dynamic environment
Responsibilities
  • Provide expert-level technical support and in-depth troubleshooting for a wide spectrum of hardware and associated software issues, encompassing server malfunctions, network outages, and performance degradations
  • Manage the lifecycle of our bare metal infrastructure, including overseeing deployment methodologies, executing maintenance procedures, coordinating upgrades, and managing hardware retirement processes
  • Architect and implement automation solutions through scripting and tooling to streamline repetitive operational tasks, enhance overall efficiency, and minimize manual intervention across the infrastructure
  • Lead the development and refinement of critical operational processes, comprehensive technical documentation (SOPs, TSGs, runbooks), and the establishment of engineering best practices to bolster team effectiveness and infrastructure resilience
  • Engage in close collaboration with Software, Network, and Data Center Operations Engineering teams to facilitate effective issue resolution, contribute to strategic project planning, and ensure the cohesive operation of the entire infrastructure ecosystem
  • Serve as a key technical point of contact for hardware and software vendors, managing technical support engagements, overseeing the RMA process, and driving the resolution of complex hardware-centric challenges
  • Design, deploy, and maintain sophisticated monitoring and alerting frameworks to proactively identify and mitigate potential infrastructure anomalies and performance deviations
  • Participate actively in incident response protocols, conduct thorough root cause analysis (RCAs) for infrastructure events, and contribute to problem management strategies aimed at preventing future occurrences
  • Contribute technical expertise to and potentially lead infrastructure-focused projects, including new hardware deployments, critical system upgrades, and the integration of new operational tooling
  • Mentor and guide junior engineering team members, fostering technical growth and contributing to the development of internal knowledge resources and training programs
  • Maintain the integrity of hardware asset tracking and related data within our infrastructure inventory systems (e.g., Snipe-IT)
  • Adhere to and promote stringent security protocols and best practices related to infrastructure access and maintenance activities
Desired Qualifications
  • Curiosity about Kubernetes, Docker, and containerized infrastructure
  • Strong problem-solving skills with a proactive and analytical mindset
  • Excellent communication skills and a demonstrated ability to work collaboratively in a fast-paced environment

CoreWeave provides cloud computing resources tailored for GPU-accelerated workloads. It offers high-performance, pay-as-you-go access to NVIDIA GPU hardware hosted on bare-metal servers managed by Kubernetes, enabling tasks such as Generative AI, machine learning, LLM inference, VFX rendering, and pixel streaming. Users run GPU-intensive workloads on a fully managed, serverless Kubernetes platform without needing to own or manage the underlying hardware. The company differentiates itself by specializing in GPU workloads, offering a wide range of NVIDIA GPUs, and reducing operational burden through its bare-metal, Kubernetes-based infrastructure. CoreWeave’s goal is to deliver scalable, cost-efficient, high-performance infrastructure for AI, HPC, and digital content creation workloads.

Company Size

1,001-5,000

Company Stage

IPO

Headquarters

Livingston, New Jersey

Founded

2017

Simplify Jobs

Simplify's Take

What believers are saying

  • Q1 2026 revenue hit $2.08B, beating $1.97B estimates amid AI demand surge.
  • NVIDIA's $2B investment funds 5GW AI infrastructure expansion by 2030.
  • $66.8B contracted backlog includes $21B Meta deal through 2032.

What critics are saying

  • NVIDIA GPU shortages delay data centers, eroding $66.8B backlog conversion.
  • Lambda Labs' 50% cheaper spot instances force CoreWeave margin cuts.
  • OpenAI partnership renegotiation wipes 20% revenue after missed targets.

What makes CoreWeave unique

  • CoreWeave operates 40+ data centers with 250,000+ GPUs on Kubernetes-native architecture.
  • First cloud provider offering NVIDIA GB200 NVL72 chips in February 2025.
  • Mission Control software enables hardware performance control and verification.

Help us improve and share your feedback! Did you find this helpful?

Benefits

Health Insurance

Dental Insurance

Vision Insurance

Life Insurance

Disability Insurance

Health Savings Account/Flexible Spending Account

Tuition Reimbursement

Mental Health Support

Family Planning Benefits

Paid Parental Leave

Hybrid Work Options

401(k) Company Match

Unlimited Paid Time Off

Catered lunch each day in our office and data center locations

A casual work environment

Growth & Insights and Company News

Headcount

6 month growth

0%

1 year growth

2%

2 year growth

3%
Dealroom.co
Apr 16th, 2026
CoreWeave company information, funding & investors

CoreWeave, a specialized cloud provider, delivering a massive range of gpu compute resources on demand and at scale. Here you'll find information about their funding, investors and team.

Bloomberg L.P.
Apr 15th, 2026
Jane Street Invests $1 Billion in CoreWeave, Boosts Spending Plans

Jane Street Group, a trading firm, has taken an additional $1 billion stake in AI cloud services provider CoreWeave Inc. and plans to spend about $6 billion on the company’s technology offerings.

Yahoo Finance
Apr 14th, 2026
Nebius surges 70% YTD versus CoreWeave's 40% in AI infrastructure race

Two AI infrastructure providers, Nebius Group and CoreWeave, are competing for dominance in the GPU compute leasing market. Nebius has outperformed year-to-date, rising 70% compared to CoreWeave's 40%, though both have surged since their IPOs last March. Nebius reported fourth-quarter revenue of $227.7 million, up 547% year-over-year, and guided 2026 revenue to $33.4 billion. The company secured a $27 billion deal with Meta Platforms and received a $2 billion investment from Nvidia for joint infrastructure development. Nebius targets over 3 gigawatts of contracted power by year-end 2026. CoreWeave posted fiscal 2025 revenue of $5.13 billion with a revenue backlog of $66.8 billion. Analysts project 2026 revenue around $12.5 billion, roughly four times Nebius's estimate, positioning CoreWeave as the larger-scale player.

Yahoo Finance
Apr 14th, 2026
Meta signs $21B AI cloud deal with CoreWeave through 2032

CoreWeave has secured a $21 billion long-term agreement with Meta Platforms to provide AI cloud capacity through December 2032, utilising Nvidia's Vera Rubin platform. This follows an existing $14.2 billion deal with Meta through 2031. Despite recent major contracts, including a $6.5 billion agreement with OpenAI in September 2025, CRWV stock remains 40% below its June 2025 highs. The company posted Q4 2025 revenue of $1.6 billion and full-year revenue of $5.1 billion, but reported a $452 million quarterly net loss. CoreWeave faces financial challenges with $29.82 billion in total debt against just $3.16 billion in cash, resulting in interest costs representing 23.5% of revenue. Whilst the stock has gained 54% year-to-date, its heavy debt reliance raises concerns about sustainability.

YouTube
Apr 11th, 2026
Meeting the Data Center Demand

CoreWeave CTO and Co-founder Peter Salanki talks with TITV Host Akash Pasricha about the current "bottleneck of the day" in AI infrastructure and why reports of data center delays are misunderstood. We also get into the complexities of deploying Nvidia's Blackwell chips and why specialized labor, like master electricians, is becoming the industry's newest constraint. Subscribe: https://www.theinformation.com/subscribe_youtube The Information’s TITV airs weekdays on YouTube, X and LinkedIn at 10AM PT / 1PM ET. Or check us out wherever you get your podcasts. Follow us: X: https://x.com/theinformation IG: https://www.instagram.com/theinformation/ TikTok: https://www.tiktok.com/@titv.theinformation LinkedIn: https://www.linkedin.com/company/theinformation/

INACTIVE