Full-Time

Network Observability Senior Engineer

CoreWeave

CoreWeave

1,001-5,000 employees

GPU-accelerated cloud computing platform

No salary listed

Livingston, NJ, USA + 3 more

More locations: New York, NY, USA | Bellevue, WA, USA | Sunnyvale, CA, USA

In Person

Category
DevOps & Infrastructure (1)
Required Skills
Bash
Kubernetes
Python
Grafana
Jinja2
Go
Prometheus
Ansible
Linux/Unix
Requirements
  • Bachelor’s degree in Computer Engineering, Electrical Engineering, Computer Science, or a related field.
  • Deep familiarity with Prometheus, Grafana, Alertmanager, gNMI, and SNMP. Experience writing or extending custom metric collectors/exporters is a plus.
  • Experience as a Network Engineer, SRE, Software Developer, or Systems Administrator in large-scale environments. A track record of building and operating robust telemetry and monitoring solutions is a plus.
  • Passion for automating tasks and processes. You find satisfaction in creating workflows that handle repetitive tasks and reduce human error to near zero.
  • Comfortable containerizing solutions in Kubernetes, designing, building, and deploying container-based workloads efficiently.
  • Proficient with Python, Go, and Bash, plus familiarity with configuration management and templating tools (e.g., Ansible, Jinja2).
  • Strong knowledge of Linux systems and IP networking concepts, with hands-on experience in routing, switching, and network troubleshooting.
  • Practical knowledge with a variety of platforms, including Arista EOS, NVIDIA Cumulus Linux, Nokia SR OS, and SR Linux.
  • Collaborative, humble, and always ready to help others while staying open to learning from more senior colleagues.
Responsibilities
  • Develop, optimize, and maintain network observability platforms. Use your skills in Python and Golang to create and automate collectors, exporters, and dashboards that provide deep visibility into network health and performance.
  • Collaborate with Network Engineering and Platform teams to ingest and unify logs, metrics, and events from a variety of platforms (Arista EOS, NVIDIA Cumulus Linux, Nokia SR OS, SR Linux, etc.) into a single observability pipeline.
  • Design and implement scalable telemetry solutions using protocols like gNMI, SNMP, and streaming analytics. Ensure advanced alerting and anomaly detection with tools such as Prometheus, Grafana, and Alertmanager.
  • Work closely with network developers, site reliability engineers, and security teams to integrate observability solutions across the broader infrastructure. Participate in design discussions, RFCs, and architectural decisions.
  • Join a rotating on-call schedule to troubleshoot and resolve observability-related issues. Provide timely support to operations teams, quickly isolating and fixing problems when they arise.
  • Guide junior team members, share best practices, and foster a culture of continuous learning and improvement within the observability domain.
Desired Qualifications
  • Machine Learning for Anomaly Detection: Hands-on experience applying ML techniques or tools (e.g., TensorFlow, scikit-learn) to proactively detect performance or security anomalies in network traffic.
  • Network Certifications: Certifications like CCNA, CCNP, or similar.
  • Advanced Metrics & Analytics: Hands-on experience with data pipelines, event correlation, or anomaly detection in large-scale environments.
  • Distributed Tracing: Familiarity with OpenTelemetry, Jaeger, or Zipkin for end-to-end tracing across microservices and network components.

CoreWeave provides cloud computing resources tailored for GPU-accelerated workloads. It offers high-performance, pay-as-you-go access to NVIDIA GPU hardware hosted on bare-metal servers managed by Kubernetes, enabling tasks such as Generative AI, machine learning, LLM inference, VFX rendering, and pixel streaming. Users run GPU-intensive workloads on a fully managed, serverless Kubernetes platform without needing to own or manage the underlying hardware. The company differentiates itself by specializing in GPU workloads, offering a wide range of NVIDIA GPUs, and reducing operational burden through its bare-metal, Kubernetes-based infrastructure. CoreWeave’s goal is to deliver scalable, cost-efficient, high-performance infrastructure for AI, HPC, and digital content creation workloads.

Company Size

1,001-5,000

Company Stage

IPO

Headquarters

Livingston, New Jersey

Founded

2017

Your Connections

People at CoreWeave who can refer or advise you

Simplify Jobs

Simplify's Take

What believers are saying

  • Strong demand for frontier AI infrastructure supports long-duration contracts and backlog growth.[7]
  • European expansion adds localized capacity for regulated customers and lower-latency AI workloads.
  • Mission Control and managed operations create upsell potential beyond raw GPU rental.[1][7]

What critics are saying

  • CoreWeave's dependence on NVIDIA hardware exposes it to supply, pricing, and roadmap shifts.[1][2]
  • General cloud and GPU-cloud rivals compress pricing and weaken its speed-to-capacity advantage.[3][8]
  • Heavy debt-funded expansion creates refinancing and liquidity risk if capacity utilization slows.

What makes CoreWeave unique

  • CoreWeave is a specialized cloud for NVIDIA GPU workloads, not general-purpose compute.[2][4]
  • Its bare-metal, Kubernetes-native stack targets AI training, inference, VFX, and HPC.[1][2]
  • CoreWeave emphasizes rapid access to the latest NVIDIA hardware and full-stack AI infrastructure.[2][4]

Help us improve and share your feedback! Did you find this helpful?

Benefits

Health Insurance

Dental Insurance

Vision Insurance

Life Insurance

Disability Insurance

Health Savings Account/Flexible Spending Account

Tuition Reimbursement

Mental Health Support

Family Planning Benefits

Paid Parental Leave

Hybrid Work Options

401(k) Company Match

Unlimited Paid Time Off

Catered lunch each day in our office and data center locations

A casual work environment

Growth & Insights and Company News

Headcount

6 month growth

0%

1 year growth

3%

2 year growth

4%
MarketSpeaker
May 19th, 2026
Google and Blackstone launch AI infrastructure venture to challenge Nvidia.

Google and Blackstone launch AI infrastructure venture to challenge Nvidia. Google and Blackstone launched a new U.S.-based AI infrastructure company built around Google's TPU chips, aiming to compete with Nvidia and cloud computing firms like CoreWeave. Google and Blackstone announced the launch of a new artificial intelligence infrastructure venture designed to compete directly with Nvidia in the rapidly expanding AI computing market. The U.S.-based company will provide cloud infrastructure powered by Google's proprietary TPU chips, which are specifically designed for training and running advanced neural networks and AI models. Analysts describe the initiative as one of the most significant attempts yet to challenge Nvidia's dominance in AI accelerators and high-performance computing infrastructure. The project is also expected to compete directly with AI cloud providers such as CoreWeave, which have benefited heavily from surging demand for AI computing capacity. Google expands TPU ecosystem. Google has spent years developing its Tensor Processing Units internally to support products including search, Gemini, cloud services, and AI model training. The partnership with Blackstone signals a broader effort to commercialize Google's AI hardware ecosystem at much larger scale. Investor interest in AI infrastructure has accelerated dramatically as demand for compute power continues outpacing available supply across cloud and data center markets. Analysts note that TPU-based systems could provide an alternative for enterprises seeking to reduce dependence on Nvidia GPUs, which currently dominate the AI hardware landscape. The venture may also help Google strengthen the position of its cloud business by integrating proprietary AI hardware directly into large-scale enterprise computing services. Competition for AI infrastructure intensifies. The announcement highlights how competition in artificial intelligence is increasingly shifting from software applications toward underlying infrastructure and compute capacity. Major technology companies and investment firms are now racing to secure access to chips, electricity, networking systems, and data center resources needed to support AI growth. Blackstone's involvement underscores how private capital is flowing aggressively into AI infrastructure as investors view computing capacity as one of the world's most valuable strategic assets. At the same time, Nvidia remains the dominant player in the AI accelerator market, with its GPUs continuing to power much of the global AI ecosystem. Still, growing demand and supply constraints are creating opportunities for alternative hardware platforms and cloud providers to expand market share. The broader takeaway is that the AI race is evolving into a battle over infrastructure ownership, where chips, data centers, and compute resources are becoming as strategically important as the AI models themselves.

PT Bumi Santosa Cemerlang
May 17th, 2026
Nvidia takes $3.66B stake in CoreWeave to expand AI infrastructure beyond GPUs

Nvidia has increased its stake in AI cloud infrastructure company CoreWeave to 11%, valued at approximately $3.66 billion, as it expands its strategy beyond GPU manufacturing. The investment ties Nvidia's future to AI cloud infrastructure growth. CoreWeave has secured major contracts with Meta, Jane Street, Anthropic and Perplexity AI, demonstrating strong market demand despite current losses and stock price challenges. The company specialises in AI infrastructure services. Nvidia's investment represents a strategic shift towards shaping and financing the broader AI ecosystem. The move signals the chipmaker's ambition to drive long-term growth through infrastructure investments alongside its core chip sales business.

Dealroom.co
Apr 16th, 2026
CoreWeave company information, funding & investors

CoreWeave, a specialized cloud provider, delivering a massive range of gpu compute resources on demand and at scale. Here you'll find information about their funding, investors and team.

Bloomberg L.P.
Apr 15th, 2026
Jane Street Invests $1 Billion in CoreWeave, Boosts Spending Plans

Jane Street Group, a trading firm, has taken an additional $1 billion stake in AI cloud services provider CoreWeave Inc. and plans to spend about $6 billion on the company’s technology offerings.

Yahoo Finance
Apr 14th, 2026
Nebius surges 70% YTD versus CoreWeave's 40% in AI infrastructure race

Two AI infrastructure providers, Nebius Group and CoreWeave, are competing for dominance in the GPU compute leasing market. Nebius has outperformed year-to-date, rising 70% compared to CoreWeave's 40%, though both have surged since their IPOs last March. Nebius reported fourth-quarter revenue of $227.7 million, up 547% year-over-year, and guided 2026 revenue to $33.4 billion. The company secured a $27 billion deal with Meta Platforms and received a $2 billion investment from Nvidia for joint infrastructure development. Nebius targets over 3 gigawatts of contracted power by year-end 2026. CoreWeave posted fiscal 2025 revenue of $5.13 billion with a revenue backlog of $66.8 billion. Analysts project 2026 revenue around $12.5 billion, roughly four times Nebius's estimate, positioning CoreWeave as the larger-scale player.