Full-Time

Senior / Staff Site Reliability Engineer

Observability

Posted on 10/31/2025

FluidStack

FluidStack

51-200 employees

High-performance GPU cloud for AI workloads

Compensation Overview

$175k - $320k/yr

+ Equity (Stock Options)

Seattle, WA, USA + 3 more

More locations: San Francisco, CA, USA | Austin, TX, USA | New York, NY, USA

In Person

Category
DevOps & Infrastructure (2)
,
Required Skills
TCP/IP
Bash
Kubernetes
Python
Grafana
Go
Prometheus
Terraform
Linux/Unix
Helm
Requirements
  • 7+ years total experience
  • 3+ years as Site Reliability Engineer focused on observability at high scale (≥ 100 million metrics series, 10 TB+/day logs)
  • Expertise operating the Grafana stack in production: Prometheus/Mimir, Loki, Tempo, Grafana, Alertmanager
  • Hands-on Kubernetes proficiency (Helm/Kustomize, custom CRDs, multi-cluster federation)
  • Infrastructure-as-Code fluency with Terraform (or Pulumi) for bare-metal plus cloud provisioning
  • Strong coding ability in Go (preferred) plus Python/Bash for automation, exporters, and custom controllers
  • Design and governance of SLOs / SLIs and alerting strategies that minimize false positives and engineer toil
  • Proven track record tuning observability pipelines for high availability, cardinality control, and cost efficiency
  • Deep Linux systems/debug skills (cgroups, namespaces, networking, filesystems) plus TCP/IP and TLS fundamentals
  • On-call ownership mindset and experience leading incident response and post-mortems for production outages
  • Clear, empathetic communication with both customers and internal engineering teams; comfortable in fast-moving, ambiguous environments
Responsibilities
  • Design, deploy, and operate the telemetry stack to monitor global AI cloud and support ML workloads
  • Optimize cost and performance of the telemetry stack while ensuring reliability and debuggability
  • Enable teams and customers to quickly detect, debug, and resolve production issues
  • Work closely with platform and infrastructure teams to ensure telemetry coverage for Kubernetes, SLURM, and distributed training jobs
Desired Qualifications
  • Experience instrumenting GPU-dense / HPC clusters (NVIDIA A-/H-series, NVSwitch, DGX, RoCE, RDMA)
  • Familiarity with Slurm, Ray, or Kubernetes-native batch schedulers for distributed ML training
  • Hands-on with eBPF, Cilium, or Hubble for low-overhead networking observability
  • OpenTelemetry adoption/migration projects across metrics, logs, and traces
  • Operating service meshes (Istio, Linkerd) and Envoy-based telemetry
  • Observability for edge or globally distributed footprints (EU/US/APAC PoPs, WAN optimization)
  • FinOps / cost-allocation tooling (Kubecost, Cloudability) integrated into dashboards and alerts
  • Security monitoring overlap (Falco, AWS GuardDuty, auditd pipelines)
  • Contributions to CNCF or Grafana Labs OSS projects; public talks or blog posts on observability at scale
  • Knowledge of high-performance storage and data planes (Ceph, NVMe-oF, Lustre) and their metrics
  • Familiarity with Kafka / ClickHouse / VictoriaMetrics as part of custom telemetry back-ends

FluidStack provides GPU-based cloud infrastructure for artificial intelligence workloads, delivering large-scale Nvidia GPU clusters through a neocloud model. The platform offers automated provisioning and a centralized orchestration layer that hides hardware complexity, with native support for Kubernetes and Slurm and proprietary monitoring to track power usage and hardware health. It targets AI labs, research institutions, and enterprise tech teams that need scalable, pay-as-you-go access to high-performance compute without owning data centers. The company's goal is to make it easy for organizations to train, develop, and deploy complex machine learning models by providing reliable, scalable GPU resources on demand.

Company Size

51-200

Company Stage

Late Stage VC

Total Funding

$11B

Headquarters

New York City, New York

Founded

2017

Simplify Jobs

Simplify's Take

What believers are saying

  • Anthropic's $50 billion deal builds custom data centers in New York and Texas.
  • Coatue's Next Frontier JV funds 430MW Indiana campus online by December 2026.
  • $750 million raise at $7 billion valuation accelerates US expansion creating 1,000 jobs.

What critics are saying

  • CoreWeave undercuts Fluidstack's pricing, capturing ex-OpenAI researchers within 6-12 months.
  • $1 billion round at $18 billion valuation fails by July 2026, causing liquidity crunch.
  • Google terminates Indiana lease if Fluidstack defaults on $5.7 billion bonds by 2028.

What makes FluidStack unique

  • Fluidstack delivers zero-setup multi-thousand GPU clusters for AI researchers from OpenAI and DeepMind.
  • Lighthouse platform enables proactive monitoring and automated remediation without customer intervention.
  • HIPAA, GDPR, ISO27001, and SOC 2 TYPE 2 compliance secures regulated AI labs and enterprises.

Help us improve and share your feedback! Did you find this helpful?

Benefits

Health Insurance

Dental Insurance

Vision Insurance

401(k) Retirement Plan

Company Equity

Unlimited Paid Time Off

Growth & Insights and Company News

Headcount

6 month growth

0%

1 year growth

-9%

2 year growth

-4%
Bloomberg L.P.
Apr 14th, 2026
Fluidstack Seeks $1 Billion in New Funding at $18 Billion Valuation

The cloud-computing startup Fluidstack Ltd. is holding funding talks with investors to bring in about $1 billion at a target valuation of $18 billion, according to people briefed on the matter.

Yahoo Finance
Apr 6th, 2026
UK data centre startup Fluidstack raises $750M, hits $7B valuation for US AI expansion

Fluidstack, a London-founded data centre startup, has been valued at $7 billion after raising over $750 million in funding. The company, established in 2017 by Gary Wu, Cesar Maklary and James Cox, is building AI infrastructure across America. The startup relocated its headquarters from London to New York in December to focus on US customers, creating over 1,000 jobs. New investors include Situational Awareness, an AI hedge fund founded by former OpenAI employee Leopold Aschenbrenner. Fluidstack is backed by Google, which has provided a $1.8 billion backstop to the company's data centre lease obligations and is reportedly in talks for an equity stake. The company is also working with Anthropic to build up to $50 billion of AI data facilities across New York and Texas.

Telegraph Media Group
Apr 5th, 2026
UK data centre giant raises $750m for US expansion

City sources say Fluidstack could still secure additional funding as start-up hits $7bn valuation

Yahoo Finance
Mar 20th, 2026
Fluidstack scraps $11.5B French data center for US expansion backed by $50B Anthropic deal

Fluidstack has abandoned an $11.5 billion data centre project in northern France to focus on US expansion, according to Bloomberg. The operator is relocating its global headquarters from the UK to New York and exited a secondary facility near Paris used by Mistral. The move could prove beneficial for Bitcoin miners partnering with Fluidstack. Hut 8, TeraWulf and Cipher Mining have signed deals with the firm over the past six months. Hut 8's 15-year agreement to build a 245-megawatt Louisiana site with Fluidstack and Anthropic generates $7 billion in revenue, potentially rising to $17.7 billion with expansion clauses. Fluidstack's US expansion includes a $50 billion master agreement with Anthropic to operate compute clusters across New York, Texas and other states.

Yahoo Finance
Mar 15th, 2026
Google-backed Fluidstack signs $7B, 15-year AI lease with Hut 8 as miner pivots to data centres

Hut 8 Corp has signed a 15-year, $7 billion IT capacity lease with Google-backed Fluidstack at its River Bend campus, marking a strategic shift from pure Bitcoin mining towards AI and data centre infrastructure. The company also sold a 310MW natural gas power plant portfolio to refocus capital. The deal is part of Hut 8's broader push to build 245MW to 2,295MW of AI data centre capacity with blue-chip clients. The company is carving out legacy mining operations into American Bitcoin whilst developing an 8,500MW infrastructure pipeline. Hut 8's narrative projects $767.3 million revenue and $140.6 million earnings by 2028, requiring 76.9% yearly revenue growth. Some analysts expect the company to reach $1.1 billion in revenue by 2028, though execution risks and potential dilution from capital-intensive expansion remain key concerns.

INACTIVE