Full-Time

Staff Production Engineer

Operational Excellence

Crusoe

Crusoe

501-1,000 employees

Harnesses flare gas for on-site HPC

Compensation Overview

$209k - $253k/yr

+ Bonus + RSUs (Equity)

San Francisco, CA, USA + 1 more

More locations: Sunnyvale, CA, USA

In Person

Category
DevOps & Infrastructure (2)
,
Required Skills
Kubernetes
Python
Grafana
Computer Networking
opentelemetry
AWS
Go
Prometheus
Terraform
Observability
Ansible
C/C++
Linux/Unix
Google Cloud Platform
Requirements
  • Bachelor's degree in Computer Science, Engineering, or a related technical field (or equivalent practical experience)
  • 8+ years of experience in Production Engineering, SRE, or large-scale infrastructure operations
  • Demonstrated experience supporting GPU workloads, HPC environments, or latency/throughput-sensitive distributed systems
  • Previous experience in Infrastructure roles building or managing compute, storage or networking platforms
  • Deep knowledge of Linux/Unix systems, including debugging complex issues across kernel and user space
  • Strong understanding of modern cloud infrastructure fundamentals including Kubernetes, distributed systems, virtualization, and cloud platforms (AWS/GCP)
  • Proven track record with incident management practices and reliability frameworks (SRE, ITIL, or similar)
  • Hands-on experience with monitoring and observability tools such as Prometheus and Grafana
  • Experience with infrastructure-as-code and configuration management tools such as Terraform or Ansible
  • Proficiency in scripting or programming with languages such as Go, Python, C, or C++
  • Exceptional communication skills and the ability to influence and collaborate across engineering teams
  • Ability to remain calm and effective while troubleshooting complex issues in high-impact production environments
  • A growth mindset and strong commitment to reliability engineering, automation, and operational excellence
Responsibilities
  • Lead cross-functional efforts to define and evolve availability metrics for Crusoe's cloud platform, including establishing, measuring, and improving SLIs and SLOs
  • Drive production incident response, diagnosing and resolving service disruptions while leading post-incident reviews and root cause analysis
  • Architect, operate, and improve observability across Crusoe's infrastructure using tools such as Prometheus, Grafana, Alertmanager, and OpenTelemetry
  • Identify reliability risks, performance bottlenecks, and early indicators of potential production issues across distributed systems
  • Design and develop automation and tooling that reduces operational toil, improves recovery times, and enables self-healing infrastructure
  • Partner with compute, networking, storage, and platform teams to strengthen service resilience and disaster recovery capabilities
  • Define and champion operational processes, knowledge sharing, and reliability best practices across the engineering organization
  • Mentor and grow junior and mid-level engineers, helping build technical depth across the team
Desired Qualifications
  • Experience leading Kubernetes or container orchestration platforms at scale
  • Exposure to change management processes, operational readiness reviews, or structured root cause analysis
  • Experience designing self-healing systems, automated remediation, or event-driven operational tooling
  • Interest in scaling AI or HPC infrastructure and solving reliability challenges in GPU-heavy environments
  • Passion for mentorship, growing teams, and developing deep expertise in Production Engineering

Crusoe Energy Systems captures wasted flare gas from oil and gas sites and uses it to generate electricity on-site, powering modular data centers for high-performance computing tasks. This setup enables clients to run compute-heavy workloads such as artificial intelligence, machine learning, and cryptocurrency mining using fuel captured from the same site. The company’s model has two parts: helping energy producers reduce flaring by turning stranded gas into usable power, and selling cloud computing services to customers around the world. In short, Crusoe pairs energy capture with on-site computing to provide affordable, low-carbon computing power. Unlike traditional data-center operators, Crusoe owns and controls the energy-to-compute chain—from gas capture to data processing—creating a direct link between the energy and technology sectors and offering a clear environmental benefit alongside access to compute resources.

Company Size

501-1,000

Company Stage

Series D

Total Funding

$1.1B

Headquarters

Denver, Colorado

Founded

2018

Simplify Jobs

Simplify's Take

What believers are saying

  • Google joined Crusoe in November 2025 for $40B Texas investment.
  • Crusoe Cloud's MemoryAlloy delivers 9.9x faster AI inference.
  • DRO unlocks revenue from underutilized renewables for partners.

What critics are saying

  • Texas revokes Goodnight air permits, halting construction in 3 months.
  • CoreWeave undercuts pricing by 30%, seizing 70% AI market in 6 months.
  • Google terminates partnership after emissions backlash in 6 months.

What makes Crusoe unique

  • Crusoe captures flare gas to power modular AI data centers on-site.
  • DFM achieves 99.9% combustion efficiency versus flares' 91.1%.
  • Vertically integrates energy capture with HPC and cloud computing.

Help us improve and share your feedback! Did you find this helpful?

Your Connections

People at Crusoe who can refer or advise you

Benefits

Industry competitive pay

Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents

Paid life insurance, short-term and long-term disability

Parental leave

Stock options in a fast-growing, well-funded technology company

Pet-friendly offices

Teladoc

401(k) with a 4% match

Unlimited time off

Cell phone reimbursement

Tuition reimbursement

Company paid commuter benefit; $100 per month

Calm

Company News

Condé Nast
Apr 2nd, 2026
Google-backed Texas data center to emit 4.5M tons of CO₂ yearly via private gas plant

A Google-backed data center in Texas will be partly powered by natural gas turbines emitting 4.5 million tonnes of greenhouse gases annually, equivalent to adding over 970,000 petrol cars to the road. The Goodnight campus in Armstrong County is being built by AI infrastructure company Crusoe, which Google joined in November as part of a $40 billion Texas investment. According to state air permit applications, the facility's fifth and sixth buildings will use private, off-grid gas power, though Google says it has no contract in place for this energy. The project reflects a broader trend: data centers are driving a US natural gas boom, with nearly 100 gigawatts of gas-fired power currently in development solely for such facilities. Several planned projects would emit even more, with OpenAI and Oracle's New Mexico facility potentially generating 14 million tonnes yearly.

Crusoe
Dec 24th, 2024
Crusoe Closes $600M in Series D Round at $2.8 Billion Valuation to Power AI

Crusoe is on a mission to align the future of computing with the future of the climate.

Benzinga
Oct 29th, 2024
Crusoe Energy Secures $500M Investment

Crusoe Energy, a data center startup, secured a $500 million equity investment led by Peter Thiel's Founders Fund, valuing the company at approximately $3 billion. The investment supports Crusoe's expansion in AI infrastructure and coincides with a $3.4 billion deal with Blue Owl Capital for a new data center in Texas. Crusoe uses waste natural gas to power its centers, reducing emissions. Thiel's Founders Fund continues to invest in AI and cryptocurrency sectors.

VentureBeat
Jul 4th, 2024
From Code To Impact: Crusoe’S Hackathon Reveals Ai’S Power To Drive Change In Energy And Beyond

We want to hear from you! Take our quick AI survey and share your insights on the current state of AI, how you’re implementing it, and what you expect to see in the future. Learn More. In a 24-hour hackathon hosted by Crusoe Energy and Lowercarbon Capital, developers demonstrated the remarkable speed at which AI can tackle longstanding challenges in the clean energy sector. The event, held in San Francisco on June 28-29, 2024, showcased how AI tools can compress months or years of traditional work into mere hours, potentially revolutionizing clean energy deployment.The winning team, Verdigris, exemplified this swift transformation by developing an AI system that addresses key barriers in home electrification. Their tool analyzes mortgage data to identify qualified homeowners for zero-cost upgrades and generates personalized marketing materials, including AI-created images of homes with proposed improvements. This level of personalization and automation could significantly accelerate the adoption of home energy upgrades.Team Verdigris’ winning moment    Credit: Crusoe Verdigris’s system integrates with bank databases to access mortgage information, income data and property details

VentureBeat
Jun 24th, 2024
How Gradient Created An Open Llm With A Million-Token Context Window

Don’t miss OpenAI, Chevron, Nvidia, Kaiser Permanente, and Capital One leaders only at VentureBeat Transform 2024. Gain essential insights about GenAI and expand your network at this exclusive three day event. Learn More. In a recent collaboration, AI startup Gradient and cloud compute platform Crusoe extended the “context window” of Llama-3 models to 1 million tokens. The context window determines the number of input and output tokens a large language model (LLM) can process. Big tech companies and frontier AI labs are locked in a race to extend the context windows of their LLMs. In a few months, models have gone from supporting a few thousand tokens to more than a million in less than a year