Full-Time

AI Researcher

Traversal

Traversal

51-200 employees

AI SRE platform for autonomous remediation

Compensation Overview

$160k - $300k/yr

+ Equity

New York, NY, USA

In Person

In-office, 5 days/week; based in New York.

Category
AI & Machine Learning (1)
Required Skills
LLM
Neural Networks
Machine Learning
Observability
Reinforcement Learning
Requirements
  • PhD in Computer Science, Electrical Engineering, Statistics, or a related technical field; demonstrated depth in LLMs, agents, or applied machine learning
  • Deep applied AI expertise, including strong working knowledge of LLMs, transformers, reinforcement learning, or neural networks in agentic systems
  • Strong judgment in model evaluation and experimental iteration to improve product accuracy and behavior
  • Strong software engineering depth, with the ability to work effectively in a complex production codebase and ship production-quality code
  • Some experience shipping AI or ML systems to production
  • Ability to run rigorous experiments, interpret results, and quickly translate learnings into product improvements
  • Startup or early-team experience, with comfort operating in ambiguous environments and building without mature infrastructure
Responsibilities
  • LLM & Agent Research: Prototype and evaluate prompting strategies, reasoning workflows, and tool-use policies for agents operating on large-scale observability data and complex troubleshooting workflows. Ship improvements to production.
  • Evaluation Design: Build and maintain eval harnesses that measure real accuracy improvements on actual customer incident types — not just benchmark scores. Own the loop from hypothesis to production measurement.
  • Cross-Team Collaboration: Work closely with AI engineers, infrastructure teams, and product leads to bring research into production and close the loop between experimentation and impact.
  • Stay on the Frontier: Track developments in LLMs, agent architectures, and AI alignment, translating insights into actionable improvements for Traversal’s domain.
  • Training & Alignment: Apply fine-tuning, reinforcement learning, and reward modeling techniques to align AI behavior with real-world SRE workflows.
  • Synthetic Data & Experimentation: Design pipelines to generate synthetic incidents and observability signals, enabling scalable training and testing in data-scarce environments
Desired Qualifications
  • Experience in SRE, observability, or backend systems, especially when paired with strong AI/ML depth
  • Experience with RLHF, synthetic data pipelines, or LLM evaluation tooling
  • Contributions to open-source agent frameworks such as LangGraph, DSPy, or similar
  • Research experience in LLMs, agents, or reinforcement learning, including publications in venues such as NeurIPS, ICML, or ICLR; top-tier conference publications are a plus

Traversal provides an AI-powered platform for site reliability engineering and observability. Its AI SRE agent autonomously detects, troubleshoots, and resolves production incidents by analyzing telemetry and performing root-cause analysis to identify underlying causes. It combines large language models with causal machine learning to orchestrate real-time remediation and offers proactive health checks. It can be deployed as a standalone product or as an intelligence layer on existing observability stacks, including on-premise hosting, to serve enterprises like cloud providers and large SaaS firms, with a goal of reducing downtime and moving systems toward self-healing.

Company Size

51-200

Company Stage

Seed

Total Funding

$48M

Headquarters

New York City, New York

Founded

2023

Simplify Jobs

Simplify's Take

What believers are saying

  • Observability market grows to $12.6B by 2028 at 15% CAGR.
  • Amex Ventures $5M investment deploys across global infrastructure.
  • Cloudways Copilot achieves 95% accuracy, scales to 4,000 investigations daily.

What critics are saying

  • Datadog's Bits AI consolidates market, erodes Traversal's value in 12 months.
  • Amex builds in-house AI SRE using product knowledge in 18 months.
  • SmartFix false positive causes data loss, triggers SOX fines at Amex.

What makes Traversal unique

  • Traversal combines causal machine learning with LLMs for root cause analysis.
  • Production World Model unifies fragmented telemetry data across observability stacks.
  • AI SRE agent automates incident remediation in minutes for enterprises.

Help us improve and share your feedback! Did you find this helpful?

Your Connections

People at Traversal who can refer or advise you

Benefits

Health Insurance

Flexible Work Hours

Company Equity

Company News

Business Wire
Mar 11th, 2026
Traversal hires 6 senior leaders across GTM and engineering as headcount grows 110% to 90+

Traversal, an AI lab building agents for enterprise site reliability engineering, has announced six senior leadership hires across go-to-market and engineering within a single month. The company's headcount has grown to over 90, representing a 110 per cent increase in six months. New appointments include Jim Cavanaugh as SVP of Worldwide Sales, Ryan Powers as SVP of Marketing, Patrick Wade as VP of Worldwide Field Engineering, and Maxime Petazzoni as Head of Engineering. The hires bring experience from companies including Cribl, Redis, SignalFx and Splunk. The expansion follows Traversal's recent investment from Amex Ventures and deployment across American Express. A Fortune 100 financial services case study showed 32 per cent reduction in potential mean time to resolution and 82 per cent root cause analysis accuracy.

Business Wire
Mar 5th, 2026
Traversal Announces Strategic Investment from Amex Ventures

Traversal, the frontier lab building AI agents for enterprise-grade site reliability engineering (SRE), today announced a strategic investment from Amex Vent...

SiliconANGLE Media
Mar 4th, 2026
American Express invests $5M in AI site reliability startup Traversal

American Express has partnered with and invested $5 million through Amex Ventures in Traversal, an AI-driven site reliability engineering startup founded by researchers from MIT, Columbia and Cornell. The credit card company will deploy Traversal's platform across its global technology infrastructure. Traversal uses large language models, AI agents and causal machine learning to analyse operational telemetry data across multiple monitoring systems, helping diagnose and resolve technology outages more quickly. The platform aims to automate work traditionally requiring dozens of engineers collaborating in "war rooms" during incidents. The startup has raised approximately $53 million to date. Its technology addresses fragmentation in the observability market by inferring cause-and-effect relationships across different monitoring platforms, moving beyond simple pattern detection to root cause analysis.

Traversal
Oct 14th, 2025
Cloudways Launches Self-Healing Site Reliability Solution, Powered by Traversal

Cloudways launches self-healing site reliability solution, powered by Traversal. At a glance. Cloudways, a leading managed cloud hosting platform, partnered with Traversal to transform its customer support and site reliability experience. Powered by Traversal's AI SRE platform, Cloudways Copilot is an end-to-end self-healing solution that enables users to identify issues and remediate them instantly with a single click. This is the first instance of self-serve site reliability as a service. Following strong adoption and positive feedback, Cloudways Copilot entered into general availability in August 2025, rolling out its issue diagnostics and self-healing solution to all 845k+ customer applications. The challenge. Cloudways - recently ranked by CNET as the number one web hosting software for developers - serves as the cloud infrastructure management platform for website hosting for digital agencies, developers, and small businesses across the globe. Like any platform that is mission critical for a diverse customer base with a broad range of technical needs, Cloudways requires a strong, responsive support workflow to ensure reliability at scale. Prior to partnering with Traversal, Cloudways customers facing issues like slow site performance, failing service, or DDoS attacks, would report their problem via chat or a helpdesk ticket, and receive diagnostic commands from a support engineer. Customers would attempt to run those commands themselves and, if unsuccessful, request remote assistance. The process often involved multiple back-and-forths and long delays in resolution due to customers' varying levels of technical expertise. To improve this experience, Cloudways partnered with Traversal to build an AI SRE with the ambitious goal of not just being a copilot for troubleshooting incidents, but an end-to-end autonomous troubleshooting and self-healing tool to over 845k applications hosted on the platform. Its deployment. Traversal began as a pilot with 500 Cloudways WordPress customers. For data privacy, troubleshooting for Cloudways customers required Traversal to access machine-level logs and metrics directly, rather than reading from a centralized observability stack. Traversal AI connected with custom Cloudways endpoints - for example, Sensu for alerts and Ansible for workflows - all via a custom proxy to meet enterprise-grade guardrails, reliability, and security standards. The resulting solution was launched in private preview as Cloudways Copilot, powered by DigitalOcean's proprietary Gradient AI platform. Its capabilities would include ingesting customer context, identifying the root cause of issues, and return recommended next steps for remediation - often within minutes. As confidence in Copilot's root cause identification grew, customers began asking for a way to apply fixes automatically. In response, Traversal Inc. launched a "SmartFix" feature, enabling users to automatically execute recommended remediations directly from the support flow with the click of a button. Cloudways Copilot is now in General Availability and is being rolled out to all Cloudways customer applications. It is currently performing over 1,000 investigations per day, with volume expected to grow to as many as 4,000 investigations per day as rollout completes. Traversal's impact at Cloudways. Cloudways Copilot constantly monitors the web stack, disk, inodes, and host health, detecting issues within seconds - from high-traffic anomalies like bot crawling and DDoS to system-level issues such as disk space exhaustion, inodes full, and service failures. It quickly analyzes the root cause and delivers clear, actionable recommendations, with the option to remediate automatically. This near-instant diagnosis helps recover optimal server performance with minimal effort, saving customers hours of manual troubleshooting. "We partnered with Traversal to build an end-to-end self-healing system - from alert to remediation. With over 95% accuracy, we can for the first time enable self-service reliability for our thousands of customers, instead of hours of frustrating back-and-forth with support - potentially saving millions in downtime and SRE costs." - Suhaib Zaheer, SVP & GM of Managed Hosting, Cloudways "With Copilot monitoring our servers and 47 applications, we identify problems before clients even experience issues - like getting automated insights that pinpoint exactly which applications are causing problems." "Cloudways Copilot & AI is a game-changer for reducing the amount of time spent taking care of your web server. It is the first good implementation of AI I've seen in a web host that actually makes my life as an agency owner easier." "Cloudways Copilot has transformed how we manage 180+ sites, saving our team 15 hours in just the last month. Instead of spending hours debugging, we now get detailed breakdowns that help us quickly resolve problems." Inside a real incident. At 2:07 PM, a WordPress site hosted by a web development company managing hundreds of sites on Cloudways began to slow down. Pages were timing out, CPU usage spiked, and some users saw 502 and 524 errors, but the root cause wasn't immediately clear. Normally, Cloudways Support would step in on behalf of the customer - spending 60 - 90 minutes collecting logs, isolating the issue, and coordinating with engineers. This time, the alert was handled by Traversal's AI SRE, streamlining the response without any manual triage: * 2:08 PM - Traversal began investigating on behalf of the customer. * 2:10 PM - It identified a set of abusive IPs overwhelming the site and outlined the root cause. * 2:12 PM - It proposed a self-healing action: block the malicious IPs and restart affected services, with UI-guided steps and a full remediation summary. * 2:13 PM - With a single click, the issue was resolved - end to end, in under 5 minutes. What would've taken hours was handled autonomously by Traversal, enabling Cloudways to respond to customer issues faster and more reliably - without manual triage or escalation.

Traversal
Oct 7th, 2025
Eventbrite Turns to Traversal's AI SRE to Overcome Complexity of Legacy Systems

To address these issues, Eventbrite partnered with Traversal to cut through this complexity and provide clearer visibility into their complex infrastructure, towards the goal of automating their incident response.