Full-Time

Staff Full-Stack Product Engineer

Traversal

Traversal

51-200 employees

AI SRE platform for autonomous remediation

Compensation Overview

$150k - $300k/yr

+ Equity

New York, NY, USA

In Person

Category
Software Engineering (1)
Required Skills
Kubernetes
FastAPI
Python
React.js
Postgres
TypeScript
AWS
Redis
REST APIs
Requirements
  • Significant experience building and operating complex full-stack systems in production environments
  • Strong proficiency in Python and modern backend frameworks such as FastAPI or similar
  • Deep experience building modern frontend applications using React and TypeScript
  • Experience designing and operating scalable backend services and APIs
  • Experience deploying and operating applications in cloud environments (AWS), including containerized workloads using ECS or Kubernetes
  • Experience working with data-intensive systems, including relational databases such as PostgreSQL and distributed storage systems
  • Strong product instincts and the ability to translate technical capabilities into intuitive user-facing functionality
  • Ability to operate effectively in early-stage startup environments with high ownership and evolving priorities
Responsibilities
  • Product Ownership: Drive end-to-end development of core product capabilities, translating AI-driven insights into intuitive workflows that empower engineers and reduce cognitive load
  • Technical Architecture: Design scalable system architectures spanning frontend, backend services, and data infrastructure to support real-time observability and automated operations
  • API & Platform Development: Design and implement high-performance APIs and service layers that power the product and enable seamless integration between AI systems, backend services, and the user interface
  • User Experience: Build highly interactive interfaces for exploring operational data, rapidly iterating with users to refine product workflows and usability
  • Data Systems: Design efficient strategies for storing, processing, and retrieving large-scale operational data using technologies like PostgreSQL, Redis, and distributed systems
  • AI Integration: Collaborate with AI engineers and researchers to operationalize model outputs in production systems, building reliable pipelines, authentication systems, and platform abstractions that enable AI-powered features
  • Technical Leadership: Help shape engineering best practices, influence technical direction, and raise the bar for system design, performance, and maintainability across the codebase
Desired Qualifications
  • Experience building data-heavy interfaces, including time-series visualization or operational dashboards
  • Experience working with observability platforms, infrastructure tooling, or developer platforms
  • Familiarity with AI/LLM-powered products or agentic systems
  • Background working on large-scale, distributed, or data-driven applications

Traversal provides an AI-powered platform for site reliability engineering and observability. Its AI SRE agent autonomously detects, troubleshoots, and resolves production incidents by analyzing telemetry and performing root-cause analysis to identify underlying causes. It combines large language models with causal machine learning to orchestrate real-time remediation and offers proactive health checks. It can be deployed as a standalone product or as an intelligence layer on existing observability stacks, including on-premise hosting, to serve enterprises like cloud providers and large SaaS firms, with a goal of reducing downtime and moving systems toward self-healing.

Company Size

51-200

Company Stage

Seed

Total Funding

$48M

Headquarters

New York City, New York

Founded

2023

Simplify Jobs

Simplify's Take

What believers are saying

  • Observability market grows to $12.6B by 2028 at 15% CAGR.
  • Amex Ventures $5M investment deploys across global infrastructure.
  • Cloudways Copilot achieves 95% accuracy, scales to 4,000 investigations daily.

What critics are saying

  • Datadog's Bits AI consolidates market, erodes Traversal's value in 12 months.
  • Amex builds in-house AI SRE using product knowledge in 18 months.
  • SmartFix false positive causes data loss, triggers SOX fines at Amex.

What makes Traversal unique

  • Traversal combines causal machine learning with LLMs for root cause analysis.
  • Production World Model unifies fragmented telemetry data across observability stacks.
  • AI SRE agent automates incident remediation in minutes for enterprises.

Help us improve and share your feedback! Did you find this helpful?

Your Connections

People at Traversal who can refer or advise you

Benefits

Health Insurance

Flexible Work Hours

Company Equity

Company News

Business Wire
Mar 11th, 2026
Traversal hires 6 senior leaders across GTM and engineering as headcount grows 110% to 90+

Traversal, an AI lab building agents for enterprise site reliability engineering, has announced six senior leadership hires across go-to-market and engineering within a single month. The company's headcount has grown to over 90, representing a 110 per cent increase in six months. New appointments include Jim Cavanaugh as SVP of Worldwide Sales, Ryan Powers as SVP of Marketing, Patrick Wade as VP of Worldwide Field Engineering, and Maxime Petazzoni as Head of Engineering. The hires bring experience from companies including Cribl, Redis, SignalFx and Splunk. The expansion follows Traversal's recent investment from Amex Ventures and deployment across American Express. A Fortune 100 financial services case study showed 32 per cent reduction in potential mean time to resolution and 82 per cent root cause analysis accuracy.

Business Wire
Mar 5th, 2026
Traversal Announces Strategic Investment from Amex Ventures

Traversal, the frontier lab building AI agents for enterprise-grade site reliability engineering (SRE), today announced a strategic investment from Amex Vent...

SiliconANGLE Media
Mar 4th, 2026
American Express invests $5M in AI site reliability startup Traversal

American Express has partnered with and invested $5 million through Amex Ventures in Traversal, an AI-driven site reliability engineering startup founded by researchers from MIT, Columbia and Cornell. The credit card company will deploy Traversal's platform across its global technology infrastructure. Traversal uses large language models, AI agents and causal machine learning to analyse operational telemetry data across multiple monitoring systems, helping diagnose and resolve technology outages more quickly. The platform aims to automate work traditionally requiring dozens of engineers collaborating in "war rooms" during incidents. The startup has raised approximately $53 million to date. Its technology addresses fragmentation in the observability market by inferring cause-and-effect relationships across different monitoring platforms, moving beyond simple pattern detection to root cause analysis.

Traversal
Oct 14th, 2025
Cloudways Launches Self-Healing Site Reliability Solution, Powered by Traversal

Cloudways launches self-healing site reliability solution, powered by Traversal. At a glance. Cloudways, a leading managed cloud hosting platform, partnered with Traversal to transform its customer support and site reliability experience. Powered by Traversal's AI SRE platform, Cloudways Copilot is an end-to-end self-healing solution that enables users to identify issues and remediate them instantly with a single click. This is the first instance of self-serve site reliability as a service. Following strong adoption and positive feedback, Cloudways Copilot entered into general availability in August 2025, rolling out its issue diagnostics and self-healing solution to all 845k+ customer applications. The challenge. Cloudways - recently ranked by CNET as the number one web hosting software for developers - serves as the cloud infrastructure management platform for website hosting for digital agencies, developers, and small businesses across the globe. Like any platform that is mission critical for a diverse customer base with a broad range of technical needs, Cloudways requires a strong, responsive support workflow to ensure reliability at scale. Prior to partnering with Traversal, Cloudways customers facing issues like slow site performance, failing service, or DDoS attacks, would report their problem via chat or a helpdesk ticket, and receive diagnostic commands from a support engineer. Customers would attempt to run those commands themselves and, if unsuccessful, request remote assistance. The process often involved multiple back-and-forths and long delays in resolution due to customers' varying levels of technical expertise. To improve this experience, Cloudways partnered with Traversal to build an AI SRE with the ambitious goal of not just being a copilot for troubleshooting incidents, but an end-to-end autonomous troubleshooting and self-healing tool to over 845k applications hosted on the platform. Its deployment. Traversal began as a pilot with 500 Cloudways WordPress customers. For data privacy, troubleshooting for Cloudways customers required Traversal to access machine-level logs and metrics directly, rather than reading from a centralized observability stack. Traversal AI connected with custom Cloudways endpoints - for example, Sensu for alerts and Ansible for workflows - all via a custom proxy to meet enterprise-grade guardrails, reliability, and security standards. The resulting solution was launched in private preview as Cloudways Copilot, powered by DigitalOcean's proprietary Gradient AI platform. Its capabilities would include ingesting customer context, identifying the root cause of issues, and return recommended next steps for remediation - often within minutes. As confidence in Copilot's root cause identification grew, customers began asking for a way to apply fixes automatically. In response, Traversal Inc. launched a "SmartFix" feature, enabling users to automatically execute recommended remediations directly from the support flow with the click of a button. Cloudways Copilot is now in General Availability and is being rolled out to all Cloudways customer applications. It is currently performing over 1,000 investigations per day, with volume expected to grow to as many as 4,000 investigations per day as rollout completes. Traversal's impact at Cloudways. Cloudways Copilot constantly monitors the web stack, disk, inodes, and host health, detecting issues within seconds - from high-traffic anomalies like bot crawling and DDoS to system-level issues such as disk space exhaustion, inodes full, and service failures. It quickly analyzes the root cause and delivers clear, actionable recommendations, with the option to remediate automatically. This near-instant diagnosis helps recover optimal server performance with minimal effort, saving customers hours of manual troubleshooting. "We partnered with Traversal to build an end-to-end self-healing system - from alert to remediation. With over 95% accuracy, we can for the first time enable self-service reliability for our thousands of customers, instead of hours of frustrating back-and-forth with support - potentially saving millions in downtime and SRE costs." - Suhaib Zaheer, SVP & GM of Managed Hosting, Cloudways "With Copilot monitoring our servers and 47 applications, we identify problems before clients even experience issues - like getting automated insights that pinpoint exactly which applications are causing problems." "Cloudways Copilot & AI is a game-changer for reducing the amount of time spent taking care of your web server. It is the first good implementation of AI I've seen in a web host that actually makes my life as an agency owner easier." "Cloudways Copilot has transformed how we manage 180+ sites, saving our team 15 hours in just the last month. Instead of spending hours debugging, we now get detailed breakdowns that help us quickly resolve problems." Inside a real incident. At 2:07 PM, a WordPress site hosted by a web development company managing hundreds of sites on Cloudways began to slow down. Pages were timing out, CPU usage spiked, and some users saw 502 and 524 errors, but the root cause wasn't immediately clear. Normally, Cloudways Support would step in on behalf of the customer - spending 60 - 90 minutes collecting logs, isolating the issue, and coordinating with engineers. This time, the alert was handled by Traversal's AI SRE, streamlining the response without any manual triage: * 2:08 PM - Traversal began investigating on behalf of the customer. * 2:10 PM - It identified a set of abusive IPs overwhelming the site and outlined the root cause. * 2:12 PM - It proposed a self-healing action: block the malicious IPs and restart affected services, with UI-guided steps and a full remediation summary. * 2:13 PM - With a single click, the issue was resolved - end to end, in under 5 minutes. What would've taken hours was handled autonomously by Traversal, enabling Cloudways to respond to customer issues faster and more reliably - without manual triage or escalation.

Traversal
Oct 7th, 2025
Eventbrite Turns to Traversal's AI SRE to Overcome Complexity of Legacy Systems

To address these issues, Eventbrite partnered with Traversal to cut through this complexity and provide clearer visibility into their complex infrastructure, towards the goal of automating their incident response.