Full-Time

Forward Deployed Engineer

Posted on 9/17/2025

Resolve

Resolve

51-200 employees

AI-powered platform autonomously resolves production incidents

No salary listed

H1B Sponsorship Available

San Francisco, CA, USA

In Person

Category
Software Engineering (1)
Required Skills
LLM
Kubernetes
AWS
Requirements
  • Are a strong software engineer with a passion for real-world systems problems
  • Have experience with AWS, other major cloud providers, and Kubernetes
  • are comfortable navigating production-scale environments
  • Thrive in fast-moving, high-ambiguity settings and take ownership from problem to production
  • Enjoy working with customers and seeing your code create immediate impact
  • Want to shape how AI transforms the engineering workflow, from debugging to decision making
Responsibilities
  • Go deep. Investigate complex customer environments, debug distributed systems, trace performance bottlenecks, and design fixes that scale.
  • Build for impact. Turn customer pain into durable code, automation, and reliability features that strengthen the platform.
  • Be the customer’s voice. Surface clear insights from the field that inform the roadmap and product direction.
  • Lead with AI and ship code. Use AI-powered reasoning and diagnostics to find root causes quickly, then make targeted code changes to fix problems and scale solutions. This requires an in-depth understanding of the product architecture and our AI agents.
  • Adopt an eval-driven approach. Continuously evaluate and improve how our AI agents diagnose, resolve, and automate customer issues, ensuring reliable and explainable outcomes.
  • Drive adoption. Identify patterns, measure outcomes, and create repeatable frameworks that increase adoption across large, complex environments.
  • Partner across teams. Work closely with product, platform, and reliability engineering to deliver solutions that make Resolve indispensable.

Resolve provides an AI-powered platform that functions as an automated production engineer to troubleshoot and fix software issues after they are deployed. The system works by integrating with tools like AWS and GitHub to analyze telemetry data and source code, using multiple AI agents to identify root causes and suggest fixes through natural language. Unlike traditional monitoring tools that only alert humans to problems, Resolve is designed to autonomously manage and resolve 80% of production alerts without manual intervention. The company’s goal is to reduce the time spent on manual operations by creating a system that can independently maintain software reliability.

Company Size

51-200

Company Stage

Series A

Total Funding

$160M

Headquarters

San Francisco, California

Founded

2010

Simplify Jobs

Simplify's Take

What believers are saying

  • Enterprise compliance demand for human-in-the-loop controls creates TAM in financial services and healthcare.
  • Datadog Marketplace integrations pressure observability vendors, opening partnership and integration opportunities.
  • Dhruv Mahajan's Meta Llama expertise accelerates domain-specific model development beyond competitor capabilities.

What critics are saying

  • Observability vendors integrate auto-remediation directly, commoditizing Resolve's core value within 12–18 months.
  • False-positive auto-remediations cause cascading failures, triggering liability and customer churn among risk-averse enterprises.
  • Split-tier Series A pricing signals investor skepticism; 375x ARR multiple unsustainable if pilot conversion rates fall.

What makes Resolve unique

  • Founded by OpenTelemetry co-creators with 20-year collaboration history and prior exits.
  • Resolve AI Labs builds domain-specific models for production reasoning, not generic foundation models.
  • Multi-agent system reasons across code, infrastructure, and telemetry simultaneously for root cause analysis.

Help us improve and share your feedback! Did you find this helpful?

Benefits

Company Social Events

Growth & Insights and Company News

Headcount

6 month growth

-3%

1 year growth

6%

2 year growth

15%
VentureBeat
Feb 27th, 2026
Enterprise MCP adoption is outpacing security controls

Enterprise MCP adoption is outpacing security controls. February 27, 2026 AI agents now carry more access and more connections to enterprise systems than any other software in the environment. That makes them a bigger attack surface than anything security teams have had to govern before, and the industry doesn't yet have a framework for it. "If that attack vector gets utilized, it can result in a data breach, or even worse," said Spiros Xanthos, founder and CEO of Resolve AI, speaking at a recent VentureBeat AI Impact Series event. Traditional security frameworks are built around human interactions. There's not yet an agreed-upon construct for AI agents that have personas and can work autonomously, noted Jon Aniano, SVP of product and CRM applications at Zendesk, at the same event. Agentic AI is moving faster than enterprises can build guardrails - and Model Context Protocol (MCP), while decreasing integration complexity, is making the problem worse. "Right now it's an unsolved problem because it's the wild, wild West," Aniano said. "We don't even have a defined technical agent-to-agent protocol that all companies agree on. How do you balance user expectations versus what keeps your platform safe?" MCP still "extremely permissive" Enterprises are increasingly hooking into MCP servers because they simplify integration between agents, tools and data. However, MCP servers tend to be "extremely permissive," he said. They are "actually probably worse than an API," he contended, because APIs at least have more controls in place to impose upon agents. Today's agents are acting on behalf of humans based on explicit permissions, thus establishing human accountability. "But you might have tens, hundreds of agents in the future with their own identity, their own access," said Xanthos. "It becomes a very complex matrix." Even as his startup is developing autonomous AI agents for site reliability engineering (SRE) and system management, he acknowledged that the industry "completely lacks the framework" for autonomous agents. "It's completely on us and to anybody who builds agents to figure out what restrictions to give them," he said. And customers must be able to trust those decisions. Some existing security tools do offer fine-grained access - Splunk, for instance, developed a method to provide access to certain indexes in underlying data stores, he noted - but most are broader and human-oriented. "We're trying to figure this out with existing tools," he said. "But I don't think they're sufficient for the era of agents." Keep Watching Who's accountable when an AI mis-authenticates a user? At Zendesk and other customer relationship management (CRM) platform providers, AI is involved in a number of user interactions, Aniano noted - in fact, now it's at a "volume and a scale that we haven't contemplated as businesses and as a society." It can get tricky when AI is helping out human agents; the audit trail can become a labyrinth. "So now you've got a human talking to a human that's talking to an AI," Aniano noted. "The human tells the AI to take action. Who's at fault if it's the wrong action?" This becomes even more complicated when there are "multiple pieces of AI and multiple humans" in the mix. To prevent agents from going off the rails, Zendesk tends to be "very strict" about access and scope; however, customers can define their own guardrails based on their needs. In most cases, AI can access knowledge sources, but they're not writing code or running commands on servers, Aniano said. If an AI does call an API, it is "declaratively designed" and sanctioned, and actions are specifically called out. However, customer demand is flooding these scenarios and "we're kind of holding the gates right now," he said. The industry must develop concrete standards for agent interactions. "We're entering a world where, with things like MCP that can auto-discover tools, we're going to have to create new methods of safety for deciding what tools these bots can interact with," said Aniano. When it comes to security, enterprises are rightly concerned when AI takes over authentication tasks, such as sending out and processing one-time passwords (OTP), SMS codes, or other two-step verification methods, he said. What happens if an AI mis-authenticates or misidentifies someone? This can lead to sensitive data leakage or open the door for attackers. "There's a spectrum now, and the end of that spectrum today is a human," Aniano said. However, "the end of that spectrum tomorrow might be a specialized agent designed to do the same kind of gut feeling or human-level interaction." Customers themselves are on a spectrum of adoption and comfort. In certain companies - particularly financial services or other highly-regulated environments - humans still must be involved in authentication, Aniano noted. In other cases, legacy companies or old guards only trust humans to authenticate other humans. He noted that Zendesk is experimenting with new AI agents that are "a little more connected to systems," and working with a select group of customers around guardrailing. Standing authorization is coming. In some future, agents may actually be more trusted than humans to do some tasks, and granted permissions "way beyond" what humans have today, Xanthos said. But we're a long way from that, and, for the most part, the fear of something going wrong is what's holding enterprises back. "Which is a good fear, right? I'm not saying that it is a bad thing," he said. Many enterprises simply aren't yet comfortable with an agent doing all steps of a workflow or fully closing the loop by itself. They still want human review. Resolve AI is on the cusp of giving agents standing authorization in a few cases that are "generally safe," such as in coding; from there they'll move to more open-ended scenarios that are not all that risky, Xanthos explained. But he acknowledged that there will always be very risky situations where AI mistakes could "mutate the state of the production system," as he put it. Ultimately, though: "There's no going back, obviously; this is moving faster than maybe even mobile did. So the question is what do we do about it?" What security teams can do now. Both speakers pointed to interim measures available within existing tooling. Xanthos noted that some tools - Splunk among them - already offer fine-grained index-level access controls that can be applied to agents. Aniano described Zendesk's approach as a practical starting point: declaratively designed API calls with explicitly sanctioned actions, strict access and scope limits, and human review before expanding agent permissions. The underlying principle, as Aniano put it: "We're always checking those gates and seeing how we can widen the aperture" - meaning don't grant standing authorization until you've validated each expansion.

Pixegias
Feb 5th, 2026
AI SRE Resolve AI confirms $125M increase, unicorn valuation

AI SRE Resolve AI confirms $125M increase, unicorn valuation. Resolve AI, a startup automating the work of system reliability engineering (SRE), aka troubleshooting system failures, has announced a $125 million Series A at a $1 billion valuation. The round was led by Lightspeed Venture Partners, with participation of existing investors including Greylock Partners, Unusual Ventures, Artisanal Ventures, and A*. The announcement confirms TechCrunch's December report that the startup was raising at a billion-dollar valuation led by Lightspeed. Sources told TechCrunch at the time that the round may have consisted of multiple tranches, at different prices, which could have put the company's actual blended valuation below $1 billion. A spokesperson for Resolve denied that there were multiple tranches in the round, saying that 100% of the equity was purchased at a valuation of $1 billion. As Pixegias, Inc. previously reported, this kind of structure allows certain investors, often the lead, to purchase a significant portion of equity at a lower price. Resolve was co-founded in early 2024 by two former Splunk executives, Spiros Xanthos and Mayank Agarwal. Their previous startup, Omnition, was acquired by Splunk in 2019. Another startup applying AI to identify and resolve system outages is the Sequoia-backed Traversal. The emerging category is known as AI SRE.

TechCrunch
Feb 4th, 2026
AI SRE Resolve AI confirms $125M raise, unicorn valuation | TechCrunch

The two-year-old startup confirms that it closed a Series A led by Lightspeed at $1 billion valuation.

FinSMEs
Feb 4th, 2026
Resolve AI Raises $125M in Series A Funding at $1B Valuation

Resolve AI raises $125M in Series A funding at $1B valuation. Resolve AI, a San Francisco, CA-based SRE & Engineering startup, raised $125M in Series A funding at $1B valuation. The round was led by Lightspeed Venture Partners, with existing investors Greylock Partners, Unusual Ventures, Artisanal Ventures, and A*. The company intends to use the funds to accelerate product development, expand the engineering and go-to-market teams, and support growing enterprise adoption. Founded by observability pioneers Spiros Xanthos and Mayank Agarwal (co-creators of OpenTelemetry; prior exits to Splunk and VMware), Resolve AI focuses on autonomous Site Reliability Engineering (SRE), acting as an "AI Production Engineer" to help software teams manage complex cloud environments. It autonomously investigates production incidents, identifies root causes, and suggests (or executes) remediations. Resolve AI builds a dynamic "knowledge graph" of a company's infrastructure (AWS, Kubernetes, etc.) and uses agentic AI to troubleshoot issues in minutes instead of hours. Customers include Coinbase, DoorDash, Salesforce, Zscaler, MongoDB, and MSCI.

Tech in Asia
Dec 20th, 2025
Lightspeed said to lead series A of US AI startup at $1b valuation

Lightspeed said to lead series A of US AI startup at $1b valuation. Resolve AI, a startup developing an autonomous site reliability engineer (SRE) tool, saw some equity in its series A round led by Lightspeed Venture Partners sold at a US$1 billion valuation, according to three people familiar with the deal. The rest of the round was acquired at a lower price. Founded less than two years ago by former Splunk executive Spiros Xanthos and former Splunk chief architect Mayank Agarwal, the company automates the process of identifying and resolving software system issues. Resolve AI's annual recurring revenue is about US$4 million, according to two people with knowledge of the matter. The startup previously raised US$35 million in seed funding in October 2024 from Greylock and others. Food for thought. $1B headline hides split-tier pricing and cautious bets on AI site reliability engineering (SRE). * Resolve AI's series A used split pricing. Some shares cleared at the $1B headline, others sold cheaper. That signals investors playing it safe on AI SRE adoption (software that automates reliability and incident response tasks) despite the hype. * With about $4M in Annual Recurring Revenue (ARR), the 250x multiple looks extreme even for AI infrastructure. The $1B figure likely priced only a slice of the round, not all new capital. * The two-tier setup echoes late-stage private deals. Headline numbers help recruit and win customers. The real dilution and cash follow tighter terms, which gives institutions downside protection. Observability vendors feel integration pressure as enterprises pilot auto-remediation in 2025. * Resolve AI and Traversal raised over $80M for AI SRE. Many enterprises will trial auto-remediation this year, software that diagnoses and fixes production incidents. That creates demand for guardrails and integration layers that curb cascading failures. * Third-party developers can build incident response automations in the Datadog Marketplace (an app store for Datadog, a cloud monitoring platform). It already lists workflow integrations like Blink 1 and InsightFinder 2 that tie observability data to remediation. * IT consultancies can focus on financial services and healthcare, where compliance requires human-in-the-loop controls (required human approvals for automated changes). Pitch integration work that links AI SRE tools with current incident management frameworks 3 before auto-remediation becomes standard. How would you feel if you could no longer use Tech in Asia?

INACTIVE