Work Here?

Arthur AI

Work Here?

Claim Your Company

Deploys, monitors ML models and LLMs

Website

Arthur AI

Work Here?

Claim Your Company

Deploys, monitors ML models and LLMs

Website

Overview

Arthur AI provides a platform to deploy and manage machine learning models and large language models. It is model- and platform-agnostic, so it can work with any type of model and be deployed on any cloud or on‑premises. The main product is a monitoring platform that lets businesses view all their models in one place, track real‑time metrics, optimize performance, and receive alerts when a metric crosses a threshold. It also supports collaboration with customizable permissions to streamline stakeholder engagement. Arthur AI differentiates itself by supporting diverse models and deployments, emphasizing standards and responsible practices, maintaining strong ties to the research community, and focusing on MLOps for enterprise needs. The company's goal is to help organizations run MLMs and LLMs at scale in a reliable, compliant, and collaborative way.

About Arthur AI

Simplify's Rating

Why Arthur AI is rated

B-

Rated B on Competitive Edge

Rated B on Growth Potential

Rated C on Differentiation

Industries

Data & Analytics

Enterprise Software

Cybersecurity

AI & Machine Learning

Company Size

11-50

Company Stage

Series B

Total Funding

$60.3M

Headquarters

New York City, New York

Founded

2018

Get referred to Arthur AI

Find people who can refer or advise you

Simplify's Take

What believers are saying

Continuous evaluations fit enterprises shipping customer-facing AI assistants.
Marketplace distribution can shorten procurement with cloud-native buyers.
Regulated industries value local execution, governance, and auditability.

What critics are saying

Hyperscalers can bundle native evaluation and governance into their AI stacks.
Open-source Engine commoditizes Arthur's core evaluation layer.
Feature sprawl weakens focus and invites narrower competitors.

What makes Arthur AI unique

Unified monitoring spans traditional ML, generative AI, and agentic AI.
Data-plane architecture keeps evaluations inside customer VPCs.
Open-source Engine and Bench drive developer adoption and benchmark transparency.

Help us improve and share your feedback! Did you find this helpful?

Funding

Total Funding

$60.3M

Above

Industry Average

Funded Over

3 Rounds

Notable Investors:

Index Ventures

Notable Investors:

Index Ventures

Series B funding is typically for startups that have proven their business model and need more funding to expand rapidly—often by entering new markets or adding more products. Investors are usually venture capital firms that specialize in later-stage investments.

Series B Funding Comparison

Above Average

Industry standards

Ind Avg. $35M

$42M

$45M

$65M

$100M

Benefits

Health Insurance

Dental Insurance

Vision Insurance

401(k) Retirement Plan

401(k) Company Match

Professional Development Budget

Wellness Program

Unlimited Paid Time Off

Hybrid Work Options

Growth & Insights and Company News

Headcount

6 month growth

↑ 0%

1 year growth

↑ 2%

2 year growth

↓ -2%

Arthur AI

Nov 13th, 2025

From Idea to Impact: How Upsolve Built Trusted Agentic AI with Arthur

From idea to impact: how Upsolve built trusted Agentic AI with Arthur. Upsolve partners with Arthur to deliver robust and reliable agentic AI capabilities Upsolve is an agentic platform which provides AI-driven business intelligence and analytics to their customers. Upsolve selected Arthur to help them develop their Analysis AI Agent which lets users ask questions about their data in natural language, generates data insights and explains the reasoning behind each answer. Users can (and often do) add their own training data to the agent, to help the agent better understand their schema and business context. Upsolve needed to guarantee, as they rolled these Agents out across their customer base, any performance issues were immediately diagnosed and fixed in a way that made the entire system better. Additionally, providing measurable proof that it could answer their questions correctly made a huge difference in customers' willingness to adopt. As a result, Upsolve engaged Arthur's Forward Deployed Engineering (FDE) program and incorporated Arthur's technology to ensure performance and build trust features directly into their product. The Arthur FDE integrated Arthur's observability and experimentation solution which allowed Upsolve's customers to quickly observe and improve the behavior of Upsolve's Analysis AI Agent. The Arthur platform provided a level of visibility which allowed Upsolve to build trust with their customers that their Agent was working correctly and reliably. Even after development, Arthur's value continued to be felt. When OpenAI released GPT-5, Upsolve adopted the new model. Arthur quickly flagged a serious performance issue in the Agent introduced by GPT-5, an issue that would have otherwise gone unnoticed until customers were affected. The problem statement. As Upsolve was building this feature and taking it to market, there were two things that kept coming up in conversations: * How can we ensure that the Agent is working reliably as we expose it to our customers? * How can customers trust that the Agent is doing what it's supposed to and answering questions correctly? The Upsolve Agent is a complex, multi-step planning Agent which performs the following actions for each user query: The solution. The Arthur Platform gave Upsolve the ability to build trust with their users and effectively measure performance using Continuous Evals to drive the critical Agent Development Flywheel (see ADLC diagram). By instrumenting effective, domain-specific measures of performance, eg: evals, Upsolve was able to immediately measure and improve the performance of their Agent internally. Once the evals were developed, Upsolve then exposed these evals in a customer-facing application, along with tools that allowed customers to make tweaks and provide training examples to the agent so that they could optimize Upsolve's Agent for their use-case. Setting up the baseline. At the beginning of the ADLC, the team instrumented the agent with tracing, so they could observe how the agent was thinking and acting internally. This instrumentation captured key raw data such as: token counts, latencies, vector database retrieval performance, as well as the inputs and outputs from the language models, vector databases, and tools. This data was then fed into Arthur's analytics and evals to give an instant, programmatic understanding of Agent Performance. Using this, developers were able to move beyond low-efficacy vibe checks and quickly improve the performance of the Analysis AI Agent. Gathering the data. Once we started measuring the baseline performance of the agent, we started capturing "gold standard" datasets. So if the question was "Which Olympic athlete won the most medals?", we would expect to get a SQL query that properly queried the database, a bar chart showing all of the athletes and their individual medal counts, and the final answer to be "Michael Phelps". At first, this dataset was manually curated and built by the Upsolve engineers as they iterated with the agent. To make it more scalable and methodical, they added functionality to their user interface that allowed them to save interactions into a gold standard dataset with a simple click using Arthur's APIs. Creating the domain-specific evaluations. Now that the team had baseline telemetry and a gold standard dataset of expected behavior, they were in a position to use the Arthur Experiments framework via Test Runs. A Test Run is an automatic replay of all the records in a gold standard dataset against a potential deployment of an agent. By running through all of the gold standard questions on a test-version of the agent, the team could directly look at the before-and-after impact that the change would have on the expected answers. If there were unexpected changes in the response, the proposed agent change could be considered a regression, and conversely, if there were no unexpected changes in the response, the proposed agents change could be considered as having no negative impact on the behavior. Tracking experiments. Finally, now that the team had a way to compare the behavior of their agent before and after a change, we then quantified the difference in performance using Domain Specific Evals. Because the agent returned three answers, a SQL query, a chart, and a final answer, we could score (a value of either 0 or 1, 0 being incorrect and 1 being correct) the performance of a change across these three dimensions. This was accomplished in the Arthur platform via an LLM-as-a-Judge technique, where the prompt for the evaluation documented the evaluation criteria - for example: the SQL query should have the same semantics as the reference SQL query. Looking Forward. Arthur gives Upsolve a methodical way to deliver a robust, powerful Agentic AI solution that provides immense value for their users and creates massive strategic differentiation in the marketplace for them. If you're a venture-backed startup building AI Agents and interested in working with Arthur to ship reliable AI agents to production, apply for Arthur's startup partner program. This program offers early access to Arthur's agentic tools, dedicated support, and resources to launch your AI safely in production. Learn more about the Agent Development Lifecycle and how you can implement it in your organization here!

PR Newswire

Apr 2nd, 2025

Arthur Open-Sources First Real-Time Ai Evaluation Engine

Build. Experiment. Scale. Now With Open-Source AI Evaluation.NEW YORK, March 31, 2025 /PRNewswire/ -- AI is evolving fast—but making it work at scale remains a challenge. Today, Arthur is launching the Arthur Engine, the first open-source, real-time AI evaluation engine designed to help teams monitor, debug, and improve Generative AI and traditional ML models. No black-box monitoring

PR Newswire

Jan 23rd, 2024

Arthur Debuts Recommender System Support To Bolster The Performance Of Ai-Driven Recommendation Engines For Online Businesses

Generally available today, Recommender System Support vastly improves AI-driven recommender systems, resulting in elevated customer satisfaction levels and increased revenue growth for online businessesNEW YORK, Jan. 23, 2024 /PRNewswire/ -- Arthur , an AI performance platform trusted by some of the largest organizations in the world to ensure that their AI systems are well-managed and safely deployed, today introduced a powerful addition to its suite of AI monitoring tools: Recommender System Support. This new technology is set to revolutionize the way online businesses utilize recommender systems in the digital economy, enabling them to drive customer satisfaction levels and increase revenue growth.A vast portion of the modern internet economy is driven by AI-based recommender systems. For example, recommender systems are the engine behind the songs that play on Spotify, the movies that are suggested on Netflix, and what products are recommended on the Amazon homepage. Every advertising email delivered to an inbox, every social media post in a feed, and even which news articles are featured on a homepage are impacted by a recommender system. These systems, which analyze extensive data to predict and offer tailored product recommendations, can significantly boost customer satisfaction and revenue growth for e-commerce platforms, as well as engagement for streaming services and content providers.A major issue that exists for companies that rely on recommender systems without a good monitoring solution in place is that these systems are prone to performance problems as well as an incredible amount of data drift

The Bridge

Aug 23rd, 2023

Arthur、Aiモデル評価ツール「Bench」をオープンソースで公開——どのLlmを採用するか、比較検討が可能に

ニューヨークを拠点とする人工知能（AI）スタートアップ Arthur は、OpenAI の「GPT-3.5 Turbo」や Meta の「LLaMA 2」などの大規模言語モデル（LLM）の性能を評価・比較するためのオープンソースツール「Arthur Bench」を公開した。. Arthur の CEO 兼共同設立者 Adam Wenchel 氏は声明で次のように述べた。. Bench では、LLM プロバイダ間の違い、プロンプティングやオーグメンテーション戦略の違い、カスタムトレーニングレジメなどをチームが深く理解できるよう、オープンソースのツールを作りました。

HeapTalk

Aug 19th, 2023

AI performance startup Arthur introduces an open-source tool for evaluating LLMs

Founded in 2019, Arthur has secured over $60M in funding from several firms, including Acrew, Greycroft, Index Ventures, BAM Elevate, Work-Bench, and Plexo Capital.

Recently Posted Jobs

There are no jobs for Arthur AI right now.

Find jobs on Simplify and start your career today

We update Arthur AI's jobs every few hours, so check again soon! Browse all jobs →

About Arthur AI

Simplify's Rating

Why Arthur AI is rated

B-

Rated B on Competitive Edge

Rated B on Growth Potential

Rated C on Differentiation

Industries

Data & Analytics

Enterprise Software

Cybersecurity

AI & Machine Learning

Company Size

11-50

Company Stage

Series B

Total Funding

$60.3M

Headquarters

New York City, New York

Founded

2018

Recently Posted Jobs

There are no jobs for Arthur AI right now.

Find jobs on Simplify and start your career today

We update Arthur AI's jobs every few hours, so check again soon! Browse all jobs →