
Work Here?
Arthur AI provides a platform to deploy and manage machine learning models and large language models. It is model- and platform-agnostic, so it can work with any type of model and be deployed on any cloud or on‑premises. The main product is a monitoring platform that lets businesses view all their models in one place, track real‑time metrics, optimize performance, and receive alerts when a metric crosses a threshold. It also supports collaboration with customizable permissions to streamline stakeholder engagement. Arthur AI differentiates itself by supporting diverse models and deployments, emphasizing standards and responsible practices, maintaining strong ties to the research community, and focusing on MLOps for enterprise needs. The company's goal is to help organizations run MLMs and LLMs at scale in a reliable, compliant, and collaborative way.
Industries
Data & Analytics
Enterprise Software
Cybersecurity
AI & Machine Learning
Company Size
11-50
Company Stage
Series B
Total Funding
$60.3M
Headquarters
New York City, New York
Founded
2018
People at Arthur AI who can refer or advise you
Help us improve and share your feedback! Did you find this helpful?
Total Funding
$60.3M
Above
Industry Average
Funded Over
3 Rounds
Industry standards
Health Insurance
Dental Insurance
Vision Insurance
401(k) Retirement Plan
401(k) Company Match
Professional Development Budget
Wellness Program
Unlimited Paid Time Off
Hybrid Work Options
From idea to impact: how Upsolve built trusted Agentic AI with Arthur. Upsolve partners with Arthur to deliver robust and reliable agentic AI capabilities Upsolve is an agentic platform which provides AI-driven business intelligence and analytics to their customers. Upsolve selected Arthur to help them develop their Analysis AI Agent which lets users ask questions about their data in natural language, generates data insights and explains the reasoning behind each answer. Users can (and often do) add their own training data to the agent, to help the agent better understand their schema and business context. Upsolve needed to guarantee, as they rolled these Agents out across their customer base, any performance issues were immediately diagnosed and fixed in a way that made the entire system better. Additionally, providing measurable proof that it could answer their questions correctly made a huge difference in customers' willingness to adopt. As a result, Upsolve engaged Arthur's Forward Deployed Engineering (FDE) program and incorporated Arthur's technology to ensure performance and build trust features directly into their product. The Arthur FDE integrated Arthur's observability and experimentation solution which allowed Upsolve's customers to quickly observe and improve the behavior of Upsolve's Analysis AI Agent. The Arthur platform provided a level of visibility which allowed Upsolve to build trust with their customers that their Agent was working correctly and reliably. Even after development, Arthur's value continued to be felt. When OpenAI released GPT-5, Upsolve adopted the new model. Arthur quickly flagged a serious performance issue in the Agent introduced by GPT-5, an issue that would have otherwise gone unnoticed until customers were affected. The problem statement. As Upsolve was building this feature and taking it to market, there were two things that kept coming up in conversations: * How can we ensure that the Agent is working reliably as we expose it to our customers? * How can customers trust that the Agent is doing what it's supposed to and answering questions correctly? The Upsolve Agent is a complex, multi-step planning Agent which performs the following actions for each user query: The solution. The Arthur Platform gave Upsolve the ability to build trust with their users and effectively measure performance using Continuous Evals to drive the critical Agent Development Flywheel (see ADLC diagram). By instrumenting effective, domain-specific measures of performance, eg: evals, Upsolve was able to immediately measure and improve the performance of their Agent internally. Once the evals were developed, Upsolve then exposed these evals in a customer-facing application, along with tools that allowed customers to make tweaks and provide training examples to the agent so that they could optimize Upsolve's Agent for their use-case. Setting up the baseline. At the beginning of the ADLC, the team instrumented the agent with tracing, so they could observe how the agent was thinking and acting internally. This instrumentation captured key raw data such as: token counts, latencies, vector database retrieval performance, as well as the inputs and outputs from the language models, vector databases, and tools. This data was then fed into Arthur's analytics and evals to give an instant, programmatic understanding of Agent Performance. Using this, developers were able to move beyond low-efficacy vibe checks and quickly improve the performance of the Analysis AI Agent. Gathering the data. Once we started measuring the baseline performance of the agent, we started capturing "gold standard" datasets. So if the question was "Which Olympic athlete won the most medals?", we would expect to get a SQL query that properly queried the database, a bar chart showing all of the athletes and their individual medal counts, and the final answer to be "Michael Phelps". At first, this dataset was manually curated and built by the Upsolve engineers as they iterated with the agent. To make it more scalable and methodical, they added functionality to their user interface that allowed them to save interactions into a gold standard dataset with a simple click using Arthur's APIs. Creating the domain-specific evaluations. Now that the team had baseline telemetry and a gold standard dataset of expected behavior, they were in a position to use the Arthur Experiments framework via Test Runs. A Test Run is an automatic replay of all the records in a gold standard dataset against a potential deployment of an agent. By running through all of the gold standard questions on a test-version of the agent, the team could directly look at the before-and-after impact that the change would have on the expected answers. If there were unexpected changes in the response, the proposed agent change could be considered a regression, and conversely, if there were no unexpected changes in the response, the proposed agents change could be considered as having no negative impact on the behavior. Tracking experiments. Finally, now that the team had a way to compare the behavior of their agent before and after a change, we then quantified the difference in performance using Domain Specific Evals. Because the agent returned three answers, a SQL query, a chart, and a final answer, we could score (a value of either 0 or 1, 0 being incorrect and 1 being correct) the performance of a change across these three dimensions. This was accomplished in the Arthur platform via an LLM-as-a-Judge technique, where the prompt for the evaluation documented the evaluation criteria - for example: the SQL query should have the same semantics as the reference SQL query. Looking Forward. Arthur gives Upsolve a methodical way to deliver a robust, powerful Agentic AI solution that provides immense value for their users and creates massive strategic differentiation in the marketplace for them. If you're a venture-backed startup building AI Agents and interested in working with Arthur to ship reliable AI agents to production, apply for Arthur's startup partner program. This program offers early access to Arthur's agentic tools, dedicated support, and resources to launch your AI safely in production. Learn more about the Agent Development Lifecycle and how you can implement it in your organization here!
Build. Experiment. Scale. Now With Open-Source AI Evaluation.NEW YORK, March 31, 2025 /PRNewswire/ -- AI is evolving fast—but making it work at scale remains a challenge. Today, Arthur is launching the Arthur Engine, the first open-source, real-time AI evaluation engine designed to help teams monitor, debug, and improve Generative AI and traditional ML models. No black-box monitoring
Generally available today, Recommender System Support vastly improves AI-driven recommender systems, resulting in elevated customer satisfaction levels and increased revenue growth for online businessesNEW YORK, Jan. 23, 2024 /PRNewswire/ -- Arthur , an AI performance platform trusted by some of the largest organizations in the world to ensure that their AI systems are well-managed and safely deployed, today introduced a powerful addition to its suite of AI monitoring tools: Recommender System Support. This new technology is set to revolutionize the way online businesses utilize recommender systems in the digital economy, enabling them to drive customer satisfaction levels and increase revenue growth.A vast portion of the modern internet economy is driven by AI-based recommender systems. For example, recommender systems are the engine behind the songs that play on Spotify, the movies that are suggested on Netflix, and what products are recommended on the Amazon homepage. Every advertising email delivered to an inbox, every social media post in a feed, and even which news articles are featured on a homepage are impacted by a recommender system. These systems, which analyze extensive data to predict and offer tailored product recommendations, can significantly boost customer satisfaction and revenue growth for e-commerce platforms, as well as engagement for streaming services and content providers.A major issue that exists for companies that rely on recommender systems without a good monitoring solution in place is that these systems are prone to performance problems as well as an incredible amount of data drift
ニューヨークを拠点とする人工知能(AI)スタートアップ Arthur は、OpenAI の「GPT-3.5 Turbo」や Meta の「LLaMA 2」などの大規模言語モデル(LLM)の性能を評価・比較するためのオープンソースツール「Arthur Bench」を公開した。. Arthur の CEO 兼共同設立者 Adam Wenchel 氏は声明で次のように述べた。. Bench では、LLM プロバイダ間の違い、プロンプティングやオーグメンテーション戦略の違い、カスタムトレーニングレジメなどをチームが深く理解できるよう、オープンソースのツールを作りました。
Founded in 2019, Arthur has secured over $60M in funding from several firms, including Acrew, Greycroft, Index Ventures, BAM Elevate, Work-Bench, and Plexo Capital.
Find jobs on Simplify and start your career today
Industries
Data & Analytics
Enterprise Software
Cybersecurity
AI & Machine Learning
Company Size
11-50
Company Stage
Series B
Total Funding
$60.3M
Headquarters
New York City, New York
Founded
2018
Find jobs on Simplify and start your career today