Work Here?
Industries
Enterprise Software
AI & Machine Learning
Company Size
11-50
Company Stage
Series A
Total Funding
$20M
Headquarters
New York City, New York
Founded
2023
Patronus AI provides tools that help businesses and organizations use artificial intelligence (AI) safely and confidently. Their products focus on AI safety, helping clients understand and manage the risks associated with AI technologies. The tools are designed to be adaptable, allowing clients to adjust their usage based on their specific needs. Patronus AI operates on a subscription-based model, where clients pay for access to these safety tools, ensuring a steady income for the company. What sets Patronus AI apart from competitors is its strong emphasis on customer relationships and a commitment to understanding client needs, which fosters trust and long-term partnerships. The company's goal is to create a trustworthy AI landscape, enabling clients to integrate AI into their operations effectively while prioritizing safety.
Help us improve and share your feedback! Did you find this helpful?
Total Funding
$20M
Meets
Industry Average
Funded Over
2 Rounds
Industry standards
Health Insurance
Dental Insurance
Vision Insurance
401(k) Retirement Plan
Unlimited Paid Time Off
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More. Patronus AI launched a new monitoring platform today that automatically identifies failures in AI agent systems, targeting enterprise concerns about reliability as these applications grow more complex.The San Francisco-based AI safety startup’s new product, Percival, positions itself as the first solution capable of automatically identifying various failure patterns in AI agent systems and suggesting optimizations to address them.“Percival is the industry’s first solution that automatically detects a variety of failure patterns in agentic systems and then systematically suggests fixes and optimizations to address them,” said Anand Kannappan, CEO and co-founder of Patronus AI, in an exclusive interview with VentureBeat.AI agent reliability crisis: Why companies are losing control of autonomous systemsEnterprise adoption of AI agents—software that can independently plan and execute complex multi-step tasks—has accelerated in recent months, creating new management challenges as companies try to ensure these systems operate reliably at scale.Unlike conventional machine learning models, these agent-based systems often involve lengthy sequences of operations where errors in early stages can have significant downstream consequences.“A few weeks ago, we published a model that quantifies how likely agents can fail, and what kind of impact that might have on the brand, on customer churn and things like that,” Kannappan said. “There’s a constant compounding error probability with agents that we’re seeing.”This issue becomes particularly acute in multi-agent environments where different AI systems interact with one another, making traditional testing approaches increasingly inadequate.Episodic memory innovation: How Percival’s AI agent architecture revolutionizes error detectionPercival differentiates itself from other evaluation tools through its agent-based architecture and what the company calls “episodic memory” — the ability to learn from previous errors and adapt to specific workflows.The software can detect more than 20 different failure modes across four categories: reasoning errors, system execution errors, planning and coordination errors, and domain-specific errors.“Unlike an LLM as a judge, Percival itself is an agent and so it can keep track of all the events that have happened throughout the trajectory,” explained Darshan Deshpande, a researcher at Patronus AI. “It can correlate them and find these errors across contexts.”For enterprises, the most immediate benefit appears to be reduced debugging time. According to Patronus, early customers have reduced the time spent analyzing agent workflows from about one hour to between one and 1.5 minutes.TRAIL benchmark reveals critical gaps in AI oversight capabilitiesAlongside the product launch, Patronus is releasing a benchmark called TRAIL (Trace Reasoning and Agentic Issue Localization) to evaluate how well systems can detect issues in AI agent workflows.Research using this benchmark revealed that even sophisticated AI models struggle with effective trace analysis, with the best-performing system scoring only 11% on the benchmark.The findings underscore the challenging nature of monitoring complex AI systems and may help explain why large enterprises are investing in specialized tools for AI oversight.Enterprise AI leaders embrace Percival for mission-critical agent applicationsEarly adopters include Emergence AI, which has raised approximately $100 million in funding and is developing systems where AI agents can create and manage other agents.“Emergence’s recent breakthrough—agents creating agents—marks a pivotal moment not only in the evolution of adaptive, self-generating systems, but also in how such systems are governed and scaled responsibly,” said Satya Nitta, co-founder and CEO of Emergence AI, in a statement sent to VentureBeat.Nova, another early customer, is using the technology for a platform that helps large enterprises migrate legacy code through AI-powered SAP integrations.These customers typify the challenge Percival aims to solve
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More. Patronus AI announced today the launch of what it calls the industry’s first multimodal large language model-as-a-judge (MLLM-as-a-Judge), a tool designed to evaluate AI systems that interpret images and produce text.The new evaluation technology aims to help developers detect and mitigate hallucinations and reliability issues in multimodal AI applications. E-commerce giant Etsy has already implemented the technology to verify caption accuracy for product images across its marketplace of handmade and vintage goods.“Super excited to announce that Etsy is one of our ship customers,” said Anand Kannappan, cofounder of Patronus AI, in an exclusive interview with VentureBeat. “They have hundreds of millions of items in their online marketplace for handmade and vintage products that people are creating around the world. One of the things that their AI team wanted to be able to leverage generative AI for was the ability to auto-generate image captions and to make sure that as they scale across their entire global user base, that the captions that are generated are ultimately correct.”Why Google’s Gemini powers the new AI judge rather than OpenAIPatronus built its first MLLM-as-a-Judge, called Judge-Image, on Google’s Gemini model after extensive research comparing it with alternatives like OpenAI’s GPT-4V.“We tended to see that there was a slighter preference toward egocentricity with GPT-4V, whereas we saw that Gemini was less biased in those ways and had more of an equitable approach to being able to judge different kinds of input-output pairs,” Kannappan explained
E-commerce giant Etsy already leveraging technology to reduce AI hallucinations in product image captionsSAN FRANCISCO, March 13, 2025 /PRNewswire/ -- Patronus AI today announced the launch of the industry's first Multimodal LLM-as-a-Judge (MLLM-as-a-Judge), a groundbreaking evaluation capability that enables developers to score and optimize multimodal AI systems for image-to-text applications.The new Judge-Image tool, powered by Google Gemini, allows AI engineers to iteratively measure and improve the quality of their multimodal AI applications by scanning for text presence, grid structure, spatial orientation, and object identification."Our mission has always been to advance scalable oversight of AI," said Anand Kannappan, CEO and Co-founder of Patronus AI. "With the release of GPT-4o, Claude Opus, and Google's Gemini over the last year, organizations have invested heavily in image generation to drive customer value. However, as these AI experiences scale, so does the unpredictability of LLM systems. Our MLLM-as-a-Judge addresses this critical challenge by providing transparent, reliable evaluation of multimodal systems."The Judge-Image tool offers several out-of-box evaluation criteria, including:Caption hallucination detection (standard and strict)Primary and non-primary object description verificationObject location accuracyBeyond validating image caption correctness, Judge-Image can test OCR extraction accuracy for tabular data, AI-generated brand asset accuracy, and scene description validity.Prior research suggests that Google Gemini can serve as a more reliable MLLM judge compared to alternatives like OpenAI's GPT-4V, exhibiting less egocentricity and a more equitable approach to judgment. Patronus AI's internal evaluation datasets confirmed that the Gemini backbone performed better compared to other multimodal LLMs.Patronus AI plans to expand their multimodal evaluation capabilities to include audio and vision features in future releases.Customer Use CaseEtsy, the leading technology marketplace for independent sellers, has already implemented Patronus AI's MLLM-as-a-Judge to detect and mitigate caption hallucination from their product images. The Etsy AI team leverages this and the broader Patronus platform to optimize their multimodal AI system.For more information, visit the Patronus AI documentation at https://docs.patronus.ai/docs/multimodal_evals/base.About Patronus AIPatronus AI develops AI evaluation and optimization to help companies build top-tier AI products confidently
Patronus AI launches industry-first Multimodal LLM-as-a-Judge for image evaluation.
Patronus AI has released GLIDER, a 3.8 billion parameter model designed for evaluating language models.
Find jobs on Simplify and start your career today
Industries
Enterprise Software
AI & Machine Learning
Company Size
11-50
Company Stage
Series A
Total Funding
$20M
Headquarters
New York City, New York
Founded
2023
Find jobs on Simplify and start your career today