
Work Here?
Industries
Enterprise Software
Gaming
Company Size
51-200
Company Stage
Series A
Total Funding
$66M
Headquarters
Somerville, Massachusetts
Founded
2017
Modulate provides real-time voice skin technology for online gaming and virtual communication. It licenses its voice-skin systems to game developers and online platforms (B2B), so players can modify their voices to sound like characters or create new ones, integrating with existing voice chat. The product works by capturing the user's voice and applying real-time voice modulation skins, allowing seamless swapping of voice identities during gameplay. Modulate differentiates itself by targeting game developers and communities with an emphasis on inclusivity and engagement, reducing intimidation for players who are uncomfortable with their natural voices and encouraging participation. The company’s goal is to expand immersive, personalized interactions in online gaming by offering scalable licensing options and premium voice-skin variants through subscriptions or one-time fees.
Help us improve and share your feedback! Did you find this helpful?
Total Funding
$66M
Above
Industry Average
Funded Over
4 Rounds
Industry standards
Hybrid Work Options
Modulate launches Velma Deepfake Detect: A paradigm shift in the economics of fraud prevention. March 31, 2026 6:20 AM Gift Article New Velma API from Modulate delivers best-in-class precision at over 100x lower cost, enabling full-call deepfake detection at production scale. Ranked #1 in accuracy by Hugging Face. BOSTON, MA / ACCESS Newswire / March 31, 2026 / Modulate, the frontier conversational voice intelligence company, today announced the launch of Velma Deepfake Detect, a synthetic voice detection API that makes continuous, full-call monitoring economically viable at scale for the first time. Ranked #1 on the Hugging Face Deepfake Speech leaderboard, Velma Deepfake Detect combines state-of-the-art accuracy with 578x lower cost vs. the next-best model, enabling detection of AI-generated audio across entire conversations in both batch and real-time streaming environments. "Voice is one of the most vulnerable attack surfaces for modern enterprises," said Mike Pappas, CEO and Cofounder of Modulate. "The problem isn't just that synthetic audio is getting better; it's that it's incredibly cheap to create, while detection has historically been too expensive to deploy at scale. That's left real gaps in how companies defend themselves. Velma Deepfake Detect changes that by creating true cost parity with scammers creating fraudulent voice deepfakes. It's a paradigm shift that gives enterprises and developers a fraud prevention solution at a low cost required to catch the huge proliferation in deepfake fraud." Unable to load content. Please check your configuration and try again. The volume of synthetic content is growing at an unprecedented pace. AI-generated voice fraud increased by over 1,200% in 2025, costing organizations an average of $14 million annually (source), and those number keeps going up. The financial impact is significant, with incidents averaging over $500,000 in 2024. The result is a growing gap between how easily these attacks can be executed and how effectively they can be stopped. Built using Modulate's Ensemble Listening Model (ELM) architecture, Velma Deepfake Detect combines insights from short vocal tones and more complex rhythm or pronunciation patterns to deliver the precision and cost efficiency required for end-to-end, real-time detection of deepfake fraud across retail, banking, and IT helpdesks - any call center, content-sharing platform, or other audio-rich environment. Reduce the Operational Costs of Deepfake Detection by up to 99% With pricing starting at $0.25 per hour of audio, Velma Deepfake Detect is over 100x less expensive than competing solutions, making large-scale deployments across entire voice pipelines economically viable for the first time. Pappas elaborates, "Historically, cost has shaped how deepfake detection is used in practice. When detection is expensive, organizations are forced to sample only a small portion of each interaction. But as fraud tactics evolve, those partial approaches leave exploitable blind spots. Velma changes the economics, making it possible to monitor entire conversations and voice pipelines, closing those gaps in real time." Beyond risk mitigation, continuous fraud detection with Modulate Velma Deepfake Detect improves the overall efficiency and cost-effectiveness of voice operations. By identifying fraudulent or suspicious interactions earlier, organizations can route calls more effectively, reduce time spent on bad actors, and allow agents to focus on legitimate customer needs - reducing unnecessary strain and potential churn on frontline teams. #1 Accuracy, Independently Validated by Hugging Face Velma Deepfake Detect is ranked the top-performing model on the independently validated Hugging Face Speech Deepfake Arena leaderboard, achieving an equal error rate (EER) of 1.1% - catching 60% of the deepfakes the #2 provider missed, while generating less than half the number of false positives - and significantly outperforming competing models across a broad range of evaluation datasets. This benchmark reflects the model's ability to reliably distinguish genuine human speech from AI-generated audio under diverse conditions, including noisy environments and compressed audio formats. Built for Real-World Voice Systems Velma Deepfake Detect is already being applied in high-risk enterprise workflows, including preventing account takeover during customer support calls, flagging synthetic voices during high-value transaction verification, and identifying scam callers in real time in contact center environments. These use cases enable organizations to stop fraud as it happens, rather than after losses occur. Now available as an API for developers building production systems that rely on voice input, Velma Deepfake Detect enables: * Batch and real-time streaming detection endpoints * Probability-based scoring for flexible decision thresholds * Segment-level analysis for identifying partial manipulation * Accurate results with as little as 2-3 seconds of audio, compared to 5-30 seconds * Robust performance across noisy, multi-speaker, and compressed audio The Velma Deepfake Detect API enables enterprises and developers to incorporate detection into fraud prevention, contact centers, voice agents, and identity verification workflows. Because alerts and scores can be routed into existing systems, organizations can use Velma Deepfake Detect to support real-time decisions such as escalation, rerouting, secondary verification, or post-call review. Modulate: The Comprehensive Voice Intelligence Platform As part of the broader Velma platform, detection can be combined with additional capabilities, including transcription, emotion detection, PII redaction, and conversational analytics - allowing organizations to move from simply identifying synthetic audio to fully understanding voice interactions. Pricing and Availability Velma Deepfake Detect is available today via API. Modulate pricing is usage-based and optimized for high-volume workloads: https://www.modulate.ai/pricing. Download the Modulate Deepfake Detect press kit here. About Modulate Modulate is a voice intelligence company building AI models and APIs designed to understand real-world conversational audio at scale. Its technology combines speech recognition, acoustic analysis, and conversational context to deliver reliable, explainable, and cost-effective voice intelligence for developers and enterprises. Media Contact
Modulate has launched Velma Deepfake Detect, a synthetic voice detection API that enables full-call deepfake monitoring at production scale. The API is ranked first for accuracy on Hugging Face's Deepfake Speech leaderboard whilst being 578 times cheaper than the next-best model. Priced from $0.25 per hour of audio, Velma Deepfake Detect reduces operational costs by up to 99% compared to competing solutions, making continuous monitoring economically viable. The system achieves a 1.1% equal error rate and can analyse audio in as little as two to three seconds. The technology addresses growing AI voice fraud, which increased by over 1,200% in 2025 and costs organisations an average of $14 million annually. The API is now available for integration into fraud prevention, contact centres and identity verification workflows.
Modulate has launched Velma Transcribe, a speech-to-text API offering high-accuracy transcription at 90% lower cost than leading providers. The service costs approximately $0.03 per hour of audio, compared to $0.21–$0.40 from competitors like AssemblyAI, Deepgram and ElevenLabs. Built using Modulate's Ensemble Listening Model architecture, Velma Transcribe orchestrates specialised transcription models to improve accuracy, latency and cost efficiency. The technology achieves industry-leading results on datasets including Earnings-22 and the AMI Meeting Corpus. The service includes enterprise features such as emotion detection for over 20 emotions, accent detection for over 20 accents, support for over 70 languages, PII redaction and diarization. Modulate aims to make transcription accessible at scale for call centres, voice agents and social applications.
Modulate launches Velma Transcribe: high-performance transcription For real world conversations at 90% lower cost. March 18, 2026 Boston, MA - March 18, 2026 - Modulate, the frontier conversational voice intelligence company, today announced Velma Transcribe, a speech-to-text API delivering high-accuracy, low-latency transcription at 90% lower cost per hour than other leading transcription providers. This significantly lower price point represents a fundamental shift in the economics of transcription. For a fraction of the cost, Modulate unlocks affordable speech-to-text transcription for every audio conversation in the world, empowering real-time voice agents, call center platforms, social apps, and more with industry-leading transcription tools at a global scale. Built using Modulate's industry-leading Ensemble Listening Model (ELM) research, Velma Transcribe orchestrates an ensemble of specialized transcription models to improve accuracy, latency, and cost efficiency compared to any single model. In addition to the outstanding unit economics, Velma Transcribe achieves industry-leading results on widely used datasets, including Earnings-22 and the AMI Meeting Corpus. The result is a new standard for conversational audio transcription, combining strong accuracy on complex multi-speaker audio with dramatically improved unit economics for processing voice data at scale. "Modulate is the world leader in using voice understanding AI, and our goal is to make the tools to understand audio available to anyone, at any scale," said Carter Huffman, CTO and Cofounder of Modulate. "Our full ensemble for conversation understanding, Velma, already outperforms LLMs in recognizing key behaviors, and now Velma Transcribe makes one of our core underlying capabilities available directly to developers who simply need accurate transcripts, not behavioral insights." In addition, Velma Transcribe offers features built for Enterprise use cases: * Emotion detection (20+ emotions) * Accent detection (20+ accents) * Multilingual (70+ languages) * PII redaction, diarization, streaming support, and more Lower Transcription Costs By up to 10X Velma Transcribe reduces transcription costs to approximately $0.03 per hour of audio, more than 90% lower than leading providers. These economics make it far more cost-effective for enterprise organizations to analyze and monetize their voice data. * $0.03 - Modulate Velma Transcribe * $0.40 - ElevenLabs Scribe v2 * $0.31 - Deepgram Nova-3 * $0.26 - Deepgram Nova-2 * $0.21 - AssemblyAI Universal-3 Pro *Based on publicly listed pricing as of March 18, 2026 Top Marks for Conversational Audio Accuracy at Scale Velma Transcribe is engineered for real-world conversations that challenge traditional systems, including overlapping speakers, interruptions, accents, and background noise. On the AMI Meeting Corpus dataset, a widely used benchmark for complex multi-speaker conversational audio, Velma avoids over 40% of the errors made by Eleven Labs and over 70% of the errors made by OpenAI GPT-4o-transcribe. Huffman explains the top marks, "We've tuned Velma for conversational audio, including emotion and accent detection, leading to materially lower error rates on meeting and call data while delivering dramatic cost savings versus incumbent providers. That combination makes high-quality transcription practical at scale." Built for Secure Enterprise Voice Production Velma Transcribe includes all the capabilities developers expect and enterprise operations need, including: * Batch and streaming transcription endpoints with structured output and segment timestamps * Zero data stored, ensuring privacy-safe workflows * Sub-second streaming latency with partial transcripts for live applications and agent pipelines * Robust formatting optimized for conversational speech and long recordings * Broad language coverage in 70 of the world's most commonly spoken languages * Personally Identifiable Information (PII) detection and redaction * Advanced transcription enrichments, including speaker diarization, emotion detection, and accent identification Backed by Modulate's security practices and ISO 27001 certification, these capabilities allow developers to build secure, voice-enabled applications and help organizations extract insights from large volumes of conversational data. Models that Listen and Understand Velma Transcribe is part of Modulate's growing family of Velma 2.0 voice analytics models built to deliver a new, context-rich listening layer for AI systems. It represents the first step in Modulate's expanding developer API strategy, with additional capabilities planned across synthetic voice detection, emotion analysis, and deeper conversational intelligence. Together, these capabilities allow developers and enterprises to move beyond transcription to understand how conversations unfold, enabling applications such as fraud detection, customer sentiment analysis, compliance monitoring, and real-time decision support. "The industry has spent years teaching AI how to generate and respond. The next frontier is teaching it how to listen," said Mike Pappas, CEO and Cofounder of Modulate. "Most systems today rely on transcription, reducing rich conversations to flat text and losing the signals humans naturally understand. Velma is the listening layer for AI, giving developers and enterprises the 'ears' needed to build voice-native applications that can capture the nuance and intent within spoken dialogue." Availability and Pricing Velma Transcribe is available today with batch and sub-second streaming transcription. Modulate pricing is usage-based and optimized for high-volume workloads: https://www.modulate.ai/pricing About Modulate Modulate is a voice intelligence company building AI models and APIs designed to understand real-world conversational audio at scale. Its technology combines speech recognition, acoustic analysis, and conversational context to deliver reliable, explainable, and cost-effective voice intelligence for developers and enterprises. Media Contact Megan Fasy Grithaus Agency (e) [email protected] (m) +1 (617) 480-3674
Modulate adds nuance to voice analysis. Transcription is key to AI, but a call's full meaning can get lost in plain text. Modulate's approach aims to deliver subtlety - and sarcasm detection. March 5, 2026 In January 2026, Modulate announced the Ensemble Listening Model (ELM) for analyzing live voice interactions via a system of interconnected, hierarchical models that don't just transcribe calls between a customer and human agent, but combine that transcript with the timbre of the interactions. The company's Velma 2.0 combines multiple specialized models into a single framework that can detect fraud, and provide insights to a human contact center agent or assess the efficacy of a voice AI-based virtual agent. Modulate integrates into CCaaS platforms like Five9 or Genesys and can connect to systems like Zendesk. Additionally, Modulate sells its platform directly to enterprises. The following Q&A with Modulate CEO and co-founder Mike Pappas provides a deeper look into how Velma works and where it might be applied in a contact center. The following Q&A was edited for length. NJ: Where does Modulate fit with all the various conversational AI providers, speech to text and speech to speech models, etc.? Mike Pappas (Pappas): We split these AI systems into the ears, the brain and the mouth. Most of the foundation model companies are building the brain. The ears and the mouth are kind of afterthoughts. Other folks are building the mouth, making it easier to speak intuitively. Very few people have invested any time or attention towards the ears, because there's this preconception that transcription is all you need. We don't think transcription is sufficient to understand what people are saying. We're not trying to present a whole conversational ecosystem that does ears, brain and mouth. We're just saying that whether you're building a transcription system, analytics or a voice AI agent, the way you are transforming from the audio to what is written down in structured data is just raw transcription that loses emotion and cadence and other [qualitative information]. Basically, we provide an augmented transcription that includes all the extra emotional content like voice timbre, incisive pauses and [other elements] and then we're transform that into structured content that allows an LLM brain to respond more artfully to what you actually intended. Or it can assist a real human agent in better understanding what is being communicated. NJ: Can you give me an example of what that might look like? Pappas: With many tools you can transcribe the phrase "nice job." There are other sentiment tools that can recognize that was said sarcastically. But if I feed that into an AI summarizer or [analytics] tool, actually having those systems connect the dots and [recognize] it was said sarcastically, is such a simple step for us humans, but that has been missing this whole time from AI. NJ: The documentation for ELM shows many different models working together at different hierarchal levels. How does the product work? Pappas: We have different layers of processing. In some of these layers we're just looking at the voice to see if it's synthetic. Some are analyzing over a short burst of time, which is what we call a clip, to see if it includes a threat, for example, or does it indicate that an issue has been resolved. Then we have larger features [that work] over the course of the whole conversation [to assess the context of the conversation]: Is this a mentoring conversation? Is it an abusive relationship? Is it an attempt at fraud? The difference between these models is what's known as their context window and how much of the conversation they are analyzing at any given point in time. Some models are producing new results every few milliseconds - every line or two of conversation, for example - while others are much mellow, so to speak, and take their time over the course of the conversation. One example is topic analysis: Is this a conversation about health care or something else? If you are using a standard LLM, it's very hard to dynamically swap in different transcription models [during] a conversation because transcription requires context. Normally, if you swap out to a new transcription model halfway through the conversation, then you've just thrown away all that context and need to regenerate it. With our approach - the dynamic ensemble blocks - even though there's multiple versions of transcription models, it's all using the same shared context. We can start with a general-purpose transcription model and then, as we realize the conversation is about healthcare, we can automatically swap to a healthcare optimized transcription model and get all the improved performance benefits [from it] without any of the downtime, lost context or other issues that would normally crop up. And the idea is not these models run independently, but that they're passing their information back into each other in different ways, so they're all effectively working together to come to this final analysis. The job of the orchestrator is functionally somewhat similar to an LLM in that it's taking these 100 or so different outputs at all these different levels and then embodying it into one understanding of the conversation. NJ: How might Modulate's system be used in a contact center? Pappas: Companies are using us as a guardrail system to notice if an agent is doing something that is off book and then flag it in real time. This is especially important for larger call centers that are testing AI agents. They want to deploy them at scale to see how they do, but they don't have good oversight mechanisms to understand how they are performing. A system like Modulate's can sit on top of all those AI agents and grade them objectively and provide evidence of what we're seeing and help them determine if they can really trust the AI agent at large scale. We're also seeing [human] agent well-being as a use case. We got started in the gaming space doing content moderation, so safety is in our blood. We've been really interested in hearing about how difficult it is to be a contact center agent. Many organizations have implemented policies saying if an agent just had an awful call where someone's harassing you, you have the right to take a break and get your head straight before you jump back in. But no one takes those breaks because they're afraid - does [that last interaction] quite count as bad enough harassment to justify a break, or will I get docked? If we can say, objectively, that was an 8.3 out of 10, the policy says if it's above an 8, you get a break, so go, take your break. That is empowering to agents in a way they haven't had. Modulate Ensemble Listening Model Senior Editor As the Senior Editor for No Jitter, Matt covers AI (predictive, generative and agentic AI) as it pertains to the enterprise communications space - i.e., unified communications, contact center and digital workplace. Matt began his journalism career back in the late 1990s writing for several telecommunications print magazines. He then spent two decades as a cellular industry analyst, where he authored market reports, articles, presentations, and opinion pieces grounded in significant research, data analysis, and accumulated expertise. Dec 16, 2024 Dec 5, 2024 Dec 17, 2024 Upcoming Webinars 2026 Trends & Outlooks March 10-12, 2026
Find jobs on Simplify and start your career today
Industries
Enterprise Software
Gaming
Company Size
51-200
Company Stage
Series A
Total Funding
$66M
Headquarters
Somerville, Massachusetts
Founded
2017
Find jobs on Simplify and start your career today