
Work Here?
Modulate provides real-time voice skin technology for online gaming and virtual communication. It licenses its voice-skin systems to game developers and online platforms (B2B), so players can modify their voices to sound like characters or create new ones, integrating with existing voice chat. The product works by capturing the user's voice and applying real-time voice modulation skins, allowing seamless swapping of voice identities during gameplay. Modulate differentiates itself by targeting game developers and communities with an emphasis on inclusivity and engagement, reducing intimidation for players who are uncomfortable with their natural voices and encouraging participation. The company’s goal is to expand immersive, personalized interactions in online gaming by offering scalable licensing options and premium voice-skin variants through subscriptions or one-time fees.
Industries
Enterprise Software
Gaming
Company Size
51-200
Company Stage
Series A
Total Funding
$66M
Headquarters
Somerville, Massachusetts
Founded
2017
People at Modulate who can refer or advise you
Help us improve and share your feedback! Did you find this helpful?
Total Funding
$66M
Above
Industry Average
Funded Over
4 Rounds
Industry standards
Hybrid Work Options
Modulate expands Velma with voice-native real-time conversation intelligence. Modulate, a conversational voice intelligence company, today released its flagship Velma model through its developer API. Previously locked to enterprises, now any developer can access and deploy the voice-native conversation intelligence model, which natively understands audio and provides live insights into emotion, intent, behavioral risk, and conversational context. Velma Enterprise API helps organizations move from post-call analysis to continuous, real-time conversational understanding and intervention and beyond transcription and point solutions toward a broader enterprise intelligence layer for live voice conversations. Powered by Modulate's Ensemble Listening Model (ELM) architecture, the API provides a real-time listening layer that identifies and interprets the signals that determine what is actually happening in a conversation beyond just the words spoken "As enterprises deploy more AI across customer interactions, they're realizing that transcription alone is an incomplete foundation for understanding conversations," said Mike Pappas, CEO and co-founder of Modulate, in a statement. "The excitement we're seeing from operators, compliance teams, and customer experience leaders comes from finally having infrastructure that can interpret conversational and emotional context in real time, beyond the transcript." Velma uses an ensemble of specialized models that work together to analyze conversational audio across multiple dimensions. Velma can detect emotional signals, conversational dynamics, behavioral patterns, and non-verbal cues that traditional speech-to-text pipelines often miss. Velma Enterprise API can support the following use cases: * Fraud and risk detection, identifying signs of synthetic audio, urgency, manipulation, policy avoidance, or other risk signals during live interactions. * Customer experience and contact center intelligence, helping teams understand caller emotion, frustration, confusion, escalation risk, and service needs in real time. * AI agent oversight, detecting when AI agents might be making inaccurate claims, violating policies, or failing to respond appropriately to customer needs. * Trust and safety, recognizing harmful, abusive, or policy-violating behavior in live voice environments. * Operational intelligence, turning conversational audio into structured, explainable signals that can inform review, escalation, training, and decision-making workflows. * Compliance and vulnerable customer protection, helping organizations identify signs of distress, confusion, disclosure failures, or regulatory risk during live interactions. "Fraud, customer dissatisfaction, policy violations, and AI failures don't politely happen only in the first 30 seconds of a call. Enterprises need systems that can listen continuously, explain what they are hearing, and help humans act quickly," Pappas added.
Modulate launches Velma Deepfake Detect: A paradigm shift in the economics of fraud prevention. March 31, 2026 6:20 AM Gift Article New Velma API from Modulate delivers best-in-class precision at over 100x lower cost, enabling full-call deepfake detection at production scale. Ranked #1 in accuracy by Hugging Face. BOSTON, MA / ACCESS Newswire / March 31, 2026 / Modulate, the frontier conversational voice intelligence company, today announced the launch of Velma Deepfake Detect, a synthetic voice detection API that makes continuous, full-call monitoring economically viable at scale for the first time. Ranked #1 on the Hugging Face Deepfake Speech leaderboard, Velma Deepfake Detect combines state-of-the-art accuracy with 578x lower cost vs. the next-best model, enabling detection of AI-generated audio across entire conversations in both batch and real-time streaming environments. "Voice is one of the most vulnerable attack surfaces for modern enterprises," said Mike Pappas, CEO and Cofounder of Modulate. "The problem isn't just that synthetic audio is getting better; it's that it's incredibly cheap to create, while detection has historically been too expensive to deploy at scale. That's left real gaps in how companies defend themselves. Velma Deepfake Detect changes that by creating true cost parity with scammers creating fraudulent voice deepfakes. It's a paradigm shift that gives enterprises and developers a fraud prevention solution at a low cost required to catch the huge proliferation in deepfake fraud." Unable to load content. Please check your configuration and try again. The volume of synthetic content is growing at an unprecedented pace. AI-generated voice fraud increased by over 1,200% in 2025, costing organizations an average of $14 million annually (source), and those number keeps going up. The financial impact is significant, with incidents averaging over $500,000 in 2024. The result is a growing gap between how easily these attacks can be executed and how effectively they can be stopped. Built using Modulate's Ensemble Listening Model (ELM) architecture, Velma Deepfake Detect combines insights from short vocal tones and more complex rhythm or pronunciation patterns to deliver the precision and cost efficiency required for end-to-end, real-time detection of deepfake fraud across retail, banking, and IT helpdesks - any call center, content-sharing platform, or other audio-rich environment. Reduce the Operational Costs of Deepfake Detection by up to 99% With pricing starting at $0.25 per hour of audio, Velma Deepfake Detect is over 100x less expensive than competing solutions, making large-scale deployments across entire voice pipelines economically viable for the first time. Pappas elaborates, "Historically, cost has shaped how deepfake detection is used in practice. When detection is expensive, organizations are forced to sample only a small portion of each interaction. But as fraud tactics evolve, those partial approaches leave exploitable blind spots. Velma changes the economics, making it possible to monitor entire conversations and voice pipelines, closing those gaps in real time." Beyond risk mitigation, continuous fraud detection with Modulate Velma Deepfake Detect improves the overall efficiency and cost-effectiveness of voice operations. By identifying fraudulent or suspicious interactions earlier, organizations can route calls more effectively, reduce time spent on bad actors, and allow agents to focus on legitimate customer needs - reducing unnecessary strain and potential churn on frontline teams. #1 Accuracy, Independently Validated by Hugging Face Velma Deepfake Detect is ranked the top-performing model on the independently validated Hugging Face Speech Deepfake Arena leaderboard, achieving an equal error rate (EER) of 1.1% - catching 60% of the deepfakes the #2 provider missed, while generating less than half the number of false positives - and significantly outperforming competing models across a broad range of evaluation datasets. This benchmark reflects the model's ability to reliably distinguish genuine human speech from AI-generated audio under diverse conditions, including noisy environments and compressed audio formats. Built for Real-World Voice Systems Velma Deepfake Detect is already being applied in high-risk enterprise workflows, including preventing account takeover during customer support calls, flagging synthetic voices during high-value transaction verification, and identifying scam callers in real time in contact center environments. These use cases enable organizations to stop fraud as it happens, rather than after losses occur. Now available as an API for developers building production systems that rely on voice input, Velma Deepfake Detect enables: * Batch and real-time streaming detection endpoints * Probability-based scoring for flexible decision thresholds * Segment-level analysis for identifying partial manipulation * Accurate results with as little as 2-3 seconds of audio, compared to 5-30 seconds * Robust performance across noisy, multi-speaker, and compressed audio The Velma Deepfake Detect API enables enterprises and developers to incorporate detection into fraud prevention, contact centers, voice agents, and identity verification workflows. Because alerts and scores can be routed into existing systems, organizations can use Velma Deepfake Detect to support real-time decisions such as escalation, rerouting, secondary verification, or post-call review. Modulate: The Comprehensive Voice Intelligence Platform As part of the broader Velma platform, detection can be combined with additional capabilities, including transcription, emotion detection, PII redaction, and conversational analytics - allowing organizations to move from simply identifying synthetic audio to fully understanding voice interactions. Pricing and Availability Velma Deepfake Detect is available today via API. Modulate pricing is usage-based and optimized for high-volume workloads: https://www.modulate.ai/pricing. Download the Modulate Deepfake Detect press kit here. About Modulate Modulate is a voice intelligence company building AI models and APIs designed to understand real-world conversational audio at scale. Its technology combines speech recognition, acoustic analysis, and conversational context to deliver reliable, explainable, and cost-effective voice intelligence for developers and enterprises. Media Contact
Modulate has launched Velma Deepfake Detect, a synthetic voice detection API that enables full-call deepfake monitoring at production scale. The API is ranked first for accuracy on Hugging Face's Deepfake Speech leaderboard whilst being 578 times cheaper than the next-best model. Priced from $0.25 per hour of audio, Velma Deepfake Detect reduces operational costs by up to 99% compared to competing solutions, making continuous monitoring economically viable. The system achieves a 1.1% equal error rate and can analyse audio in as little as two to three seconds. The technology addresses growing AI voice fraud, which increased by over 1,200% in 2025 and costs organisations an average of $14 million annually. The API is now available for integration into fraud prevention, contact centres and identity verification workflows.
Modulate has launched Velma Transcribe, a speech-to-text API offering high-accuracy transcription at 90% lower cost than leading providers. The service costs approximately $0.03 per hour of audio, compared to $0.21–$0.40 from competitors like AssemblyAI, Deepgram and ElevenLabs. Built using Modulate's Ensemble Listening Model architecture, Velma Transcribe orchestrates specialised transcription models to improve accuracy, latency and cost efficiency. The technology achieves industry-leading results on datasets including Earnings-22 and the AMI Meeting Corpus. The service includes enterprise features such as emotion detection for over 20 emotions, accent detection for over 20 accents, support for over 70 languages, PII redaction and diarization. Modulate aims to make transcription accessible at scale for call centres, voice agents and social applications.
Modulate launches Velma Transcribe: high-performance transcription For real world conversations at 90% lower cost. March 18, 2026 Boston, MA - March 18, 2026 - Modulate, the frontier conversational voice intelligence company, today announced Velma Transcribe, a speech-to-text API delivering high-accuracy, low-latency transcription at 90% lower cost per hour than other leading transcription providers. This significantly lower price point represents a fundamental shift in the economics of transcription. For a fraction of the cost, Modulate unlocks affordable speech-to-text transcription for every audio conversation in the world, empowering real-time voice agents, call center platforms, social apps, and more with industry-leading transcription tools at a global scale. Built using Modulate's industry-leading Ensemble Listening Model (ELM) research, Velma Transcribe orchestrates an ensemble of specialized transcription models to improve accuracy, latency, and cost efficiency compared to any single model. In addition to the outstanding unit economics, Velma Transcribe achieves industry-leading results on widely used datasets, including Earnings-22 and the AMI Meeting Corpus. The result is a new standard for conversational audio transcription, combining strong accuracy on complex multi-speaker audio with dramatically improved unit economics for processing voice data at scale. "Modulate is the world leader in using voice understanding AI, and our goal is to make the tools to understand audio available to anyone, at any scale," said Carter Huffman, CTO and Cofounder of Modulate. "Our full ensemble for conversation understanding, Velma, already outperforms LLMs in recognizing key behaviors, and now Velma Transcribe makes one of our core underlying capabilities available directly to developers who simply need accurate transcripts, not behavioral insights." In addition, Velma Transcribe offers features built for Enterprise use cases: * Emotion detection (20+ emotions) * Accent detection (20+ accents) * Multilingual (70+ languages) * PII redaction, diarization, streaming support, and more Lower Transcription Costs By up to 10X Velma Transcribe reduces transcription costs to approximately $0.03 per hour of audio, more than 90% lower than leading providers. These economics make it far more cost-effective for enterprise organizations to analyze and monetize their voice data. * $0.03 - Modulate Velma Transcribe * $0.40 - ElevenLabs Scribe v2 * $0.31 - Deepgram Nova-3 * $0.26 - Deepgram Nova-2 * $0.21 - AssemblyAI Universal-3 Pro *Based on publicly listed pricing as of March 18, 2026 Top Marks for Conversational Audio Accuracy at Scale Velma Transcribe is engineered for real-world conversations that challenge traditional systems, including overlapping speakers, interruptions, accents, and background noise. On the AMI Meeting Corpus dataset, a widely used benchmark for complex multi-speaker conversational audio, Velma avoids over 40% of the errors made by Eleven Labs and over 70% of the errors made by OpenAI GPT-4o-transcribe. Huffman explains the top marks, "We've tuned Velma for conversational audio, including emotion and accent detection, leading to materially lower error rates on meeting and call data while delivering dramatic cost savings versus incumbent providers. That combination makes high-quality transcription practical at scale." Built for Secure Enterprise Voice Production Velma Transcribe includes all the capabilities developers expect and enterprise operations need, including: * Batch and streaming transcription endpoints with structured output and segment timestamps * Zero data stored, ensuring privacy-safe workflows * Sub-second streaming latency with partial transcripts for live applications and agent pipelines * Robust formatting optimized for conversational speech and long recordings * Broad language coverage in 70 of the world's most commonly spoken languages * Personally Identifiable Information (PII) detection and redaction * Advanced transcription enrichments, including speaker diarization, emotion detection, and accent identification Backed by Modulate's security practices and ISO 27001 certification, these capabilities allow developers to build secure, voice-enabled applications and help organizations extract insights from large volumes of conversational data. Models that Listen and Understand Velma Transcribe is part of Modulate's growing family of Velma 2.0 voice analytics models built to deliver a new, context-rich listening layer for AI systems. It represents the first step in Modulate's expanding developer API strategy, with additional capabilities planned across synthetic voice detection, emotion analysis, and deeper conversational intelligence. Together, these capabilities allow developers and enterprises to move beyond transcription to understand how conversations unfold, enabling applications such as fraud detection, customer sentiment analysis, compliance monitoring, and real-time decision support. "The industry has spent years teaching AI how to generate and respond. The next frontier is teaching it how to listen," said Mike Pappas, CEO and Cofounder of Modulate. "Most systems today rely on transcription, reducing rich conversations to flat text and losing the signals humans naturally understand. Velma is the listening layer for AI, giving developers and enterprises the 'ears' needed to build voice-native applications that can capture the nuance and intent within spoken dialogue." Availability and Pricing Velma Transcribe is available today with batch and sub-second streaming transcription. Modulate pricing is usage-based and optimized for high-volume workloads: https://www.modulate.ai/pricing About Modulate Modulate is a voice intelligence company building AI models and APIs designed to understand real-world conversational audio at scale. Its technology combines speech recognition, acoustic analysis, and conversational context to deliver reliable, explainable, and cost-effective voice intelligence for developers and enterprises. Media Contact Megan Fasy Grithaus Agency (e) [email protected] (m) +1 (617) 480-3674
Find jobs on Simplify and start your career today
Industries
Enterprise Software
Gaming
Company Size
51-200
Company Stage
Series A
Total Funding
$66M
Headquarters
Somerville, Massachusetts
Founded
2017
Find jobs on Simplify and start your career today