Full-Time

Researcher

Multimodal

Posted on 7/25/2025

Cartesia

Cartesia

51-200 employees

Develops foundation models with subquadratic architectures

Compensation Overview

$180k - $350k/yr

San Francisco, CA, USA

In Person

Relocation and immigration support.

Category
AI & Machine Learning (2)
,
Required Skills
Tensorflow
Neural Networks
Pytorch
Machine Learning
Requirements
  • Expertise in machine learning, multimodal learning, and generative modeling, with a strong research track record in top-tier conferences (e.g., CVPR, ICML, NeurIPS, ICCV).
  • Proficiency in deep learning frameworks such as PyTorch or TensorFlow, with experience in handling diverse data modalities (e.g., audio, video, text).
  • Strong understanding of state-of-the-art techniques for multimodal modeling, such as autoregressive and diffusion modeling, and deep understanding of architectural tradeoffs.
  • Passion for exploring the interplay between modalities to solve complex problems and create groundbreaking applications.
  • Excellent problem-solving skills, with the ability to independently tackle research challenges and collaborate effectively with multidisciplinary teams.
Responsibilities
  • Conduct cutting-edge research at the intersection of machine learning, multimodal data, and generative modeling to advance the state of AI across audio, text, vision, and other modalities.
  • Develop novel algorithms for multimodal understanding and generation, leveraging new architectures, training algorithms, datasets, and inference techniques.
  • Design and build models that enable seamless integration of modalities for multimodal reasoning on streaming data.
  • Lead the creation of robust evaluation frameworks to benchmark model performance on multimodal datasets and tasks.
  • Collaborate closely with cross-functional teams to translate research breakthroughs into impactful products and applications.
Desired Qualifications
  • Experience working with multimodal datasets, such as audio-visual datasets, video-captioning datasets, or large-scale cross-modal corpora.
  • Background in designing or deploying real-time multimodal systems in resource-constrained environments.
  • Early-stage startup experience or experience working in fast-paced R&D environments.

Cartesia.ai develops advanced AI foundation models using new subquadratic architectures and state space models. Its models are designed as underlying systems that can be adapted for many applications and are released in part under open-source licenses (e.g., Apache 2.0), with licensing, partnerships, and consulting as revenue streams. The company’s products work by providing large, adaptable AI models that clients can license or customize for specific needs, enabling faster processing and efficient inference compared to traditional designs. Cartesia differentiates itself from competitors by using subquadratic, state space approaches instead of relying mainly on standard Transformer models, and by combining open-source releases with enterprise partnerships. The goal is to help businesses and research institutions access powerful, adaptable AI capabilities while building a community around its technology and sustaining revenue through licensing and services.

Company Size

51-200

Company Stage

Late Stage VC

Total Funding

$191M

Headquarters

San Francisco, California

Founded

2023

Simplify Jobs

Simplify's Take

What believers are saying

  • Forethought partnership validates Sonic-3 for enterprise agentic AI CX platforms with high-quality conversations.
  • $100M Series B funding from Kleiner Perkins and NVIDIA accelerates engineering and global expansion.
  • 10,000+ customers including ServiceNow and Cresta deploy Sonic for millions of monthly interactions.

What critics are saying

  • OpenAI's ChatGPT advanced voice mode delivers low-latency expressiveness to millions, eroding differentiation.
  • ElevenLabs undercuts pricing with superior voice cloning and 70+ language support versus Sonic-3's 42.
  • EU AI Act Class IV enforcement bans deepfake voice cloning, eliminating Cartesia's voice cloning features.

What makes Cartesia unique

  • Sonic-3 achieves 90ms model latency using State Space Models, outpacing Transformer-based competitors.
  • Cartesia's SSM architecture captures emotional nuance including laughter and tone variation authentically.
  • Sonic-3 supports 42 languages enabling truly global enterprise voice deployments at scale.

Help us improve and share your feedback! Did you find this helpful?

Benefits

Health Insurance

Dental Insurance

Vision Insurance

401(k) Retirement Plan

401(k) Company Match

Relocation Assistance

Growth & Insights and Company News

Headcount

6 month growth

-5%

1 year growth

-5%

2 year growth

17%
IANS
Mar 27th, 2026
Smallest.ai launches Lightning V3, a new text-to-speech model that beats OpenAI, Cartesia, and ElevenLabs on key voice quality benchmarks.

Smallest.ai launches Lightning V3, a new text-to-speech model that beats OpenAI, Cartesia, and ElevenLabs on key voice quality benchmarks. Vmpl. * March 27, 2026 6:55 AM Smallest.ai Designed for real-time use, it combines multilingual speech, voice cloning from seconds of audio, and conversational-level prosody in a single system San Francisco, CA | March 27, 2026 - Smallest.ai, the research-first Voice AI company building proprietary speech models and production-grade voice agents, today announced the launch of Lightning V3, its most advanced text-to-speech (TTS) model for real-time, conversational AI. In conversational evaluations, Lightning V3 achieves a 3.89 MOS, outperforming leading models from OpenAI, Cartesia, and ElevenLabs, while also leading on intonation (3.33) and prosody (3.07)- two of the most critical factors for natural, human-like speech. The model combines this performance with multilingual support, instant voice cloning, and streaming generation designed for real-world interactions. Most TTS models today are still evaluated on complete sentences generated in isolation. That setup is easier to optimize for, but it doesn't reflect how voice systems actually behave in production- where audio is generated in chunks, context is incomplete, and responses have to adapt as conversations unfold. Lightning V3 is built for how voice systems actually run in production- generating speech in chunks, without full context, and adapting as conversations evolve. It maintains consistency across turns and adjusts tone and pacing mid-sentence, which is where most systems break down. That same setup allows the model to work across use cases without retraining- including voice agents, contact centers, podcasts, audiobooks, dubbing, and interactive applications. It supports 15 languages with automatic detection and mid-sentence switching, and can clone a voice from 5-15 seconds of audio. These cloned voices tend to sound more natural than preset ones, since they retain the variations of real speech. The model outputs audio at 44.1 kHz, and can be downsampled to 8-24 kHz for telephony. "Conversation is where most voice systems fall apart," said Sudarshan Kamath, Founder and CEO, Smallest.ai. "It's not just about sounding clear- the voice has to track context, timing, and emotion at the same time. If it works there, it works everywhere." A shift in how voice quality is measured. The launch also challenges how voice models are evaluated. Most benchmarks rely on static outputs- a setup that rarely reflects real usage. Lightning V3 is evaluated across these use case specific settings, measuring how well the voice maintains coherence, responsiveness, and believability throughout an interaction, in the given context of the conversation not just within a single utterance. Voices should be designed and judged in context: for whether they fit the persona they are meant to inhabit, carry the right social signal, and feel believable in the moment they were built for. Pricing. Lightning V3.1 is available on a pay-as-you-go model, with no upfront commitments, seat licenses, or minimum usage requirements. Teams can scale from early prototypes to high-volume deployments across both voice agents and content generation- with usage-based pricing and non-expiring credits. About Smallest.ai. Smallest.ai is a research-first Voice AI company building proprietary speech models and production-grade voice agents for regulated enterprises. The company develops state of the art speech-to-text, text-to-speech, and real-time voice systems, enabling end-to-end automation of high-volume conversations across support, collections, onboarding, and servicing- without relying on stitched third-party APIs. Designed for financial services and other regulated industries, Smallest.ai is SOC 2, GDPR, HIPAA, and PCI compliant, supports on-prem and private cloud deployments, and operates reliably in multilingual environments. Its platform is used in production by enterprises across banking, insurance, BPO and telecommunications in the US and India. Disclaimer: The content provided in this section is part of a third party press release service and does not reflect the editorial views or opinions of IANS. The responsibility for the accuracy, authenticity, and legality of the information lies solely with the content provider. IANS assumes no liability for the content published under this arrangement and encourages readers to verify the information independently before consuming it.

Medium
Nov 18th, 2025
Cartesia's $100M Bet on SSMs

Cartesia has raised $100 million to develop its AI model, Sonic-3, which uses State Space Models (SSMs) instead of the widely used transformer architecture. This marks a significant shift in AI conversation models, as transformers, known for their attention mechanism, are commonly used in text, audio, and image recognition models.

AIM Media House
Oct 29th, 2025
Cartesia Raises $100 Million to Transform Real-Time Voice AI with Sonic-3

Cartesia raises $100 million to transform real-time voice AI with Sonic-3. Silicon Valley startup Cartesia has secured a $100 million funding round from Kleiner Perkins, Index Ventures, Lightspeed, and NVIDIA. Co-founded by Stanford AI Lab alumni Karan Goel and Albert Gu, Cartesia is launching Sonic-3, a real-time conversational AI model. Sonic-3 redefines what's possible in voice AI by delivering a combination of uniqueness, speed, and multilingual support. It captures the full emotional range of human speech, including laughter, tone variation, and subtle emotional shifts, making conversations feel deeply authentic and engaging. It also boasts lightning-fast performance, with a model latency of just 90 milliseconds and a total end-to-end response time of 190 milliseconds, placing it among the fastest real-time voice AI systems available. Its global reach is equally impressive, supporting 42 languages, enabling enterprises to deploy truly global, natural voice applications that meet diverse market needs. Unlike most voice AI solutions that rely on Transformer architectures, Sonic-3 is built on State Space Models (SSMs). The traditional Transformer-based models process conversations by re-reviewing all preceding dialogue to predict each next word, similar to replaying the entire conversation repeatedly. This approach introduces latency and inefficiency. SSMs, pioneered by Cartesia's founders at Stanford (with innovations like S4 and Mamba), function more like human memory. They retain an ongoing understanding of the topic and conversational vibe without replaying everything from scratch for each response. This enables Sonic-3 to generate speech that is both natural and fast. "If you're qualified and we can't make your voice AI better than what you're using now, I'll donate $5K to your chosen charity," said Karan. Thousands of companies, including ServiceNow, Cresta, and Decagon trust Sonic to power millions of voice interactions monthly. Cartesia's platform enables enterprises to build voice agents capable of complex tasks such as customer support, scheduling, and even lighthearted pranks, all with human-like expressiveness. To encourage adoption, Cartesia offers free trials and demos, as well as an 11-page guide on cloning voices and creating AI agents in under 10 minutes. Additionally, new users receive $100 in free credits to experiment with voice AI applications. The $100 million raise highlights growing investor confidence in Cartesia's technology and business potential. With capital from Silicon Valley titans like Kleiner Perkins and NVIDIA, Cartesia plans to expand its engineering team, scale product development, and extend its global reach. Want to advertise in AIM Media House? Book here > Global leaders, intimate gatherings, bold visions for AI. Data Officers (CDOs) & Enterprise AI Leaders across major cities worldwide.

VentureBeat
May 13th, 2025
Ai Power Rankings Upended: Openai, Google Rise As Anthropic Falls, Poe Report Finds

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More. Poe‘s latest usage report shows OpenAI and Google strengthening their positions in key AI categories while Anthropic loses ground and specialized reasoning capabilities emerge as a crucial competitive battleground.According to data released today by Poe, a platform offering access to more than 100 AI models, significant market share shifts occurred across all major AI categories between January and May 2025. The data, drawn from Poe subscribers, provides rare visibility into actual user preferences beyond industry benchmarks.“As a universal gateway to 100+ AI models, Poe has a unique view of usage trends across the ecosystem,” said Nick Huber, Poe’s AI Ecosystem Lead, in an exclusive interview with VentureBeat. “The most surprising things happening right now are rapid innovation (3x the number of releases Jan-May 2025 vs. the same period in 2024), an increasingly diverse competitive landscape, and reasoning models are the clear success story of early 2025.”A chart from Poe showing AI model rankings across different categories as of May 2025

MarTech360
Mar 27th, 2025
Forethought Unveils Voice AI, Pioneering the First Complete Agentic AI CX Platform

Forethought has joined forces with Cartesia, a leader in real-time voice AI, to enhance its voice AI agents and deliver high-quality conversational experiences.

INACTIVE