Building the Next Generation of Real-Time AI Models: Our Investment in Cartesia

Cartesia co-founding team: Brandon Yang, Karan Goel, Albert Gu, and Arjun Desai

QUICK TAKE

  • Cartesia has created a new type of AI architecture that can match transformer performance with fewer resources.
  • The company’s state-space models (SSMs) are more efficient, with better long-term memory and lower latency than the dominant transformer-based models.
  • Founded by a group of researchers from the Stanford AI Lab, Cartesia has announced $22 million in new funding led by Index Ventures, bringing total capital raised to $27 million.
  • Cartesia’s first product, Sonic, is already the world’s fastest text-to-speech model. It can operate locally on any device without an internet connection and outperforms the best existing models across voice quality, stability, and accuracy.
  • “Transformers have provided a step-change in model performance, but given their limitations there is opportunity for a fundamentally new and different architecture to unlock the next wave of AI innovation. We believe Cartesia’s SSMs can be that new architecture,” says Mike Volpi, Partner at Index Ventures.

INDEX PERSPECTIVE

IV_Perspectives_Default.jpg Cartesia YouTube thumbnail Play video

THE DETAILS

Transformer architecture has become the default for almost every large language model developed since the landmark “Attention Is All You Need” paper was published by Google scientists in 2017. But while transformers have revolutionized AI and support many of the applications we see today, they have a limit: quadratic scaling (transformers compare every word in the input to every other word) means processing speeds slow down as inputs increase, and handling the necessary data can become costly.

Founded by a group of researchers from the Stanford AI Lab, Cartesia is addressing this challenge through state-space models (SSMs), a new type of architecture invented by the team. Highly efficient, with better long-term memory, lower latency, and the ability to run locally, Cartesia’s SSMs are set to shape the next wave of innovation in generative AI. To accelerate its mission of building real-time, multimodal intelligence available on any device, the company has announced $22 million in new funding led by Index Ventures, bringing their total funding raised to $27 million.

Early impact

The Mamba architecture built by Cartesia’s founding team already shows that SSMs can match transformer performance with fewer resources, making them a more efficient and cost-effective alternative for developers building real-time AI applications. While transformers attend to every past token, SSMs scale linearly, updating the model’s state and discarding previous tokens as they stream in, making them the ideal architecture for real-time inference.

“It’s well-known that today’s foundation models fall far short of the standard set by human intelligence,” explains Karan Goel, Cartesia’s co-founder and CEO. “Not only do these models lack the depth of understanding that humans possess, they’re slow and computationally expensive in a way that restricts their development and use to only the largest companies. At Cartesia, we believe the next generation of AI requires a phase shift in how we think about model architectures and machine learning. That includes SSMs that bring intelligence directly to the device, where it can operate efficiently, in real-time, without reliance on data centers.”

Never-before-seen features

Demonstrating this, Cartesia’s first product, Sonic, is the world’s fastest text-to-speech model. A low-latency voice model that generates expressive, lifelike speech, it can stream the first audio byte in just 90ms (about twice as fast as the blink of an eye). In addition, Sonic outperforms the best existing models on the market on voice quality, stability, and accuracy when compared head-to-head in blind human preference tests by third-party evaluators like Labelbox.

The underlying SSM architecture has enabled Sonic to offer never-before-seen features, such as an on-device product that can run locally without an internet connection, and advanced controllability features like emotion, speed and prompting. Built in just a few months, the Sonic API already supports a variety of real-time use cases — customer service, debt collection, interview screening, voiceovers, and interactive character voices — with hundreds of customers ranging from new startups to public companies.

Cartesia plans to build on Sonic’s success with a long-term roadmap that includes multimodal AI models capable of ingesting and processing different inputs such as text, audio, video, images, and time-series data. Ultimately, the goal is to create real-time intelligence that can handle massive amounts of information across a wide range of applications and tasks. By building the next wave of foundation models with long-term memory and low latency, Cartesia aims to transform industries ranging from healthcare to robotics to gaming, paving the way for ubiquitous, interactive, and real-time AI available to anyone, on any device.

The team behind the tech

Cartesia is led by a group of Stanford researchers that includes CEO Karan Goel, his Stanford labmates Albert Gu, Arjun Desai, and Brandon Yang, along with their former professor Chris Ré. Ré’s Stanford lab has served as a hotbed of research and multiple billion-dollar startups in recent years, including SambaNova, Snorkel AI, and Together AI. They’re joined by a diverse and well-rounded product team that brings experience from companies like DoorDash, Salesforce, Meta, Scale AI, Microsoft, Google Brain, Apple, and Zoom.

The Index-led round was supported by A* Capital, Conviction, General Catalyst, Lightspeed, and SV Angel, along with 90 prominent angel investors, including the founders of Airtable, Captions, Cohere, Datadog, Hugging Face, Hubspot, Mistral, Okta, Pinterest, RunwayML, Sonos, Weaviate, and Zapier.

In this post: Shardul Shah, Cartesia

Published — Dec. 12, 2024