deep gram has made a name for itself as one of the go-to startups for speech recognition. Today, the well-funded company announced the launch of Aura, your new real-time text-to-speech API. Aura combines very realistic voice models with a low latency API to allow developers to create real-time conversational ai agents. Supported by large language models (LLM), these agents can then replace customer service agents in call centers and other customer-facing situations.
As Deepgram co-founder and CEO Scott Stephenson told me, it has long been possible to access excellent speech models, but they were expensive and time-consuming to compute. Meanwhile, low-latency models tend to look robotic. Deepgram's Aura combines human-like voice models that reproduce extremely quickly (typically in less than half a second) and, as Stephenson repeatedly noted, does so at a low price.
“Now everyone is saying, 'Hey, we need real-time voice ai robots that can perceive what's being said and that can understand and generate a response, and then they can respond,'” he said. In his view, it takes a combination of accuracy (which he described as at stake for a service like this), low latency, and acceptable costs to make a product like this worthwhile for businesses, especially when combined with the relatively high cost. high of accessing LLMs. .
Deepgram maintains that Aura's price currently beats virtually all of its competitors at $0.015 per 1,000 characters. That's not that far from the price Google is offering for its WaveNet Voices at 0.016 per 1,000 characters and Amazon's Polly's Neural voices at the same price of $0.016 per 1,000 characters, but of course it is cheaper. However, Amazon's highest tier is significantly more expensive.
“You have to achieve a really good price in all (segments), but you also have to have amazing latencies and speeds and, on top of that, amazing precision. So it's a really difficult thing to do,” Stephenson said of Deepgram's overall approach to developing its product. “But this is what we focused on from the beginning and this is why we built for four years before launching anything, because we were building the underlying infrastructure to make it happen.”
Aura offers around a dozen voice models at this point, all of which were trained by a dataset that Deepgram created alongside voice actors. The Aura model, like all the company's other models, was trained internally. This is what it sounds like:
You can try a demo of Aura here. I've been testing it for a while and while you'll sometimes encounter some strange pronunciations, the speed is really what stands out, in addition to Deepgram's existing high-quality speech-to-text model. To highlight the speed at which it generates responses, Deepgram looks at how long it took the model to start speaking (typically less than 0.3 seconds) and how long it took the LLM to finish generating its response (which is typically just under one second).