There have been many attempts to create ai-powered open source voice assistants (see Rhasspy, Mycroft, and Jasper, to name a few), all set with the goal of creating privacy-preserving, non-compromising offline experiences. functionality. But development has been shown to be extraordinarily slow. This is because, in addition to all the usual challenges that come with open source projects, programming a wizard is hard. Technologies like Google Assistant, Siri and Alexa have years, if not decades, of R&D behind them and a huge infrastructure to boot.
But that doesn't deter the folks at Large-scale artificial intelligence Open Network (LAION), the German nonprofit responsible for maintaining some of the world's most popular ai training data sets. This month, ai/”>LAION announced a new initiative, BUD-E, that seeks to build a “fully open” voice assistant capable of running on consumer hardware.
Why launch an entirely new voice assistant project when there are countless others in various states of abandonment? Wieland Brendel, a member of the Ellis Institute and a BUD-E contributor, believes there is no open assistant with a sufficiently extensible architecture to take full advantage of emerging GenAI technologies, particularly large language models (LLMs). similar to OpenAI's ChatGPT.
“Most interactions with (attendees) are based on chat interfaces that are quite cumbersome to interact with, (and) dialogues with those systems feel forced and unnatural,” Brendel told TechCrunch in an email interview. . “Those systems are fine for transmitting commands to control music or turn on the light, but they are not the basis for long, engaging conversations. “The goal of BUD-E is to provide the basis for a voice assistant that feels much more natural to humans and that mimics the natural speech patterns of human dialogue and remembers past conversations.”
Brendel added that LAION also wants to ensure that each BUD-E component can eventually be integrated with unlicensed applications and services, even commercially, which is not necessarily the case with other open assistant efforts.
A collaboration with the Ellis Institute in Tübingen, the technology consultancy Collabora and the ai Center in Tübingen, BUD-E (recursive abbreviation for “Buddy for Understanding and Digital Empathy”) has an ambitious roadmap. in a ai/blog/bud-e/”>blog entryThe LAION team lays out what they hope to achieve in the coming months, primarily incorporating “emotional intelligence” into BUD-E and ensuring it can handle conversations involving multiple speakers at once.
“There is a huge need for a natural voice assistant that works well,” Brendel said. “LAION has proven in the past to be excellent at building communities, and the ELLIS Institute Tübingen and the ai Center Tübingen are committed to providing the resources to develop the assistant.”
BUD-E is up and running: you can ai/natural_voice_assistant”>discharge and install it today from GitHub on an Ubuntu or Windows PC (macOS is coming), but it's very clearly in the early stages.
LAION brought together several open models to assemble an MVP, including Microsoft's Phi-2 LLM, Columbia's text-to-speech StyleTTS2, and Nvidia's FastConformer for speech-to-text. As such, the experience is a bit streamlined. Getting BUD-E to respond to commands in about 500 milliseconds (in the range of commercial voice assistants like Google Assistant and Alexa) requires a beefy GPU like Nvidia's. RTX 4090.
Collabora is working pro bono to adapt its open source text-to-speech and speech recognition models, WhisperLive and WhisperSpeech, for BUD-E.
“Building text-to-speech and speech recognition solutions ourselves means we can customize them to an extent that is not possible with closed models exposed via APIs,” Jakub Piotr Cłapa, ai researcher at Collabora and member of the BUD team. E, he said in an email. “Initially, Collabora started working (open wizards) in part because we were struggling to find a good text-to-speech solution for an LLM-based voice agent for one of our clients. “We decided to join forces with the broader open source community to make our models more accessible and useful.”
In the short term, LAION says it will work to make BUD-E's hardware requirements less onerous and reduce assistant latency. A longer-term task is to create a dialogue data set to tune BUD-E, as well as a memory mechanism that allows BUD-E to store information from previous conversations and a speech processing process that can keep track of several people talking. Right away.
I asked the team if accessibility was a priority, considering that voice recognition systems have historically not worked well with languages other than English and accents that are not transatlantic. A Stanford study found that voice recognition systems from Amazon, IBM, Google, Microsoft and Apple were almost twice as likely to mishear black speakers as white speakers of the same age and gender.
Brendel said that LAION does not ignore accessibility – but that is not an “immediate approach” to BUD-E.
“The first goal is to really redefine the experience of how we interact with voice assistants before generalizing that experience to more diverse accents and languages,” Brendel said.
To that end, LAION has some pretty wacky ideas for BUD-E, ranging from an animated avatar to impersonate the assistant to support for analyzing users' faces via webcams to take their emotional state into account.
The ethics of this last part (the facial analysis) are a bit uncertain, to say the least. But Robert Kaczmarczyk, co-founder of LAION, stressed that LAION will remain committed to safety.
“(We) strictly comply with the ethical and safety guidelines formulated by the EU ai Law,” he told TechCrunch via email, referring to the legal framework governing the sale and use of ai in the EU. The EU ai Law allows European Union member countries to adopt more restrictive rules and safeguards for “high-risk” ai, including emotion classifiers.
““This commitment to transparency not only makes it easier to identify and correct potential biases early, but also helps the cause of scientific integrity,” Kaczmarczyk added. “By making our data sets accessible, we enable the broader scientific community to engage in research that maintains the highest standards of reproducibility.”
LAION's previous work ai-models-rife-with-bias-2023-1″>It hasn't been pristine. in the ethical sense, and is currently carrying out a separate, somewhat controversial project on emotion detection. But maybe BUD-E is different; We'll have to wait and see.