A key aspect of generative ai is audio generation. In recent years, the popularity of generative ai has led to increasingly diverse and emerging needs in audio production. For example, text-to-sound and text-to-music technologies are expected to produce audio based on human requests for speech synthesis (TTS), speech conversion (VC), singing speech synthesis (SVS), and voice conversion ( VC). . Most previous efforts in audio creation work have task-specific designs that rely heavily on domain expertise and can only be used in fixed configurations. This study aims to create a universal audio generation, which handles numerous audio generation jobs with a single unified model instead of handling each task individually.
It is anticipated that the universal audio generation model will accumulate adequate background knowledge in audio and related modalities, which can offer simple and efficient solutions to the growing need to create a variety of audio. The exceptional performance of Large Language Model (LLM) technology in text generation work inspired several LLM-based audio generation models. Among these studies, the independence of LLM in tasks such as text-to-speech (TTS) and music production has received substantial study and performs competitively. However, the potential of LLM to handle numerous jobs should be used more in audio generation research because most LLM-based jobs still focus on single tasks.
They argue that the LLM paradigm holds promise for achieving universality and variety in audio creation, but has not yet been thoroughly investigated. In this study, researchers from the Chinese University of Hong Kong, Carnegie Mellon University, Microsoft Research Asia, and Zhejiang University present UniAudio, which uses LLM approaches to produce a variety of audio genres (speech, noises, music, and singing). based on various input modalities, including phoneme sequences, textual descriptions, and the audio itself. The following are the key features of the planned UniAudio: All audio formats and input modalities are first tokenized as discrete streams. To successfully tokenize audio, regardless of the audio format, a universal neural codec model is developed and various tokenizers are employed to tokenize various input modalities.
UniAudio then combines the source-destination pair into a single stream. Finally, UniAudio uses LLM to make predictions of the next token. The tokenization technique uses neural codec-based residual vector quantization, which produces excessively long token sequences (one frame equivalent to several tokens) that LLM cannot effectively parse. Inter-frame and intra-frame correlation are modeled independently in a multi-scale Transformer architecture aimed at decreasing computational complexity. In particular, a global Transformer module represents the correlation between frames (e.g., at the semantic level). In contrast, a local Transformer module models the correlation within the frames (for example, at the acoustic level). Building UniAudio involves two steps to show its scalability for new projects.
First, the proposed UniAudio is trained on multiple audio generation tasks simultaneously, giving the model sufficient prior knowledge of both the inherent qualities of audio and the relationships between audio and other input modalities. Second, with a few adjustments, the trained model will be able to accommodate more audio creation activities that are not visible. Because it can continuously adapt to emerging demands in audio generation, UniAudio has the potential to become a basic model for universal audio generation. Their UniAudio experimentally supports 11 audio generation tasks: the training stage covers seven audio generation jobs and the tuning step adds four tasks. To accommodate 165,000 hours of audio and 1 billion parameters, the UniAudio construction method has been increased.
UniAudio consistently achieves competitive performance on all 11 tasks, as judged by objective and subjective standards. Even modern results are achieved for most of these tasks. More research indicates that practicing several activities simultaneously in the training stage benefits all included tasks. Furthermore, UniAudio outperforms task-specific models with a non-trivial gap and can quickly adapt to new audio generation workloads. In conclusion, their work shows that developing universal audio generation models is important, hopeful, and advantageous.
Below is a summary of the key contributions of this work:
(1) To achieve universal audio generation, UniAudio is offered as a single solution for 11 audio generation jobs, which is more than all previous efforts in the field.
(2) Technically, UniAudio offers new ideas for (i) sequential representations of audio and other input modalities, (ii) consistent formulation for LLM-based audio production tasks, and (iii) effective model architecture created especially for audio generation.
(3) Extensive test results verify the overall performance of UniAudio and demonstrate the advantages of creating a flexible audio generation paradigm.
(4) The UniAudio demo and source code are made public, with the hope that it will help emerging audio production in future studios as a basic model.
Review the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to join. our 31k+ ML SubReddit, Facebook community of more than 40,000 people, Discord Channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you’ll love our newsletter.
We are also on WhatsApp. Join our ai channel on Whatsapp.
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Data Science and artificial intelligence at the Indian Institute of technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around it. She loves connecting with people and collaborating on interesting projects.
<!– ai CONTENT END 2 –>