Now that we've talked about the relevance of mono to stereo technology, you may be wondering how it works under the hood. It turns out that there are different approaches to tackling this problem with ai. Next, I want to show four different methods, which range From traditional signal processing to generative ai.. It does not serve as a complete list of methods, but rather as an inspiration of how this task has been solved over the past 20 years.
Traditional signal processing: formation of sound sources
Before machine learning became as popular as it is today, the field of Music Information Retrieval (MIR) it was dominated by intelligent, hand-crafted algorithms. Not surprisingly, these methods also exist for mono-to-stereo mixing.
The fundamental idea behind a 2007 article (Lagrange, Martins, Tzanetakis, (1)) It's simple:
If we can find the different sound sources in a recording and extract them from the signal, we can remix them for a realistic stereo experience.
This sounds simplebut how can we know what the sound sources of the signal are? How do we define them so clearly that an algorithm can extract them from the signal? These questions are difficult to solve, and the article uses a variety of advanced methods to achieve this. In essence, this is the algorithm they came up with:
- Divide the recording into short fragments and identify peak frequencies (dominant notes) in each fragment
- Identify which peaks go together (a sound source) using a clustering algorithm
- Decide where each sound source must be placed in the stereo mix (manual step)
- For each sound source, extract your assigned frequencies of the signal
- Mix all extracted sources together to form the final stereo mix.
Although quite complex in the details, the intuition is quite clear: Find sources, extract them, mix them again.
A quick solution: source separation/stem division
A lot has happened since Lagrange's 2007 article. Since Deezer launched its stem splitting tool Spleeter In 2019, ai-based source separation systems have become remarkably useful. Notable players such as <a target="_blank" class="af qa" href="https://www.lalal.ai/” rel=”noopener ugc nofollow” target=”_blank”>Lalal.ai either <a target="_blank" class="af qa" href="https://www.audioshake.ai/instrument-stem-separation” rel=”noopener ugc nofollow” target=”_blank”>Audioshake make a quick solution possible:
- Separate a mono recording into its individual instrument stems using a free or commercial stem splitter.
- Load the stems into a digital audio workstation (DAW) and mix them to your liking.
This technique was used in a 2011 research article (see (2)), but it has become much more viable since due to the Recent improvements in stem separation tools..
The disadvantage of source separation approaches is that they produce notable sound artifactsbecause source separation itself is still not without its flaws. Furthermore, these approaches still Requires manual mixing by humans, making them only semi-automatic.
To fully automate mono-to-stereo mixing, machine learning is required. By learning from real stereo mixes, the ML system can adapt the mixing style of real human producers.
Machine learning with parametric stereo
Serrà and his colleagues presented at ISMIR 2023 a very creative and efficient way to use machine learning for mono-to-stereo mixing. (3). This work is based on a musical compression technique called parametric stereo. Stereo mixes consist of two audio channels, making them difficult to integrate into low-bandwidth environments such as streaming music, radio broadcasts, or telephone connections.
Parametric stereo is a technique for creating stereo sound from a single mono signal using focusing on important spatial cues our brain uses it to determine where sounds come from. These signs are:
- how noisy a sound is in the left ear versus the right ear (Interchannel Intensity Difference, IID)
- How synchronized? is between left and right in terms of time or phase (time between channels or phase difference)
- How similar or different? the signals are in each ear (Interchannel Correlation, IC)
Using these parameters, a stereo experience can be created from nothing more than a mono signal.
This is the approach the researchers took to develop their mono-to-stereo mixing model:
- Collect a large set of data stereo music tracks
- Convert stereo tracks to parametric stereo (mono + spatial parameters)
- Train a neural network predict spatial parameters given a mono recording
- To convert a new mono signal to stereo, use the trained model to infer spatial parameters from mono signal and combine the two into a parametric stereo experience
Currently, there does not appear to be any code or listening demos available for this document. The authors themselves confess that “there is still a gap between professional stereo mixes and the proposed approaches” (p. 6). Still, the article describes a creative and efficient way to achieve a fully automated mono-to-stereo mix using machine learning.
Generative ai: Transformer-Based Synthesis
Now, we'll get to the seemingly simplest way to generate stereo from mono. Training a generative model to take a mono input and synthesize both stereo output channels directly. Although conceptually simple, this is by far the most technically challenging approach. One second of high-resolution audio has 44.1k data points. Therefore, generating a three-minute song with stereo channels means generating more than 15 million data points.
With today's technologies such as convolutional neural networks, transformers, and neural audio codecs, the complexity of the task is starting to become manageable. There are some papers that chose to generate stereo signals through direct neural synthesis (see (4), (5), (6)). However, only (5) train a model that can solve out-of-the-box mono-to-stereo generation. My intuition is that there is room for an article that builds a project dedicated to the “simple” mono-to-stereo generation task and focuses 100% on solving this goal. Anyone here looking for a PhD topic?