The big question is how we can improve automatic transcription models.
To develop a more effective system for separating musical notes into voices and staves, particularly for complex piano music, we need to rethink the problem from a different perspective. Our goal is to improve the readability of music transcribed from quantized MIDI, which is important for creating good score recordings and better performance by musicians.
For good readability of the score, two elements are probably the most important:
- staff separation, which organizes the notes between the upper and lower staff;
- and the separation of voices, highlighted in this image with lines of different colors.
In piano scores, as said before, the voices are not strictly monophonic but homophonic, meaning that a single voice can contain one or several notes played at the same time. From now on, we'll call them chords. You can see some examples of chords highlighted in purple on the bottom staff of the image above.
From a machine learning perspectiveWe have two tasks to solve:
- The first is separation of personnelwhich is simple, we just need to predict for each note a binary label, for the upper or lower staff, specifically for piano scores.
- He voice separation The task may seem similar, after all, if we can predict the voice number for each voice, with a multi-class classifier, and the problem would be solved!
However, directly predicting voice tags is problematic. We would have to set the maximum number of voices the system can accept, but this creates a balance between the flexibility of our system and the class imbalance within the data.
For example, if we set the maximum number of voices to 8, to represent 4 on each staff, as is commonly done in music notation software, we can expect to have very few occurrences of the labels 8 and 4 in our data set.
Looking specifically at the score excerpt here, voices 3, 4, and 8 are missing entirely. Highly imbalanced data will degrade the performance of a multi-label classifier and if we configure a smaller number of voices, we would lose system flexibility.
The solution to these problems is to be able to translate the knowledge that the system learned in some voices to other voices. For this, we abandon the idea of the multiclass classifier and frame the voice prediction like a link prediction problem. We want to join two notes if they are consecutive in the same voice. This has the advantage of breaking down a complex problem into a set of very simple problems where for each pair of notes we again predict a binary label indicating whether the two notes are linked or not. This approach also holds true for chords, as you can see in the low voice in this image.
This process will create a graph that we call output graph. To find the voices, we can simply calculate the connected components of the output graph!
To reiterate, we formulate the voice and personal separation problem as two binary prediction tasks.
- For separation of personnelwe predict the staff number for each note,
- already separate voices We predict links between each pair of notes.
While not strictly necessary, we found it useful for our system's performance to add an extra task:
- Chord predictionSimilar to vocals, we link each pair of notes if they belong to the same chord.
Let's recap what our system looks like so far, we have three binary classifiers, one that inputs individual notes and two that inputs pairs of notes. What we need now are good input features, so that our classifiers can use contextual information in their prediction. Using deep learning vocabulary, we need a good note encoder!
We chose to use Graph Neural Network (GNN) as the note encoder as it often excels at processing symbolic music. Therefore we need to create a graph from the musical input.
For this, we deterministically construct a new graph from the quantized midi, which we call input graph.
Creating these input graphs can be easily done with tools like GraphMuse.
Now, putting everything together, our model looks like this: