I love music, see here if you doubt me. Or just take a look at the >200 songs in my Spotify playlist. Though, music requires artists, and I’m… not good at instruments. Okay, this whole thing is just another excuse to generate music with AI.
The results aren’t perfect. If you want some good AI generated music, go check out Magenta and their actually good tech. Also, I don’t have a supercomputer at home, so the model could only be trained for a few hours, nothing more.
First, I knew I had to use some kind of recurrent unit, to have some temporal consistency. Since, again, I don’t have a supercomputer, I chose GRUs, which are I feel like the least-expensive recurrent unit suited for the job here. I also had LSTMs when I felt like the ‘’Long’’ memory could be useful.
Then, I had to use Gates, which means I have to feed in context. I couldn’t just use the whole input as context (which works for tiny inputs), since I’m feeding the model a whole sequence of notes at every point in time. I ended up choosing the last few notes played, though I feel like there are better possible options.
Finally, since I wanted to give full control to the model, I had it choose not only the pitch, but also the length of it, and the volume along with the time between notes (it allowed the model for longer pauses, playing two notes at the same time etc.). I then opted for a multi-circuit approach, one for each output.
At the time I only knew Tensorflow well enough, and only had Gates implemented with it, so I went with the most natural option.
I picked the data from the Maestro Dataset, since it has a lot of data, and no-noise. As far as preprocessing goes, I just split the song in notes, of which I extracted the caracteristics.
It’s here that it becomes a brain-tangler. Multi-Circuit Models are hard to represent, so much so that it broke the Weights&Biases model visualizer. It consisted of 5 circuits:
Which handles the context
Z1: Low -dim dimension representation of the context
Z2: High-dim dimension representation of the context
K (second largest)
Which handles the key / pitch
D (medium sized)
Which handles the duration
Which handles the offset / space between notes
V (tiniest one)
Which handles the volume of each note
These are all connected to Z, and the Inputs are separated in 5. The Input to Z is a combinaison of all the characteristics of the last few notes. The Input to the 4 others is the according characteristic of a longer sequence.
Here’s a few snippets of the code:
1 2 3 4 5 6 7 8 9 10
# Analyse the whole sequence z = K.layers.Bidirectional(K.layers.LSTM(32, return_sequences=True))(context_input) z = K.layers.Activation("selu")(z) # Get rid of the extra dimension z = K.layers.GRU(16, return_sequences=False)(z) z = K.layers.Activation("relu")(z) # Low-dimensional representation of the context z1 = K.layers.Dense(32, activation="tanh")(z) # Higher-dimensional representation of the context z2 = K.layers.Dense(64, activation="tanh")(z)
One Circuit (Duration)
1 2 3 4 5 6 7 8
# Analyse the whole sequence d = K.layers.GRU(48, return_sequences=True)(duration_input) # Use the large-context and gets rid of the extra dimension d = Gate([K.layers.GRU(32, return_sequences=False) for _ in range(4)], z2)(d, z2) d = K.layers.Activation("selu")(d) # Uses a Gate, based on the low-res context for the output layer d = Gate([K.layers.Dense(duration_length) for _ in range(2)], z1)(d, z1) d = K.layers.Activation("softmax", name="Duration")(d)
As you may know it, I’m not a fan of ‘’if it doesn’t work, go bigger‘’. As such, I often limit myself in terms of how many trainable parameters and how costly my model is to run. In the end, I had 735k params, which is good enough for my goal, as I aimed for less than 1 million. I actually feel like it’s not that bad for input-output sizes this huge. Also, the weights are distributed quite homogenously.