Music-Generating AI w/ Gates

Quick Overview

I love music, see here if you doubt me. Or just take a look at the >200 songs in my Spotify playlist.
Though, music requires artists, and I’m… not good at instruments.
Okay, this whole thing is just another excuse to generate music with AI.

Goal

Here, my goal was to:

Disclaimer

The results aren’t perfect. If you want some good AI generated music, go check out Magenta and their actually good tech.
Also, I don’t have a supercomputer at home, so the model could only be trained for a few hours, nothing more.

The Approach

First, I knew I had to use some kind of recurrent unit, to have some temporal consistency.
Since, again, I don’t have a supercomputer, I chose GRUs, which are I feel like the least-expensive recurrent unit suited for the job here.
I also had LSTMs when I felt like the ‘’Long’’ memory could be useful.

Then, I had to use Gates, which means I have to feed in context. I couldn’t just use the whole input as context (which works for tiny inputs), since I’m feeding the model a whole sequence of notes at every point in time. I ended up choosing the last few notes played, though I feel like there are better possible options.

Finally, since I wanted to give full control to the model, I had it choose not only the pitch, but also the length of it, and the volume along with the time between notes (it allowed the model for longer pauses, playing two notes at the same time etc.). I then opted for a multi-circuit approach, one for each output.

The Framework

At the time I only knew Tensorflow well enough, and only had Gates implemented with it, so I went with the most natural option.

The Data

I picked the data from the Maestro Dataset, since it has a lot of data, and no-noise.
As far as preprocessing goes, I just split the song in notes, of which I extracted the caracteristics.

The Model

It’s here that it becomes a brain-tangler. Multi-Circuit Models are hard to represent, so much so that it broke the Weights&Biases model visualizer.
It consisted of 5 circuits:

  • Z (largest)

    Which handles the context
    • Z1: Low -dim dimension representation of the context
    • Z2: High-dim dimension representation of the context
  • K (second largest)

    Which handles the key / pitch
  • D (medium sized)

    Which handles the duration
  • O (little)

    Which handles the offset / space between notes
  • V (tiniest one)

    Which handles the volume of each note

These are all connected to Z, and the Inputs are separated in 5.
The Input to Z is a combinaison of all the characteristics of the last few notes.
The Input to the 4 others is the according characteristic of a longer sequence.

Here’s a few snippets of the code:

Context Manipulation

1
2
3
4
5
6
7
8
9
10
# Analyse the whole sequence
z = K.layers.Bidirectional(K.layers.LSTM(32, return_sequences=True))(context_input)
z = K.layers.Activation("selu")(z)
# Get rid of the extra dimension
z = K.layers.GRU(16, return_sequences=False)(z)
z = K.layers.Activation("relu")(z)
# Low-dimensional representation of the context
z1 = K.layers.Dense(32, activation="tanh")(z)
# Higher-dimensional representation of the context
z2 = K.layers.Dense(64, activation="tanh")(z)

One Circuit (Duration)

1
2
3
4
5
6
7
8
# Analyse the whole sequence
d = K.layers.GRU(48, return_sequences=True)(duration_input)
# Use the large-context and gets rid of the extra dimension
d = Gate([K.layers.GRU(32, return_sequences=False) for _ in range(4)], z2)(d, z2)
d = K.layers.Activation("selu")(d)
# Uses a Gate, based on the low-res context for the output layer
d = Gate([K.layers.Dense(duration_length) for _ in range(2)], z1)(d, z1)
d = K.layers.Activation("softmax", name="Duration")(d)

Total Weights

As you may know it, I’m not a fan of ‘’if it doesn’t work, go bigger‘’. As such, I often limit myself in terms of how many trainable parameters and how costly my model is to run. In the end, I had 735k params, which is good enough for my goal, as I aimed for less than 1 million.
I actually feel like it’s not that bad for input-output sizes this huge. Also, the weights are distributed quite homogenously.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
Model: "Gated Musicgen"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_5 (InputLayer) [(None, 5, 800)] 0
__________________________________________________________________________________________________
bidirectional (Bidirectional) (None, 5, 64) 213248 input_5[0][0]
__________________________________________________________________________________________________
activation (Activation) (None, 5, 64) 0 bidirectional[0][0]
__________________________________________________________________________________________________
gru (GRU) (None, 16) 3936 activation[0][0]
__________________________________________________________________________________________________
input_1 (InputLayer) [(None, 25, 545)] 0
__________________________________________________________________________________________________
activation_1 (Activation) (None, 16) 0 gru[0][0]
__________________________________________________________________________________________________
input_2 (InputLayer) [(None, 25, 118)] 0
__________________________________________________________________________________________________
input_3 (InputLayer) [(None, 25, 10)] 0
__________________________________________________________________________________________________
input_4 (InputLayer) [(None, 25, 127)] 0
__________________________________________________________________________________________________
gru_1 (GRU) (None, 25, 72) 133704 input_1[0][0]
__________________________________________________________________________________________________
dense_1 (Dense) (None, 64) 1088 activation_1[0][0]
__________________________________________________________________________________________________
gru_8 (GRU) (None, 25, 48) 24192 input_2[0][0]
__________________________________________________________________________________________________
gru_13 (GRU) (None, 25, 24) 2592 input_3[0][0]
__________________________________________________________________________________________________
gru_18 (GRU) (None, 25, 48) 25488 input_4[0][0]
__________________________________________________________________________________________________
gate (Gate) (None, 48) 105798 gru_1[0][0]
__________________________________________________________________________________________________
gate_2 (Gate) (None, 32) 31748 gru_8[0][0]
__________________________________________________________________________________________________
gate_4 (Gate) (None, 16) 8324 gru_13[0][0]
__________________________________________________________________________________________________
gate_6 (Gate) (None, 48) 56708 gru_18[0][0]
__________________________________________________________________________________________________
activation_2 (Activation) (None, 48) 0 gate[0][0]
__________________________________________________________________________________________________
dense (Dense) (None, 32) 544 activation_1[0][0]
__________________________________________________________________________________________________
activation_3 (Activation) (None, 32) 0 gate_2[0][0]
__________________________________________________________________________________________________
activation_4 (Activation) (None, 16) 0 gate_4[0][0]
__________________________________________________________________________________________________
activation_5 (Activation) (None, 48) 0 gate_6[0][0]
__________________________________________________________________________________________________
gate_1 (Gate) (None, 545) 106952 activation_2[0][0]
__________________________________________________________________________________________________
gate_3 (Gate) (None, 118) 7854 activation_3[0][0]
__________________________________________________________________________________________________
gate_5 (Gate) (None, 10) 406 activation_4[0][0]
__________________________________________________________________________________________________
gate_7 (Gate) (None, 127) 12512 activation_5[0][0]
__________________________________________________________________________________________________
Key (Activation) (None, 545) 0 gate_1[0][0]
__________________________________________________________________________________________________
Duration (Activation) (None, 118) 0 gate_3[0][0]
__________________________________________________________________________________________________
Volume (Activation) (None, 10) 0 gate_5[0][0]
__________________________________________________________________________________________________
Offset (Activation) (None, 127) 0 gate_7[0][0]
==================================================================================================
Total params: 735,094
Trainable params: 735,094
Non-trainable params: 0
__________________________________________________________________________________________________
Output Sizes : [(None, 545), (None, 118), (None, 10), (None, 127)]

The Results

Sadly, apparently these are from a training session which had less freedom over the length of the notes. That’s why they’re all equally spaced.

In the beginning, as expected, I got random noise:

Then, it learnt chords:

Sometimes, it got stuck, only to give a great melody in the end:

Sometimes, it went a little repetitive:

At some point, it managed to have multiple patterns in a single file. It still had a few accidents here and there:

And Some of the results were actually quite good ! (about 4h of training)

Possible Improvements

At the time of writing this article, I’ve learnt more about ML (thankfully), and think I could improve on these points:

  • Activation Functions

    Even though they’re fine-tuned and experimented with, I think gelu would’ve been a better fit for temporal data (it is less costly as well), and tanh feels a bit random for the z circuit.
  • Gate Use

    I had just discovered gates, so I wanted to use them everywhere, but, from my experience, they aren’t the most suitable when less than 2 in a row, or in an output layer.
  • Input-Output Sizes

    I hot-encoded both input and output data, which makes the model quite bulky.
  • Naming Layers

    The summary is quite a mess, I should’ve named the layers.