Most music production today depend strongly on technology, from the beginning of a songs creation, till the the last final tunings during mix and master. Still their is usually many human aspect involved, like singing, humans playing instruments, humans using a music making software etc. The skill to make music is not something we are born with, but something we can learn to do. This learning process is something we can mimic in a computer by using machine learning, and then generate music. Magenta1 is a project started by the Google Brain team to demonstrate machine learning as a tool in creative processes; one of which are to make music. Drum RNN (Simon et al. 2018) is one of their models, which applies language modeling to drum track generation using an LSTM. In this report I will explore the Drum RNN model, by training the model with MIDI drum files from the “Groove MIDI dataset” (Gillick et al, 2019). When the Drum RNN model is trained it can generate drum MIDI files. I will have a look at the training of the model regarding number of neural layers, neurons in each layer and training cycles. I will also discuss the quality of the drum MIDI files generated by the model from a drummers perspective in short.


The Drum RNN model are based on a recurrent neural network, hence its name (RNN). The Drum RNN model uses a LSTM (long short-term memory) architecture. LSTM (Hochreiter and Schmidhuber, 1997) has a feedback connections which makes it able to learn how to map sequences to sequences. For the Drum RNN model to learn, we need a sequenced data structure to learn from. To do this we simply train the Drum RNN model by giving it sequences of drum patterns. When the model have used these drum patterns to learn, it is then able to take the beginning of a drum pattern/sequence, and produce the rest/following sequence. Take for example a simple pattern of a single kick- and snare drum, hit after each other in a loop. If this was the training sequence, the system would have learned that the snare comes after the kick and the kick after the snare etc. Giving the trained data a kick drum as a starting point, it would then produce a snare drum sound, a kick drum sound etc… RNNs ability to make use of sequential information in the data, is what makes it suitable to learn and produce music, since music is a sequence of sound (for example MIDI notes). To take advantage of the sequences that have been played previously and use them to learn what should come next, the neural network need some type of memory. This is where the LSTM architecture comes in handy. The LSTM architecture is based on memory cells and gate units. The memory sells are used to convey useful information about the current state. The gates are used to decide when to keep or override the memory sells. Which are one of the reasons this type of architecture is suitable to learn music patterns. Since it can use the information from previous sequences to predict the next note.

The data and feature extractions

“Unlike melodies, drum tracks are polyphonic in the sense that multiple drums can be struck simultaneously. Despite this, we model a drum track as a single sequence of events by a) mapping all of the different MIDI drums onto a smaller number of drum classes, and b) representing each event as a single value representing the set of drum classes that are struck.”(Simon et al. 2018) In the model you can choose to use two different configurations. Either you configure to map all drums to a single drum class, which uses a basic binary encoding of drum tracks, where the value 0 means silence and 1 means at least one drum is struck at the step. The other configuration is to map all drums to a 9-piece drum kit consisting of bass drum, snare drum, closed and open hi-hat, three toms, crash and ride cymbals. I chose to use the 9-piece kit configuration, because I wanted a full kit playing as an output. “Our first step will be to convert a collection of MIDI files into NoteSequences. NoteSequences are protocol buffers, which is a fast and efficient data format, and easier to work with than MIDI files.” (Simon et al. 2018). In Magenta there is a script that lets you convert your MIDI files (input) into NoteSequences (output) by running the “convert_dir_to_note_sequences” script. “SequenceExamples are fed into the model during training and evaluation. Each SequenceExample will contain a sequence of inputs and a sequence of labels that represent a drum track”(Simon et al. 2018). By running the “drums_rnn_create_dataset” script, we extract drum tracks from out NoteSequences and save them as SequenceExamples. These examples is which we use to train and evaluate the model.

Data set used to train and test the system

As mentioned above, the data used to train and test the system are MIDI drum files from the “Groove MIDI dataset” (Gillick et al, 2019). These MIDI drum files have been played on a Roland TD11 by a total of 10 drummer. Most of them hired professionals. The dataset consists of a total of 1150 MIDI files, including a range of genres which is labeled in the file names. I chose to use MIDI beat-files of the rock genre (excluding the fills MIDI files), which consist of a total of 239 MIDI files. This is a genre I am familiar with when it comes to drumming and I did not want to combine genres, since more genres means more options for the model when it comes to learning. Which means the model would most likely require more cycles of training to get the same result of perplexity (if it would ever get there). The reason for this is simply that different genres often have different drum structures/patterns. I assume including different genres would have made the data set more complex. The same assumption goes for adding the drum fills.

Training and evaluation of the model

To see if the system is actually learning from the data and are getting better for each cycle, we can have a look at the perplexity of the model during training. The perplexity is shown in the terminal during training. This allows me to have a look at how good the training are doing for each 10th cycle. By changing the number of RNN layers and the number of cell in each layer, I can brute force to find the most optimal configuration for the model given my data set, with some inspiration from when it comes to fewer layers with more neurons in each (Hewahi, N et al. 2019). Even though the end result seam to be quit similar when I react a certain amount of cycles. For example during one training of the model with two layers with 64 neurons each this was the outcome when it comes to perplexity: First cycle the model had an perplexity = 510, at the 11th cycle the perplexity = 182, at the 21th cycle the perplexity = 59, at the 31th cycle the perplexity = 28 an at the 91th cycle the perplexity = 13. If I try other configuration, it might get to a perplexity of close to 13 earlier, but then it might take more time for each cycle to run, since I either increase the amount of layers or the amount of neurons (or both). When it comes to the artistic quality this is a bit harder to evaluate. The MIDI drum files generated by the trained Drum RNN model sounds like it belongs to the rock genre, but the pattern seems to be a bit random. I expected the generated MIDI drum files to follow a more strict pattern and go into a loop, playing the same over and over. But from what I have tested so far it seams like it does the opposite, by both changing the pattern of the beat and by randomly adding toms and crashes, like a child with a lot of talent that have learnt to play drums, but are not mature enough to stick to any particular pattern/beat.

Personal reflection

My original idea for this project was to record my own drum MIDI files and use these to train the Music VAE model. To then have a look at the MIDI drum file generated by the trained model and compare these to my own recorded MIDI drum files. But after many attempts of recording and train the Music VAE model with my recorded MIDI files, without any success, I had to give up. At first I had some problems with getting the Music VAE model to work. I then found Drum RNN in the process of trying to fix the problem, and decided to change to this model since it were a better fit for my project. But I still had the problem of using my recorded MIDI drum files as a dataset for training the model. After spending the first week on troubleshooting I decided to changed the project to deal with a MIDI dataset recorded for GrooVAE (Gillick et al, 2019). I also trained the model using some MIDI metal drum files, but had no time to compare the result with the “rock model”. Given more time it would have been exciting to use more data to train the model and compare rasults. None of the machine learning tools that I have used are made or contributed by me. My contribution to this project was to explore the Drum RNN model by training it with MIDI drum files (rock genre) and changing the training configurations.

Generated drum beats


Simon, I., Roberts, A., Schatz, J. and Hawthrone, C. (2018). Drum RNN. [Viewed 11 September 2019] https://www.github.com/tensorflow/magenta/blob/master/magenta/scripts/README.md

Hochreiter, S. and Schmidhuber, J. (1997). “Long short-term memory” Neural Computation 9 (8), pp. 1735–1780.

Gillick, J., Roberts, A., Engel, J., Eck, D. And Banman, D. (2019) “Learning to Groove with Inverse Sequence Transformations.” International Conference on Machine Learning (ICML) [Viewed 12 September 2019]

Hewahi, N., AlSaigal, S. and AlJanahi, S. (2019) Generation of music pieces using machine learning: long short-term memory neural networks approach, Arab Journal of Basic and Applied Sciences, 26:1, 397-413, DOI: 10.1080/25765299.2019.1649972