The application

The application I have decided to work on is of a machine learning model that can ultimately differentiate between a saxophone sound with effect (wet) and without effect (dry). After pre-processing the dataset, it was possible to extract reliable features for classification, to train and test the model, and to categorize the audio correctly by its classification category (wet/dry). A machine learning model that can classify different kinds of sound timbre is related to the interdisciplinary science of Music information retrieval (MIR), and more specifically, to instrument recognition. This type of application can also contribute to processes of track separation, music transcription and sound effect retrieval. The machine learning technique that is used in this model is supervised learning. “The algorithm builds a model of the relationship between two types of data: input data (the list of features for each example) and output data (labels or classes). The training dataset for a supervised learning problem contains examples of input-output pairs. Once the model has been trained, it can compute new outputs in response to new inputs” (Fiebrink et al., 2016). Using a Multi-layer perceptron algorithm (MLPClassifier) available from the scikit machine learning package, it was able to train an artificial neural network (ANN) and to successfully classify the two sounds (with 90-100% accuracy).

The dataset

I collected 102 samples of saxophone notes provided by Philharmonia. The samples were downloaded as mp3 files and were converted to wav files for handling with the Librosa library. I made sure that all samples are around the same volume level and length (approx. 1 second). After processing the sounds by cutting out all silent parts from both the start and end of each sound file, I copied the same 102 files into a new folder labeled wet and applied two effects on each sample. The effects I have used for creating the wet samples were a simple chorus effect and an octaver that is doubling the note an octave lower from the original pitch. It is common knowledge that a small dataset for training results in a poor approximation. “An over-constrained model will underfit the small training dataset, whereas an under-constrained model will likely overfit the small training data, both resulting in poor performance” (Brownlee, 2019). I started with a much smaller dataset at the beginning of the project, where I only had 30 samples of each class, but as the project progressed, I decided to increase my dataset for achieving a higher accuracy rate.

Feature extraction

To be able to differentiate between the two classes of sound at hand, I first had to decide on several features that can represent and show clear differences between dry and wet signals. After numerous trials and errors during the feature selection process, attempting to find a suitable combination between several features to extract, as well as plotting visual examples (figure 1-6), I settled on three features that fit my classification problem and produced the best results over time.

  • Melspectrogram - a visual spectrum representation of a Mel Scale (y-axis) over time (x-axis).
  • Spectral Centroid - a calculation of a spectrum which indicates where the center of mass of the spectrum is located.
  • Spectral Flatness - a measure to quantify how much noise-like a sound is, as opposed to being tone-like.

Plot for feature extraction

Preparing the dataset

The extracted features are being converted from lists into NumPy arrays and labels are being attached to each class (0-dry, 1-wet) by using a function from the NumPy library. The next step in preparing the dataset is the implementation of a scale function from the scikit-learn library for standardizing all features and to avoid a single variable dominating the others just because it consists of a different range of numerical values. Once the values of all features are rescaled, I convert the dataset into a two-dimensional data structure using the Pandas DataFrame and apply labels of the three features and two classes. Calling the train_test_split function from scikit model selection, I was able to determine the size of my testing and training data. This has also been a point where I tested different split points to optimize the model’s results. The model’s accuracy rate was at its highest when the test size parameter was set to be between 25-35%. At this point, the dataset is ready to be tested and trained with the algorithm.

The Algorithm

The algorithm used in this ML model is provided by the scikit library and is called a Multi-Layer Perceptron (MLPClassifier). The perceptron was a machine that “could be taught to perform certain tasks using examples. This surprising invention was almost immediately followed by an equally surprising theoretical result, the perceptron convergence theorem, which states that a machine executing the perceptron algorithm can effectively produce a decision rule that is compatible with its training examples” (Minsky and Papert, 2017). The idea behind this algorithm is that a single perceptron can only solve linearly separable problems, but by linking multiple perceptrons together, more complex problems that are not linearly separable can be solved (Haykin, 1994). The single perceptron is a simplified model of a biological neuron which receives input, contain an activation function and produces an output. The multi-layer perceptron model, which is a simplified model of a neural network, contains several layers of perceptrons (neurons) whereas each perceptron in a layer is connected to each of the perceptrons in the other layers. The input layer of the machine learning model presented here is of the three extracted features: Melspectrogram, Spectral Centroid and Spectral Flatness. The output layer is of the two classes (dry/wet). Any number of layers in between the input and output is considered hidden, and the whole design can be viewed as an artificial neural network. The MLPClassifier function provides the option to specify the number of hidden layers and the number of perceptrons in each of those layers. The connection between the neurons is being adjusted by weights which are responsible for the amplitude of the connection between two nodes. A bias value is also being linked with each node in the input to help control the triggering value of the perceptron’s activation function. In practice, the bias is being treated as the weights. Using a backpropagation algorithm already implemented within the MLPClassifier function, the weights and bias values are being adjusted in proportion to how much they contribute to the overall error with each training run of the test data. The process of updating the values of the weights and bias is possible due to a gradient descent optimization algorithm which calculates the partial derivatives of the cost function with respect to each parameter (weight and bias) and stores the result in a gradient. This lets the neural network not only feed-forward information but also feed the learned information backward through the network, adjust the weights and bias, and test the model again based on previous results. Finetuning of parameters inside the MLPClassifier function took some time and effort. I have set the maximum limit of training iteration to 2000, but usually, the model completes the training within the range of 500 to 1000 iterations. The MLPClassifier function is set to stops the training if the training loss did not improve significantly over the last ten epochs. I have mostly trained the system with the hyperbolic tangent as the activation function and found better results with lower iteration count when using the hyperbolic tangent or the linear (‘identity’) functions. The number of hidden layers and the size of each layer also played a significant role in the results but considering the size of the dataset and all parameters that can be attuned (n_mel size, test_train_split, hidden_layer_sizes, activation function), I am not able to determine which is the ‘best combination’ of parameters for 100% accuracy. I settled on three hidden layers containing five, four and three neurons.

Implementation, replication, and evaluation

To be able to run the code and conduct training and testing run, one should download the data folder and the Jupiter notebook containing the code, and place both in the same folder. The first cell in the Jupiter notebook deals with importing of all required packages and libraries for running the model and setting up the environment. The second cell loads the dataset, set up the sample rate and print the number of files. In the third cell, the features are extracted from the samples and converted to NumPy arrays and Pandas data structure, labels are being attached to classes and features, and partitioning of the data set to train and test data is being done. In the fourth cell, one can adjust the MLPClassifier parameters, train the model and apply the trained model on the testing data. In the fifth cell, it is possible to plot and print the results and get a complete report of the trained model. The machine learning model is not a complicated one and can be run and provide results within a minute or two at the most. For future work, it might be interesting to compare between three or more classes of various applied effects on a dry saxophone sound or even to compare between other instruments while using the same effects setup.