Digit Recognition from Sound

A simple neural network (CNN) to classify spoken digits (0-9).

Dataset: free-spoken-digit-dataset (FSDD)

Step 1 - Data Preprocessing

The data is provided as 50 audio samples (WAV files) of each digit per person, and 3 people have contributed to the official project.

Total data = 1500 audio samples in .wav format.

We split the data as 90% train - 10% test.

Possible approaches to this problem -

Simple Neural Network
1. Load the wav file as a NUMPY array, and feed this numpy array to a simple multi-layer perceptron.
2. When we convert a WAV file to a NUMPY array, the data gets stored as a 1-D matrix. The length of this array is not specific, andit is highly dependent on the data.
3. Due to this, the first layer of the MLP will have ~10000 neurons. Further calculations will add extreme complexity.
Spectrogram
1. Convert the wav data into a spectrogram (image file) of size (64*64)
2. Feed the image file to a simple Neural Network with 4096 neurons in the first layer.
3. This is a good approach, but the number of neurons are large, and it does not seem logical to flatten out an image and feed it to a simple NN. We can do better.
4. Based on point 3 above, we can feed this 64*64 image to a simple Convolutional Neural Network.
5. Every audio willl be converted into a simple 2-D image, and this image will be fed to a CNN. This will speed up the training, and as CNNs are flawless in simple image recognition, we will definitely get a good output.

Sample Spectrogram
Spectrogram

Mel-Frequency Cepstrum Coefficient
Here’s what Wikipedia has to say about MFCC -

In sound processing, the mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC. They are derived from a type of cepstral representation of the audio clip (a nonlinear “spectrum-of-a-spectrum”). The difference between the cepstrum and the mel-frequency cepstrum is that in the MFC, the frequency bands are equally spaced on the mel scale, which approximates the human auditory system’s response more closely than the linearly-spaced frequency bands used in the normal cepstrum. This frequency warping can allow for better representation of sound, for example, in audio compression.
1. MFCC is a better representation of sound. And it can be treated like an image for all practical purposes for training.
2. Thus, I think using the MFCC representation of the WAV files is the best approach.

We move forward with the MFCC approach

Step 2 - Model Building

We use Keras for the model building.

Model Hyperparameters
1. Optimizer - Adadelta
2. Activation - ReLU
3. Number of epochs - 50
4. Batch Size - 64
5. Learning rate - Adadelta default
6. Loss - Categorical Crossentropy
Model Structure
1. 3 convolutional layers
2. 1 Max Pooling Layer
3. 3 dense layers (MLP)
4. Softmax Activation for output
5. BatchNormalization Layer after every Conv Layer and Dense Layer.
6. Dropout for every layer of MLP.

Model
Tensorboard Visualisation of Model

Step 3 - Training

Accuracy

Model Accuracy

Loss

Loss
Model Loss

We get 98% validation accuracy!

Step 4 - Test

We use the test data to check the model performance on new data. Based on the results, we get 97% accuracy!

         precision    recall  f1-score   support

     1.00      0.84      0.91        19
     0.87      0.87      0.87        15
     1.00      1.00      1.00        23
     0.91      1.00      0.95        10
     1.00      1.00      1.00        10
     1.00      1.00      1.00        23
     1.00      1.00      1.00        13
     0.93      1.00      0.96        13
     1.00      1.00      1.00        14
     0.91      1.00      0.95        10

avg / total       0.97      0.97      0.97       150

We have thus trained a Neural Network to correctly classify spoken digits.