Skip to the content.

Digit Recognition from Sound

A simple neural network (CNN) to classify spoken digits (0-9).


Dataset: free-spoken-digit-dataset (FSDD)

Step 1 - Data Preprocessing

The data is provided as 50 audio samples (WAV files) of each digit per person, and 3 people have contributed to the official project.

Total data = 1500 audio samples in .wav format.

We split the data as 90% train - 10% test.

Possible approaches to this problem -

Sample Spectrogram
Spectrogram

We move forward with the MFCC approach

Step 2 - Model Building

We use Keras for the model building.

Model
Tensorboard Visualisation of Model

Step 3 - Training

Accuracy
Model Accuracy

Loss
Model Loss

We get 98% validation accuracy!

Step 4 - Test

We use the test data to check the model performance on new data. Based on the results, we get 97% accuracy!

         precision    recall  f1-score   support

      0       1.00      0.84      0.91        19
      1       0.87      0.87      0.87        15
      2       1.00      1.00      1.00        23
      3       0.91      1.00      0.95        10
      4       1.00      1.00      1.00        10
      5       1.00      1.00      1.00        23
      6       1.00      1.00      1.00        13
      7       0.93      1.00      0.96        13
      8       1.00      1.00      1.00        14
      9       0.91      1.00      0.95        10

avg / total       0.97      0.97      0.97       150

We have thus trained a Neural Network to correctly classify spoken digits.