Introduction
-
We introduce a new AmSDD that contains a digit 0 (Zaero) to 9 (zet’enyi) from 120 volunteer speakers of different age groups, genders, and dialects with 10 repetitions of each digit. This dataset can be downloaded from here.1
-
We propose AmSDR system using this AmSDD and various classical SML models to investigate the performance of the prediction and understanding of the nature of this dataset.
-
To further improve the accuracy of the AmSDR, we also propose the DL model of CNN architecture with Batch Normalization and compare it with the baseline of classical SML models.
-
We conducted extensive experimental evaluations to demonstrate the performance of the proposed work using MFCCs and Mel-Spectrogram feature extraction techniques.
Related work
Amharic spoken digits recognition system
Speech collection
Digits | Amharic digits script | Pronunciation |
---|---|---|
0 | zaero | |
1 | ānidi | |
2 | huleti | |
3 | sositi | |
4 | ārati | |
5 | āmisiti | |
6 | sedisiti | |
7 | sebati | |
8 | siminiti | |
9 | zet’enyi |
Speech preprocessing
Attributes | Values |
---|---|
Sampling rate | 16 kHz |
Number of quantization (bits) | 16 bit |
Number of channel | mono |
Audio file format | .wav |
Number of speakers according to genders | Male and female |
Age distribution | Children, young and middle age |
Recording environment | Normal life, closed room and with noise |
Duration | Less than 1 s |
Dialects | Addis Ababa, Gojjam, Gondar, Wollo and North Showa |
Number of speakers | 120 |
Number of tokens per speaker | 100 |
Number of digits | 10 |
Number of repetitions per digit | 10 |
Total number of utterances | 12,000 |
Feature extraction
Mel-spectrogram
Mel-frequency cepstral coefficients
Visualization
Supervised machine learning
Linear discriminant analysis
K-nearest neighbors
Support vector machine
Random forest
Convolutional neural network
Experimental results and discussions
Experimental setups and configuration of parameters
Parameters | Values |
---|---|
Sampling rate | 16 kHz, 16 bit |
Fast Fourier transform | 512 |
Hop length | 256 |
Applied window | Hamming |
Number of Mel Filter Banks | 23 |
Cepstral coefficients | 13 |
Types of layer | Dimension | Remarks |
---|---|---|
Input | (1, 13, 63) | MFCCs |
Conv2d | (32, 12, 62) | kernel 2 × 2, stride = 1, ReLu activation |
Maxpool2d | (32, 6, 31) | Max pool 2 × 2 |
BatchNorm2d | (32, 6, 31) | N.A |
Conv2d | (64, 5, 30) | Kernel 2 × 2, stride = 1, ReLu activation |
BatchNorm2d | (64, 5, 30) | N.A |
Conv2d | (128, 4, 29) | Kernel 2 × 2, stride = 1, ReLu activation |
Maxpool2d | (128, 2, 14) | Max pool 2 × 2 |
BatchNorm2d | (128, 2, 14) | N.A |
Dropout | (128, 2, 14) | Dropout rate = 0.4 |
Flatten | 3584 | N.A |
Linear | 256 | ReLu activation |
Dropout | 256 | Dropout rate = 0.4 |
Linear | 128 | ReLu activation |
Dropout | 128 | Dropout rate = 0.4 |
Linear | 10 | Softmax activation |
Performance evaluation metrics
Experimental results
Classes | Precision (%) | Recall (%) | F1-score (%) | Accuracy (%) |
---|---|---|---|---|
0 | 98.00 | 100.00 | 98.98 | 98.00 |
1 | 98.80 | 97.00 | 97.51 | 98.80 |
2 | 99.00 | 99.00 | 99.00 | 99.00 |
3 | 100.00 | 100.00 | 100.00 | 100.00 |
4 | 99.00 | 98.00 | 98.50 | 99.00 |
5 | 99.00 | 100.00 | 99.49 | 99.00 |
6 | 100.00 | 100.00 | 100.00 | 100.00 |
7 | 99.00 | 98.00 | 98.50 | 99.00 |
8 | 99.00 | 98.00 | 98.50 | 99.00 |
9 | 99.00 | 100.00 | 99.49 | 99.00 |
Mean | 99.00 | 99.01 | 99.00 | 99.00 |
Languages | No. of utterances | Training size (%) | Test size (%) | Feature extraction | Accuracy (%) |
---|---|---|---|---|---|
English [32] | 3000 | 80 | 20 | MFCCs | 98.33 |
Mel-spectrogram | 97.50 | ||||
Gujarati [46] | 1940 | 80 | 20 | MFCCs | 96.80 |
Mel-spectrogram | 88.00 |
Languages | Models | Feature extractions | Accuracy (%) |
---|---|---|---|
English [28] | DFNN | MFCCs | 99.50 |
Arabic [18] | CNN | MFCCs | 99.00 |
Urdu [19] | CNN | Mel-Spectrogram | 97.00 |
Bangali [17] | CNN | MFCCs | 98.37 |
Hindi [62] | Pattern network | MFCCs | 96.80 |
Gujarati [63] | CNN | MFCCs | 98.70 |
Portugese [64] | SVM | Line spectral frequencies (LSF) | 99.33 |
Pashato [65] | SVM | Prosodic | 91.50 |
Amharic (ours) | CNN | MFCCs | 99.00 |
Impact of various factors on model performance
Training type | Training size (%) | Test type | Test size (%) | Feature extraction | Accuracy (%) |
---|---|---|---|---|---|
Females | 42.5 | Males | 57.5 | MFCCs | 81.50 |
Mel-spectrogram | 73.20 | ||||
Males | 57.5 | Females | 42.5 | MFCCs | 92.50 |
Mel-spectrogram | 82.50 | ||||
Both | 42.5 | Both | 57.5 | MFCCs | 97.62 |
Mel-spectrogram | 96.54 | ||||
Both | 57.5 | Both | 42.5 | MFCCs | 98.50 |
Mel-spectrogram | 97.64 |
Training type | Training size (%) | Test type | Test size (%) | Feature extraction | Accuracy (%) |
---|---|---|---|---|---|
Addis Ababa + Gondar + Gojjam + North Shewa | 84.99 | Wollo | 15.01 | MFCCs | 91.00 |
Mel-spectrogram | 87.28 | ||||
Addis Ababa + Gondar + Gojjam + North Wollo | 81.67 | North Shewa | 18.33 | MFCCs | 94.50 |
Mel-spectrogram | 92.40 | ||||
Addis Ababa + Gondar + North Shewa + Wollo | 80.00 | Gojjam | 20.00 | MFCCs | 93.00 |
Mel-spectrogram | 87.80 | ||||
Addis Ababa + Gojjam + North Shewa + Wollo | 74.18 | Gondar | 25.82 | MFCCs | 91.50 |
Mel-spectrogram | 85.00 | ||||
Gondar + Gojjam + North Shewa + Wollo | 75.83 | Addis Ababa | 24.17 | MFCCs | 92.50 |
Mel-spectrogram | 89.72 |
Learning rate | Batch size | Execution time (s) | Loss | Accuracy (%) |
---|---|---|---|---|
1 | 4 | 499.59 | 2.363 | 9.79 |
0.1 | 8 | 259.34 | 2.361 | 10.59 |
0.01 | 16 | 494.06 | 2.360 | 10.083 |
0.001 | 32 | 125.48 | 1.496 | 96.45 |
0.0001 | 64 | 96.68 | 1471 | 99.083 |
0.00001 | 128 | 103.5 | 1.472 | 98.83 |
Sample rate | Trainable params | Execution time (s) | Loss | Accuracy (%) |
---|---|---|---|---|
8kHz | 469,418 | 49.50 | 1.473 | 98.958 |
16 kHz | 993,706 | 96.68 | 1.471 | 99.083 |
22.05 kHz | 1,386,922 | 106.18 | 1.472 | 98.875 |
24 kHz | 1,517,994 | 113.15 | 1.471 | 99.083 |
No. of MFCCs | Trainable params | Execution time (s) | Loss | Accuracy (%) |
---|---|---|---|---|
13 | 993,706 | 96.68 | 1.471 | 99.083 |
15 | 993,706 | 89.99 | 1.472 | 98.95 |
20 | 1,452,458 | 116.71 | 1.473 | 98.75 |
25 | 2,369,962 | 162.57 | 1.473 | 98.83 |
30 | 2,828,714 | 187.64 | 1.473 | 98.85 |
35 | 3,287,466 | 229.63 | 1.472 | 98.92 |
40 | 3,746,218 | 251.54 | 1.473 | 98.75 |