Introduction
Related work
Theoretical background
Feature extraction
Acoustic modeling and parameter optimization approaches
Maximum likelihood estimation (MLE)
Maximum mutual information estimation (MMIE)
Minimum phone error/minimum word error
State-level minimum Bayes risk (sMBR)
Boosted maximum mutual information estimation (bMMIE)
Experimental overview
Dataset details
Characteristics | Adult dataset | Child dataset |
---|---|---|
No. of speakers | 21 | 39 |
Speech data type | Isolated words and phonetically rich sentences | Continuous speech sentences |
Recording environment | Closed room using dictaphone and microphone | Open and closed environment using microphone |
No. of utterances | 3953 | 2159 |
Age | 17–26 years | 7–12 years |
Duration | 10 h 12 min | 4 h 10 min |
No. of unique words | 6567 | 4863 |
Gender | 9 male/12 female | 20 male/19 female |
Type of ASR | Training | Testing |
---|---|---|
Adult ASR-S1 system | Adult dataset | Adult dataset |
Children ASR-S2 system | Children dataset | Children dataset |
Mismatched ASR-S3 system | Adult dataset | Children dataset |
Semi-mismatched-S4 system | Adult and children mixed dataset | Children dataset |
Noise augmentation
Spectral augmentation
System overview
Experimental results
Performance analysis on adult, children and mismatch ASR system under clean environmental conditions
Training set | Testing set | System type | DNN (WER%) (%) |
---|---|---|---|
Adult | Adult | S1 | 6.52 |
Child | Child | S2 | 15.43 |
Adult | Child | S3 | 41.28 |
Adult–child | Child | S4 | 14.27 |
Performance analysis for matched and mismatched ASR system under varying noisy test conditions
Performance evaluations on random noise-based training data augmentation
Training set | MFCC | RASTA-PLP | GFCC | PNCC | ||||
---|---|---|---|---|---|---|---|---|
Clean test set | Noisy test set | Clean test set | Noisy test set | Clean test set | Noisy test set | Clean test set | Noisy test set | |
S1 + random noise | 7.32 | 9.42 | 7.01 | 8.25 | 6.50 | 7.12 | 5.99 | 6.04 |
S2 + random noise | 15.61 | 18.55 | 15.07 | 17.42 | 14.61 | 15.66 | 13.24 | 13.31 |
S3 + random noise | 42.21 | 49.62 | 41.93 | 47.26 | 40.15 | 44.13 | 37.21 | 39.23 |
S4 + random noise | 14.18 | 17.96 | 13.86 | 16.51 | 13.16 | 14.53 | 12.67 | 12.69 |
Performance analysis of discriminative analysis under noisy and clean conditions when adult and adult–child in training set
No. of iterations (MMI) | WER (%) | |||
---|---|---|---|---|
Clean test set | Noisy test set | |||
S1 | S4 | S1 | S4 | |
1 | 6.97 | 14.25 | 6.89 | 13.89 |
2 | 6.25 | 13.12 | 5.97 | 12.27 |
3 | 5.63 | 12.65 | 5.5 | 12.14 |
4 | 5.68 | 12.13 | 5.61 | 12.07 |
5 | 5.59 | 12.19 | 5.48 | 12.12 |
6 | 5.58 | 12.17 | 5.51 | 12.09 |
7 | 5.59 | 12.18 | 5.51 | 12.11 |
8 | 5.59 | 12.17 | 5.5 | 12.11 |
LM | WER (%) | |||
---|---|---|---|---|
Clean test set | Noisy test set | |||
S1 | S4 | S1 | S4 | |
1-g | 7.56 | 14.21 | 7.52 | 14.04 |
2-g | 6.61 | 12.27 | 6.47 | 12.02 |
3-g | 5.57 | 11.74 | 5.39 | 11.64 |
4-g | 5.59 | 11.81 | 5.4 | 11.66 |
Boost factor | WER (%) | |||
---|---|---|---|---|
Clean test set | Noisy test set | |||
S1 | S4 | S1 | S4 | |
0 (mmi) | 5.63 | 12.13 | 5.5 | 12.07 |
0.05 | 5.6 | 12.04 | 5.47 | 12.01 |
0.1 | 5.52 | 11.93 | 5.43 | 11.87 |
0.15 | 5.49 | 11.89 | 5.39 | 11.73 |
0.2 | 5.51 | 11.74 | 5.41 | 11.64 |
0.25 | 5.53 | 11.76 | 5.44 | 11.66 |
System type | WER (%) | |||
---|---|---|---|---|
Clean test set | Noisy test set | |||
S1 | S4 | S1 | S4 | |
DNN-MMI | 5.63 | 12.13 | 5.5 | 12.07 |
DNN-MPE | 5.57 | 11.92 | 5.46 | 11.76 |
DNN-bMMI | 5.49 | 11.74 | 5.39 | 11.64 |
DNN-sMBR | 4.97 | 10.17 | 4.82 | 9.97 |
Performance analysis of gender-based selection under mismatched system on clean and noisy test dataset
System type | WER (%) | |||
---|---|---|---|---|
Clean test set | Noisy test set | |||
Female adult + child | Male adult + child | Female adult + child | Male adult + child | |
DNN-MMI | 11.81 | 12.34 | 11.69 | 12.26 |
DNN-MPE | 11.85 | 12.32 | 11.65 | 11.82 |
DNN-bMMI | 11.57 | 11.80 | 11.44 | 11.85 |
DNN-sMBR | 10.05 | 11.01 | 9.85 | 10.34 |
Performance analysis under augmentation adult and adult–child in training set
Training set | Classifier type | PNCC | PNCC + VTLN | ||
---|---|---|---|---|---|
Clean test set | Noisy test set | Clean test set | Noisy test set | ||
S1 + noise + 3-way | DNN | 4.64 | 4.68 | 4.37 | 4.48 |
S4 + noise + 3-way | 9.38 | 9.24 | 8.82 | 8.64 | |
Female adult + noise + 3-way | 9.31 | 9.18 | 8.71 | 8.62 | |
S1 + noise + 3-way | TDNN | 4.18 | 4.27 | 3.90 | 4.02 |
S4 + noise + 3-way | 8.89 | 8.65 | 8.26 | 8.10 | |
Female adult + noise + 3-way | 8.85 | 8.59 | 8.20 | 8.08 |
Performance analysis based on gender-based spectral augmentation under mismatched conditions
Training set | Perturbation type | Classifier type | PNCC + VTLN | ||
---|---|---|---|---|---|
Noise augmented dataset | Warping factor | Clean test set | Noisy test set | ||
Female adult + noise | – | Three-way | TDNN | 8.20 | 8.08 |
− 0.1 + 0.05 | 7.75 | 7.06 | |||
− 0.1 ± 0.05 | 7.78 | 7.14 | |||
0.05 ± 0.05 | 7.86 | 7.34 |
Comparative performance analysis of proposed system architecture with earlier implemented approaches
Author details | Dataset details | Methodologies | Summary |
---|---|---|---|
Kadyan et al. [13] | Punjabi adult corpora constituting continuous and phonetically rich sentences | MFCC; GFCC-based hybrid DNN–HMM and GMM–HMM modeling | The reduction in size, vector knowledge de-correlation and speaker heterogeneity are being discussed by the researcher employing LDA, transition probability, speaker adaptive tri-phones, highest probability, linear regression adaptation models. In two hybrid classifiers, the accuracy of the interconnected and ongoing Punjabi voice corpus is studied. GMM–HMM and DNN–HMM with the experimental configuration detailing significant RI of 4–5% and 1–3%, respectively |
Shivakumar et al. [5] | English language children dataset employing transfer learning | MFCC-based GMM–HMM and DNN–HMM-based modeling | The paper presents a systematic and an extensive analysis of the proposed transfer learning technique considering the key factors affecting children’s speech recognition from prior literature. Evaluations are presented by making the comparisons of earlier GMM–HMM and the newer DNN Models such that the author had experimented for the detailed effectiveness of standard adaptation techniques versus transfer learning |
Kumar et al. [42] | Adult data comprising of 13,218 Punjabi words with over 200 min of recorded speech | MFCC feature extraction technique | In this paper, the author has experimented for auto-denoising method employing the novel Corpus Optimization Algorithm on the Punjabi language corpus. At the same time, for 13,218 Punjabi words, the WER was lowered to 5.8%. Likewise, some other important factors such as the total probability per frame and the convergence ratio spanning different iterations for obtainable Gaussian mixtures has also been evaluated and consequently the improved performance of the system has been relatively being suggested |
Gretter et al. [43] | TLT-school corpora containing Italian children recorded English dataset | Metrics for collection of adequate children data based upon good pronunciation vs bad pronunciation | The researchers have maintained for the collection of corpuses corresponding to students between 9 and 16 years of age, students from elementary, secondary and secondary schools, was registered in 2017 and 2018. Both statements have been obtained by human experts with regard to certain predefined ability measures |
Kadyan et al. [44] | Punjabi children speech corpora | MFCC; MFFC + Pitch; MFCC + Pitch + VTLN-based DNN–HMM modeling | Substantially lower error rates from an increase in off-domain data dependent on prosody modifications has been experimented by the researcher. Furthermore, the authors analyzed the impact of changing the number of senones, the number of hidden nodes and layers, and the early stagnation, which resulted in a relative improvement of 32.1% (RI) in contrast to the baseline structure of different senones |
Dua et al. [45] | Hindi speech corpora | Discriminative training based on MPE through variations among the quantity of Gaussian mixtures | The researcher has trained speech recognition through interpolation of language model and discriminative approaches. They achieved a relative improvement of 85.45 under clean and 82.95 under noisy conditions |
Kadyan et al. [46] | Punjabi adult corpora comprising of isolated and phonetically rich sentences | MFCC coupled bottleneck features based on Tandem-NN acoustic modeling | In this paper, the authors have processed context-independent input speech signal information through utilization of bottleneck characteristics. Further noisy data have been handled and experimental results revealed that under clean and noisy settings a Tandem-NN system achieved a RI of 13.53% as compared to the Baseline system |
Dua et al. [47] | Hindi continuous sentences speech corpora and noise augmented dataset | Use of noise-resistant integrated features and an improved HMM model for the development of discriminatively trained speech recognition system | The suggested study has examined that with MF-PLP and MF-GFCC alone or integrated feature vectors results into large performance improvement |
Kumar and Aggarwal [48] | Two low-resource Indo-Aryan family languages including Hindi and Marathi | Integrated features vector with RNN being employed on Hindi ASR system utilizing MLLR and constrained-MLLR) | The researcher experimented 256 Gaussian mixtures corresponding to every HMM state using discriminatively trained method of MMI and MPE. The experiments showcased that the discriminative training has been improved in comparison to baseline system by 3% |
Bawa et al. [1] | Gender-based selection under mismatched conditions | MFCC; GFCC-based DNN–HMM modeling | The study attempts to create Punjabi Children ASR in mismatched parameters via noise-robust techniques such as the MFCC or GFCC. Accordingly, acoustic and phonetic differences between adults and children are managed by gender-based selection of adult data and subsequent acoustic variability across speakers in training and test conditions are normalized by means of the VTLN with 30.94% of RI in comparison to the baseline system |
Proposed approach | Punjabi adult and children under mismatched conditions | PNCC; PNCC + VTLN-based DNN-sMBR and TDNN-sMBR modeling; gender-based selection; spectral augmentation | (i) The results demonstrate that ASR frames examined on PNCC + VTLN techniques are only successful when testing it on sMBR optimized acoustic models. The outcomes of these experiments shown that an overall RI of 40.18%, 47.51%, and 47.64% are achieved, respectively, with S1 and S4 ASR systems and female adult-selected ASR system (ii) Second, the gender-based spectral augmentation has led to an enhanced performance improvement of 49.87% in comparison to the baseline system |