nach oben

International Journal of Speech Technology

Erschienen in:

24.06.2023

Noise robust automatic speech recognition: review and analysis

verfasst von: Mohit Dua, Akanksha, Shelza Dua

Erschienen in: International Journal of Speech Technology | Ausgabe 2/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Automatic Speech Recognition (ASR) system is an emerging technology used in various fields such as robotics, traffic controls, and healthcare, etc. The leading cause of ASR performance degradation is mismatch between the training and testing environments. The main reason for this mismatch is the presence of noise during the testing phase of an ASR system. Various techniques have been used by different researchers in front and backend phases of ASR, to detect and handle the noise. However, a very few review papers have considered noise as a criterion to present the comparison among the existing research works. Hence, the objective of this survey is to analyze and review all the effective methods proposed by different scientists and researchers to boost the noise robustness of an ASR system. Initially, the paper discusses the basic architecture of an ASR system, the factors affecting the its performance, and noise problem formulation. Secondly, the work analysis existing state of the art noise robust ASR methods in terms of front end feature extraction techniques and backend classification model. Then, a detailed review in terms of various speech databases, that are used by these methods, is given. Finally, an analysis in terms of performance metrics of all these noise-resistant ASR techniques is presented. Also, the paper discusses various feature extraction techniques, backend classification methods, different speech databases and performance metrics in detail, while presenting the analysis. The paper also discusses the existing challenges, and describes future research directions in the area of building noise-resistant ASR systems.

Vorheriger Artikel Mouth2Audio: intelligible audio synthesis from videos with distinctive vowel articulation

Nächster Artikel Stuttering detection using speaker representations and self-supervised contextual embeddings

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Abdollahi, M., & Nasersharif, B. (2017, May). Noise adaptive deep belief network for robust speech features extraction. In 2017 Iranian conference on electrical engineering (ICEE) (pp. 1491–1496). IEEE.

Agrawal, P., & Ganapathy, S. (2019). Modulation filter learning using deep variational networks for robust speech recognition. IEEE Journal of Selected Topics in Signal Processing, 13(2), 244–253.

Alimuradov, A. K., & Tychkov, A. Y. (2021, March). EMD-based noise-robust method for speech/pause segmentation. In 2021 3rd international youth conference on radio electronics, electrical and power engineering (REEPE) (pp. 1–8). IEEE.

Al-Karawi, K. A., & Mohammed, D. Y. (2021). Improving short utterance speaker verification by combining MFCC and entrocy in noisy conditions. Multimedia Tools and Applications, 80(14), 22231–22249.

Baevski, A., Hsu, W. N., Conneau, A., & Auli, M. (2021). Unsupervised speech recognition. arXiv preprint arXiv:2105.11084.

Barker, J., Watanabe, S., Vincent, E., & Trmal, J. (2018). The fifth ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines. arXiv preprint arXiv:1803.10609.

Barker, J. P., Marxer, R., Vincent, E., & Watanabe, S. (2017). The CHiME challenges: Robust speech recognition in everyday environments. In S. Watanabe, M. Delcroix, F. Metze, & J. R. Hershey (Eds.), New era for robust speech recognition (pp. 327–344). Springer.

Bawa, P., & Kadyan, V. (2021). Noise robust in-domain children speech enhancement for automatic Punjabi recognition system under mismatched conditions. Applied Acoustics, 175, 107810.

Bharath, K. P., & Kumar, R. (2020). ELM speaker identification for limited dataset using multitaper based MFCC and PNCC features with fusion score. Multimedia Tools and Applications, 79(39), 28859–28883.

Bourouba, H., & Djemili, R. (2020). Feature extraction algorithm using new cepstral techniques for robust speech recognition. Malaysian Journal of Computer Science, 33(2), 90–101.

Bu, H., Du, J., Na, X., Wu, B., & Zheng, H. (2017, November). Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA) (pp. 1–5). IEEE.

Carletta, J., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., Kadlec, J., Karaiskos, V., Kraaij, W., Kronenthal, M., Lathoud, G., Lincoln, M., Lisowska, A., McCowan, I., Post, W., Reidsma, D., & Wellner, P. (2005, July). The AMI meeting corpus: A pre-announcement. In International workshop on machine learning for multimodal interaction (pp. 28–39). Springer.

Casebeer, J., Vale, V., Isik, U., Valin, J. M., Giri, R., & Krishnaswamy, A. (2021, June). Enhancing into the codec: Noise robust speech coding with vector-quantized auto-encoders. In ICASSP 2021–2021 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 711–715). IEEE.

Chai, L., Du, J., Liu, D. Y., Tu, Y. H., & Lee, C. H. (2021, January). Acoustic modeling for multi-array conversational speech recognition in the chime-6 challenge. In 2021 IEEE spoken language technology workshop (SLT) (pp. 912–918). IEEE.

Chao, F. A., Jiang, S. W. F., Yan, B. C., Hung, J. W., & Chen, B. (2021). TENET: A time-reversal enhancement network for noise-robust ASR. arXiv preprint arXiv:2107.01531.

Chao, F. A., Hung, J. W., & Chen, B. (2021, July). Cross-domain single-channel speech enhancement model with BI-projection fusion module for noise-robust ASR. In 2021 IEEE international conference on multimedia and expo (ICME) (pp. 1–6). IEEE.

Cho, B. J., & Park, H. M. (2021). Convolutional maximum-likelihood distortionless response beamforming with steering vector estimation for robust speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 1352–1367.

Christensen, H., Barker, J., Ma, N., & Green, P. D. (2010). The CHiME corpus: a resource and a challenge for computational hearing in multisource environments. In Eleventh annual conference of the international speech communication association.

Chung, H., Jeon, H. B., & Park, J. G. (2020, July). Semi-supervised training for sequence-to-sequence speech recognition using reinforcement learning. In 2020 international joint conference on neural networks (IJCNN) (pp. 1–6). IEEE.

de La Calle-Silos, F., & Stern, R. M. (2017). Synchrony-based feature extraction for robust automatic speech recognition. IEEE Signal Processing Letters, 24(8), 1158–1162.

Donahue, C., Li, B., & Prabhavalkar, R. (2018, April). Exploring speech enhancement with generative adversarial networks for robust speech recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5024–5028). IEEE.

Dua, M., Aggarwal, R. K., & Biswas, M. (2017, September). Discriminative training using heterogeneous feature vector for Hindi automatic speech recognition system. In 2017 international conference on computer and applications (ICCA) (pp. 158–162). IEEE.

Dua, M., Sethi, P. S., Agrawal, V., & Chawla, R. (2021). Speaker recognition using noise robust features and LSTM-RNN. In Progress in advanced computing and intelligent engineering (pp. 19–28). Springer.

Dua, M., Aggarwal, R. K., & Biswas, M. (2018). Optimizing integrated features for Hindi automatic speech recognition system. Journal of Intelligent Systems, 29(1), 959–976.

Dua, M., Aggarwal, R. K., & Biswas, M. (2020). Discriminative training using noise-robust integrated features and refined HMM modeling. Journal of Intelligent Systems, 29(1), 327–344.

Dua, M., Jain, C., & Kumar, S. (2021). LSTM and CNN based ensemble approach for spoof detection task in automatic speaker verification systems. Journal of Ambient Intelligence and Humanized Computing. https://doi.org/10.1007/s12652-021-02960-0CrossRef

Dua, M., Jain, C., & Kumar, S. (2022). LSTM and CNN based ensemble approach for spoof detection task in automatic speaker verification systems. Journal of Ambient Intelligence and Humanized Computing, 13, 1–16.

Dua, M., Sadhu, A., Jindal, A., & Mehta, R. (2022). A hybrid noise robust model for multi-replay attack detection in automatic speaker verification systems. Biomedical Signal Processing and Control, 74, 103517.

Dubey, H., Sangwan, A., & Hansen, J. H. (2018). Leveraging frequency-dependent kernel and dip-based clustering for robust speech activity detection in naturalistic audio streams. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(11), 2056–2071.

Erdogan, H., Hershey, J. R., Watanabe, S., & Le Roux, J. (2017). Deep recurrent networks for separation and recognition of single-channel speech in nonstationary background audio. In New Era for Robust Speech Recognition (pp. 165–186). Springer.

Errattahi, R., El Hannani, A., & Ouahmane, H. (2018). Automatic speech recognition errors detection and correction: A review. Procedia Computer Science, 128, 32–37.

Fallside, F., Lucke, H., Marsland, T. P., O'Shea, P. J., Owen, M. S. J., Prager, R. W., Robinson, A. J., & Russell, N. H. (1990, April). Continuous speech recognition for the TIMIT database using neural networks. In International conference on acoustics, speech, and signal processing (pp. 445–448). IEEE.

Faragallah, O. S. (2018). Robust noise MKMFCC–SVM automatic speaker identification. International Journal of Speech Technology, 21(2), 185–192.

Fendji, J. L. K., Tala, D. M., Yenke, B. O., & Atemkeng, M. (2021). Automatic Speech Recognition using limited vocabulary: A survey. arXiv preprint arXiv:2108.10254.

Fukuda, T., & Kurata, G. (2021, June). Generalized knowledge distillation from an ensemble of specialized teachers leveraging unsupervised neural clustering. In ICASSP 2021–2021 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6868–6872). IEEE.

Gref, M., Walter, O., Schmidt, C., Behnke, S., & Köhler, J. (2020). Multi-staged cross-lingual acoustic model adaption for robust speech recognition in real-world applications—A case study on German oral history interviews. arXiv preprint arXiv:2005.12562.

Hermansky, H., Ellis, D. P., & Sharma, S. (2000, June). Tandem connectionist feature extraction for conventional HMM systems. In 2000 IEEE international conference on acoustics, speech, and signal processing: Proceedings (Cat. No. 00CH37100) (Vol. 3, pp. 1635–1638). IEEE.

Higuchi, Y., Tawara, N., Ogawa, A., Iwata, T., Kobayashi, T., & Ogawa, T. (2021, January). Noise-robust attention learning for end-to-end speech recognition. In 2020 28th European Signal Processing Conference (EUSIPCO) (pp. 311–315). IEEE.

Hsu, W. N., & Glass, J. (2018, April). Extracting domain invariant features by unsupervised learning for robust automatic speech recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5614–5618). IEEE.

Hu, H., Tan, T., & Qian, Y. (2018, April). Generative adversarial networks based data augmentation for noise robust speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5044–5048). IEEE.

Huang, C. W., & Narayanan, S. S. (2017, July). Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition. In 2017 IEEE international conference on multimedia and expo (ICME) (pp. 583–588). IEEE.

Huang, Y., Ao, W., & Zhang, G. (2017). Novel sub-band spectral centroid weighted wavelet packet features with importance-weighted support vector machines for robust speech emotion recognition. Wireless Personal Communications, 95(3), 2223–2238.

Huang, Y., Tian, K., Wu, A., & Zhang, G. (2019). Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition. Journal of Ambient Intelligence and Humanized Computing, 10(5), 1787–1798.

Ibrahim, A. K., Zhuang, H., Erdol, N., & Ali, A. M. (2018, December). Feature extraction methods for the detection of north Atlantic right whale up-calls. In 2018 international conference on computational science and computational intelligence (CSCI) (pp. 179–185). IEEE.

Jainar, S. J., Sale, P. L., & Nagaraja, B. G. (2020). VAD, feature extraction and modelling techniques for speaker recognition: A review. International Journal of Signal and Imaging Systems Engineering, 12(1–2), 1–18.

Joshi, S. S., & Bhagile, V. D. (2020, November). Native and non-native Marathi numerals recognition using LPC and ANN. In 2020 4th international conference on electronics, communication and aerospace technology (ICECA) (pp. 355–361). IEEE.

Kadyan, V., & Kaur, M. (2020). SGMM-based modeling classifier for Punjabi automatic speech recognition system. In Smart computing paradigms: New progresses and challenges (pp. 149–155). Springer.

Kadyan, V., Bala, S., & Bawa, P. (2021). Training augmentation with TANDEM acoustic modelling in Punjabi adult speech recognition system. International Journal of Speech Technology, 24(2), 473–481.

Kadyan, V., Bala, S., Bawa, P., & Mittal, M. (2020a). Developing in-vehicular noise robust children ASR system using Tandem-NN-based acoustic modelling. International Journal of Vehicle Autonomous Systems, 15(3–4), 296–306.

Kadyan, V., Dua, M., & Dhiman, P. (2021). Enhancing accuracy of long contextual dependencies for Punjabi speech recognition system using deep LSTM. International Journal of Speech Technology, 24, 517–527.

Kadyan, V., Mantri, A., & Aggarwal, R. K. (2020b). Improved filter bank on multitaper framework for robust Punjabi-ASR system. International Journal of Speech Technology, 23(1), 87–100.

Kahn, J., Riviere, M., Zheng, W., Kharitonov, E., Xu, Q., Mazare, P.E., Karadayi, J., Liptchinsky, V., Collobert, R., Fuegen, C., Likhomanenko, T., Synnaeve, G., Joulin, A., Mohamed, A., & Dupoux, E. (2020, May). Libri-light: A benchmark for ASR with limited or no supervision. In ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 7669–7673). IEEE.

Kamble, M. R., & Patil, H. A. (2020). Combination of amplitude and frequency modulation features for presentation attack detection. Journal of Signal Processing Systems, 92(8), 777–791.

Khoria, K., Kamble, M. R., & Patil, H. A. (2021, January). Teager energy cepstral coefficients for classification of normal vs. whisper speech. In 2020 28th European signal processing conference (EUSIPCO) (pp. 1–5). IEEE.

Kinoshita, K., Delcroix, M., Yoshioka, T., Nakatani, T., Habets, E., Haeb-Umbach, R., Leutnant, V., Sehr, A., Kellermann, W., Maas, R., Gannot, S., & Raj, B. (2013, October). The REVERB challenge: A common evaluation framework for de-reverberation and recognition of reverberant speech. In 2013 IEEE workshop on applications of signal processing to audio and acoustics (pp. 1–4). IEEE.

Kinoshita, K., Ochiai, T., Delcroix, M., & Nakatani, T. (2020, May). Improving noise-robust automatic speech recognition with single-channel time-domain enhancement network. In ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 7009–7013). IEEE.

Kinoshita, K., Delcroix, M., Gannot, S., Habets, E. A. P., Haeb-Umbach, R., Kellermann, W., Leutnant, V., Maas, R., Nakatani, T., Raj, B., Sehr, A., & Yoshioka, T. (2016). A summary of the REVERB challenge: State-of-the-art and remaining challenges in reverberant speech processing research. EURASIP Journal on Advances in Signal Processing, 2016, 1–19.

Ko, T., Peddinti, V., Povey, D., Seltzer, M. L., & Khudanpur, S. (2017, March). A study on data augmentation of reverberant speech for robust speech recognition. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5220–5224). IEEE.

Koya, J. R., & Rao, S. V. M. (2021). Deep bidirectional neural networks for robust speech recognition under heavy background noise. Materials Today: Proceedings.

Krishna, G., Tran, C., Yu, J., & Tewfik, A. H. (2019, May). Speech recognition with no speech or with noisy speech. In ICASSP 2019–2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1090–1094). IEEE.

Krobba, A., Debyeche, M., & Selouani, S. A. (2020). Mixture linear prediction Gammatone Cepstral features for robust speaker verification under transmission channel noise. Multimedia Tools and Applications, 79(25), 18679–18693.

Kuamr, A., Dua, M., & Choudhary, A. (2014, February). Implementation and performance evaluation of continuous Hindi speech recognition. In 2014 international conference on electronics and communication systems (ICECS) (pp. 1–5). IEEE.

Kumar, A., & Shahnawazuddin, S. (2020, July). Robust detection of vowel onset and end points. In 2020 international conference on signal processing and communications (SPCOM) (pp. 1–5). IEEE.

Kumar, K., Ren, B., Gong, Y., & Wu, J. (2020). Bandpass noise generation and augmentation for unified ASR. In INTERSPEECH (pp. 1683–1687).

Kumar, A., & Aggarwal, R. K. (2021). Discriminatively trained continuous Hindi speech recognition using integrated acoustic features and recurrent neural network language modeling. Journal of Intelligent Systems, 30(1), 165–179.

Kumar, A., & Mittal, V. (2021). Hindi speech recognition in noisy environment using hybrid technique. International Journal of Information Technology, 13(2), 483–492.MathSciNet

Laghari, M., Tahir, M. J., Azeem, A., Riaz, W., & Zhou, Y. (2021, May). Robust speech emotion recognition for Sindhi language based on deep convolutional neural network. In 2021 international conference on communications, information system and computer engineering (CISCE) (pp. 543–548). IEEE.

Latha, A. P. (2020, October). Evaluation of voice mimicking using I–Vector framework. In Speech and computer: 22nd international conference, SPECOM 2020, St. Petersburg, Russia, October 7–9, 2020, Proceedings (Vol. 12335, p. 446). Springer Nature.

Li, H., Wang, D., Zhang, X., & Gao, G. (2020). Frame-level signal-to-noise ratio estimation using deep learning. In INTERSPEECH (pp. 4626–4630).

Lim, H., Kim, Y., & Kim, H. (2020). Cross-informed domain adversarial training for noise-robust wake-up word detection. IEEE Signal Processing Letters, 27, 1769–1773.

Lin, Y., Guo, D., Zhang, J., Chen, Z., & Yang, B. (2020). A unified framework for multilingual speech recognition in air traffic control systems. IEEE Transactions on Neural Networks and Learning Systems.

Liu, B., Nie, S., Zhang, Y., Ke, D., Liang, S., & Liu, W. (2018, April). Boosting noise robustness of acoustic model via deep adversarial training. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5034–5038). IEEE.

Liu, B., Shen, Z., Huang, L., Gong, Y., Zhang, Z., & Cai, H. (2021, February). A 1D-CRNN inspired reconfigurable processor for noise-robust low-power keywords recognition. In 2021 design, automation & test in Europe conference & exhibition (DATE) (pp. 495–500). IEEE.

Lokesh, S., & Devi, M. R. (2019). Speech recognition system using enhanced mel frequency cepstral coefficient with windowing and framing method. Cluster Computing, 22(5), 11669–11679.

Lü, Y., Lin, H., Wu, P., & Chen, Y. (2021). Feature compensation based on independent noise estimation for robust speech recognition. EURASIP Journal on Audio, Speech, and Music Processing, 2021(1), 1–9.

Maity, K., Pradhan, G., & Singh, J. P. (2021). A pitch and noise robust keyword spotting system using SMAC features with prosody modification. Circuits, Systems, and Signal Processing, 40(4), 1892–1904.

Malekzadeh, S., Gholizadeh, M. H., & Razavi, S. N. (2018). Persian vowel recognition with MFCC and ANN on PCVC speech dataset. arXiv preprint arXiv:1812.06953.

Malik, M., Malik, M. K., Mehmood, K., & Makhdoom, I. (2021). Automatic speech recognition: A survey. Multimedia Tools and Applications, 80(6), 9411–9457.

Mandalapu, H., Ramachandra, R., & Busch, C. (2021, May). Smartphone audio replay attacks dataset. In 2021 IEEE international workshop on biometrics and forensics (IWBF) (pp. 1–6). IEEE.

McLoughlin, I., Xie, Z., Song, Y., Phan, H., & Palaniappan, R. (2020). Time-frequency feature fusion for noise-robust audio event classification. Circuits, Systems, and Signal Processing, 39(3), 1672–1687.

Meng, Z., Watanabe, S., Hershey, J. R., & Erdogan, H. (2017, March). Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 271–275). IEEE.

Meng, L., Xu, J., Tan, X., Wang, J., Qin, T., & Xu, B. (2021, June). MixSpeech: Data augmentation for low-resource automatic speech recognition. In ICASSP 2021–2021 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 7008–7012). IEEE.

Meutzner, H., Ma, N., Nickel, R., Schymura, C., & Kolossa, D. (2017, March). Improving audio-visual speech recognition using deep neural networks with dynamic stream reliability estimates. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5320–5324). IEEE.

Mitra, V., Sivaraman, G., Bartels, C., Nam, H., Wang, W., Espy-Wilson, C., Vergyri, D., & Franco, H. (2017, March). Joint modeling of articulatory and acoustic spaces for continuous speech recognition tasks. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5205–5209). IEEE.

Mitra, V., Franco, H., Stern, R. M., van Hout, J., Ferrer, L., Graciarena, M., Wang, W., Vergyri, D., Alwan, A., & Hansen, J. H. L. (2017). Robust features in deep-learning-based speech recognition. In S. Watanabe, M. Delcroix, F. Metze, & J. R. Hershey (Eds.), New era for robust speech recognition (pp. 187–217). Springer.

Mittal, A., & Dua, M. (2021). Constant Q cepstral coefficients and long short-term memory model-based automatic speaker verification system. In Proceedings of international conference on intelligent computing, information and control systems: ICICCS 2020 (pp. 895–904). Springer.

Naik, A. (2021). HMM-based phoneme speech recognition system for the control and command of industrial robots. Technical. Technical Transactions, e2021002.

Nainan, S., & Kulkarni, V. (2020). Enhancement in speaker recognition for optimized speech features using GMM, SVM and 1-D CNN. International Journal of Speech Technology, 24, 1–14.

Naing, H. M. S., Hidayat, R., Hartanto, R., & Miyanaga, Y. (2020, November). A front-end technique for automatic noisy speech recognition. In 2020 23rd conference of the oriental COCOSDA international committee for the co-ordination and standardisation of speech databases and assessment techniques (O-COCOSDA) (pp. 49–54). IEEE.

Namazifar, M., Tur, G., & Hakkani-Tür, D. (2021, January). Warped language models for noise robust language understanding. In 2021 IEEE spoken language technology workshop (SLT) (pp. 981–988). IEEE.

Nanjo, H., & Kawahara, T. (2005, March). A new ASR evaluation measure and minimum Bayes-risk decoding for open-domain speech understanding. In Proceedings (ICASSP’05): IEEE international conference on acoustics, speech, and signal processing (Vol. 1, pp. I–1053). IEEE.

Nian, Z., Tu, Y. H., Du, J., & Lee, C. H. (2021, June). A progressive learning approach to adaptive noise and speech estimation for speech enhancement and noisy speech recognition. In ICASSP 2021–2021 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6913–6917). IEEE.

Noé, P. G., Parcollet, T., & Morchid, M. (2020, May). CGCNN: Complex Gabor convolutional neural network on raw speech. In ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 7724–7728). IEEE.

Oglic, D., Cvetkovic, Z., Bell, P., & Renals, S. (2020, July). A deep 2D convolutional network for waveform-based speech recognition. In INTERSPEECH (pp. 1654–1658).

Oh, S. (2021). DNN based robust speech feature extraction and signal noise removal method using improved average prediction LMS filter for speech recognition. Journal of Convergence for Information Technology, 11(6), 1–6.

Ouisaadane, A., & Safi, S. (2021). A comparative study for Arabic speech recognition system in noisy environments. International Journal of Speech Technology, 24, 1–10.

Padi, B., Mohan, A., & Ganapathy, S. (2020). Towards relevance and sequence modeling in language recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 1223–1232.

Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015, April). Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5206–5210). IEEE.

Paul, D. B., & Baker, J. (1992). The design for the Wall Street Journal-based CSR corpus. In Speech and natural language: proceedings of a workshop held at Harriman. New York, February 23–26, 1992.

Pearce, D. (1998). Aurora project: Experimental framework for the performance evaluation of distributed speech recognition front-ends. ETSI working paper.

Qian, Y., Tan, T., Hu, H., & Liu, Q. (2018, April). Noise robust speech recognition on aurora4 by humans and machines. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5604–5608). IEEE.

Radha, K., & Bansal, M. (2022). Audio augmentation for non-native children’s speech recognition through discriminative learning. Entropy, 24(10), 1490.

Raju, S., Jagtap, V., Kulkarni, P., Ravikanth, M., & Rafeeq, M. (2020, March). Speech recognition to build context: A survey. In 2020 international conference on computer science, engineering and applications (ICCSEA) (pp. 1–7). IEEE.

Ravanelli, M., Zhong, J., Pascual, S., Swietojanski, P., Monteiro, J., Trmal, J., & Bengio, Y. (2020, May). Multi-task self-supervised learning for robust speech recognition. In ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6989–6993). IEEE.

Ray, A., Rajeswar, S., & Chaudhury, S. (2015, January). Text recognition using deep BLSTM networks. In 2015 eighth international conference on advances in pattern recognition (ICAPR) (pp. 1–6). IEEE.

Reddy, C.K.A., Dubey, H., Gopal, V., Cutler, R., Braun, S., Gamper, H., Aichner, R., Srinivasan, S. (2021, June). ICASSP 2021 deep noise suppression challenge. In ICASSP 2021–2021 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6623–6627). IEEE.

Rownicka, J., Bell, P., & Renals, S. (2020, May). Multi-scale octave convolutions for robust speech recognition. In ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 7019–7023). IEEE.

Sahidullah, M., Kinnunen, T., & Hanilçi, C. (2015). A comparison of features for synthetic speech detection.

Sahu, P., Dua, M., & Kumar, A. (2018). Challenges and issues in adopting speech recognition. Speech and Language Processing for Human-Machine Communications: Proceedings of CSI, 2015, 209–215.

Sailor, H. B., & Patil, H. A. (2017). Auditory feature representation using convolutional restricted Boltzmann machine and Teager energy operator for speech recognition. The Journal of the Acoustical Society of America, 141(6), EL500–EL506.

Sakthi, M., Tewfik, A., & Pawate, R. (2020, May). Speech Recognition model compression. In ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 7869–7873). IEEE.

Shahrebabaki, A. S., Siniscalchi, S. M., Salvi, G., & Svendsen, T. (2021, May). A DNN based speech enhancement approach to noise robust acoustic-to-articulatory inversion. In 2021 IEEE international symposium on circuits and systems (ISCAS) (pp. 1–5). IEEE.

Shen, Y. L., Huang, C. Y., Wang, S. S., Tsao, Y., Wang, H. M., & Chi, T. S. (2019, May). Reinforcement learning based speech enhancement for robust speech recognition. In ICASSP 2019–2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6750–6754). IEEE.

Sheng, P., Yang, Z., Hu, H., Tan, T., & Qian, Y. (2018, November). Data augmentation using conditional generative adversarial networks for robust speech recognition. In 2018 11th international symposium on Chinese spoken language processing (ISCSLP) (pp. 121–125). IEEE.

Singh, A., Kadyan, V., Kumar, M., & Bassan, N. (2020). ASRoIL: A comprehensive survey for automatic speech recognition of Indian languages. Artificial Intelligence Review, 53(5), 3673–3704.

Song, Z. (2020). English speech recognition based on deep learning with multiple features. Computing, 102(3), 663–682.MathSciNetMATH

Sriram, A., Jun, H., Gaur, Y., & Satheesh, S. (2018, April). Robust speech recognition using generative adversarial networks. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5639–5643). IEEE.

Sultana, S., Rahman, M. S., & Iqbal, M. Z. (2021). Recent advancement in speech recognition for Bangla: A survey. Int J Adv Comput Sci App. https://doi.org/10.14569/IJACSA.2021.0120365CrossRef

Sun, S., Yeh, C. F., Hwang, M. Y., Ostendorf, M., & Xie, L. (2018, April). Domain adversarial training for accented speech recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4854–4858). IEEE.

Sun, S., Zhang, B., Xie, L., & Zhang, Y. (2017). An unsupervised deep domain adaptation approach for robust speech recognition. Neurocomputing, 257, 79–87.

Szöke, I., Skácel, M., Mošner, L., Paliesek, J., & Černocký, J. (2019). Building and evaluation of a real room impulse response dataset. IEEE Journal of Selected Topics in Signal Processing, 13(4), 863–876.

Tambe, T., Yang, E-Y., Ko, G., Chai, Y., Hooper, C., Donato, M., Whatmough, P., Rush, A., Brooks, D., & Wei, G-Y. (2021, February). 9.8 A 25mm 2 SoC for IoT devices with 18ms noise-robust speech-to-text latency via Bayesian speech denoising and attention-based sequence-to-sequence DNN speech recognition in 16nm FinFET. In 2021 IEEE international solid-state circuits conference (ISSCC) (Vol. 64, pp. 158–160). IEEE.

Tan, T., Lu, Y., Ma, R., Zhu, S., Guo, J., & Qian, Y. (2021, June). AI speech-SJTUASR system for the accented English speech recognition challenge. In ICASSP 2021–2021 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6413–6417). IEEE.

Tang, Z., Chen, L., Wu, B., Yu, D., & Manocha, D. (2020, May). Improving reverberant speech training using diffuse acoustic simulation. In ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6969–6973). IEEE.

Thimmaraja, Y. G., Nagaraja, B. G., & Jayanna, H. S. (2021). Speech enhancement and encoding by combining SS-VAD and LPC. International Journal of Speech Technology, 24(1), 165–172.

Thomas, T., Spoorthy, V., Sobhana, N. V., & Koolagudi, S. G. (2020, December). Speaker recognition in emotional environment using excitation features. In 2020 third international conference on advances in electronics, computers and communications (ICAECC) (pp. 1–6). IEEE.

Vanderreydt, G., & Demuynck, K. (n.d.). A Novel Channel estimate for noise robust speech recognition. Available at SSRN 4330824.

Varga, A., & Steeneken, H. J. (1993). Assessment for automatic speech recognition: II— NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12(3), 247–251.

Wang, Z. Q., & Wang, D. (2020, May). Multi-microphone complex spectral mapping for speech de-reverberation. In ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 486–490). IEEE.

Wang, Z. Q., Wang, P., & Wang, D. (2020). Complex spectral mapping for single-and multi-channel speech enhancement and robust ASR. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 1778–1787.

Warden, P. (2017). Speech commands: A public dataset for single-word speech recognition. Retrieved from http://download.tensorflow.org/data/speech_commands_v0,1

Watanabe, S., Mandel, M., Barker, J., Vincent, E., Arora, A., Chang, X., Khudanpur, S., Manohar, V., Povey, D., Raj, D., Snyder, D., Subramanian, A.S., Trmal, J., Yair, B.B., Boeddeker, C., Ni, Z., Fujita, Y., Horiguchi, S., Kanda, N., et al. (2020). CHiME-6 challenge: Tackling multi-speaker speech recognition for unsegmented recordings. arXiv preprint arXiv:2004.09249.

Wessel, F., Schluter, R., Macherey, K., & Ney, H. (2001). Confidence measures for large vocabulary continuous speech recognition. IEEE Transactions on Speech and Audio Processing, 9(3), 288–298.

Wu, B., Li, K., Ge, F., Huang, Z., Yang, M., Siniscalchi, S. M., & Lee, C. H. (2017). An end-to-end deep learning approach to simultaneous speech de-reverberation and acoustic modeling for robust speech recognition. IEEE Journal of Selected Topics in Signal Processing, 11(8), 1289–1300.

Xu, Y., Weng, C., Hui, L., Liu, J., Yu, M., Su, D., & Yu, D. (2019, May). Joint training of complex ratio mask based beam former and acoustic model for noise robust ASR. In ICASSP 2019–2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6745–6749). IEEE.

Yadav, I. C., & Pradhan, G. (2021). Pitch and noise normalized acoustic feature for children’s ASR. Digital Signal Processing, 109, 102922.

Yalamanchili, B., Dungala, K., Mandapati, K., Pillodi, M., & Vanga, S. R. (2021). Survey on multimodal emotion recognition (MER) Systems. In Machine learning technologies and applications: Proceedings of ICACECS 2020 (pp. 319–326). Springer.

Yang, S., Lee, M., & Kim, H. (2021, January). Deep learning-based syllable recognition framework for Korean children. In 2021 international conference on information networking (ICOIN) (pp. 723–726). IEEE.

Yoshioka, T., & Gales, M. J. (2015). Environmentally robust ASR front-end for deep neural network acoustic models. Computer Speech & Language, 31(1), 65–86.

Zealouk, O., Satori, H., Laaidi, N., Hamidi, M., & Satori, K. (2020). Noise effect on Amazigh digits in speech recognition system. International Journal of Speech Technology, 23(4), 885–892.

Zhang, S., Do, C. T., Doddipatla, R., Loweimi, E., Bell, P., & Renals, S. (2021, June). Train your classifier first: Cascade Neural Networks Training from upper layers to lower layers. In ICASSP 2021–2021 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2750–2754). IEEE.

Zhang, X., Zou, X., Sun, M., Zheng, T. F., Jia, C., & Wang, Y. (2019). Noise robust speaker recognition based on adaptive frame weighting in GMM for i-vector extraction. IEEE Access, 7, 27874–27882.

Zhang, Z., Geiger, J., Pohjalainen, J., Mousa, A. E. D., Jin, W., & Schuller, B. (2018). Deep learning for environmentally robust speech recognition: An overview of recent developments. ACM Transactions on Intelligent Systems and Technology (TIST), 9(5), 1–28.

Zheng, N., Shi, Y., Kang, Y., & Meng, Q. (2021, June). A noise-robust signal processing strategy for cochlear implants using neural networks. In ICASSP 2021–2021 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 8343–8347). IEEE.

Zhou, P., Yang, W., Chen, W., Wang, Y., & Jia, J. (2019, May). Modality attention for end-to-end audio-visual speech recognition. In ICASSP 2019–2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6565–6569). IEEE

Zhu, Q. S., Zhou, L., Zhang, J., Liu, S. J., Hu, Y. C., & Dai, L. R. (2022). Robust Data2vec: Noise-robust speech representation learning for ASR by combining regression and improved contrastive learning. arXiv preprint arXiv:2210.15324.

Zylich, B., & Whitehill, J. (2020, May). Noise-robust key-phrase detectors for automated classroom feedback. In ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 9215–9219). IEEE.

Titel: Noise robust automatic speech recognition: review and analysis
verfasst von: Mohit Dua
Akanksha
Shelza Dua
Publikationsdatum: 24.06.2023
Verlag: Springer US
Erschienen in: International Journal of Speech Technology / Ausgabe 2/2023
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI: https://doi.org/10.1007/s10772-023-10033-0

Neuer Inhalt

Bildnachweise

VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Internationaler Motorenkongress/© [M] ATZlive | Chisnikov / Fotolia.com, Search Icon, Banner Hanser, Benny Hahn/© ZEP GmbH, Customer Experience/© © oatawa / Getty Images / iStock, Erdgasmotor 1.5 TGI evo von Volkswagen/© Volkswagen AG, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, 2023_Antrieb/© supervisuell, ATZ-Webinar: Prototypenfreie Entwicklung durch Offline- und Driver-in-the-Loop-HiL-Tests /© (c) VI-grade, chassis.tech plus 2023/© [M] ATZlive / TÜV SÜD PRODUCT SERVICE GMBH

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 2/2023

An automated speech analysis system for the detection of cognitive decline in elderly

Mining speech signal patterns for robust speaker variability classification

Ensemble machine learning regression model based predictive framework for Parkinson’s UPDRS motor score prediction from speech data

Stuttering detection using speaker representations and self-supervised contextual embeddings

Different attacks presence considerations: analyzing the simple and efficient self-marked algorithm performance for highly-sensitive audio signals contents verification

SHO based Deep Residual network and hierarchical speech features for speech enhancement

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.