Skip to main content
Top
Published in: Wireless Personal Communications 2/2022

04-03-2022

Survey of Deep Learning Paradigms for Speech Processing

Authors: Kishor Barasu Bhangale, Mohanaprasad Kothandaraman

Published in: Wireless Personal Communications | Issue 2/2022

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Over the past decades, a particular focus is given to research on machine learning techniques for speech processing applications. However, in the past few years, research has focused on using deep learning for speech processing applications. This new machine learning field has become a very attractive area of study and has remarkably better performance than the others in the various speech processing applications. This paper presents a brief survey of application deep learning for various speech processing applications such as speech separation, speech enhancement, speech recognition, speaker recognition, emotion recognition, language recognition, music recognition, speech data retrieval, etc. The survey goes on to cover the use of Auto-Encoder, Generative Adversarial Network, Restricted Boltzmann Machine, Deep Belief Network, Deep Neural Network, Convolutional Neural Network, Recurrent Neural Network and Deep Reinforcement Learning for speech processing. Additionally, it focuses on the various speech database and evaluation metrics used by deep learning algorithms for performance evaluation.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Sarker, I. H. (2021). Deep learning: A comprehensive overview on techniques, taxonomy, applications and research directions. SN Computer Science, 2(6), 1–20.MathSciNetCrossRef Sarker, I. H. (2021). Deep learning: A comprehensive overview on techniques, taxonomy, applications and research directions. SN Computer Science, 2(6), 1–20.MathSciNetCrossRef
2.
go back to reference Otter, D. W., Medina, J. R., & Kalita, J. K. (2020). A survey of the usages of deep learning for natural language processing. IEEE Transactions on Neural Networks and Learning Systems, 32(2), 604–624.MathSciNetCrossRef Otter, D. W., Medina, J. R., & Kalita, J. K. (2020). A survey of the usages of deep learning for natural language processing. IEEE Transactions on Neural Networks and Learning Systems, 32(2), 604–624.MathSciNetCrossRef
3.
go back to reference Alam, M., Samad, M. D., Vidyaratne, L., Glandon, A., & Iftekharuddin, K. M. (2020). Survey on deep neural networks in speech and vision systems. Neurocomputing, 417, 302–321.CrossRef Alam, M., Samad, M. D., Vidyaratne, L., Glandon, A., & Iftekharuddin, K. M. (2020). Survey on deep neural networks in speech and vision systems. Neurocomputing, 417, 302–321.CrossRef
4.
go back to reference Watanabe, S., & Araki, S. (2019). Introduction to the issue on far-field speech processing in the era of deep learning: speech enhancement, separation, and recognition. IEEE Journal of Selected Topics in Signal Processing, 13(4), 785–786.CrossRef Watanabe, S., & Araki, S. (2019). Introduction to the issue on far-field speech processing in the era of deep learning: speech enhancement, separation, and recognition. IEEE Journal of Selected Topics in Signal Processing, 13(4), 785–786.CrossRef
5.
go back to reference Raj, D., Denisov, P., Chen, Z., Erdogan, H., Huang, Z., He, M., Watanabe, S., Du, J., Yoshioka, T., Luo, Y., & Kanda, N. (2021). Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis. In 2021 IEEE spoken language technology workshop (SLT), pp. 897–904. IEEE. Raj, D., Denisov, P., Chen, Z., Erdogan, H., Huang, Z., He, M., Watanabe, S., Du, J., Yoshioka, T., Luo, Y., & Kanda, N. (2021). Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis. In 2021 IEEE spoken language technology workshop (SLT), pp. 897–904. IEEE.
6.
go back to reference Suh, J. Y., Bennett, C. C., Weiss, B., Yoon, E., Jeong, J., & Chae, Y. (2021). Development of speech dialogue systems for social ai in cooperative game evironments. In IEEE region 10 symposium (TENSYMP 2021). Suh, J. Y., Bennett, C. C., Weiss, B., Yoon, E., Jeong, J., & Chae, Y. (2021). Development of speech dialogue systems for social ai in cooperative game evironments. In IEEE region 10 symposium (TENSYMP 2021).
7.
go back to reference Hanifa, R. M., Isa, K., & Mohamad, S. (2021). A review on speaker recognition: Technology and challenges. Computers & Electrical Engineering, 90, 107005.CrossRef Hanifa, R. M., Isa, K., & Mohamad, S. (2021). A review on speaker recognition: Technology and challenges. Computers & Electrical Engineering, 90, 107005.CrossRef
8.
go back to reference Ntalampiras, S. (2021). Speech emotion recognition via learning analogies. Pattern Recognition Letters, 144, 21–26.CrossRef Ntalampiras, S. (2021). Speech emotion recognition via learning analogies. Pattern Recognition Letters, 144, 21–26.CrossRef
9.
go back to reference Deng, L., Hassanein, K., & Elmasry, M. (1994). Analysis of the correlation structure for a neural predictive model with application to speech recognition. Neural Networks, 7(2), 331–339.CrossRef Deng, L., Hassanein, K., & Elmasry, M. (1994). Analysis of the correlation structure for a neural predictive model with application to speech recognition. Neural Networks, 7(2), 331–339.CrossRef
10.
go back to reference Cohen, J., Kamm, T., & Andreou, A. (1995). Vocal tract normalization in speech recognition: Compensation for system systematic speaker variability. The Journal of the Acoustical Society of America, 97(5), 3246–3247.CrossRef Cohen, J., Kamm, T., & Andreou, A. (1995). Vocal tract normalization in speech recognition: Compensation for system systematic speaker variability. The Journal of the Acoustical Society of America, 97(5), 3246–3247.CrossRef
12.
go back to reference Hermansky, H., Ellis, D. P. W., & Sharma, S. (2000). Tandem connectionist feature extraction for conventional HMM systems. In 2000 IEEE international conference on acoustics, speech, and signal processing proceedings (Cat. No.00CH37100), Istanbul, Turkey, vol. 3, pp. 1635–1638. https://doi.org/10.1109/ICASSP.2000.862024. Hermansky, H., Ellis, D. P. W., & Sharma, S. (2000). Tandem connectionist feature extraction for conventional HMM systems. In 2000 IEEE international conference on acoustics, speech, and signal processing proceedings (Cat. No.00CH37100), Istanbul, Turkey, vol. 3, pp. 1635–1638. https://​doi.​org/​10.​1109/​ICASSP.​2000.​862024.
13.
go back to reference Povey, D., Kingsbury, B., Mangu, L., Saon, G., Soltau, H., & Zweig, G. (2005). fPME: Discriminatively trained features for speech recognition. In Proceedings IEEE ICASSP’05, pp. 961–964. Povey, D., Kingsbury, B., Mangu, L., Saon, G., Soltau, H., & Zweig, G. (2005). fPME: Discriminatively trained features for speech recognition. In Proceedings IEEE ICASSP’05, pp. 961–964.
15.
17.
go back to reference Rabiner, L. R., & Schafer, R. W. (2007). Introduction to digital speech processing. Now Publishers Inc.MATHCrossRef Rabiner, L. R., & Schafer, R. W. (2007). Introduction to digital speech processing. Now Publishers Inc.MATHCrossRef
18.
go back to reference Van Gilse, P. H. G. (1948). Another method of speech without larynx. Acta Oto-Laryngologica, 36(sup78), 109–110.CrossRef Van Gilse, P. H. G. (1948). Another method of speech without larynx. Acta Oto-Laryngologica, 36(sup78), 109–110.CrossRef
19.
go back to reference Everest, F. A., & Pohlmann, K. (2009). Master handbook of acoustics. McGraw-Hill/TAB Electronics. Everest, F. A., & Pohlmann, K. (2009). Master handbook of acoustics. McGraw-Hill/TAB Electronics.
21.
go back to reference Sonawane, A., Inamdar, M. U., & Bhangale, K. B. (2017). Sound based human emotion recognition using MFCC & multiple SVM. In 2017 international conference on information, communication, instrumentation and control (ICICIC), pp. 1–4. IEEE. Sonawane, A., Inamdar, M. U., & Bhangale, K. B. (2017). Sound based human emotion recognition using MFCC & multiple SVM. In 2017 international conference on information, communication, instrumentation and control (ICICIC), pp. 1–4. IEEE.
22.
go back to reference Bhangale, K. B., Titare, P., Pawar, R., & Bhavsar, S. (2018). Synthetic speech spoofing detection using MFCC and radial basis function SVM. IOSR Journal of Engineering (IOSRJEN), 8(6), 55–61. Bhangale, K. B., Titare, P., Pawar, R., & Bhavsar, S. (2018). Synthetic speech spoofing detection using MFCC and radial basis function SVM. IOSR Journal of Engineering (IOSRJEN), 8(6), 55–61.
23.
go back to reference Bhangale, K. B., & Mohanaprasad, K. (2021). A review on speech processing using machine learning paradigm. International Journal of Speech Technology, 24(2), 367–388.CrossRef Bhangale, K. B., & Mohanaprasad, K. (2021). A review on speech processing using machine learning paradigm. International Journal of Speech Technology, 24(2), 367–388.CrossRef
24.
go back to reference Nirmal, J., Zaveri, M., Patnaik, S., & Kachare, P. (2014). Voice conversion using general regression neural network. Applied Soft Computing, 24, 1–12.CrossRef Nirmal, J., Zaveri, M., Patnaik, S., & Kachare, P. (2014). Voice conversion using general regression neural network. Applied Soft Computing, 24, 1–12.CrossRef
25.
go back to reference Amrouche, A., Taleb-Ahmed, A., Rouvaen, J. M., & Yagoub, M. C. E. (2009). Improvement of the speech recognition in noisy environments using a nonparametric regression. International Journal of Parallel, Emergent and Distributed Systems, 24(1), 49–67.MathSciNetMATHCrossRef Amrouche, A., Taleb-Ahmed, A., Rouvaen, J. M., & Yagoub, M. C. E. (2009). Improvement of the speech recognition in noisy environments using a nonparametric regression. International Journal of Parallel, Emergent and Distributed Systems, 24(1), 49–67.MathSciNetMATHCrossRef
28.
go back to reference Ng, A. Y., & Jordan, M. I. (2001). On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In Proceedings of the 14th international conference on neural information processing systems, Cambridge, MA, USA: MIT Press, 2001, pp. 841–848. Ng, A. Y., & Jordan, M. I. (2001). On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In Proceedings of the 14th international conference on neural information processing systems, Cambridge, MA, USA: MIT Press, 2001, pp. 841–848.
29.
go back to reference LeCun, Y., Kavukcuoglu, K., & Farabet, C. (2010). Convolutional networks and applications in vision. In Proceedings of 2010 IEEE international symposium on circuits and systems, pp. 253–256. LeCun, Y., Kavukcuoglu, K., & Farabet, C. (2010). Convolutional networks and applications in vision. In Proceedings of 2010 IEEE international symposium on circuits and systems, pp. 253–256.
30.
go back to reference Purwins, H., Li, Bo., Virtanen, T., Schlüter, J., Chang, S.-Y., & Sainath, T. (2019). Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing, 13(2), 206–219.CrossRef Purwins, H., Li, Bo., Virtanen, T., Schlüter, J., Chang, S.-Y., & Sainath, T. (2019). Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing, 13(2), 206–219.CrossRef
31.
go back to reference Chen, X. W., & Lin, X. (2014). Big data deep learning: Challenges and perspectives. IEEE Access, 2, 514–525.CrossRef Chen, X. W., & Lin, X. (2014). Big data deep learning: Challenges and perspectives. IEEE Access, 2, 514–525.CrossRef
32.
go back to reference Shrestha, A., & Mahmood, A. (2019). Review of deep learning algorithms and architectures. IEEE Access, 7, 53040–53065.CrossRef Shrestha, A., & Mahmood, A. (2019). Review of deep learning algorithms and architectures. IEEE Access, 7, 53040–53065.CrossRef
34.
go back to reference Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958.MathSciNetMATH Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958.MathSciNetMATH
35.
go back to reference Strom, N. (2015). Scalable distributed DNN training using commodity GPU cloud computing. In Sixteenth annual conference of the international speech communication association. Strom, N. (2015). Scalable distributed DNN training using commodity GPU cloud computing. In Sixteenth annual conference of the international speech communication association.
37.
go back to reference Noda, K. (2013). Multimodal integration learning of object manipulation behaviors using deep neural networks. In Proceedings of the IEEE/RSJ international conference on intelligent robots and systems, pp. 1728–1733. Noda, K. (2013). Multimodal integration learning of object manipulation behaviors using deep neural networks. In Proceedings of the IEEE/RSJ international conference on intelligent robots and systems, pp. 1728–1733.
38.
go back to reference Lu, X., Matsuda, S., Hori, C., & Kashioka, H. (2012). Speech restoration based on deep learning autoencoder with layer-wised pretraining. In 13th annual conference of the international speech communication association. Lu, X., Matsuda, S., Hori, C., & Kashioka, H. (2012). Speech restoration based on deep learning autoencoder with layer-wised pretraining. In 13th annual conference of the international speech communication association.
39.
go back to reference Lu, X., Matsuda, S., Hori, C., & Kashioka, H. (2012). Speech restoration based on deep learning autoencoder with layer-wised learning. In INTERSPEECH, Portland, Oregon, Sept. 2012. Lu, X., Matsuda, S., Hori, C., & Kashioka, H. (2012). Speech restoration based on deep learning autoencoder with layer-wised learning. In INTERSPEECH, Portland, Oregon, Sept. 2012.
40.
go back to reference Lu, X., Tsao, Y., Matsuda, S., & Hori, C. (2013). Speech enhancement based on deep denoising auto-encoder. In Proceedings of interspeech, pp. 436–440. Lu, X., Tsao, Y., Matsuda, S., & Hori, C. (2013). Speech enhancement based on deep denoising auto-encoder. In Proceedings of interspeech, pp. 436–440.
41.
go back to reference Lu, X., Tsao, Y., Matsuda, S., & Hori, C. (2014). Ensemble modeling of denoising autoencoder for speech spectrum restoration. In Proceedings of the annual conference of the international speech communication association, INTERSPEECH, pp 885–889. Lu, X., Tsao, Y., Matsuda, S., & Hori, C. (2014). Ensemble modeling of denoising autoencoder for speech spectrum restoration. In Proceedings of the annual conference of the international speech communication association, INTERSPEECH, pp 885–889.
44.
go back to reference Agrawal, P., & Ganapathy, S. (2019). Modulation filter learning using deep variational networks for robust speech recognition. IEEE Journal of Selected Topics in Signal Processing, 13(2), 244–253.CrossRef Agrawal, P., & Ganapathy, S. (2019). Modulation filter learning using deep variational networks for robust speech recognition. IEEE Journal of Selected Topics in Signal Processing, 13(2), 244–253.CrossRef
45.
47.
go back to reference Zhang, Q., & Hansen, J. H. L. (2018). Language/dialect recognition based on unsupervised deep learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(5), 873–882.CrossRef Zhang, Q., & Hansen, J. H. L. (2018). Language/dialect recognition based on unsupervised deep learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(5), 873–882.CrossRef
48.
go back to reference Chorowski, J., Weiss, R. J., Bengio, S., & van den Oord, A. (2019). Unsupervised speech representation learning using WaveNet autoencoders. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(12), 2041–2053.CrossRef Chorowski, J., Weiss, R. J., Bengio, S., & van den Oord, A. (2019). Unsupervised speech representation learning using WaveNet autoencoders. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(12), 2041–2053.CrossRef
49.
go back to reference Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680.
51.
go back to reference Qian, Y., Hu, Hu., & Tan, T. (2019). Data augmentation using generative adversarial networks for robust speech recognition. Speech Communication, 114, 1–9.CrossRef Qian, Y., Hu, Hu., & Tan, T. (2019). Data augmentation using generative adversarial networks for robust speech recognition. Speech Communication, 114, 1–9.CrossRef
52.
go back to reference Pascual, S., Serra, J., & Bonafonte, A. (2019). Time-domain speech enhancement using generative adversarial networks. Speech Communication, 114, 10–21.CrossRef Pascual, S., Serra, J., & Bonafonte, A. (2019). Time-domain speech enhancement using generative adversarial networks. Speech Communication, 114, 10–21.CrossRef
53.
go back to reference Kaneko, T., Kameoka, H., Hojo, N., Ijima, Y., Hiramatsu, K., & Kashino, K. (2017). Generative adversarial network-based postfilter for statistical parametric speech synthesis. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4910–4914. IEEE. Kaneko, T., Kameoka, H., Hojo, N., Ijima, Y., Hiramatsu, K., & Kashino, K. (2017). Generative adversarial network-based postfilter for statistical parametric speech synthesis. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4910–4914. IEEE.
54.
go back to reference Kaneko, T., Takaki, S., Kameoka, H., & Yamagishi J. (2017). Generative adversarial network-based postfilter for STFT spectrograms. In Interspeech, pp. 3389–3393. Kaneko, T., Takaki, S., Kameoka, H., & Yamagishi J. (2017). Generative adversarial network-based postfilter for STFT spectrograms. In Interspeech, pp. 3389–3393.
55.
go back to reference Hsu, C. C., Hwang, H. T., Wu, Y. C., Tsao, Y., & Wang H. M. (2017). Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks. arXiv preprint arXiv:1704.00849. Hsu, C. C., Hwang, H. T., Wu, Y. C., Tsao, Y., & Wang H. M. (2017). Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks. arXiv preprint arXiv:​1704.​00849.
56.
go back to reference Mimura, M., Sakai, S., & Kawahara, T. (2017). Cross-domain speech recognition using nonparallel corpora with cycle-consistent adversarial networks. In 2017 IEEE automatic speech recognition and understanding workshop (ASRU), pp. 134–140. IEEE. Mimura, M., Sakai, S., & Kawahara, T. (2017). Cross-domain speech recognition using nonparallel corpora with cycle-consistent adversarial networks. In 2017 IEEE automatic speech recognition and understanding workshop (ASRU), pp. 134–140. IEEE.
57.
go back to reference Hu, H., Tan, T., & Qian, Y. (2018). Generative adversarial networks based data augmentation for noise robust speech recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5044–5048. IEEE. Hu, H., Tan, T., & Qian, Y. (2018). Generative adversarial networks based data augmentation for noise robust speech recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5044–5048. IEEE.
58.
go back to reference Freund, Y., & Haussler, D. (1992). Unsupervised learning of distributions on binary vectors using two layer networks. In Advances in neural information processing systems, pp. 912–919. Freund, Y., & Haussler, D. (1992). Unsupervised learning of distributions on binary vectors using two layer networks. In Advances in neural information processing systems, pp. 912–919.
59.
go back to reference Larochelle, H., & Bengio, Y. (2008). Classification using discriminative restricted Boltzmann machines. In Proceedings of the 25th international conference on machine learning, pp. 536–543. Larochelle, H., & Bengio, Y. (2008). Classification using discriminative restricted Boltzmann machines. In Proceedings of the 25th international conference on machine learning, pp. 536–543.
62.
go back to reference Shah, M., Chakrabarti, C., & Spanias, A. (2015). Within and cross-corpus speech emotion recognition using latent topic model-based features. EURASIP Journal on Audio, Speech, and Music Processing, 2015(1), 4.CrossRef Shah, M., Chakrabarti, C., & Spanias, A. (2015). Within and cross-corpus speech emotion recognition using latent topic model-based features. EURASIP Journal on Audio, Speech, and Music Processing, 2015(1), 4.CrossRef
64.
go back to reference Rizk, Y., Hajj, N., Mitri, N., & Awad, M. (2019). Deep belief networks and cortical algorithms: A comparative study for supervised classification. Applied Computing and Informatics, 15(2), 81–93.CrossRef Rizk, Y., Hajj, N., Mitri, N., & Awad, M. (2019). Deep belief networks and cortical algorithms: A comparative study for supervised classification. Applied Computing and Informatics, 15(2), 81–93.CrossRef
65.
go back to reference Mohamed, A. R., Dahl, G., & Hinton, G. (2009). Deep belief networks for phone recognition. In Nips workshop on deep learning for speech recognition and related applications, vol. 1, no. 9, p. 39. Mohamed, A. R., Dahl, G., & Hinton, G. (2009). Deep belief networks for phone recognition. In Nips workshop on deep learning for speech recognition and related applications, vol. 1, no. 9, p. 39.
66.
go back to reference Mohamed, A. R., Yu, D., & Deng L. (2010). Investigation of full-sequence training of deep belief networks for speech recognition. In Eleventh annual conference of the international speech communication association. Mohamed, A. R., Yu, D., & Deng L. (2010). Investigation of full-sequence training of deep belief networks for speech recognition. In Eleventh annual conference of the international speech communication association.
67.
go back to reference Mohamed, A.-R., Dahl, G. E., & Hinton, G. (2011). Acoustic modeling using deep belief networks. IEEE transactions on audio, speech, and language processing, 20(1), 14–22.CrossRef Mohamed, A.-R., Dahl, G. E., & Hinton, G. (2011). Acoustic modeling using deep belief networks. IEEE transactions on audio, speech, and language processing, 20(1), 14–22.CrossRef
73.
go back to reference Hourri, S., & Kharroubi, J. (2020). A deep learning approach for speaker recognition. International Journal of Speech Technology, 23(1), 123–131.CrossRef Hourri, S., & Kharroubi, J. (2020). A deep learning approach for speaker recognition. International Journal of Speech Technology, 23(1), 123–131.CrossRef
76.
go back to reference Nie, S., Liang, S., Liu, W., Zhang, X., & Tao, J. (2018). Deep learning based speech separation via NMF-style reconstructions. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(11), 2043–2055.CrossRef Nie, S., Liang, S., Liu, W., Zhang, X., & Tao, J. (2018). Deep learning based speech separation via NMF-style reconstructions. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(11), 2043–2055.CrossRef
78.
go back to reference Zhao, Y., Wang, Z., & Wang, D. (2019). Two-stage deep learning for noisy-reverberant speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(1), 53–62.CrossRef Zhao, Y., Wang, Z., & Wang, D. (2019). Two-stage deep learning for noisy-reverberant speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(1), 53–62.CrossRef
88.
go back to reference Tan, Z., Mak, M., & Mak, B. K. (2018). DNN-based score calibration with multitask learning for noise robust speaker verification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(4), 700–712.CrossRef Tan, Z., Mak, M., & Mak, B. K. (2018). DNN-based score calibration with multitask learning for noise robust speaker verification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(4), 700–712.CrossRef
89.
go back to reference Yu, H., Tan, Z., Ma, Z., Martin, R., & Guo, J. (2018). Spoofing detection in automatic speaker verification systems using DNN classifiers and dynamic acoustic features. IEEE Transactions on Neural Networks and Learning Systems, 29(10), 4633–4644.CrossRef Yu, H., Tan, Z., Ma, Z., Martin, R., & Guo, J. (2018). Spoofing detection in automatic speaker verification systems using DNN classifiers and dynamic acoustic features. IEEE Transactions on Neural Networks and Learning Systems, 29(10), 4633–4644.CrossRef
90.
go back to reference Wang, Z., & Wang, D. (2019). Combining spectral and spatial features for deep learning based blind speaker separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(2), 457–468.CrossRef Wang, Z., & Wang, D. (2019). Combining spectral and spatial features for deep learning based blind speaker separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(2), 457–468.CrossRef
91.
go back to reference Lotfian, R., & Busso, C. (2019). Curriculum learning for speech emotion recognition from crowdsourced labels. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(4), 815–826.CrossRef Lotfian, R., & Busso, C. (2019). Curriculum learning for speech emotion recognition from crowdsourced labels. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(4), 815–826.CrossRef
93.
go back to reference Fukushima, K. (1988). Neocognitron: A hierarchical neural network capable of visual pattern recognition. Neural Networks, 1, 119–130.CrossRef Fukushima, K. (1988). Neocognitron: A hierarchical neural network capable of visual pattern recognition. Neural Networks, 1, 119–130.CrossRef
94.
go back to reference LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86, 2278–2324.CrossRef LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86, 2278–2324.CrossRef
95.
go back to reference Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. The Journal of Physiology., 195(1), 215–243.CrossRef Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. The Journal of Physiology., 195(1), 215–243.CrossRef
97.
go back to reference Hou, J., Wang, S., Lai, Y., Tsao, Y., Chang, H., & Wang, H. (2018). Audio-visual speech enhancement using multimodal deep convolutional neural networks. IEEE Transactions on Emerging Topics in Computational Intelligence, 2(2), 117–128.CrossRef Hou, J., Wang, S., Lai, Y., Tsao, Y., Chang, H., & Wang, H. (2018). Audio-visual speech enhancement using multimodal deep convolutional neural networks. IEEE Transactions on Emerging Topics in Computational Intelligence, 2(2), 117–128.CrossRef
98.
go back to reference Luo, Y., Chen, Z., & Mesgarani, N. (2018). Speaker-independent speech separation with deep attractor network. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(4), 787–796.CrossRef Luo, Y., Chen, Z., & Mesgarani, N. (2018). Speaker-independent speech separation with deep attractor network. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(4), 787–796.CrossRef
99.
go back to reference Tan, T., Qian, Y., Hu, H., Zhou, Y., Ding, W., & Yu, K. (2018). Adaptive very deep convolutional residual network for noise robust speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(8), 1393–1405.CrossRef Tan, T., Qian, Y., Hu, H., Zhou, Y., Ding, W., & Yu, K. (2018). Adaptive very deep convolutional residual network for noise robust speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(8), 1393–1405.CrossRef
100.
go back to reference Jati, A., & Georgiou, P. (2019). Neural predictive coding using convolutional neural networks toward unsupervised learning of speaker characteristics. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(10), 1577–1589.CrossRef Jati, A., & Georgiou, P. (2019). Neural predictive coding using convolutional neural networks toward unsupervised learning of speaker characteristics. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(10), 1577–1589.CrossRef
102.
go back to reference Nagrani, A., Chung, J. S., Xie, W., & Zisserman, A. (2020). Voxceleb: Large-scale speaker verification in the wild. Computer Speech & Language, 60, 101027.CrossRef Nagrani, A., Chung, J. S., Xie, W., & Zisserman, A. (2020). Voxceleb: Large-scale speaker verification in the wild. Computer Speech & Language, 60, 101027.CrossRef
105.
go back to reference Hossain, M. S., & Muhammad, G. (2019). Emotion recognition using deep learning approach from audio–visual emotional big data. Information Fusion, 49, 69–78.CrossRef Hossain, M. S., & Muhammad, G. (2019). Emotion recognition using deep learning approach from audio–visual emotional big data. Information Fusion, 49, 69–78.CrossRef
106.
go back to reference Ocquaye, E. N. N., Mao, Q., Song, H., Xu, G., & Xue, Y. (2019). Dual exclusive attentive transfer for unsupervised deep convolutional domain adaptation in speech emotion recognition. IEEE Access, 7, 93847–93857.CrossRef Ocquaye, E. N. N., Mao, Q., Song, H., Xu, G., & Xue, Y. (2019). Dual exclusive attentive transfer for unsupervised deep convolutional domain adaptation in speech emotion recognition. IEEE Access, 7, 93847–93857.CrossRef
107.
go back to reference Tripathi, S., Kumar, A., Ramesh, A., Singh, C., & Yenigalla, P. (2019). Deep learning based emotion recognition system using speech features and transcriptions. arXiv preprint arXiv:1906.05681. Tripathi, S., Kumar, A., Ramesh, A., Singh, C., & Yenigalla, P. (2019). Deep learning based emotion recognition system using speech features and transcriptions. arXiv preprint arXiv:​1906.​05681.
108.
go back to reference Dinkel, H., Qian, Y., & Yu, K. (2018). Investigating raw wave deep neural networks for end-to-end speaker spoofing detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(11), 2002–2014.CrossRef Dinkel, H., Qian, Y., & Yu, K. (2018). Investigating raw wave deep neural networks for end-to-end speaker spoofing detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(11), 2002–2014.CrossRef
110.
go back to reference Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:​1412.​3555.
112.
go back to reference Qin, C.-X., Dan, Qu., & Zhang, L.-H. (2018). Towards end-to-end speech recognition with transfer learning. EURASIP Journal on Audio, Speech, and Music Processing, 2018(1), 1–9.CrossRef Qin, C.-X., Dan, Qu., & Zhang, L.-H. (2018). Towards end-to-end speech recognition with transfer learning. EURASIP Journal on Audio, Speech, and Music Processing, 2018(1), 1–9.CrossRef
113.
go back to reference de Benito-Gorron, D., Lozano-Diez, A., Toledano, D. T., & Gonzalez-Rodriguez, J. (2019). Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset. EURASIP Journal on Audio, Speech, and Music Processing, 2019(1), 9.CrossRef de Benito-Gorron, D., Lozano-Diez, A., Toledano, D. T., & Gonzalez-Rodriguez, J. (2019). Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset. EURASIP Journal on Audio, Speech, and Music Processing, 2019(1), 9.CrossRef
114.
go back to reference Kang, J., Zhang, W.-Q., Liu, W.-W., Liu, J., & Johnson, M. T. (2018). Advanced recurrent network-based hybrid acoustic models for low resource speech recognition. EURASIP Journal on Audio, Speech, and Music Processing, 2018(1), 6.CrossRef Kang, J., Zhang, W.-Q., Liu, W.-W., Liu, J., & Johnson, M. T. (2018). Advanced recurrent network-based hybrid acoustic models for low resource speech recognition. EURASIP Journal on Audio, Speech, and Music Processing, 2018(1), 6.CrossRef
115.
go back to reference Tang, Z., Wang, D., Chen, Y., Li, L., & Abel, A. (2018). Phonetic temporal neural model for language identification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(1), 134–144.CrossRef Tang, Z., Wang, D., Chen, Y., Li, L., & Abel, A. (2018). Phonetic temporal neural model for language identification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(1), 134–144.CrossRef
117.
go back to reference Tan, K., & Wang, D. (2018). A convolutional recurrent neural network for real-time speech enhancement. In Interspeech, pp. 3229–3233. Tan, K., & Wang, D. (2018). A convolutional recurrent neural network for real-time speech enhancement. In Interspeech, pp. 3229–3233.
118.
go back to reference Li, A., Yuan, M., Zheng, C., & Li, X. (2020). Speech enhancement using progressive learning-based convolutional recurrent neural network. Applied Acoustics, 166, 107347.CrossRef Li, A., Yuan, M., Zheng, C., & Li, X. (2020). Speech enhancement using progressive learning-based convolutional recurrent neural network. Applied Acoustics, 166, 107347.CrossRef
119.
go back to reference Vafeiadis, A., Fanioudakis, E., Potamitis, I., Votis, K., Giakoumis, D., Tzovaras, D., Chen, L., & Hamzaoui, R. (2019). Two-dimensional convolutional recurrent neural networks for speech activity detection. In International Speech Communication Association, pp. 2045–2049. Vafeiadis, A., Fanioudakis, E., Potamitis, I., Votis, K., Giakoumis, D., Tzovaras, D., Chen, L., & Hamzaoui, R. (2019). Two-dimensional convolutional recurrent neural networks for speech activity detection. In International Speech Communication Association, pp. 2045–2049.
121.
go back to reference Wu, Y., & Li, W. (2019). Automatic audio chord recognition with MIDI-trained deep feature and BLSTM-CRF sequence decoding model. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(2), 355–366.CrossRef Wu, Y., & Li, W. (2019). Automatic audio chord recognition with MIDI-trained deep feature and BLSTM-CRF sequence decoding model. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(2), 355–366.CrossRef
122.
go back to reference Zhao, J., Mao, X., & Chen, L. (2019). Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomedical Signal Processing and Control, 47, 312–323.CrossRef Zhao, J., Mao, X., & Chen, L. (2019). Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomedical Signal Processing and Control, 47, 312–323.CrossRef
123.
go back to reference Yu, Y., Si, X., Changhua, Hu., & Zhang, J. (2019). A review of recurrent neural networks: LSTM cells and network architectures. Neural computation, 31(7), 1235–1270.MathSciNetMATHCrossRef Yu, Y., Si, X., Changhua, Hu., & Zhang, J. (2019). A review of recurrent neural networks: LSTM cells and network architectures. Neural computation, 31(7), 1235–1270.MathSciNetMATHCrossRef
124.
go back to reference Goehring, T., Keshavarzi, M., Carlyon, R. P., & Moore, B. C. J. (2019). Using recurrent neural networks to improve the perception of speech in non-stationary noise by people with cochlear implants. The Journal of the Acoustical Society of America, 146(1), 705–718.CrossRef Goehring, T., Keshavarzi, M., Carlyon, R. P., & Moore, B. C. J. (2019). Using recurrent neural networks to improve the perception of speech in non-stationary noise by people with cochlear implants. The Journal of the Acoustical Society of America, 146(1), 705–718.CrossRef
125.
go back to reference Sutton, R. S., Barto, A. G., & Williams, R. J. (1992). Reinforcement learning is direct adaptive optimal control. IEEE Control Systems, 12(2), 19–22.CrossRef Sutton, R. S., Barto, A. G., & Williams, R. J. (1992). Reinforcement learning is direct adaptive optimal control. IEEE Control Systems, 12(2), 19–22.CrossRef
126.
go back to reference Mnih,V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. In NIPS deep learning workshop. Mnih,V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. In NIPS deep learning workshop.
127.
go back to reference Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th international conference on neural information processing systems, NIPS’99, pp. 1057–1063. Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th international conference on neural information processing systems, NIPS’99, pp. 1057–1063.
129.
go back to reference Chen, L., Chang, C., Chen, Z., Tan, B., Gašić, M., & Yu, K. (2018). Policy adaptation for deep reinforcement learning-based dialogue management. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), Calgary, AB, pp. 6074–6078. https://doi.org/10.1109/ICASSP.2018.8462272. Chen, L., Chang, C., Chen, Z., Tan, B., Gašić, M., & Yu, K. (2018). Policy adaptation for deep reinforcement learning-based dialogue management. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), Calgary, AB, pp. 6074–6078. https://​doi.​org/​10.​1109/​ICASSP.​2018.​8462272.
131.
go back to reference Shen, Y. L., Huang, C. Y., Wang, S. S., Tsao, Y., Wang, H. M., & Chi, T. S. (2019). Reinforcement learning based speech enhancement for robust speech recognition. In ICASSP 2019–2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 6750–6754. IEEE. Shen, Y. L., Huang, C. Y., Wang, S. S., Tsao, Y., Wang, H. M., & Chi, T. S. (2019). Reinforcement learning based speech enhancement for robust speech recognition. In ICASSP 2019–2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 6750–6754. IEEE.
132.
go back to reference Rajapakshe, T., Rana, R., Latif, S., Khalifa, S., & Schuller, B. W. (2019). Pre-training in deep reinforcement learning for automatic speech recognition. arXiv preprint arXiv:1910.11256. Rajapakshe, T., Rana, R., Latif, S., Khalifa, S., & Schuller, B. W. (2019). Pre-training in deep reinforcement learning for automatic speech recognition. arXiv preprint arXiv:​1910.​11256.
133.
go back to reference Kala, T., & Shinozaki, T. (2018). Reinforcement learning of speech recognition system based on policy gradient and hypothesis selection. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), Calgary, AB, pp. 5759–5763, https://doi.org/10.1109/ICASSP.2018.8462656. Kala, T., & Shinozaki, T. (2018). Reinforcement learning of speech recognition system based on policy gradient and hypothesis selection. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), Calgary, AB, pp. 5759–5763, https://​doi.​org/​10.​1109/​ICASSP.​2018.​8462656.
134.
go back to reference Lee, H., Chung, P., Wu, Y., Lin, T., & Wen, T. (2018). Interactive spoken content retrieval by deep reinforcement learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(12), 2447–2459.CrossRef Lee, H., Chung, P., Wu, Y., Lin, T., & Wen, T. (2018). Interactive spoken content retrieval by deep reinforcement learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(12), 2447–2459.CrossRef
137.
go back to reference Zue, V., Seneff, S., & Glass, J. (1990). Speech database development at MIT: TIMIT and beyond. Speech Communication, 9(4), 351–356.CrossRef Zue, V., Seneff, S., & Glass, J. (1990). Speech database development at MIT: TIMIT and beyond. Speech Communication, 9(4), 351–356.CrossRef
138.
go back to reference Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5206–5210. IEEE. Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5206–5210. IEEE.
139.
go back to reference Nagrani, A., Chung, J. S., & Zisserman, A. (2017). "Voxceleb: A large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612. Nagrani, A., Chung, J. S., & Zisserman, A. (2017). "Voxceleb: A large-scale speaker identification dataset. arXiv preprint arXiv:​1706.​08612.
140.
go back to reference Pearce, D., & Picone, J. (2002). Aurora working group: DSR front end LVCSR evaluation AU/384/02. In Institute for signal & information processing, Mississippi State University, Technical Report. Pearce, D., & Picone, J. (2002). Aurora working group: DSR front end LVCSR evaluation AU/384/02. In Institute for signal & information processing, Mississippi State University, Technical Report.
141.
go back to reference Sinha, R., Gales, M. J., Kim, D. Y., Liu, X. A., Sim, K. C., & Woodland, P. C. (2006). The CU-HTK mandarin broadcast news transcription system. In Proceedings of ICASSP 2006, May, 2006, pp. 1077–1080. Sinha, R., Gales, M. J., Kim, D. Y., Liu, X. A., Sim, K. C., & Woodland, P. C. (2006). The CU-HTK mandarin broadcast news transcription system. In Proceedings of ICASSP 2006, May, 2006, pp. 1077–1080.
142.
go back to reference Barker, J., Watanabe, S., Vincent, E., & Trmal, J. (2018). The fifth'CHiME'speech separation and recognition challenge: Dataset, task and baselines. arXiv preprint arXiv:1803.10609. Barker, J., Watanabe, S., Vincent, E., & Trmal, J. (2018). The fifth'CHiME'speech separation and recognition challenge: Dataset, task and baselines. arXiv preprint arXiv:​1803.​10609.
143.
go back to reference Kinoshita, K., Delcroix, M., Gannot, S., Habets, E., Haeb-Umbach, R., Kellermann, W., Leutnant, V., Maas, R., Nakatani, T., Raj, B., Sehr, A., & Yoshioka, T. (2016). A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research. EURASIP Journal on Advances in Signal Processing. https://doi.org/10.1186/s13634-016-0306-6CrossRef Kinoshita, K., Delcroix, M., Gannot, S., Habets, E., Haeb-Umbach, R., Kellermann, W., Leutnant, V., Maas, R., Nakatani, T., Raj, B., Sehr, A., & Yoshioka, T. (2016). A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research. EURASIP Journal on Advances in Signal Processing. https://​doi.​org/​10.​1186/​s13634-016-0306-6CrossRef
144.
go back to reference Godfrey, J. J., Holliman, E. C., & McDaniel, J. (1992) SWITCHBOARD: telephone speech corpus for research and development. In [Proceedings] ICASSP-92: 1992 IEEE international conference on acoustics, speech, and signal processing, San Francisco, CA, USA, vol. 1, pp. 517–520. https://doi.org/10.1109/ICASSP.1992.225858. Godfrey, J. J., Holliman, E. C., & McDaniel, J. (1992) SWITCHBOARD: telephone speech corpus for research and development. In [Proceedings] ICASSP-92: 1992 IEEE international conference on acoustics, speech, and signal processing, San Francisco, CA, USA, vol. 1, pp. 517–520. https://​doi.​org/​10.​1109/​ICASSP.​1992.​225858.
145.
go back to reference Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., & Weiss, B. (2005). A database of German emotional speech. In Proceedings of Interspeech. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., & Weiss, B. (2005). A database of German emotional speech. In Proceedings of Interspeech.
146.
go back to reference Busso, C., Bulut, M., Lee, C. C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J. N., Lee, S., & Narayanan, S. S. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Journal of Language Resources and Evaluation, 42(4), 335–359.CrossRef Busso, C., Bulut, M., Lee, C. C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J. N., Lee, S., & Narayanan, S. S. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Journal of Language Resources and Evaluation, 42(4), 335–359.CrossRef
147.
go back to reference Lotfian, R., & Busso, C. (2019). Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings. IEEE Transactions on Affective Computing, 10(4), 471–483.CrossRef Lotfian, R., & Busso, C. (2019). Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings. IEEE Transactions on Affective Computing, 10(4), 471–483.CrossRef
148.
149.
go back to reference Goto, M., Hashiguchi, H., Nishimura, T., & Oka, R. (2002). RWC music database: Popular, classical, and jazz music databases. In Proceedings of the 3rd international conference on music information retrieval (ISMIR 2002), pp. 287–288. Goto, M., Hashiguchi, H., Nishimura, T., & Oka, R. (2002). RWC music database: Popular, classical, and jazz music databases. In Proceedings of the 3rd international conference on music information retrieval (ISMIR 2002), pp. 287–288.
151.
go back to reference Varga, A., & Steeneken, H. J. M. (1993). Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12(3), 247–251.CrossRef Varga, A., & Steeneken, H. J. M. (1993). Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12(3), 247–251.CrossRef
152.
go back to reference Jensen, J., & Taal, C. H. (2016). An algorithm for predicting the intelligibility of speech masked by modulated noise maskers. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(11), 2009–2022.CrossRef Jensen, J., & Taal, C. H. (2016). An algorithm for predicting the intelligibility of speech masked by modulated noise maskers. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(11), 2009–2022.CrossRef
153.
go back to reference Vincent, E., Gribonval, R., & Fevotte, C. (2006). Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech and Language Processing, 14(4), 1462–1469.CrossRef Vincent, E., Gribonval, R., & Fevotte, C. (2006). Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech and Language Processing, 14(4), 1462–1469.CrossRef
Metadata
Title
Survey of Deep Learning Paradigms for Speech Processing
Authors
Kishor Barasu Bhangale
Mohanaprasad Kothandaraman
Publication date
04-03-2022
Publisher
Springer US
Published in
Wireless Personal Communications / Issue 2/2022
Print ISSN: 0929-6212
Electronic ISSN: 1572-834X
DOI
https://doi.org/10.1007/s11277-022-09640-y

Other articles of this Issue 2/2022

Wireless Personal Communications 2/2022 Go to the issue