Top

International Journal of Speech Technology

Published in:

28-08-2018

Three-stage speaker verification architecture in emotional talking environments

Authors: Ismail Shahin, Ali Bou Nassif

Published in: International Journal of Speech Technology | Issue 4/2018

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Speaker verification performance in neutral talking environment is usually high, while it is sharply decreased in emotional talking environments. This performance degradation in emotional environments is due to the problem of mismatch between training in neutral environment while testing in emotional environments. In this work, a three-stage speaker verification architecture has been proposed to enhance speaker verification performance in emotional environments. This architecture is comprised of three cascaded stages: gender identification stage followed by an emotion identification stage followed by a speaker verification stage. The proposed framework has been evaluated on two distinct and independent emotional speech datasets: in-house dataset and “Emotional Prosody Speech and Transcripts” dataset. Our results show that speaker verification based on both gender information and emotion information is superior to each of speaker verification based on gender information only, emotion information only, and neither gender information nor emotion information. The attained average speaker verification performance based on the proposed framework is very alike to that attained in subjective assessment by human listeners.

previous article Revisiting distinctive phonetic features from applied computing perspective: unifying views and analyzing modern Arabic speech varieties

next article Deep and shallow features fusion based on deep convolutional neural network for speech emotion recognition

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Bosch, L. T. (2003). Emotions, speech and the ASR framework. Speech Communication, 40, 213–225.CrossRefMATH

Chen, L., Lee, K. A., Chng, E.-S., Ma, B., Li, H., & Dai, L. R., (2016). Content-aware local variability vector for speaker verification with short utterance. In The 41st IEEE international conference on acoustics, speech and signal processing, Shanghai, China, March 2016 (pp. 5485–5489).

Emotional Prosody Speech and Transcripts dataset. (2016). Retrieved November 15, 2016, from http://www.ldc.upenn.edu/Catalog/CatalogEntry. jsp?catalogId = LDC2002S28.

Hansen, J. H. L., & Hasan, T., (2015). Speaker recognition by machines and humans: a tutorial review. IEEE Signal Processing Magazine, 32(6), 74–99.CrossRef

Harb, H., & Chen, L. (2003). Gender identification using a general audio classifier. In International Conference on Multimedia and Expo 2003 (ICME’03), July 2003, (pp. 733–736).

Huang, M. X., Ngai, G., Hua, K. A., Chan, S. C. F., & Leong, H. V. (2016). Identifying user-specific facial affects from spontaneous expressions with minimal annotation. IEEE Transactions on Affective Computing, 7(4), 360–373. https://doi.org/10.1109/TAFFC.2015.2495222.CrossRef

Lee, C. M., & Narayanan, S. S. (2005). Towards detecting emotions in spoken dialogs. IEEE Transactions on Speech and Audio Processing, 13(2), 293–303.CrossRef

Mariooryad, S., & Busso, C. (2016). Facial expression recognition in the presence of speech using blind lexical compensation. IEEE Transactions on Affective Computing, 7(4), 346–359. https://doi.org/10.1109/TAFFC.2015.2490070.CrossRef

Mary, L., & Yegnanarayana, B. (2008). Extraction and representation of prosodic features for language and speaker recognition. Speech Communication, 50(10), 782–796.CrossRef

Nwe, T. L., Foo, S. W., & De Silva, L. C. (2003). Speech emotion recognition using hidden Markov models. Speech Communication, 41, 603–623.CrossRef

Pillay, S. G., Ariyaeeinia, A., Pawlewski, M., & Sivakumaran, P. (2009). Speaker verification under mismatched data conditions. IET Signal Processing, 3(4), 236–246.CrossRef

Pitsikalis, V., & Maragos, P. (2009). Analysis and classification of speech signals by generalized fractal dimension features. Speech Communication, 51(12), 1206–1223.CrossRef

Pittermann, J., Pittermann, A., & Minker, W. (2010). Emotion recognition and adaptation in spoken dialogue systems. International Journal of Speech Technology, 13, 49–60.CrossRef

Polzin, T. S., & Waibel, A. H., (1998). Detecting emotions in speech. Cooperative multimodal communication. In second international conference 1998, CMC 1998.

Reynolds, D. A. (1995). Automatic speaker recognition using Gaussian mixture speaker models. The Lincoln Laboratory Journal, 8(2), 173–192.

Reynolds, D. A. (2002). An overview of automatic speaker recognition technology. ICASSP 2002, 4, IV-4072–IV-4075.

Reynolds, D. A., Quatieri, T. F., & Dunn, R. B., (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41.CrossRef

Scherer, K. R., Johnstone, T., Klasmeyer, G., & Banziger, T. (2000). Can automatic speaker verification be improved by training the algorithms on emotional speech? Proceedings of International Conference on Spoken Language Processing, 2, 807–810.CrossRef

Shahin, I. (2008). Speaker identification in the shouted environment using suprasegmental hidden Markov models. Signal Processing, 88(11), 2700–2708.CrossRefMATH

Shahin, I. (2009). Verifying speakers in emotional environments. In The 9th IEEE international symposium on signal processing and information technology, Ajman, United Arab Emirates, December 2009, (pp. 328–333).

Shahin, I. (2011). Identifying speakers using their emotion cues. International Journal of Speech Technology, 14(2), 89–98. https://doi.org/10.1007/s10772-011-9089-1.CrossRef

Shahin, I. (2012). Studying and enhancing talking condition recognition in stressful and emotional talking environments based on HMMs, CHMM2s and SPHMMs. Journal on Multimodal User Interfaces, 6, 59–71. https://doi.org/10.1007/s12193-011-0082-4.CrossRef

Shahin, I. (2013a). Employing both gender and emotion cues to enhance speaker identification performance in emotional talking environments. International Journal of Speech Technology, 16(3), 341–351. https://doi.org/10.1007/s10772-013-9188-2.CrossRef

Shahin, I. (2013b). Speaker identification in emotional talking environments based on CSPHMM2s. Engineering Applications of Artificial Intelligence, 26, 1652–1659. https://doi.org/10.1016/j.engappai.2013.03.013.CrossRef

Shahin, I. (2013c). Gender-dependent emotion recognition based on HMMs and SPHMMs. International Journal of Speech Technology, 16(2), 133–141. https://doi.org/10.1007/s10772-012-9170-4.CrossRef

Shahin, I. (2014). Novel third-order hidden Markov models for speaker identification in shouted talking environments. Engineering Applications of Artificial Intelligence, 35, 316–323. https://doi.org/10.1016/j.engappai.2014.07.006.CrossRef

Shahin, I. (2016). Employing emotion cues to verify speakers in emotional talking environments. Journal of Intelligent Systems, Special Issue on Intelligent Healthcare Systems, 25(1), 3–17. https://doi.org/10.1515/jisys-2014-0118.MathSciNet

Shahin, I., & Ba-Hutair, M. N. (2015). Talking condition recognition in stressful and emotional talking environments based on CSPHMM2s. International Journal of Speech Technology, 18(1), 77–90, https://doi.org/10.1007/s10772-014-9251-7.CrossRef

Ververidis, D., & Kotropoulos, C. (2006). Emotional speech recognition: Resources, features, and methods. Speech Communication, 48(9), 1162–1181.CrossRef

Vogt, T., & Andre, E., (2006). Improving automatic emotion recognition from speech via gender differentiation. In Proceedings of Language Resources and Evaluation Conference (LREC 2006), Genoa, Italy, 2006.

Wang, L., Wang, J., Li, L., Zheng, T. F., & Soong, F. K. (2016). Improving speaker verification performance against long-term speaker variability. Speech Communication, 79, 14–29.CrossRef

Wu, W., Zheng, T. F., Xu, M. X., & Bao, H. J., (2006). Study on speaker verification on emotional speech. In Proceedings of International Conference on Spoken Language Processing, INTERSPEECH 2006. September 2006, (pp. 2102–2105).

Yegnanarayana, B., Prasanna, S. R. M., Zachariah, J. M., & Gupta, C. S. (2005). Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker verification systems. IEEE Transactions on Speech and Audio Processing, 13(4), 575–582.CrossRef

Zhou, G., Hansen, J. H. L., & Kaiser, J. F. (2001). Nonlinear feature based classification of speech under stress. IEEE Transactions on Speech & Audio Processing, 9(3), 201–216.CrossRef

Title: Three-stage speaker verification architecture in emotional talking environments
Authors: Ismail Shahin
Ali Bou Nassif
Publication date: 28-08-2018
Publisher: Springer US
Published in: International Journal of Speech Technology / Issue 4/2018
Print ISSN: 1381-2416
Electronic ISSN: 1572-8110
DOI: https://doi.org/10.1007/s10772-018-9543-4

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 4/2018

Enhancing speech intelligibility in reverberant spaces by a speech features distributions dependent pre-processing

Speaker identification based on normalized pitch frequency and Mel Frequency Cepstral Coefficients

Large scale data based audio scene classification

Correction to: An investigation of the impact of MVA normalization on the advanced front-end features

DSP-based voice activity detection and background noise reduction

Mel scaled M-band wavelet filter bank for speech recognition