Skip to main content
Erschienen in: Neural Computing and Applications 7/2020

04.10.2018 | Original Article

Novel cascaded Gaussian mixture model-deep neural network classifier for speaker identification in emotional talking environments

verfasst von: Ismail Shahin, Ali Bou Nassif, Shibani Hamsa

Erschienen in: Neural Computing and Applications | Ausgabe 7/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This research is an effort to present an effective approach to enhance text-independent speaker identification performance in emotional talking environments based on novel classifier called cascaded Gaussian Mixture Model-Deep Neural Network (GMM-DNN). Our current work focuses on proposing, implementing and evaluating a new approach for speaker identification in emotional talking environments based on cascaded Gaussian mixture model-deep neural network as a classifier. The results point out that the cascaded GMM-DNN classifier improves speaker identification performance at various emotions using two distinct speech databases: Emirati speech database (Arabic United Arab Emirates dataset) and “speech under simulated and actual stress” English dataset. The proposed classifier outperforms classical classifiers such as multilayer perceptron and support vector machine in each dataset. Speaker identification performance that has been attained based on the cascaded GMM-DNN is similar to that acquired from subjective assessment by human listeners.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Furui S (1991) Speaker-dependent-feature-extraction, recognition and processing techniques. Speech Commun 10:505–520CrossRef Furui S (1991) Speaker-dependent-feature-extraction, recognition and processing techniques. Speech Commun 10:505–520CrossRef
2.
Zurück zum Zitat Wang Y, Tang F, Zheng U (2012) Robust text-independent speaker identification in a time-varying noisy environment. J Softw 7:1975–1980 Wang Y, Tang F, Zheng U (2012) Robust text-independent speaker identification in a time-varying noisy environment. J Softw 7:1975–1980
3.
Zurück zum Zitat Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Collias S, Fellenz W, Taylor J (2001) Emotion recognition in human-computer interaction. IEEE Signal Process Mag 18(1):32–80CrossRef Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Collias S, Fellenz W, Taylor J (2001) Emotion recognition in human-computer interaction. IEEE Signal Process Mag 18(1):32–80CrossRef
4.
Zurück zum Zitat Fragopanagos N, Taylor JG (2005) Emotion recognition in human-computer interaction. Neural Netw 18:389–405 (Special issue) CrossRef Fragopanagos N, Taylor JG (2005) Emotion recognition in human-computer interaction. Neural Netw 18:389–405 (Special issue) CrossRef
6.
Zurück zum Zitat Li D, Yang Y, Wu Z, Wu T (2005) Emotion-state conversion for speaker recognition. In: Affective computing and intelligent interaction. LNCS, vol 3784. Springer, Berlin, pp 403–410 Li D, Yang Y, Wu Z, Wu T (2005) Emotion-state conversion for speaker recognition. In: Affective computing and intelligent interaction. LNCS, vol 3784. Springer, Berlin, pp 403–410
7.
Zurück zum Zitat Wu S, Falk TH, Chan WY (2011) Automatic speech emotion recognition using modulation spectral features. Speech Commun 53:768–785CrossRef Wu S, Falk TH, Chan WY (2011) Automatic speech emotion recognition using modulation spectral features. Speech Commun 53:768–785CrossRef
8.
Zurück zum Zitat Bao H, Xu M, Zheng TF (2007) Emotion attribute projection for speaker recognition on emotional speech. In: Proceedings of the 8th MLPual conference of the international speech communication association (Interspeech’07), Antwerp, Belgium, pp 601–604 Bao H, Xu M, Zheng TF (2007) Emotion attribute projection for speaker recognition on emotional speech. In: Proceedings of the 8th MLPual conference of the international speech communication association (Interspeech’07), Antwerp, Belgium, pp 601–604
9.
Zurück zum Zitat Koolagudi SG, Krothapalli RS (2011) Two stage emotion recognition based on speaking rate. Int J Speech Technol 14:35–48CrossRef Koolagudi SG, Krothapalli RS (2011) Two stage emotion recognition based on speaking rate. Int J Speech Technol 14:35–48CrossRef
10.
Zurück zum Zitat Jawarkar NP, Holambe RS, Basu TK (2012) Text-independent speaker identification in emotional environments: a classifier fusion approach. In: Frontiers in computer education, Volume 133 of the series advances in intelligent and soft computing, pp 569–576 Jawarkar NP, Holambe RS, Basu TK (2012) Text-independent speaker identification in emotional environments: a classifier fusion approach. In: Frontiers in computer education, Volume 133 of the series advances in intelligent and soft computing, pp 569–576
11.
Zurück zum Zitat Mansour A, Lachiri Z (2016) Emotional speaker recognition in simulated and spontaneous context. In: 2nd International conference on advanced technologies for signal and image processing (ATSIP), pp 776–781 Mansour A, Lachiri Z (2016) Emotional speaker recognition in simulated and spontaneous context. In: 2nd International conference on advanced technologies for signal and image processing (ATSIP), pp 776–781
14.
Zurück zum Zitat Shahin I, Nasser Ba-Hutair M (2014) Emarati speaker identification. In: 12th International conference on signal processing (ICSP 2014), HangZhou, China, pp 488–493 Shahin I, Nasser Ba-Hutair M (2014) Emarati speaker identification. In: 12th International conference on signal processing (ICSP 2014), HangZhou, China, pp 488–493
17.
Zurück zum Zitat George T, Fabien R, Raymond B, Erik M, Mihalis AN, Björn S, Stefanos Z (2016) Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: IEEE international conference on acoustics, speech and signal processing (ICASSP) George T, Fabien R, Raymond B, Erik M, Mihalis AN, Björn S, Stefanos Z (2016) Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: IEEE international conference on acoustics, speech and signal processing (ICASSP)
18.
Zurück zum Zitat Erik MS, Youngmoo EK (2011) Learning emotion-based acoustic features with deep belief networks. In: 2011 IEEE workshop on applications of signal processing to audio and acoustics, pp 16–19 Erik MS, Youngmoo EK (2011) Learning emotion-based acoustic features with deep belief networks. In: 2011 IEEE workshop on applications of signal processing to audio and acoustics, pp 16–19
19.
Zurück zum Zitat Matejka P, Glembek O, Navotny O, Plchot O, Grezl F, Burget L, Cernocky J (2016) Analysis of DNN approaches to speaker identification. In: International conference on acoustics, speech and signal processing, pp 5100–5104 Matejka P, Glembek O, Navotny O, Plchot O, Grezl F, Burget L, Cernocky J (2016) Analysis of DNN approaches to speaker identification. In: International conference on acoustics, speech and signal processing, pp 5100–5104
20.
Zurück zum Zitat Lee H, Pham P, Largman Y, Ng A (2009) Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Bengio Y, Schuurmans D, Lafferty J, Williams CKI, Culotta A (eds) Advances in neural information processing systems, vol 22. MIT Press, Cambridge, pp 1096–1104 Lee H, Pham P, Largman Y, Ng A (2009) Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Bengio Y, Schuurmans D, Lafferty J, Williams CKI, Culotta A (eds) Advances in neural information processing systems, vol 22. MIT Press, Cambridge, pp 1096–1104
21.
Zurück zum Zitat Richardson F, Reynolds D, Dehak N (2015) Deep neural network approaches to speaker and language recognition. IEEE Signal Process Lett 22(10):1671–1675CrossRef Richardson F, Reynolds D, Dehak N (2015) Deep neural network approaches to speaker and language recognition. IEEE Signal Process Lett 22(10):1671–1675CrossRef
23.
Zurück zum Zitat Geeta N, Soni MK (2014) Speaker recognition using support vector machine. Int J Comput Appl 87(2):7–10 Geeta N, Soni MK (2014) Speaker recognition using support vector machine. Int J Comput Appl 87(2):7–10
24.
Zurück zum Zitat Marcel K, Sven EK, Martin S, Edin A, Andreas W (2016) Speaker identification and verification using support vector machines and sparse kernel logistic regression. In: International workshop on intelligent computing in pattern analysis/synthesis (IWICPAS), pp 176–184 Marcel K, Sven EK, Martin S, Edin A, Andreas W (2016) Speaker identification and verification using support vector machines and sparse kernel logistic regression. In: International workshop on intelligent computing in pattern analysis/synthesis (IWICPAS), pp 176–184
25.
Zurück zum Zitat Sharma A, Snghand SP, Kumar VK (2005) Text-independent speaker identification using back propagation MLP network classifier for a closed set of speakers. In: Proceedings of the fifth IEEE international symposium on signal processing and information technology Sharma A, Snghand SP, Kumar VK (2005) Text-independent speaker identification using back propagation MLP network classifier for a closed set of speakers. In: Proceedings of the fifth IEEE international symposium on signal processing and information technology
26.
Zurück zum Zitat Srinivas V, Santhi Rani C, Madhu T (2014) Neural network based classification for speaker identification. Int J Signal Process Image Process Pattern Recognit 7:109–120 Srinivas V, Santhi Rani C, Madhu T (2014) Neural network based classification for speaker identification. Int J Signal Process Image Process Pattern Recognit 7:109–120
28.
Zurück zum Zitat Adell J, Benafonte A, Escudero D (2005) Analysis of prosodic features: towards modeling of emotional and pragmatic attributes of speech. XXI Congreso de la Sociedad Española para el Procesamiento del Lenguaje Natural, SEPLN, Granada, Spain Adell J, Benafonte A, Escudero D (2005) Analysis of prosodic features: towards modeling of emotional and pragmatic attributes of speech. XXI Congreso de la Sociedad Española para el Procesamiento del Lenguaje Natural, SEPLN, Granada, Spain
29.
Zurück zum Zitat Ververidis D, Kotropoulos C (2006) Emotional speech recognition: resources, features and methods. Speech Commun 48(9):1162–1181CrossRef Ververidis D, Kotropoulos C (2006) Emotional speech recognition: resources, features and methods. Speech Commun 48(9):1162–1181CrossRef
30.
Zurück zum Zitat Bosch LT (2003) Emotions, speech and the ASR framework. Speech Commun 40(1–2):213–225MATH Bosch LT (2003) Emotions, speech and the ASR framework. Speech Commun 40(1–2):213–225MATH
32.
Zurück zum Zitat Reynold DA (1995) Robust text independent speaker identification using Gaussian mixture speaker models. IEEE Trans Speech Audio Process 3:72–82CrossRef Reynold DA (1995) Robust text independent speaker identification using Gaussian mixture speaker models. IEEE Trans Speech Audio Process 3:72–82CrossRef
33.
Zurück zum Zitat McLaren M, Lei Y, Scheffer N, Ferrer L (2014) Application of convolutional neural networks to speaker recognition in noisy conditions. In: Interspeech, pp 686–690 McLaren M, Lei Y, Scheffer N, Ferrer L (2014) Application of convolutional neural networks to speaker recognition in noisy conditions. In: Interspeech, pp 686–690
34.
35.
Zurück zum Zitat Hogg RV, Craig AT (1970) Chapter 4: Introduction to mathematical statistics. Collier-Macmillan, London Hogg RV, Craig AT (1970) Chapter 4: Introduction to mathematical statistics. Collier-Macmillan, London
36.
Zurück zum Zitat Hansen JHL, Bou-Ghazale S (1997) Getting started with SUSAS: a speech under simulated and actual stress database. In: International conference on speech communication and technology, EUROSPEECH-97, Rhodes, Greece, vol 4, pp 1743–1746 Hansen JHL, Bou-Ghazale S (1997) Getting started with SUSAS: a speech under simulated and actual stress database. In: International conference on speech communication and technology, EUROSPEECH-97, Rhodes, Greece, vol 4, pp 1743–1746
Metadaten
Titel
Novel cascaded Gaussian mixture model-deep neural network classifier for speaker identification in emotional talking environments
verfasst von
Ismail Shahin
Ali Bou Nassif
Shibani Hamsa
Publikationsdatum
04.10.2018
Verlag
Springer London
Erschienen in
Neural Computing and Applications / Ausgabe 7/2020
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-018-3760-2

Weitere Artikel der Ausgabe 7/2020

Neural Computing and Applications 7/2020 Zur Ausgabe

Deep Learning & Neural Computing for Intelligent Sensing and Control

Recognition and prediction of ground vibration signal based on machine learning algorithm

Deep Learning & Neural Computing for Intelligent Sensing and Control

Application research of improved genetic algorithm based on machine learning in production scheduling

Premium Partner