Skip to main content
Erschienen in: Neural Computing and Applications 20/2022

31.05.2022 | Original Article

Bidirectional parallel echo state network for speech emotion recognition

verfasst von: Hemin Ibrahim, Chu Kiong Loo, Fady Alnajjar

Erschienen in: Neural Computing and Applications | Ausgabe 20/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Speech is an effective way for communicating and exchanging complex information between humans. Speech signal has involved a great attention in human-computer interaction. Therefore, emotion recognition from speech has become a hot research topic in the field of interacting machines with humans. In this paper, we proposed a novel speech emotion recognition system by adopting multivariate time series handcrafted feature representation from speech signals. Bidirectional echo state network with two parallel reservoir layers has been applied to capture additional independent information. The parallel reservoirs produce multiple representations for each direction from the bidirectional data with two stages of concatenation. The sparse random projection approach has been adopted to reduce the high-dimensional sparse output for each direction separately from both reservoirs. Random over-sampling and random under-sampling methods are used to overcome the imbalanced nature of the used speech emotion datasets. The performance of the proposed parallel ESN model is evaluated from the speaker-independent experiments on EMO-DB, SAVEE, RAVDESS, and FAU Aibo datasets. The results show that the proposed SER model is superior to the single reservoir and the state-of-the-art studies.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Bojanić M, Delić V, Karpov A (2020) Call redistribution for a call center based on speech emotion recognition. Appl Sci 10(13):4653CrossRef Bojanić M, Delić V, Karpov A (2020) Call redistribution for a call center based on speech emotion recognition. Appl Sci 10(13):4653CrossRef
2.
Zurück zum Zitat Katsis CD, Rigas G, Goletsis Y, Fotiadis DI (2015) Emotion recognition in car industry. Emot Recognit A Pattern Anal Approach 515–544 Katsis CD, Rigas G, Goletsis Y, Fotiadis DI (2015) Emotion recognition in car industry. Emot Recognit A Pattern Anal Approach 515–544
3.
Zurück zum Zitat Al-Talabani A (2015) Automatic speech emotion recognition-feature space dimensionality and classification challenges. PhD thesis, University of Buckingham Al-Talabani A (2015) Automatic speech emotion recognition-feature space dimensionality and classification challenges. PhD thesis, University of Buckingham
5.
Zurück zum Zitat Mao Q, Dong M, Huang Z, Zhan Y (2014) Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans Multimed 16(8):2203–2213CrossRef Mao Q, Dong M, Huang Z, Zhan Y (2014) Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans Multimed 16(8):2203–2213CrossRef
6.
Zurück zum Zitat Kathiresan T, Dellwo V (2019) Cepstral derivatives in mfccs for emotion recognition. In: 2019 IEEE 4th international conference on signal and image processing (ICSIP), pp 56–60. IEEE Kathiresan T, Dellwo V (2019) Cepstral derivatives in mfccs for emotion recognition. In: 2019 IEEE 4th international conference on signal and image processing (ICSIP), pp 56–60. IEEE
11.
Zurück zum Zitat Fayek HM, Lech M, Cavedon L (2017) Evaluating deep learning architectures for speech emotion recognition. Neural Netw 92:60–68CrossRef Fayek HM, Lech M, Cavedon L (2017) Evaluating deep learning architectures for speech emotion recognition. Neural Netw 92:60–68CrossRef
14.
Zurück zum Zitat Chen Y, Keogh E, Hu B, Begum N, Bagnall A, Mueen A, Batista G (2015) The ucr time series classification archive Chen Y, Keogh E, Hu B, Begum N, Bagnall A, Mueen A, Batista G (2015) The ucr time series classification archive
16.
Zurück zum Zitat Wu Q, Fokoue E, Kudithipudi D (2018) On the statistical challenges of echo state networks and some potential remedies. arXiv:1802.07369 Wu Q, Fokoue E, Kudithipudi D (2018) On the statistical challenges of echo state networks and some potential remedies. arXiv:​1802.​07369
17.
Zurück zum Zitat Shoumy NJ, Ang L-M, Rahaman DM, Zia T, Seng KP, Khatun S (2021) Augmented audio data in improving speech emotion classification tasks. In: International conference on industrial, engineering and other applications of applied intelligent systems, pp 360–365. Springer Shoumy NJ, Ang L-M, Rahaman DM, Zia T, Seng KP, Khatun S (2021) Augmented audio data in improving speech emotion classification tasks. In: International conference on industrial, engineering and other applications of applied intelligent systems, pp 360–365. Springer
18.
Zurück zum Zitat López E, Valle C, Allende H, Gil E, Madsen H (2018) Wind power forecasting based on echo state networks and long short-term memory. Energies 11(3):526CrossRef López E, Valle C, Allende H, Gil E, Madsen H (2018) Wind power forecasting based on echo state networks and long short-term memory. Energies 11(3):526CrossRef
19.
Zurück zum Zitat Scherer S, Oubbati M, Schwenker F, Palm G (2008) Real-time emotion recognition from speech using echo state networks. In: IAPR workshop on artificial neural networks in pattern recognition, pp 205–216. Springer Scherer S, Oubbati M, Schwenker F, Palm G (2008) Real-time emotion recognition from speech using echo state networks. In: IAPR workshop on artificial neural networks in pattern recognition, pp 205–216. Springer
20.
Zurück zum Zitat Rodan A, Sheta AF, Faris H (2017) Bidirectional reservoir networks trained using svm + privileged information for manufacturing process modeling. Soft Comput 21(22):6811–6824CrossRef Rodan A, Sheta AF, Faris H (2017) Bidirectional reservoir networks trained using svm + privileged information for manufacturing process modeling. Soft Comput 21(22):6811–6824CrossRef
21.
Zurück zum Zitat Bianchi FM, Scardapane S, Løkse S, Jenssen R (2020) Reservoir computing approaches for representation and classification of multivariate time series. IEEE Trans Neural Netw Learn Syst Bianchi FM, Scardapane S, Løkse S, Jenssen R (2020) Reservoir computing approaches for representation and classification of multivariate time series. IEEE Trans Neural Netw Learn Syst
22.
Zurück zum Zitat Gallicchio C, Micheli A (2019) Reservoir topology in deep echo state networks. In: International conference on artificial neural networks, pp 62–75. Springer Gallicchio C, Micheli A (2019) Reservoir topology in deep echo state networks. In: International conference on artificial neural networks, pp 62–75. Springer
23.
Zurück zum Zitat Sun L, Zou B, Fu S, Chen J, Wang F (2019) Speech emotion recognition based on dnn-decision tree svm model. Speech Commun 115:29–37CrossRef Sun L, Zou B, Fu S, Chen J, Wang F (2019) Speech emotion recognition based on dnn-decision tree svm model. Speech Commun 115:29–37CrossRef
24.
Zurück zum Zitat Zhong G, Wang L-N, Ling X, Dong J (2016) An overview on data representation learning: from traditional feature learning to recent deep learning. J Financ Data Sci 2(4):265–278CrossRef Zhong G, Wang L-N, Ling X, Dong J (2016) An overview on data representation learning: from traditional feature learning to recent deep learning. J Financ Data Sci 2(4):265–278CrossRef
26.
Zurück zum Zitat Dai D, Wu Z, Li R, Wu X, Jia J, Meng H (2019) Learning discriminative features from spectrograms using center loss for speech emotion recognition. In: ICASSP 2019—2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 7405–7409. https://doi.org/10.1109/ICASSP.2019.8683765 Dai D, Wu Z, Li R, Wu X, Jia J, Meng H (2019) Learning discriminative features from spectrograms using center loss for speech emotion recognition. In: ICASSP 2019—2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 7405–7409. https://​doi.​org/​10.​1109/​ICASSP.​2019.​8683765
27.
Zurück zum Zitat Eyben F, Wöllmer M, Schuller B (2010) Opensmile: The munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM international conference on multimedia. MM ’10, pp 1459–1462. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1873951.1874246 Eyben F, Wöllmer M, Schuller B (2010) Opensmile: The munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM international conference on multimedia. MM ’10, pp 1459–1462. Association for Computing Machinery, New York, NY, USA. https://​doi.​org/​10.​1145/​1873951.​1874246
28.
Zurück zum Zitat Amiriparian S, Gerczuk M, Ottl S, Cummins N, Freitag M, Pugachevskiy S, Baird A, Schuller B (2017) Snore sound classification using image-based deep spectrum features. In: Interspeech 2017, pp 3512–3516. ISCA Amiriparian S, Gerczuk M, Ottl S, Cummins N, Freitag M, Pugachevskiy S, Baird A, Schuller B (2017) Snore sound classification using image-based deep spectrum features. In: Interspeech 2017, pp 3512–3516. ISCA
29.
30.
Zurück zum Zitat Liu Z-T, Wu B-H, Li D-Y, Xiao P, Mao J-W (2020) Speech emotion recognition based on selective interpolation synthetic minority over-sampling technique in small sample environment. Sensors 20(8):2297CrossRef Liu Z-T, Wu B-H, Li D-Y, Xiao P, Mao J-W (2020) Speech emotion recognition based on selective interpolation synthetic minority over-sampling technique in small sample environment. Sensors 20(8):2297CrossRef
32.
33.
Zurück zum Zitat Fu C, Dissanayake T, Hosoda K, Maekawa T, Ishiguro H (2020) Similarity of speech emotion in different languages revealed by a neural network with attention. In: 2020 IEEE 14th international conference on semantic computing (ICSC), pp 381–386. https://doi.org/10.1109/ICSC.2020.00076 Fu C, Dissanayake T, Hosoda K, Maekawa T, Ishiguro H (2020) Similarity of speech emotion in different languages revealed by a neural network with attention. In: 2020 IEEE 14th international conference on semantic computing (ICSC), pp 381–386. https://​doi.​org/​10.​1109/​ICSC.​2020.​00076
36.
Zurück zum Zitat Vryzas N, Vrysis L, Matsiola M, Kotsakis R, Dimoulas C, Kalliris G (2020) Continuous speech emotion recognition with convolutional neural networks. J Audio Eng Soc 68(1/2):14–24CrossRef Vryzas N, Vrysis L, Matsiola M, Kotsakis R, Dimoulas C, Kalliris G (2020) Continuous speech emotion recognition with convolutional neural networks. J Audio Eng Soc 68(1/2):14–24CrossRef
37.
Zurück zum Zitat Gallicchio C, Micheli A (2014) A preliminary application of echo state networks to emotion recognition. In: Fourth international workshop EVALITA 2014, pp 116–119. Pisa University Press, Pisa, Italy Gallicchio C, Micheli A (2014) A preliminary application of echo state networks to emotion recognition. In: Fourth international workshop EVALITA 2014, pp 116–119. Pisa University Press, Pisa, Italy
38.
Zurück zum Zitat Saleh Q, Merkel C, Kudithipudi D, Wysocki B (2015) Memristive computational architecture of an echo state network for real-time speech-emotion recognition. In: 2015 IEEE symposium on computational intelligence for security and defense applications (CISDA), pp 1–5. https://doi.org/10.1109/CISDA.2015.7208624 Saleh Q, Merkel C, Kudithipudi D, Wysocki B (2015) Memristive computational architecture of an echo state network for real-time speech-emotion recognition. In: 2015 IEEE symposium on computational intelligence for security and defense applications (CISDA), pp 1–5. https://​doi.​org/​10.​1109/​CISDA.​2015.​7208624
42.
Zurück zum Zitat Wcisło R, Czech W (2021) Grouped multi-layer echo state networks with self-normalizing activations. In: International conference on computational science, pp 90–97. Springer Wcisło R, Czech W (2021) Grouped multi-layer echo state networks with self-normalizing activations. In: International conference on computational science, pp 90–97. Springer
43.
Zurück zum Zitat Attabi Y, Dumouchel P (2013) Anchor models for emotion recognition from speech. IEEE Trans Affect Comput 4(3):280–290CrossRef Attabi Y, Dumouchel P (2013) Anchor models for emotion recognition from speech. IEEE Trans Affect Comput 4(3):280–290CrossRef
44.
45.
Zurück zum Zitat Li P, Hastie TJ, Church KW (2006) Very sparse random projections. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining. KDD 06, pp 287–296. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1150402.1150436 Li P, Hastie TJ, Church KW (2006) Very sparse random projections. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining. KDD 06, pp 287–296. Association for Computing Machinery, New York, NY, USA. https://​doi.​org/​10.​1145/​1150402.​1150436
47.
Zurück zum Zitat Babu M, Kumar MA, Santhosh S (2014) Extracting mfcc and gtcc features for emotion recognition from audio speech signals. Int J Res Comput Appl Robot 2(8):46–63 Babu M, Kumar MA, Santhosh S (2014) Extracting mfcc and gtcc features for emotion recognition from audio speech signals. Int J Res Comput Appl Robot 2(8):46–63
48.
Zurück zum Zitat Fernández A, García S, Galar M, Prati RC, Krawczyk B, Herrera F (2018) Learning from imbalanced data sets vol. 10. Springer Fernández A, García S, Galar M, Prati RC, Krawczyk B, Herrera F (2018) Learning from imbalanced data sets vol. 10. Springer
50.
Zurück zum Zitat Menardi G, Torelli N (2012) Training and assessing classification rules with imbalanced data. Data Min Knowl Discov 28:92–122MathSciNetCrossRef Menardi G, Torelli N (2012) Training and assessing classification rules with imbalanced data. Data Min Knowl Discov 28:92–122MathSciNetCrossRef
55.
Zurück zum Zitat Chouikhi N, Ammar B, Alimi AM (2018) Genesis of basic and multi-layer echo state network recurrent autoencoders for efficient data representations. arXiv:1804.08996 Chouikhi N, Ammar B, Alimi AM (2018) Genesis of basic and multi-layer echo state network recurrent autoencoders for efficient data representations. arXiv:​1804.​08996
56.
Zurück zum Zitat Gallicchio C, Micheli A (2017) Echo state property of deep reservoir computing networks. Cognit Comput 9(3):337–350CrossRef Gallicchio C, Micheli A (2017) Echo state property of deep reservoir computing networks. Cognit Comput 9(3):337–350CrossRef
57.
Zurück zum Zitat Snoek J, Larochelle H, Adams RP (2012) Practical bayesian optimization of machine learning algorithms. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems, vol. 25. Curran Associates, Inc Snoek J, Larochelle H, Adams RP (2012) Practical bayesian optimization of machine learning algorithms. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems, vol. 25. Curran Associates, Inc
59.
Zurück zum Zitat Vlasenko B, Schuller B, Wendemuth A, Rigoll G (2007) Combining frame and turn-level information for robust recognition of emotions within speech. In: INTERSPEECH Vlasenko B, Schuller B, Wendemuth A, Rigoll G (2007) Combining frame and turn-level information for robust recognition of emotions within speech. In: INTERSPEECH
60.
Zurück zum Zitat Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of German emotional speech. In: INTERSPEECH Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of German emotional speech. In: INTERSPEECH
61.
Zurück zum Zitat Haq S, Jackson PJB (2010) Multimodal emotion recognition. In: Wang W (ed) Machine audition: principles, algorithms and systems. IGI Global, Hershey PA, pp 398–423 Haq S, Jackson PJB (2010) Multimodal emotion recognition. In: Wang W (ed) Machine audition: principles, algorithms and systems. IGI Global, Hershey PA, pp 398–423
62.
Zurück zum Zitat Livingstone S, Russo F (2018) The ryerson audio-visual database of emotional speech and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in north American english. PLoS ONE 13 Livingstone S, Russo F (2018) The ryerson audio-visual database of emotional speech and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in north American english. PLoS ONE 13
63.
Zurück zum Zitat Steidl S (2009) Automatic classification of emotion related user states in spontaneous children’s speech. Logos-Verlag Steidl S (2009) Automatic classification of emotion related user states in spontaneous children’s speech. Logos-Verlag
64.
Zurück zum Zitat Schuller B, Steidl S, Batliner A (2009) The interspeech 2009 emotion challenge. In: Tenth annual conference of the international speech communication association Schuller B, Steidl S, Batliner A (2009) The interspeech 2009 emotion challenge. In: Tenth annual conference of the international speech communication association
66.
Zurück zum Zitat Wen G, Li H, Huang J, Li D, Xun E (2017) Random deep belief networks for recognizing emotions from speech signals. Comput Intell Neurosci 2017 Wen G, Li H, Huang J, Li D, Xun E (2017) Random deep belief networks for recognizing emotions from speech signals. Comput Intell Neurosci 2017
Metadaten
Titel
Bidirectional parallel echo state network for speech emotion recognition
verfasst von
Hemin Ibrahim
Chu Kiong Loo
Fady Alnajjar
Publikationsdatum
31.05.2022
Verlag
Springer London
Erschienen in
Neural Computing and Applications / Ausgabe 20/2022
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-022-07410-2

Weitere Artikel der Ausgabe 20/2022

Neural Computing and Applications 20/2022 Zur Ausgabe

Premium Partner