nach oben

Neural Computing and Applications

Erschienen in:

31.05.2022 | Original Article

Bidirectional parallel echo state network for speech emotion recognition

verfasst von: Hemin Ibrahim, Chu Kiong Loo, Fady Alnajjar

Erschienen in: Neural Computing and Applications | Ausgabe 20/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Speech is an effective way for communicating and exchanging complex information between humans. Speech signal has involved a great attention in human-computer interaction. Therefore, emotion recognition from speech has become a hot research topic in the field of interacting machines with humans. In this paper, we proposed a novel speech emotion recognition system by adopting multivariate time series handcrafted feature representation from speech signals. Bidirectional echo state network with two parallel reservoir layers has been applied to capture additional independent information. The parallel reservoirs produce multiple representations for each direction from the bidirectional data with two stages of concatenation. The sparse random projection approach has been adopted to reduce the high-dimensional sparse output for each direction separately from both reservoirs. Random over-sampling and random under-sampling methods are used to overcome the imbalanced nature of the used speech emotion datasets. The performance of the proposed parallel ESN model is evaluated from the speaker-independent experiments on EMO-DB, SAVEE, RAVDESS, and FAU Aibo datasets. The results show that the proposed SER model is superior to the single reservoir and the state-of-the-art studies.

Vorheriger Artikel COVID-19 forecasting and intervention planning using gated recurrent unit and evolutionary algorithm

Nächster Artikel MHD stagnation-point flow of hybrid nanofluid with convective heated shrinking disk, viscous dissipation and Joule heating effects

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Bojanić M, Delić V, Karpov A (2020) Call redistribution for a call center based on speech emotion recognition. Appl Sci 10(13):4653CrossRef

Katsis CD, Rigas G, Goletsis Y, Fotiadis DI (2015) Emotion recognition in car industry. Emot Recognit A Pattern Anal Approach 515–544

Al-Talabani A (2015) Automatic speech emotion recognition-feature space dimensionality and classification challenges. PhD thesis, University of Buckingham

Pérez-Espinosa H, Gutiérrez-Serafín B, Martínez-Miranda J, Espinosa-Curiel IE (2022) Automatic children’s personality assessment from emotional speech. Expert Syst Appl 187:115885. https://doi.org/10.1016/j.eswa.2021.115885CrossRef

Mao Q, Dong M, Huang Z, Zhan Y (2014) Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans Multimed 16(8):2203–2213CrossRef

Kathiresan T, Dellwo V (2019) Cepstral derivatives in mfccs for emotion recognition. In: 2019 IEEE 4th international conference on signal and image processing (ICSIP), pp 56–60. IEEE

Abbaschian BJ, Sierra-Sosa D, Elmaghraby A (2021) Deep learning techniques for speech emotion recognition, from databases to models. Sensors. https://doi.org/10.3390/s21041249CrossRef

Mustaqeem Kwon S (2021) Mlt-dnet: speech emotion recognition using 1d dilated cnn based on multi-learning trick approach. Expert Syst Appl 167:114177. https://doi.org/10.1016/j.eswa.2020.114177CrossRef

Li D, Liu J, Yang Z, Sun L, Wang Z (2021) Speech emotion recognition using recurrent neural networks with directional self-attention. Expert Syst Appl 173:114683. https://doi.org/10.1016/j.eswa.2021.114683CrossRef

10.

Ma Z, Yu H, Chen W, Guo J (2019) Short utterance based speech language identification in intelligent vehicles with time-scale modifications and deep bottleneck features. IEEE Trans Veh Technol 68(1):121–128. https://doi.org/10.1109/TVT.2018.2879361CrossRef

11.

Fayek HM, Lech M, Cavedon L (2017) Evaluating deep learning architectures for speech emotion recognition. Neural Netw 92:60–68CrossRef

12.

Daneshfar F, Kabudian SJ, Neekabadi A (2020) Speech emotion recognition using hybrid spectral-prosodic features of speech signal/glottal waveform, metaheuristic-based dimensionality reduction, and gaussian elliptical basis function network classifier. Appl Acoust 166:107360. https://doi.org/10.1016/j.apacoust.2020.107360CrossRef

13.

Ma Q, Shen L, Chen W, Wang J, Wei J, Yu Z (2016) Functional echo state network for time series classification. Inf Sci 373:1–20. https://doi.org/10.1016/j.ins.2016.08.081CrossRefMATH

14.

Chen Y, Keogh E, Hu B, Begum N, Bagnall A, Mueen A, Batista G (2015) The ucr time series classification archive

15.

Ibrahim H, Loo CK, Alnajjar F (2021) Speech emotion recognition by late fusion for bidirectional reservoir computing with random projection. IEEE Access 1:18. https://doi.org/10.1109/ACCESS.2021.3107858CrossRef

16.

Wu Q, Fokoue E, Kudithipudi D (2018) On the statistical challenges of echo state networks and some potential remedies. arXiv:1802.07369

17.

Shoumy NJ, Ang L-M, Rahaman DM, Zia T, Seng KP, Khatun S (2021) Augmented audio data in improving speech emotion classification tasks. In: International conference on industrial, engineering and other applications of applied intelligent systems, pp 360–365. Springer

18.

López E, Valle C, Allende H, Gil E, Madsen H (2018) Wind power forecasting based on echo state networks and long short-term memory. Energies 11(3):526CrossRef

19.

Scherer S, Oubbati M, Schwenker F, Palm G (2008) Real-time emotion recognition from speech using echo state networks. In: IAPR workshop on artificial neural networks in pattern recognition, pp 205–216. Springer

20.

Rodan A, Sheta AF, Faris H (2017) Bidirectional reservoir networks trained using svm + privileged information for manufacturing process modeling. Soft Comput 21(22):6811–6824CrossRef

21.

Bianchi FM, Scardapane S, Løkse S, Jenssen R (2020) Reservoir computing approaches for representation and classification of multivariate time series. IEEE Trans Neural Netw Learn Syst

22.

Gallicchio C, Micheli A (2019) Reservoir topology in deep echo state networks. In: International conference on artificial neural networks, pp 62–75. Springer

23.

Sun L, Zou B, Fu S, Chen J, Wang F (2019) Speech emotion recognition based on dnn-decision tree svm model. Speech Commun 115:29–37CrossRef

24.

Zhong G, Wang L-N, Ling X, Dong J (2016) An overview on data representation learning: from traditional feature learning to recent deep learning. J Financ Data Sci 2(4):265–278CrossRef

25.

Jiang P, Fu H, Tao H, Lei P, Zhao L (2019) Parallelized convolutional recurrent neural network with spectral features for speech emotion recognition. IEEE Access 7:90368–90377. https://doi.org/10.1109/ACCESS.2019.2927384CrossRef

26.

Dai D, Wu Z, Li R, Wu X, Jia J, Meng H (2019) Learning discriminative features from spectrograms using center loss for speech emotion recognition. In: ICASSP 2019—2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 7405–7409. https://doi.org/10.1109/ICASSP.2019.8683765

27.

Eyben F, Wöllmer M, Schuller B (2010) Opensmile: The munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM international conference on multimedia. MM ’10, pp 1459–1462. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1873951.1874246

28.

Amiriparian S, Gerczuk M, Ottl S, Cummins N, Freitag M, Pugachevskiy S, Baird A, Schuller B (2017) Snore sound classification using image-based deep spectrum features. In: Interspeech 2017, pp 3512–3516. ISCA

29.

Al-Talabani A, Sellahewa H, Jassim S (2013) Excitation source and low level descriptor features fusion for emotion recognition using svm and ann. In: 2013 5th computer science and electronic engineering conference (CEEC), pp 156–161. https://doi.org/10.1109/CEEC.2013.6659464

30.

Liu Z-T, Wu B-H, Li D-Y, Xiao P, Mao J-W (2020) Speech emotion recognition based on selective interpolation synthetic minority over-sampling technique in small sample environment. Sensors 20(8):2297CrossRef

31.

Ooi CS, Seng KP, Ang L-M, Chew LW (2014) A new approach of audio emotion recognition. Expert Syst Appl 41(13):5858–5869. https://doi.org/10.1016/j.eswa.2014.03.026CrossRef

32.

Zhou S, Jia J, Wang Y, Chen W, Meng F, Li Y, Tao J (2018) Emotion inferring from large-scale internet voice data: A multimodal deep learning approach. In: 2018 first Asian conference on affective computing and intelligent interaction (ACII Asia), pp 1–6. https://doi.org/10.1109/ACIIAsia.2018.8470311

33.

Fu C, Dissanayake T, Hosoda K, Maekawa T, Ishiguro H (2020) Similarity of speech emotion in different languages revealed by a neural network with attention. In: 2020 IEEE 14th international conference on semantic computing (ICSC), pp 381–386. https://doi.org/10.1109/ICSC.2020.00076

34.

Chen L, Mao X, Xue Y, Cheng LL (2012) Speech emotion recognition: features and classification models. Digit. Signal Process. 22(6):1154–1160. https://doi.org/10.1016/j.dsp.2012.05.007MathSciNetCrossRef

35.

Lee J, Tashev I (2015) High-level feature representation using recurrent neural network for speech emotion recognition. In: INTERSPEECH, pp 1537–1540. ISCA, Dresden, Germany. http://dblp.uni-trier.de/db/conf/interspeech/interspeech2015.html

36.

Vryzas N, Vrysis L, Matsiola M, Kotsakis R, Dimoulas C, Kalliris G (2020) Continuous speech emotion recognition with convolutional neural networks. J Audio Eng Soc 68(1/2):14–24CrossRef

37.

Gallicchio C, Micheli A (2014) A preliminary application of echo state networks to emotion recognition. In: Fourth international workshop EVALITA 2014, pp 116–119. Pisa University Press, Pisa, Italy

38.

Saleh Q, Merkel C, Kudithipudi D, Wysocki B (2015) Memristive computational architecture of an echo state network for real-time speech-emotion recognition. In: 2015 IEEE symposium on computational intelligence for security and defense applications (CISDA), pp 1–5. https://doi.org/10.1109/CISDA.2015.7208624

39.

Wang Z, Yao X, Huang Z, Liu L (2021) Deep echo state network with multiple adaptive reservoirs for time series prediction. IEEE Trans Cognit Dev Syst. https://doi.org/10.1109/TCDS.2021.3062177CrossRef

40.

Gallicchio C, Micheli A, Pedrelli L (2017) Deep reservoir computing: A critical experimental analysis. Neurocomputing 268, 87–99. https://doi.org/10.1016/j.neucom.2016.12.089. Advances in artificial neural networks, machine learning and computational intelligence

41.

Huang Z, Yang C, Chen X, Zhou X, Chen G, Huang T, Gui W (2021) Functional deep echo state network improved by a bi-level optimization approach for multivariate time series classification. Appl Soft Comput 106:107314. https://doi.org/10.1016/j.asoc.2021.107314CrossRef

42.

Wcisło R, Czech W (2021) Grouped multi-layer echo state networks with self-normalizing activations. In: International conference on computational science, pp 90–97. Springer

43.

Attabi Y, Dumouchel P (2013) Anchor models for emotion recognition from speech. IEEE Trans Affect Comput 4(3):280–290CrossRef

44.

Bianchi FM, Scardapane S, Løkse S, Jenssen R (2017) Bidirectional deep-readout echo state networks. arXiv:1711.06509

45.

Li P, Hastie TJ, Church KW (2006) Very sparse random projections. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining. KDD 06, pp 287–296. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1150402.1150436

46.

Davis S, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process 28(4):357–366. https://doi.org/10.1109/TASSP.1980.1163420CrossRef

47.

Babu M, Kumar MA, Santhosh S (2014) Extracting mfcc and gtcc features for emotion recognition from audio speech signals. Int J Res Comput Appl Robot 2(8):46–63

48.

Fernández A, García S, Galar M, Prati RC, Krawczyk B, Herrera F (2018) Learning from imbalanced data sets vol. 10. Springer

49.

He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284. https://doi.org/10.1109/TKDE.2008.239CrossRef

50.

Menardi G, Torelli N (2012) Training and assessing classification rules with imbalanced data. Data Min Knowl Discov 28:92–122MathSciNetCrossRef

51.

Jaeger H, Haas H (2004) Harnessing nonlinearity: predicting chaotic systems and saving energy in wireless communication. Science 304(5667):78–80. https://doi.org/10.1126/science.1091277CrossRef

52.

Lukoševičius M, Jaeger H (2009) Reservoir computing approaches to recurrent neural network training. Comput Sci Rev 3(3):127–149. https://doi.org/10.1016/j.cosrev.2009.03.005CrossRefMATH

53.

Xue Y, Yang L, Haykin S (2007) Decoupled echo state networks with lateral inhibition. Neural Netw. 20(3), 365–376. https://doi.org/10.1016/j.neunet.2007.04.014. Echo State Networks and Liquid State Machines

54.

Malik ZK, Hussain A, Wu QJ (2017) Multilayered echo state machine: a novel architecture and algorithm. IEEE Trans Cybern 47(4):946–959. https://doi.org/10.1109/TCYB.2016.2533545CrossRef

55.

Chouikhi N, Ammar B, Alimi AM (2018) Genesis of basic and multi-layer echo state network recurrent autoencoders for efficient data representations. arXiv:1804.08996

56.

Gallicchio C, Micheli A (2017) Echo state property of deep reservoir computing networks. Cognit Comput 9(3):337–350CrossRef

57.

Snoek J, Larochelle H, Adams RP (2012) Practical bayesian optimization of machine learning algorithms. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems, vol. 25. Curran Associates, Inc

58.

Wu S, Falk TH, Chan W-Y (2011) Automatic speech emotion recognition using modulation spectral features. Speech Commun 53(5), 768–785. https://doi.org/10.1016/j.specom.2010.08.013. Perceptual and Statistical Audition

59.

Vlasenko B, Schuller B, Wendemuth A, Rigoll G (2007) Combining frame and turn-level information for robust recognition of emotions within speech. In: INTERSPEECH

60.

Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of German emotional speech. In: INTERSPEECH

61.

Haq S, Jackson PJB (2010) Multimodal emotion recognition. In: Wang W (ed) Machine audition: principles, algorithms and systems. IGI Global, Hershey PA, pp 398–423

62.

Livingstone S, Russo F (2018) The ryerson audio-visual database of emotional speech and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in north American english. PLoS ONE 13

63.

Steidl S (2009) Automatic classification of emotion related user states in spontaneous children’s speech. Logos-Verlag

64.

Schuller B, Steidl S, Batliner A (2009) The interspeech 2009 emotion challenge. In: Tenth annual conference of the international speech communication association

65.

Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. 45(4), 427–437. https://doi.org/10.1016/j.ipm.2009.03.002

66.

Wen G, Li H, Huang J, Li D, Xun E (2017) Random deep belief networks for recognizing emotions from speech signals. Comput Intell Neurosci 2017

67.

Chen M, He X, Yang J, Zhang H (2018) 3-d convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process Lett 25(10):1440–1444. https://doi.org/10.1109/LSP.2018.2860246CrossRef

68.

Meng H, Yan T, Yuan F, Wei H (2019) Speech emotion recognition from 3d log-mel spectrograms with deep learning network. IEEE Access 7:125868–125881. https://doi.org/10.1109/ACCESS.2019.2938007CrossRef

69.

Liu Z-T, Rehman A, Wu M, Cao W-H, Hao M (2021) Speech emotion recognition based on formant characteristics feature extraction and phoneme type convergence. Inf Sci 563:309–325. https://doi.org/10.1016/j.ins.2021.02.016CrossRef

70.

Yildirim S, Kaya Y, Kılıç F (2021) A modified feature selection method based on metaheuristic algorithms for speech emotion recognition. Appl Acoust 173:107721. https://doi.org/10.1016/j.apacoust.2020.107721CrossRef

71.

Triantafyllopoulos A, Liu S, Schuller BW (2021) Deep speaker conditioning for speech emotion recognition. In: 2021 IEEE international conference on multimedia and expo (ICME), pp 1–6. https://doi.org/10.1109/ICME51207.2021.9428217

72.

Zhao Z, Bao Z, Zhao Y, Zhang Z, Cummins N, Ren Z, Schuller B (2019) Exploring deep spectrum representations via attention-based recurrent and convolutional neural networks for speech emotion recognition. IEEE Access 7:97515–97525. https://doi.org/10.1109/ACCESS.2019.2928625CrossRef

73.

Zhao Z, Li Q, Zhang Z, Cummins N, Wang H, Tao J, W. Schuller B, (2021) Combining a parallel 2d cnn with a self-attention dilated residual network for ctc-based discrete speech emotion recognition. Neural Netw 141:52–60. https://doi.org/10.1016/j.neunet.2021.03.013CrossRef

74.

Shih P-Y, Chen C-P, Wang H-M (2017) Speech emotion recognition with skew-robust neural networks. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 2751–2755. https://doi.org/10.1109/ICASSP.2017.7952657

75.

Deb S, Dandapat S (2019) Multiscale amplitude feature and significance of enhanced vocal tract information for emotion classification. IEEE Trans Cybern 49(3):802–815. https://doi.org/10.1109/TCYB.2017.2787717CrossRef

Titel: Bidirectional parallel echo state network for speech emotion recognition
verfasst von: Hemin Ibrahim
Chu Kiong Loo
Fady Alnajjar
Publikationsdatum: 31.05.2022
Verlag: Springer London
Erschienen in: Neural Computing and Applications / Ausgabe 20/2022
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI: https://doi.org/10.1007/s00521-022-07410-2

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 20/2022

Multi-objective Stochastic Paint Optimizer (MOSPO)

Contour-enhanced CycleGAN framework for style transfer from scenery photos to Chinese landscape paintings

Learning with deep Gaussian processes and homothety in weather simulation

A light defect detection algorithm of power insulators from aerial images for power inspection

Using BERT and Knowledge Graph for detecting triples in Vietnamese text

Short range correlation transformer for occluded person re-identification

Premium Partner