ABSTRACT
Speech emotion recognition (SER) is always challenging because of factors such as emotional corpus, acoustic features and SER modeling. SER based on deep learning are limited to using a spectrogram or handcrafted features as input, but cannot capture enough of the defects of emotional information, this paper proposes a feature fusion method based on Bidirectional Long Short-Term Memory (BLSTM) and Convolutional Neural Networks (CNN) to study richer emotional features, the method is combining context features and spatial features. Statistical features are used as the input of BLSTM network, the context features of speech signals are extracted by BLSTM, and the spatial features of speech signals are extracted by using log-mel spectrogram as the input of CNN, so as to jointly learn the emotional features with good recognition performance. The experimental results showed that the weighted accuracy and unweighted accuracy of the proposed method on the IEMOCAP data set were 74.14% and 65.62% respectively. In addition, compared with the existing SER methods, the effectiveness of the proposed method is verified.
- S. Mirsamadi, E. Barsoum and C. Zhang. 2017. Automatic speech emotion recognition using recurrent neural networks with local attention. In Proceedings of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (New Orleans, LA, March 5-9, 2017). IEEE, 2227--2231.Google ScholarDigital Library
- Z. Aldeneh and E. M. Provost. 2017. Using regional saliency for speech emotion recognition. In Proceedings of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (New Orleans, LA, March 5-9, 2017). IEEE, 2741--2745.Google ScholarDigital Library
- Pengcheng Li, Yan Song, Ian McLoughlin, et al. 2018. An Attention Pooling based Representation Learning Method for Speech Emotion Recognition. In Proceedings of the Interspeech 2018.Google ScholarCross Ref
- Mingyi Chen, Xuanji He, Jing Yang, Han Zhang. 2018. 3-D Convolutional Recurrent Neural Networks with Attention Model for Speech Emotion Recognition. IEEE Signal Processing Letters. 25, 10 (Oct. 2018), 1440--1444.Google ScholarCross Ref
- John W.Kim, Rif A.Saurous. 2018. Emotion Recognition from Human Speech Using Temporal Information and Deep Learning. In Proceedings of the Interspeech 2018. 2018-1132.Google Scholar
- P. Tzirakis, J. Zhang and B. W. Schuller. 2018. End-to-End Speech Emotion Recognition Using Deep Neural Networks. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Calgary, AB, April 15-20, 2018). IEEE, 5089--5093.Google ScholarDigital Library
- C. Busso, M. Bulut, C. C. Lee, et al. 2008. IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation. 42, 4 (2008), 335.Google ScholarCross Ref
- Björn Schuller, Stefan Steidl, and Anton Batliner. 2009. The interspeech 2009 emotion challenge. In Proceedings of Interspeech 2009. 312--315.Google ScholarCross Ref
- Y. Kim and E. M. Provost. 2013. Emotion classification via utterance-level dynamics: A pattern-based approach to characterizing affective expressions. In Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Vancouver, BC, May 26-31, 2013). IEEE, 3677--3681.Google ScholarCross Ref
- Y. Zhang, J. Du, Z. Wang, J. Zhang and Y. Tu. 2018. Attention Based Fully Convolutional Network for Speech Emotion Recognition. In Proceedings of 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (Honolulu, HI, USA, Nov 12-15, 2018). IEEE, 1771--1775.Google Scholar
- G. Ramet, P. N. Garner, M. Baeriswyl and A. Lazaridis. 2018. Context-Aware Attention Mechanism for Speech Emotion Recognition. In Proceedings of 2018 IEEE Spoken Language Technology Workshop (SLT), (Athens, Greece, Dec 18-21, 2018). IEEE, 126--131.Google ScholarCross Ref
- Liu ZT., Xiao P., Li DY., Hao M. 2019. Speaker-Independent Speech Emotion Recognition Based on CNN-BLSTM and Multiple SVMs. In Proceedings of Intelligent Robotics and Applications (ICIRA 2019). Lecture Notes in Computer Science. Springer, Cham. 11742.Google Scholar
Index Terms
- Speech Emotion Recognition Based on BLSTM and CNN Feature Fusion
Recommendations
Feature Fusion Methods for Robust Speech Emotion Recognition Based on Deep Belief Networks
ICNCC '16: Proceedings of the Fifth International Conference on Network, Communication and ComputingThe speech emotion recognition accuracy of prosody feature and voice quality feature declines with the decrease of SNR (Signal to Noise Ratio) of speech signals. In this paper, we propose novel sub-band spectral centroid weighted wavelet packet cepstral ...
A hybrid CNN and BLSTM network for human complex activity recognition with multi-feature fusion
AbstractA hybrid convolutional neural network (CNN) and bidirectional long short-term memory (BLSTM) network for human complex activity recognition with multi-feature fusion is proposed in this paper. Specifically, a new CNN model is designed to extract ...
Dynamic-Static Cross Attentional Feature Fusion Method for Speech Emotion Recognition
MultiMedia ModelingAbstractThe dynamic-static fusion features play an important role in speech emotion recognition (SER). However, the fusion methods of dynamic features and static features generally are simple addition or serial fusion, which might cause the loss of ...
Comments