research-article

Speech Emotion Recognition Based on BLSTM and CNN Feature Fusion

Authors:
Lv Huilian

Electronic Engineering, Guangxi Normal University, China

Electronic Engineering, Guangxi Normal University, China
View Profile

,
Hu Weiping

Electronic Engineering, Guangxi Normal University, China

Electronic Engineering, Guangxi Normal University, China
View Profile

,
Wang Yan

Electronic Engineering, Guangxi Normal University, China

Electronic Engineering, Guangxi Normal University, China
View Profile

ICDSP '20: Proceedings of the 2020 4th International Conference on Digital Signal ProcessingJune 2020Pages 169–172https://doi.org/10.1145/3408127.3408192

Published:10 September 2020Publication History

ICDSP '20: Proceedings of the 2020 4th International Conference on Digital Signal Processing

Pages 169–172

ABSTRACT

Speech emotion recognition (SER) is always challenging because of factors such as emotional corpus, acoustic features and SER modeling. SER based on deep learning are limited to using a spectrogram or handcrafted features as input, but cannot capture enough of the defects of emotional information, this paper proposes a feature fusion method based on Bidirectional Long Short-Term Memory (BLSTM) and Convolutional Neural Networks (CNN) to study richer emotional features, the method is combining context features and spatial features. Statistical features are used as the input of BLSTM network, the context features of speech signals are extracted by BLSTM, and the spatial features of speech signals are extracted by using log-mel spectrogram as the input of CNN, so as to jointly learn the emotional features with good recognition performance. The experimental results showed that the weighted accuracy and unweighted accuracy of the proposed method on the IEMOCAP data set were 74.14% and 65.62% respectively. In addition, compared with the existing SER methods, the effectiveness of the proposed method is verified.

References

S. Mirsamadi, E. Barsoum and C. Zhang. 2017. Automatic speech emotion recognition using recurrent neural networks with local attention. In Proceedings of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (New Orleans, LA, March 5-9, 2017). IEEE, 2227--2231.Google ScholarDigital Library
Z. Aldeneh and E. M. Provost. 2017. Using regional saliency for speech emotion recognition. In Proceedings of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (New Orleans, LA, March 5-9, 2017). IEEE, 2741--2745.Google ScholarDigital Library
Pengcheng Li, Yan Song, Ian McLoughlin, et al. 2018. An Attention Pooling based Representation Learning Method for Speech Emotion Recognition. In Proceedings of the Interspeech 2018.Google ScholarCross Ref
Mingyi Chen, Xuanji He, Jing Yang, Han Zhang. 2018. 3-D Convolutional Recurrent Neural Networks with Attention Model for Speech Emotion Recognition. IEEE Signal Processing Letters. 25, 10 (Oct. 2018), 1440--1444.Google ScholarCross Ref
John W.Kim, Rif A.Saurous. 2018. Emotion Recognition from Human Speech Using Temporal Information and Deep Learning. In Proceedings of the Interspeech 2018. 2018-1132.Google Scholar
P. Tzirakis, J. Zhang and B. W. Schuller. 2018. End-to-End Speech Emotion Recognition Using Deep Neural Networks. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Calgary, AB, April 15-20, 2018). IEEE, 5089--5093.Google ScholarDigital Library
C. Busso, M. Bulut, C. C. Lee, et al. 2008. IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation. 42, 4 (2008), 335.Google ScholarCross Ref
Björn Schuller, Stefan Steidl, and Anton Batliner. 2009. The interspeech 2009 emotion challenge. In Proceedings of Interspeech 2009. 312--315.Google ScholarCross Ref
Y. Kim and E. M. Provost. 2013. Emotion classification via utterance-level dynamics: A pattern-based approach to characterizing affective expressions. In Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Vancouver, BC, May 26-31, 2013). IEEE, 3677--3681.Google ScholarCross Ref
Y. Zhang, J. Du, Z. Wang, J. Zhang and Y. Tu. 2018. Attention Based Fully Convolutional Network for Speech Emotion Recognition. In Proceedings of 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (Honolulu, HI, USA, Nov 12-15, 2018). IEEE, 1771--1775.Google Scholar
G. Ramet, P. N. Garner, M. Baeriswyl and A. Lazaridis. 2018. Context-Aware Attention Mechanism for Speech Emotion Recognition. In Proceedings of 2018 IEEE Spoken Language Technology Workshop (SLT), (Athens, Greece, Dec 18-21, 2018). IEEE, 126--131.Google ScholarCross Ref
Liu ZT., Xiao P., Li DY., Hao M. 2019. Speaker-Independent Speech Emotion Recognition Based on CNN-BLSTM and Multiple SVMs. In Proceedings of Intelligent Robotics and Applications (ICIRA 2019). Lecture Notes in Computer Science. Springer, Cham. 11742.Google Scholar

Index Terms

Speech Emotion Recognition Based on BLSTM and CNN Feature Fusion
1. Computing methodologies
  1. Machine learning

Recommendations

Feature Fusion Methods for Robust Speech Emotion Recognition Based on Deep Belief Networks
ICNCC '16: Proceedings of the Fifth International Conference on Network, Communication and Computing

The speech emotion recognition accuracy of prosody feature and voice quality feature declines with the decrease of SNR (Signal to Noise Ratio) of speech signals. In this paper, we propose novel sub-band spectral centroid weighted wavelet packet cepstral ...
Read More
A hybrid CNN and BLSTM network for human complex activity recognition with multi-feature fusion
Abstract
A hybrid convolutional neural network (CNN) and bidirectional long short-term memory (BLSTM) network for human complex activity recognition with multi-feature fusion is proposed in this paper. Specifically, a new CNN model is designed to extract ...
Read More
Dynamic-Static Cross Attentional Feature Fusion Method for Speech Emotion Recognition
MultiMedia Modeling
Abstract
The dynamic-static fusion features play an important role in speech emotion recognition (SER). However, the fusion methods of dynamic features and static features generally are simple addition or serial fusion, which might cause the loss of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICDSP '20: Proceedings of the 2020 4th International Conference on Digital Signal Processing
June 2020
383 pages
ISBN:9781450376877
DOI:10.1145/3408127

Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 September 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
BLSTM
CNN
SER
feature fusion
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 160
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Speech Emotion Recognition Based on BLSTM and CNN Feature Fusion

ICDSP '20: Proceedings of the 2020 4th International Conference on Digital Signal Processing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Feature Fusion Methods for Robust Speech Emotion Recognition Based on Deep Belief Networks

A hybrid CNN and BLSTM network for human complex activity recognition with multi-feature fusion

Dynamic-Static Cross Attentional Feature Fusion Method for Speech Emotion Recognition

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Speech Emotion Recognition Based on BLSTM and CNN Feature Fusion

ICDSP '20: Proceedings of the 2020 4th International Conference on Digital Signal Processing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Feature Fusion Methods for Robust Speech Emotion Recognition Based on Deep Belief Networks

A hybrid CNN and BLSTM network for human complex activity recognition with multi-feature fusion

Dynamic-Static Cross Attentional Feature Fusion Method for Speech Emotion Recognition

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media