We investigate a number of Deep Neural Network (DNN) architectures for emotion identification with the IEMOCAP database. First we compare different feature extraction front-ends: we compare high-dimensional MFCC input (equivalent to filterbanks), versus frequency-domain and time-domain approaches to learning filters as part of the network. We obtain the best results with the time-domain filter-learning approach. Next we investigated different ways to aggregate information over the duration of an utterance. We tried approaches with a single label per utterance with time aggregation inside the network; and approaches where the label is repeated for each frame. Having a separate label per frame seemed to work best and the best architecture that we tried interleaves TDNN-LSTM with time-restricted self-attention, achieving a weighted accuracy of 70.6%, versus 61.8% for the best previously published system which used 257-dimensional Fourier log-energies as input.
Cite as: Sarma, M., Ghahremani, P., Povey, D., Goel, N.K., Sarma, K.K., Dehak, N. (2018) Emotion Identification from Raw Speech Signals Using DNNs. Proc. Interspeech 2018, 3097-3101, doi: 10.21437/Interspeech.2018-1353
@inproceedings{sarma18_interspeech, author={Mousmita Sarma and Pegah Ghahremani and Daniel Povey and Nagendra Kumar Goel and Kandarpa Kumar Sarma and Najim Dehak}, title={{Emotion Identification from Raw Speech Signals Using DNNs}}, year=2018, booktitle={Proc. Interspeech 2018}, pages={3097--3101}, doi={10.21437/Interspeech.2018-1353} }