Emotion Identification from Raw Speech Signals Using DNNs

Sarma, Mousmita; Ghahremani, Pegah; Povey, Daniel; Goel, Nagendra Kumar; Sarma, Kandarpa Kumar; Dehak, Najim

doi:10.21437/Interspeech.2018-1353

Emotion Identification from Raw Speech Signals Using DNNs

Mousmita Sarma, Pegah Ghahremani, Daniel Povey, Nagendra Kumar Goel, Kandarpa Kumar Sarma, Najim Dehak

We investigate a number of Deep Neural Network (DNN) architectures for emotion identification with the IEMOCAP database. First we compare different feature extraction front-ends: we compare high-dimensional MFCC input (equivalent to filterbanks), versus frequency-domain and time-domain approaches to learning filters as part of the network. We obtain the best results with the time-domain filter-learning approach. Next we investigated different ways to aggregate information over the duration of an utterance. We tried approaches with a single label per utterance with time aggregation inside the network; and approaches where the label is repeated for each frame. Having a separate label per frame seemed to work best and the best architecture that we tried interleaves TDNN-LSTM with time-restricted self-attention, achieving a weighted accuracy of 70.6%, versus 61.8% for the best previously published system which used 257-dimensional Fourier log-energies as input.

doi: 10.21437/Interspeech.2018-1353

Cite as: Sarma, M., Ghahremani, P., Povey, D., Goel, N.K., Sarma, K.K., Dehak, N. (2018) Emotion Identification from Raw Speech Signals Using DNNs. Proc. Interspeech 2018, 3097-3101, doi: 10.21437/Interspeech.2018-1353

@inproceedings{sarma18_interspeech,
  author={Mousmita Sarma and Pegah Ghahremani and Daniel Povey and Nagendra Kumar Goel and Kandarpa Kumar Sarma and Najim Dehak},
  title={{Emotion Identification from Raw Speech Signals Using DNNs}},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={3097--3101},
  doi={10.21437/Interspeech.2018-1353}
}