Skip to main content
Top
Published in: International Journal of Speech Technology 3/2019

22-11-2018

Enhanced speech emotion detection using deep neural networks

Authors: S. Lalitha, Shikha Tripathi, Deepa Gupta

Published in: International Journal of Speech Technology | Issue 3/2019

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This paper focusses on investigation of the effective performance of perceptual based speech features on emotion detection. Mel frequency cepstral coefficients (MFCC’s), perceptual linear predictive cepstrum (PLPC), Mel frequency perceptual linear prediction cepstrum (MFPLPC), bark frequency cepstral coefficients (BFCC), revised perceptual linear prediction coefficient’s (RPLP) and inverted Mel frequency cepstral coefficients (IMFCC) are the perception features considered. The algorithm using these auditory cues is evaluated with deep neural networks (DNN). The novelty of the work involves analysis of the perceptual features to identify predominant features that contain significant emotional information about the speaker. The validity of the algorithm is analysed on publicly available Berlin database with seven emotions in 1-dimensional space termed categorical and 2-dimensional continuous space consisting of emotions in valence and arousal dimensions. Comparative analysis reveals that considerable improvement in the performance of emotion recognition is obtained using DNN with the identified combination of perceptual features.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
go back to reference Amer, M. R., Siddiquie, B., Richey, C., & Divakaran, A. (2014). Emotion detection in speech using deep networks. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), Florence, pp. 3724–3728. Amer, M. R., Siddiquie, B., Richey, C., & Divakaran, A. (2014). Emotion detection in speech using deep networks. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), Florence, pp. 3724–3728.
go back to reference Anagnostopoulos, C. N., Iliou, T., & Giannoukos, I. (2015). Features and classifiers for emotion recognition from speech: A survey from 2010 to 2011. Artificial Intelligence Review, 43(2), 155–177.CrossRef Anagnostopoulos, C. N., Iliou, T., & Giannoukos, I. (2015). Features and classifiers for emotion recognition from speech: A survey from 2010 to 2011. Artificial Intelligence Review, 43(2), 155–177.CrossRef
go back to reference Anila, R., & Revathy, A. (2015). Emotion recognition using continuous density HMM. In IEEE international conference on communications and signal processing (ICCSP), pp. 0919–0923. Anila, R., & Revathy, A. (2015). Emotion recognition using continuous density HMM. In IEEE international conference on communications and signal processing (ICCSP), pp. 0919–0923.
go back to reference Badshah, A. M., Ahmad, J., Rahim, N., & Baik, S. W. (2017). Speech emotion recognition from spectrograms with deep convolutional neural network, 2017. In International conference on platform technology and service (PlatCon), Busan, South Korea, pp. 1–5. Badshah, A. M., Ahmad, J., Rahim, N., & Baik, S. W. (2017). Speech emotion recognition from spectrograms with deep convolutional neural network, 2017. In International conference on platform technology and service (PlatCon), Busan, South Korea, pp. 1–5.
go back to reference Bitouk, D., Verma, R., & Nenkova, A. (2010). Class-level spectral features for emotion recognition. Speech Communication, 52(7), 613–625.CrossRef Bitouk, D., Verma, R., & Nenkova, A. (2010). Class-level spectral features for emotion recognition. Speech Communication, 52(7), 613–625.CrossRef
go back to reference Cullen, A., & Harte, N. (2013). Late integration of features for acoustic emotion recognition. In Proceedings of the 21st European Signal Processing Conference (EUSIPCO), IEEE, pp 1–5. Cullen, A., & Harte, N. (2013). Late integration of features for acoustic emotion recognition. In Proceedings of the 21st European Signal Processing Conference (EUSIPCO), IEEE, pp 1–5.
go back to reference Deb, S., & Dandapat, S. (2017). Emotion classification using segmentation of vowel-like and non-vowel-like regions. IEEE Transactions on Affective Computing, 99, 1–1.CrossRef Deb, S., & Dandapat, S. (2017). Emotion classification using segmentation of vowel-like and non-vowel-like regions. IEEE Transactions on Affective Computing, 99, 1–1.CrossRef
go back to reference Deng, L. (2012). Three classes of deep learning architectures and their applications: A tutorial survey. In APSIPA Transactions on Signal and Information Processing. Deng, L. (2012). Three classes of deep learning architectures and their applications: A tutorial survey. In APSIPA Transactions on Signal and Information Processing.
go back to reference Ekman, P. (1992). Argument for basic emotions. Cognition and Emotion., 6, 169–200.CrossRef Ekman, P. (1992). Argument for basic emotions. Cognition and Emotion., 6, 169–200.CrossRef
go back to reference Fayek, H. M., Lech, M., & Cavedon, L. (2015). Towards real-time speech emotion recognition using deep neural networks. In 2015 9th International Conference on Signal Processing and Communication Systems (ICSPCS), Cairns, QLD, pp. 1–5. Fayek, H. M., Lech, M., & Cavedon, L. (2015). Towards real-time speech emotion recognition using deep neural networks. In 2015 9th International Conference on Signal Processing and Communication Systems (ICSPCS), Cairns, QLD, pp. 1–5.
go back to reference Feraru, S. M., & Zbancioc, M. D. (2013). Emotion recognition in Romanain language using LPC features. In E-health and bioengineering conference (EHB), pp 1–4. Feraru, S. M., & Zbancioc, M. D. (2013). Emotion recognition in Romanain language using LPC features. In E-health and bioengineering conference (EHB), pp 1–4.
go back to reference Ghai, M., Lal, S., Duggal, S., & Manik, S. (2017). Emotion recognition on speech signals using machine learning. In 2017 international conference on big data analytics and computational intelligence (ICBDAC), Chirala, pp. 34–39. Ghai, M., Lal, S., Duggal, S., & Manik, S. (2017). Emotion recognition on speech signals using machine learning. In 2017 international conference on big data analytics and computational intelligence (ICBDAC), Chirala, pp. 34–39.
go back to reference Han, J., Zhang, Z., Ringeval, F., & Schuller, B. (2017). Prediction-based learning for continuous emotion recognition in speech. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, pp. 5005–5009. Han, J., Zhang, Z., Ringeval, F., & Schuller, B. (2017). Prediction-based learning for continuous emotion recognition in speech. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, pp. 5005–5009.
go back to reference Hassan, A., & Damper, R. I. (2012). Classification of emotional speech using 3dec hierarchical classifier. Speech Communication, 54(7), 903–916.CrossRef Hassan, A., & Damper, R. I. (2012). Classification of emotional speech using 3dec hierarchical classifier. Speech Communication, 54(7), 903–916.CrossRef
go back to reference Huang, C. W., & Narayanan, S. S. S. (2016). Flow of Renyi information in deep neural networks. In 2016 IEEE 26th international workshop on machine learning for signal processing (MLSP), Vietrisul Mare, pp. 1–6. Huang, C. W., & Narayanan, S. S. S. (2016). Flow of Renyi information in deep neural networks. In 2016 IEEE 26th international workshop on machine learning for signal processing (MLSP), Vietrisul Mare, pp. 1–6.
go back to reference Huang, Z., & Epps, J. (2017). A PLLR and multi-stage staircase regression framework for speech-based emotion prediction. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, pp. 5145–5149. Huang, Z., & Epps, J. (2017). A PLLR and multi-stage staircase regression framework for speech-based emotion prediction. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, pp. 5145–5149.
go back to reference Jassim, W. A., Paramesran, R., & Harte, N. (2017). Speech emotion classification using combined neurogram and INTERSPEECH 2010 paralinguistic challenge features. IET Signal Processing, 11(5), 587–595.CrossRef Jassim, W. A., Paramesran, R., & Harte, N. (2017). Speech emotion classification using combined neurogram and INTERSPEECH 2010 paralinguistic challenge features. IET Signal Processing, 11(5), 587–595.CrossRef
go back to reference Kamińska, D., Sapiński, T., & Pelikant, A. (2013). Comparison of perceptual features efficiency for automatic identification of emotional states from speech. In 6th international conference on human system interactions (HSI), Sopot, pp. 210–213. Kamińska, D., Sapiński, T., & Pelikant, A. (2013). Comparison of perceptual features efficiency for automatic identification of emotional states from speech. In 6th international conference on human system interactions (HSI), Sopot, pp. 210–213.
go back to reference Khan, A., & Roy, U. K. (2017). Emotion recognition using prosodie and spectral features of speech and Naïve Bayes Classifier. In 2017 international conference on wireless communications, signal processing and networking (WiSPNET), Chennai, pp. 1017–1021. Khan, A., & Roy, U. K. (2017). Emotion recognition using prosodie and spectral features of speech and Naïve Bayes Classifier. In 2017 international conference on wireless communications, signal processing and networking (WiSPNET), Chennai, pp. 1017–1021.
go back to reference Khorrami, P., Le Paine, T., Brady, K., Dagli, C., & Huang, T. S. (2016). How deep neural networks can improve emotion recognition on video data. In 2016 IEEE international conference on image processing (ICIP), Phoenix, pp. 619–623. Khorrami, P., Le Paine, T., Brady, K., Dagli, C., & Huang, T. S. (2016). How deep neural networks can improve emotion recognition on video data. In 2016 IEEE international conference on image processing (ICIP), Phoenix, pp. 619–623.
go back to reference Kotti, M., & Paterno, F. (2012). Speaker-independent emotion recognition exploiting a psychologically-inspired binary cascade classification schema. International Journal of Speech Technology, 15(2), 131–150.CrossRef Kotti, M., & Paterno, F. (2012). Speaker-independent emotion recognition exploiting a psychologically-inspired binary cascade classification schema. International Journal of Speech Technology, 15(2), 131–150.CrossRef
go back to reference Kumar, K., Kim, C., & Stern, R. M. (2011). Delta-spectral cepstral coefficients for robust speech recognition. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP), Prague, pp. 4784–4787. Kumar, K., Kim, C., & Stern, R. M. (2011). Delta-spectral cepstral coefficients for robust speech recognition. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP), Prague, pp. 4784–4787.
go back to reference Lalitha, S., Chaitanya, K. K., Teja, G. V. N., Varma, K. V., & Tripathi, S. (2015). Time-frequency and phase derived features for emotion classification. In 2015 annual IEEE India conference (INDICON), New Delhi, pp. 1–5. Lalitha, S., Chaitanya, K. K., Teja, G. V. N., Varma, K. V., & Tripathi, S. (2015). Time-frequency and phase derived features for emotion classification. In 2015 annual IEEE India conference (INDICON), New Delhi, pp. 1–5.
go back to reference Lalitha, S., Geyasruti, D., Narayanan, R., & Shravani, M. (2015). Emotion detection using MFCC and cepstrum features. In 4th international conference on eco-friendly computing and communication systems, Procedia Computer Science, pp 29–35. Lalitha, S., Geyasruti, D., Narayanan, R., & Shravani, M. (2015). Emotion detection using MFCC and cepstrum features. In 4th international conference on eco-friendly computing and communication systems, Procedia Computer Science, pp 29–35.
go back to reference Lalitha, S., Madhavan, A., Bhushan, B., & Saketh, S. (2014). Speech emotion recognition. In International conference on advances in electronics, computers and communications (ICAECC), pp 1–4. Lalitha, S., Madhavan, A., Bhushan, B., & Saketh, S. (2014). Speech emotion recognition. In International conference on advances in electronics, computers and communications (ICAECC), pp 1–4.
go back to reference Latha (2016). Robust speaker identification incorporating high frequency features. In Twelth international multi-conference on information processing, Procedia Computer Science, pp 804–811. Latha (2016). Robust speaker identification incorporating high frequency features. In Twelth international multi-conference on information processing, Procedia Computer Science, pp 804–811.
go back to reference Li, L., et al. (2013). Hybrid deep neural network–hidden markov model (DNN-HMM) based speech emotion recognition. In Humaine association conference on affective computing and intelligent interaction, Geneva, pp. 312–317. Li, L., et al. (2013). Hybrid deep neural network–hidden markov model (DNN-HMM) based speech emotion recognition. In Humaine association conference on affective computing and intelligent interaction, Geneva, pp. 312–317.
go back to reference Lim, W., Jang, D., & Lee, T. (2016). Speech emotion recognition using convolutional and recurrent neural networks. In 2016 Asia-Pacific signal and information processing association annual summit and conference (APSIPA), Jeju, pp. 1–4. Lim, W., Jang, D., & Lee, T. (2016). Speech emotion recognition using convolutional and recurrent neural networks. In 2016 Asia-Pacific signal and information processing association annual summit and conference (APSIPA), Jeju, pp. 1–4.
go back to reference Ma, J., Jin, H., Yang, L., & Tsai, J. (2006). Ubiquitous intelligence and computing. In Third International Conference, UIC 2006, Wuhan, China, September 3–6, 2006, Proceedings (Lecture Notes in Computer Science), Springer, New York, Inc., Secaucus. Ma, J., Jin, H., Yang, L., & Tsai, J. (2006). Ubiquitous intelligence and computing. In Third International Conference, UIC 2006, Wuhan, China, September 3–6, 2006, Proceedings (Lecture Notes in Computer Science), Springer, New York, Inc., Secaucus.
go back to reference Mao, X., Chen, L., & Fu, L. (2009). Multi-level Speech Emotion Recognition Based on HMM and ANN. In WRI world congress on computer science and information engineering, IEEE, pp. 225–229. Mao, X., Chen, L., & Fu, L. (2009). Multi-level Speech Emotion Recognition Based on HMM and ANN. In WRI world congress on computer science and information engineering, IEEE, pp. 225–229.
go back to reference Niu, J., Qian, Y., & Yu, K. (2014). Acoustic emotion recognition using deep neural network. In The 9th international symposium on chinese spoken language processing, Singapore, pp. 128–132. Niu, J., Qian, Y., & Yu, K. (2014). Acoustic emotion recognition using deep neural network. In The 9th international symposium on chinese spoken language processing, Singapore, pp. 128–132.
go back to reference Parthasarathy, S., Lotfian, R., & Busso, C. (2017). Ranking emotional attributes with deep neural networks. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, pp. 4995–4999. Parthasarathy, S., Lotfian, R., & Busso, C. (2017). Ranking emotional attributes with deep neural networks. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, pp. 4995–4999.
go back to reference Schlosberg, H. (1954). Three dimensions of emotions. Psychological Review 61, 81–88.CrossRef Schlosberg, H. (1954). Three dimensions of emotions. Psychological Review 61, 81–88.CrossRef
go back to reference Soltani, K., & Ainon, R. N. (2007). Speech emotion detection based on neural networks. In 2007 9th international symposium on signal processing and its applications, Sharjah, pp. 1–3. Soltani, K., & Ainon, R. N. (2007). Speech emotion detection based on neural networks. In 2007 9th international symposium on signal processing and its applications, Sharjah, pp. 1–3.
go back to reference Trigeorgis, G., et al. (2017). Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, pp. 5200–5204. Trigeorgis, G., et al. (2017). Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, pp. 5200–5204.
go back to reference Vlckova-Mejvaldova, J., & Horak, P. (2011). The influence of individual prosodic parameters on the perception of emotions in Czech. In Signal processing algorithms, architectures, arrangements, and applications conference proceedings (SPA), IEEE, pp. 1–6. Vlckova-Mejvaldova, J., & Horak, P. (2011). The influence of individual prosodic parameters on the perception of emotions in Czech. In Signal processing algorithms, architectures, arrangements, and applications conference proceedings (SPA), IEEE, pp. 1–6.
go back to reference Wang, K., An, N., Li, B. N., Zhang, Y., & Li, L. (2015). Speech emotion recognition using Fourier parameters. IEEE Transactions on Affective Computing, 6(1), 69–75.CrossRef Wang, K., An, N., Li, B. N., Zhang, Y., & Li, L. (2015). Speech emotion recognition using Fourier parameters. IEEE Transactions on Affective Computing, 6(1), 69–75.CrossRef
go back to reference Wang, Y., & Guan, L. (2008). Recognizing human emotional state from audiovisual signals. In IEEE transactions on multimedia, pp. 936–946. Wang, Y., & Guan, L. (2008). Recognizing human emotional state from audiovisual signals. In IEEE transactions on multimedia, pp. 936–946.
go back to reference Wang, Z., & Tashev, I. (2017). Learning utterance-level representations for speech emotion and age/gender recognition using deep neural networks. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, pp. 5150–5154. Wang, Z., & Tashev, I. (2017). Learning utterance-level representations for speech emotion and age/gender recognition using deep neural networks. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, pp. 5150–5154.
go back to reference Xia, R., & Liu, Y. (2017). A multi-task learning framework for emotion recognition using 2D continuous space. IEEE Transactions on Affective Computing, 8(1), 3–14.CrossRef Xia, R., & Liu, Y. (2017). A multi-task learning framework for emotion recognition using 2D continuous space. IEEE Transactions on Affective Computing, 8(1), 3–14.CrossRef
go back to reference Yadav, J., Kumari, A., & Rao, K. S. (2015). Emotion recognition using LP residual at sub-segmental, segmental and supra-segmental levels. In International conference on communication, information & computing Technology (ICCICT), IEEE, pp. 1–6. Yadav, J., Kumari, A., & Rao, K. S. (2015). Emotion recognition using LP residual at sub-segmental, segmental and supra-segmental levels. In International conference on communication, information & computing Technology (ICCICT), IEEE, pp. 1–6.
go back to reference Yang, B., & Lugger, M. (2010). Emotion recognition from speech signals using new harmony features. Signal Processing, 90(5), 1415–1423.CrossRefMATH Yang, B., & Lugger, M. (2010). Emotion recognition from speech signals using new harmony features. Signal Processing, 90(5), 1415–1423.CrossRefMATH
go back to reference Zao, L., Cavalcante, D., & Coelho, R. (2014). Time-frequency feature and AMS-GMM mask for acoustic emotion classification. IEEE Signal Processing Letters, 21(5), 620–624.CrossRef Zao, L., Cavalcante, D., & Coelho, R. (2014). Time-frequency feature and AMS-GMM mask for acoustic emotion classification. IEEE Signal Processing Letters, 21(5), 620–624.CrossRef
go back to reference Zhang, Y., Liu, Y., Weninger, F., & Schuller, B. (2017). Multi-task deep neural network with shared hidden layers: Breaking down the wall between emotion representations. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, pp. 4990–4994. Zhang, Y., Liu, Y., Weninger, F., & Schuller, B. (2017). Multi-task deep neural network with shared hidden layers: Breaking down the wall between emotion representations. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, pp. 4990–4994.
go back to reference Zheng, W. Q., Yu, J. S., & Zou, Y. X. (2015). An experimental study of speech emotion recognition based on deep convolutional neural networks. In 2015 international conference on affective computing and intelligent interaction (ACII), Xi’an, pp. 827–831. Zheng, W. Q., Yu, J. S., & Zou, Y. X. (2015). An experimental study of speech emotion recognition based on deep convolutional neural networks. In 2015 international conference on affective computing and intelligent interaction (ACII), Xi’an, pp. 827–831.
Metadata
Title
Enhanced speech emotion detection using deep neural networks
Authors
S. Lalitha
Shikha Tripathi
Deepa Gupta
Publication date
22-11-2018
Publisher
Springer US
Published in
International Journal of Speech Technology / Issue 3/2019
Print ISSN: 1381-2416
Electronic ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-018-09572-8

Other articles of this Issue 3/2019

International Journal of Speech Technology 3/2019 Go to the issue