Skip to main content
Top

2015 | OriginalPaper | Chapter

19. Be at Odds? Deep and Hierarchical Neural Networks for Classification and Regression of Conflict in Speech

Authors : Raymond Brueckner, Björn Schuller

Published in: Conflict and Multimodal Communication

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Conflict is a fundamental phenomenon inevitably arising in inter-human communication and only recently has become the subject of study in the emerging field of computational paralinguistics. As speech is a predominant carrier of information about the valence and level of conflict we investigate and demonstrate how deep and hierarchical neural networks, which have become the new mainstream paradigm in automatic speech recognition over the last few years, can be leveraged to automatically classify and predict levels of conflict purely based on audio recordings. For this purpose we adopt a neural network architecture which we previously have applied successfully to another paralinguistics task. On the Conflict Sub-Challenge data set of the Interspeech 2013 Computational Paralinguistics Challenge (ComParE) we obtained the best results reported so far in the literature on both the classification and the regression task. These results demonstrate that deep neural networks are also appropriate for the prediction of conflict levels, both for classification and regression.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
go back to reference Bishop CM (2006) Pattern recognition and machine learning. Springer, BerlinMATH Bishop CM (2006) Pattern recognition and machine learning. Springer, BerlinMATH
go back to reference Boakye K, Vinyals O, Friedland G (2011) Improved overlapped speech handling for speaker diarization. In: Proceedings of interspeech, ISCA, Florence, Aug 2011, pp 941–944 Boakye K, Vinyals O, Friedland G (2011) Improved overlapped speech handling for speaker diarization. In: Proceedings of interspeech, ISCA, Florence, Aug 2011, pp 941–944
go back to reference Bousmalis K, Mehu M, Pantic M (2009) Spotting agreement and disagreement: a survey of nonverbal audiovisual cues and tools. In: Proceedings of the 3rd international conference on affective computing and intelligent interaction and workshops, ACII 2009, vol 2. IEEE Computer Society Press, Los Alamitos Bousmalis K, Mehu M, Pantic M (2009) Spotting agreement and disagreement: a survey of nonverbal audiovisual cues and tools. In: Proceedings of the 3rd international conference on affective computing and intelligent interaction and workshops, ACII 2009, vol 2. IEEE Computer Society Press, Los Alamitos
go back to reference Brueckner R, Schuller B (2012) Likability classification - a not so deep neural network approach. In: Proceedings of interspeech, Portland, OR, Sep 2012 Brueckner R, Schuller B (2012) Likability classification - a not so deep neural network approach. In: Proceedings of interspeech, Portland, OR, Sep 2012
go back to reference Brueckner R, Schuller B (2013) Hierarchical neural networks and enhanced class posteriors for social signal classification. In: Proceedings of ASRU, IEEE, Olomouc, Dec 2013, pp 361–364 Brueckner R, Schuller B (2013) Hierarchical neural networks and enhanced class posteriors for social signal classification. In: Proceedings of ASRU, IEEE, Olomouc, Dec 2013, pp 361–364
go back to reference Brueckner R, Schuller B (2014) Social signal classification using deep BLSTM recurrent neural networks. In: Proceedings of ICASSP, IEEE, Florence, May 2014 Brueckner R, Schuller B (2014) Social signal classification using deep BLSTM recurrent neural networks. In: Proceedings of ICASSP, IEEE, Florence, May 2014
go back to reference Dahl G, Sainath T, Hinton G (2013) Improving deep neural networks for LVCSR using rectified linear units and dropout. In: Proceedings of ICASSP, IEEE, Vancouver, May 2013, pp 8609–8613 Dahl G, Sainath T, Hinton G (2013) Improving deep neural networks for LVCSR using rectified linear units and dropout. In: Proceedings of ICASSP, IEEE, Vancouver, May 2013, pp 8609–8613
go back to reference Erhan D, Bengio Y, Courville A, Vincent PAMP, Bengio S (2010) Why does unsupervised pre-training help deep learning? J Mach Learn Res 11:625–660MATHMathSciNet Erhan D, Bengio Y, Courville A, Vincent PAMP, Bengio S (2010) Why does unsupervised pre-training help deep learning? J Mach Learn Res 11:625–660MATHMathSciNet
go back to reference Eyben F, Wöllmer M, Schuller B (2010) openSMILE – the Munich versatile and fast open-source audio feature extractor. In: Proceedings of ACM multimedia, MM 2010, ACM, Florence, Oct 2010. ACM, New York, pp 1459–1462 (acceptance rate short paper: about 30 %) Eyben F, Wöllmer M, Schuller B (2010) openSMILE – the Munich versatile and fast open-source audio feature extractor. In: Proceedings of ACM multimedia, MM 2010, ACM, Florence, Oct 2010. ACM, New York, pp 1459–1462 (acceptance rate short paper: about 30 %)
go back to reference Geiger JT, Vipperla R, Bozonnet S, Evans N, Schuller B, Rigoll G (2012) Convolutive non-negative sparse coding and new features for speech overlap handling in speaker diarization. In: Proceedings of interspeech, Portland, OR, Sept 2012 Geiger JT, Vipperla R, Bozonnet S, Evans N, Schuller B, Rigoll G (2012) Convolutive non-negative sparse coding and new features for speech overlap handling in speaker diarization. In: Proceedings of interspeech, Portland, OR, Sept 2012
go back to reference Geiger J, Eyben F, Schuller B, Rigoll G (2013) Detecting overlapping speech with long short-term memory recurrent neural networks. In: Proceedings of interspeech, ISCA, Lyon, Aug 2013, pp 1668–1672 Geiger J, Eyben F, Schuller B, Rigoll G (2013) Detecting overlapping speech with long short-term memory recurrent neural networks. In: Proceedings of interspeech, ISCA, Lyon, Aug 2013, pp 1668–1672
go back to reference Gers F, Schraudolph N, Schmidhuber J (2002) Learning precise timing with LSTM recurrent networks. J Mach Learn Res 3:115–143MathSciNet Gers F, Schraudolph N, Schmidhuber J (2002) Learning precise timing with LSTM recurrent networks. J Mach Learn Res 3:115–143MathSciNet
go back to reference Grèzes F, Richards J, Rosenberg A (2013) Let me finish: automatic conflict detection using speaker overlap. In: Proceedings of interspeech, ISCA, Lyon, Aug 2013, pp 200–204 Grèzes F, Richards J, Rosenberg A (2013) Let me finish: automatic conflict detection using speaker overlap. In: Proceedings of interspeech, ISCA, Lyon, Aug 2013, pp 200–204
go back to reference Hinton G, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov R (2012) Improving neural networks by preventing co-adaptation of feature detectors. CoRR. abs/1207.0580 Hinton G, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov R (2012) Improving neural networks by preventing co-adaptation of feature detectors. CoRR. abs/1207.0580
go back to reference Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780CrossRef Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780CrossRef
go back to reference Hochreiter S, Bengio Y, Frasconi P, Schmidhuber J (2001) Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In: Kremer SC, Kolen JF (eds) A field guide to dynamical recurrent neural networks. IEEE Press, New York Hochreiter S, Bengio Y, Frasconi P, Schmidhuber J (2001) Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In: Kremer SC, Kolen JF (eds) A field guide to dynamical recurrent neural networks. IEEE Press, New York
go back to reference Jaeger H (2001) The “echo state” approach to analysing and training recurrent neural networks. GMD Report 148, GMD - German National Research Institute for Computer Science Jaeger H (2001) The “echo state” approach to analysing and training recurrent neural networks. GMD Report 148, GMD - German National Research Institute for Computer Science
go back to reference Jaeger H, Maass W, Príncipe JC (2007) Special issue on echo state networks and liquid state machines. Neural Netw 20(3):287–289CrossRef Jaeger H, Maass W, Príncipe JC (2007) Special issue on echo state networks and liquid state machines. Neural Netw 20(3):287–289CrossRef
go back to reference Judd CM (1978) Cognitive effects of attitude conflict resolution. J Conflict Resolut 22(3):483–498CrossRef Judd CM (1978) Cognitive effects of attitude conflict resolution. J Conflict Resolut 22(3):483–498CrossRef
go back to reference Kim S, Filippone M, Valente F, Vinciarelli A (2012) Predicting the conflict level in television political debates: an approach based on crowdsourcing, nonverbal communication and Gaussian processes. In: Babaguchi N, Aizawa K, Smith JR, Satoh S, Plagemann T, Hua XS, Yan R (eds) Proceedings of ACM international conference on multimedia, Nara. ACM, New York, pp 793–796CrossRef Kim S, Filippone M, Valente F, Vinciarelli A (2012) Predicting the conflict level in television political debates: an approach based on crowdsourcing, nonverbal communication and Gaussian processes. In: Babaguchi N, Aizawa K, Smith JR, Satoh S, Plagemann T, Hua XS, Yan R (eds) Proceedings of ACM international conference on multimedia, Nara. ACM, New York, pp 793–796CrossRef
go back to reference Kim S, Yella SH, Valente F (2012) Automatic detection of conflict escalation in spoken conversations. In: Proceedings of interspeech, ISCA, Portland, OR, Sept 2012 Kim S, Yella SH, Valente F (2012) Automatic detection of conflict escalation in spoken conversations. In: Proceedings of interspeech, ISCA, Portland, OR, Sept 2012
go back to reference Levine JM, Moreland RL (1998) Small groups. In: Gilbert D, Lindzey G (eds) The handbook of social psychology, vol 2. Oxford University Press, Oxford Levine JM, Moreland RL (1998) Small groups. In: Gilbert D, Lindzey G (eds) The handbook of social psychology, vol 2. Oxford University Press, Oxford
go back to reference Maas A, Hannun A, Ng A (2013) Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of ICML workshop on deep learning for audio, speech, and language processing, WDLASL, Atlanta, GA, Jun 2013 Maas A, Hannun A, Ng A (2013) Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of ICML workshop on deep learning for audio, speech, and language processing, WDLASL, Atlanta, GA, Jun 2013
go back to reference Pesarin A, Cristani M, Murino V, Vinciarelli A (2012) Conversation analysis at work: detection of conflict in competitive discussions through automatic turn-organization analysis. Cogn Process 13(2):533–540CrossRef Pesarin A, Cristani M, Murino V, Vinciarelli A (2012) Conversation analysis at work: detection of conflict in competitive discussions through automatic turn-organization analysis. Cogn Process 13(2):533–540CrossRef
go back to reference Räsänen O, Pohjalainen J (2013) Random subset feature selection in automatic recognition of developmental disorders, affective states, and level of conflict from speech. In: Proceedings of interspeech, Lyon, Aug 2013, pp 210–214 Räsänen O, Pohjalainen J (2013) Random subset feature selection in automatic recognition of developmental disorders, affective states, and level of conflict from speech. In: Proceedings of interspeech, Lyon, Aug 2013, pp 210–214
go back to reference Salakhutdinov R (2009) Learning deep generative models. Ph.D. thesis, University of Toronto Salakhutdinov R (2009) Learning deep generative models. Ph.D. thesis, University of Toronto
go back to reference Schmidhuber J (1992) Learning complex extended sequences using the principle of history compression. Neural Comput 4(2):234–242CrossRef Schmidhuber J (1992) Learning complex extended sequences using the principle of history compression. Neural Comput 4(2):234–242CrossRef
go back to reference Schuller B (2012) The computational paralinguistics challenge. IEEE Signal Process Mag 29(4):97–101CrossRef Schuller B (2012) The computational paralinguistics challenge. IEEE Signal Process Mag 29(4):97–101CrossRef
go back to reference Schuller B, Batliner A (2013) Computational paralinguistics: emotion, affect and personality in speech and language processing. Wiley, New YorkCrossRef Schuller B, Batliner A (2013) Computational paralinguistics: emotion, affect and personality in speech and language processing. Wiley, New YorkCrossRef
go back to reference Schuller B, Batliner A, Steidl S, Seppi D (2011) Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Commun 53(9/10):1062–1087 [Special Issue on Sensing Emotion and Affect – Facing Realism in Speech Processing] Schuller B, Batliner A, Steidl S, Seppi D (2011) Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Commun 53(9/10):1062–1087 [Special Issue on Sensing Emotion and Affect – Facing Realism in Speech Processing]
go back to reference Schuller B, Steidl S, Batliner A, Nöth E, Vinciarelli A, Burkhardt A, van Son R, Weninger F, Eyben F, Bocklet T, Mohammadi G, Weiss B (2012) The interspeech 2012 speaker trait challenge. In: Proceedings of interspeech, Portland, OR Schuller B, Steidl S, Batliner A, Nöth E, Vinciarelli A, Burkhardt A, van Son R, Weninger F, Eyben F, Bocklet T, Mohammadi G, Weiss B (2012) The interspeech 2012 speaker trait challenge. In: Proceedings of interspeech, Portland, OR
go back to reference Schuller B, Steidl S, Batliner A, Vinciarelli A, Scherer K, Ringeval F, Chetouani M Weninger F, Eyben F, Marchi E, Mortillaro M, Salamin H, Polychroniou A, Valente F, Kim S (2013) The interspeech 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. In: Proceedings of interspeech, Lyon, Aug 2013 Schuller B, Steidl S, Batliner A, Vinciarelli A, Scherer K, Ringeval F, Chetouani M Weninger F, Eyben F, Marchi E, Mortillaro M, Salamin H, Polychroniou A, Valente F, Kim S (2013) The interspeech 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. In: Proceedings of interspeech, Lyon, Aug 2013
go back to reference Schuster M, Paliwal K (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45:2673–2681CrossRef Schuster M, Paliwal K (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45:2673–2681CrossRef
go back to reference Stuhlsatz A, Meyer C, Eyben F, Zielke T, Meier G, Schuller B (2011) Deep neural networks for acoustic emotion recognition: raising the benchmarks. In: Proceedings of ICASSP, Prague, pp 5688–5691 Stuhlsatz A, Meyer C, Eyben F, Zielke T, Meier G, Schuller B (2011) Deep neural networks for acoustic emotion recognition: raising the benchmarks. In: Proceedings of ICASSP, Prague, pp 5688–5691
go back to reference Vincent P, Larochelle H, Bengio Y, Manzagol PA (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of ICML, New York, NY, 2008, pp 1096–1103 Vincent P, Larochelle H, Bengio Y, Manzagol PA (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of ICML, New York, NY, 2008, pp 1096–1103
go back to reference Vinciarelli A, Dielmann A, Favre S, Salamin H (2009) Canal9: a database of political debates for analysis of social interactions. In: Proceedings of the international conference on affective computing and intelligent interaction, Sept 2009, pp 1–4 Vinciarelli A, Dielmann A, Favre S, Salamin H (2009) Canal9: a database of political debates for analysis of social interactions. In: Proceedings of the international conference on affective computing and intelligent interaction, Sept 2009, pp 1–4
go back to reference Vinciarelli A, Pantic M, Bourlard H (2009) Social signal processing: survey of an emerging domain. Image Vis Comput 27(12):1743–1759CrossRef Vinciarelli A, Pantic M, Bourlard H (2009) Social signal processing: survey of an emerging domain. Image Vis Comput 27(12):1743–1759CrossRef
go back to reference Waibel A, Hanazawa T, Hinton G, Shikano K, Lang K (1989) Phoneme recognition using time-delay neural networks. IEEE Trans Acoust Speech Signal Process 37(3):328–339CrossRef Waibel A, Hanazawa T, Hinton G, Shikano K, Lang K (1989) Phoneme recognition using time-delay neural networks. IEEE Trans Acoust Speech Signal Process 37(3):328–339CrossRef
go back to reference Wang N, Melchior J, Wiskott L (2012) An analysis of Gaussian-binary restricted Boltzmann machines for natural images. In: Proceedings of ESANN, Bruges, Apr 2012, pp 287–292 Wang N, Melchior J, Wiskott L (2012) An analysis of Gaussian-binary restricted Boltzmann machines for natural images. In: Proceedings of ESANN, Bruges, Apr 2012, pp 287–292
go back to reference Wrede B, Shriberg E (2003) Spotting “hot spots” in meetings: human judgments and prosodic cues. In: Proceedings of Eurospeech, ISCA, Geneva, Sept 2003, pp 2805–2808 Wrede B, Shriberg E (2003) Spotting “hot spots” in meetings: human judgments and prosodic cues. In: Proceedings of Eurospeech, ISCA, Geneva, Sept 2003, pp 2805–2808
go back to reference Yamamoto K, Asano F, Yamada T, Kitawaki N (2006) Detection of overlapping speech in meetings using support vector machines and support vector regression. IEICE Trans Fundam Electron Commun Comput Sci 89-A(8):2158–2165CrossRef Yamamoto K, Asano F, Yamada T, Kitawaki N (2006) Detection of overlapping speech in meetings using support vector machines and support vector regression. IEICE Trans Fundam Electron Commun Comput Sci 89-A(8):2158–2165CrossRef
go back to reference Zeiler M, Ranzato M, Monga R, Mao M, Yang K, Le QV, Nguyen P, Senior A, Vanhoucke V, Dean J, Hinton G (2013) On rectified linear units for speech processing. In: ICASSP, IEEE, Vancouver, May 2013, pp 3517–3521 Zeiler M, Ranzato M, Monga R, Mao M, Yang K, Le QV, Nguyen P, Senior A, Vanhoucke V, Dean J, Hinton G (2013) On rectified linear units for speech processing. In: ICASSP, IEEE, Vancouver, May 2013, pp 3517–3521
go back to reference Zelenák M, Hernando J (2011) The detection of overlapping speech with prosodic features for speaker diarization. In: Proceedings of interspeech, ISCA, Florence, Aug 2011, pp 1041–1044 Zelenák M, Hernando J (2011) The detection of overlapping speech with prosodic features for speaker diarization. In: Proceedings of interspeech, ISCA, Florence, Aug 2011, pp 1041–1044
Metadata
Title
Be at Odds? Deep and Hierarchical Neural Networks for Classification and Regression of Conflict in Speech
Authors
Raymond Brueckner
Björn Schuller
Copyright Year
2015
DOI
https://doi.org/10.1007/978-3-319-14081-0_19

Premium Partner