Top

Published in:

2016 | OriginalPaper | Chapter

4. Real-time Incremental Processing

Author : Florian Eyben

Published in: Real-time Speech and Music Classification by Large Audio Feature Space Extraction

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

The features and the modelling methods used in this thesis have been selected with the goal of on-line processing in mind, however, most of them are all general methods that are suitable both for on-line and off-line processing. This section deals specifically with the issues encountered in on-line (aka incremental) processing, such as segmentation, constraints on feature extraction, and complexity and run-time constraints.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Standard Baseline Feature Sets

next chapter Real-Life Robustness

In openSMILE this behaviour is implemented in the cTurnDetector component.

Incremental segmentation of music into beats and bars is not part of this thesis. An on-line segmentation approach for music has not been investigated in this thesis, however, an off-line segmentation method basing on the beat tracker presented by Schuller et al. (2007b) and Eyben et al. (2007) has been used.

Available at: http://opensmile.audeering.com/.

According to Google scholar citations.

http://sigmm.org/Resources/software/ossc.

Available for the latest version at http://www.audeering.com/research-and-open-source/files/openSMILE-book-latest.pdf.

Also referred to as tick-loop in the code.

In practice limited by the range of the ‘long’ data-type (32-bit or 64-bit).

A. Batliner, D. Seppi, S. Steidl, B. Schuller, Segmenting into adequate units for automatic recognition of emotion-related episodes: a speech-based approach. Adv. Human Comput. Interact., Special Issue on Emotion-Aware Natural Interaction 2010, 1–15 (2010). Article ID 782802 (on-line)

M. Ben-Ari, Principles of Concurrent and Distributed Programming (Prentice Hall, Englewood Cliffs, 1990). ISBN 0-13-711821-XMATH

C. Busso, S. Lee, S. Narayanan, Using neutral speech models for emotional speech analysis, in Proceedings of the INTERSPEECH 2007, Antwerp, Belgium, August 2007. ISCA, pp. 2225–2228

G. Caridakis, L. Malatesta, L. Kessous, N. Amir, A. Raouzaiou, K. Karpouzis, Modeling naturalistic affective states via facial and vocal expressions recognition, in Proceedings of the 8th International Conference on Multimodal Interfaces (ICMI) 2006, Banff, Canada, 2006. ACM, pp. 146–154

J. Cohen, P. Cohen, S.G. West, L.S. Aiken, Applied multiple regression/correlation analysis for the behavioral sciences, 2nd edn. (Lawrence Erlbaum Associates, Hillsdale, 2003)

J. Deng, B. Schuller, Confidence measures in speech emotion recognition based on semi-supervised learning, in Proceedings of INTERSPEECH 2012, Portland, September 2012. ISCA

J. Deng, W. Han, B. Schuller, Confidence measures for speech emotion recognition: a start, in Proceedings of the 10-th ITG Symposium on Speech Communication, ed. by T. Fingscheidt, W. Kellermann (Braunschweig, Germany, September 2012). IEEE, pp. 1–4

L. Devillers, L. Vidrascu, L. Lamel, Challenges in real-life emotion annotation and machine learning based detection. Neural Netw. 18(4), 407–422 (2005). doi:10.1016/j.neunet.2005.03.007. ISSN 0893-6080CrossRef

E. Douglas-Cowie, R. Cowie, I. Sneddon, C. Cox, O. Lowry, M. McRorie, J.C. Martin, L. Devillers, S. Abrilian, A. Batliner, N. Amir, K. Karpouzis, The HUMAINE Database, vol. 4738, Lecture Notes in Computer Science (Springer, Berlin, 2007), pp. 488–500

P. Ekman, W.V. Friesen, Unmasking the Face: A Guide to Recognizing Emotions from Facial Expressions (Prentice Hall, Englewood Cliffs, 1975)

F. Eyben, B. Schuller, S. Reiter, G. Rigoll, Wearable assistance for the ballroom-dance hobbyist – holistic rhythm analysis and dance-style classification, in Proceedings of the IEEE International Conference on Multimedia and Expo (ICME) 2007, Bejing, China, July 2007. IEEE, pp. 92–95

F. Eyben, M. Wöllmer, B. Schuller, openSMILE – the munich versatile and fast open-source audio feature extractor, in Proceedings of the ACM Multimedia 2010, Florence, Italy, 2010a. ACM, pp. 1459–1462

F. Eyben, M. Woellmer, B. Schuller, openSMILE version 1.0.1 – source code, GPL. http://opensmile.sourceforge.net, 2010b

F. Eyben, M. Woellmer, B. Schuller, The openSMILE documentation v. 1.0.1. http://sourceforge.net/projects/opensmile/files/openSMILE_book_1.0.1.pdf/download, 2010c. Documentation of the the openSMILE toolkit, referred to as openSMILE book, online version 1.0.1

F. Eyben, M. Wöllmer, A. Graves, B. Schuller, E. Douglas-Cowie, R. Cowie, On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues. J. Multimodal User Interfaces (JMUI) 3(1–2), 7–19 (2010d). doi:10.1007/s12193-009-0032-6

F. Eyben, M. Wöllmer, M. Valstar, H. Gunes, B. Schuller, M. Pantic, String-based audiovisual fusion of behavioural events for the assessment of dimensional affect, in Proceedings of the International Workshop on Emotion Synthesis, Representation, and Analysis in Continuous space (EmoSPACE) 2011, held in conjunction with FG 2011, Santa Barbara, March 2011. IEEE, pp. 322–329

F. Eyben, M. Wöllmer, B. Schuller, A multi-task approach to continuous five-dimensional affect sensing in natural speech. ACM Trans. Interact. Intell. Syst., Special Issue on Affective Interaction in Natural Environments 2(1), 29 (2012). Article No. 6

F. Eyben, F. Weninger, F. Gross, B. Schuller, Recent developments in openSMILE, the munich open-source multimedia feature extractor, in Proceedings of ACM Multimedia 2013, Barcelona, Spain, 2013a. ACM, pp. 835–838

F. Eyben, F. Weninger, M. Woellmer, and B. Schuller, openSMILE version 2.0rc1 – source code, open-source research only license. http://opensmile.sourceforge.net, 2013b

F. Eyben, F. Weninger, M. Woellmer, B. Schuller, The openSMILE documentation v. 2.0 rc1. http://sourceforge.net/projects/opensmile/files/openSMILE_book_2.0-rc1.pdf/download, 2013c. Documentation of the the openSMILE toolkit, referred to as the openSMILE book, online version 2.0 rc1

S. Fernandez, A. Graves, J. Schmidhuber, Phoneme recognition in TIMIT with BLSTM-CTC, Technical report, IDSIA, Switzerland, 2008

J.R.J. Fontaine, K.R. Scherer, E.B. Roesch, P.C. Ellsworth, The world of emotions is not two-dimensional. Psychol. Sci. 18(2), 1050–1057 (2007)CrossRef

N. Fragopanagos, J.G. Taylor, Emotion recognition in human-computer interaction. Neural Netw., 2005 Special Issue on Emotion and Brain 18(4), 389–405 (2005)

D. Glowinski, A. Camurri, G. Volpe, N. Dael, K. Scherer, Technique for automatic emotion recognition by body gesture analysis, in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops 2008 (CVPRW’08), Anchorage, June 2008. IEEE, pp. 1–6

A. Graves, J. Schmidhuber, Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18(5–6), 602–610 (2005)CrossRef

M. Grimm, K. Kroschel, S. Narayanan, Support vector regression for automatic recognition of spontaneous emotions in speech, in Proceedings of the ICASSP 2007, vol. 4, Honolulu, April 2007a. IEEE, pp. 1085–1088

M. Grimm, E. Mower, K. Kroschel, S. Narayanan, Primitives based estimation and evaluation of emotions in speech. Speech Commun. 49, 787–800 (2007b)CrossRef

H. Gunes, M. Pantic, Dimensional emotion prediction from spontaneous head gestures for interaction with sensitive artificial listeners, in Proceedings of the International Conference on Intelligent Virtual Agents (IVA), . (Springer, Berlin, 2010a), pp. 371–377. ISBN 978-3-642-15891-9

H. Gunes, M. Pantic, Automatic, dimensional and continuous emotion recognition. Int. J. Synth. Emot. (IJSE) 1(1), 68–99 (2010b)CrossRef

M.A. Hall, Correlation-based Feature Subset Selection for Machine Learning, Doctoral thesis, University of Waikato, Hamilton, New Zealand, 1998

W. Han, H. Li, H. Ruan, L. Ma, J. Sun, B. Schuller. Active learning for dimensional speech emotion recognition, in Proceedings of INTERSPEECH 2013, Lyon, France, August 2013. ISCA, pp. 2856–2859

S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRef

S. Ioannou, A. Raouzaiou, V. Tzouvaras, T. Mailis, K. Karpouzis, S. Kollias, Emotion recognition through facial expression analysis based on a neurofuzzy method. Neural Netw., 2005 Special Issue on Emotion and Brain 18(4), 423–435 (2005)

C.-C. Lee, C. Busso, S. Lee, S.S. Narayanan, Modeling mutual influence of interlocutor emotion states in dyadic spoken interactions, in Proceedings of INTERSPEECH 2009, Brighton, UK, September 2009. ISCA, pp. 1983–1986

G. McKeown, M. Valstar, R. Cowie, M. Pantic, M. Schroder, The SEMAINE database: annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Trans. Affect. Comput. 3(1), 5–17 (2012). doi:10.1109/T-AFFC.2011.20. ISSN 1949-3045CrossRef

E. Mower, S.S. Narayanan, A hierarchical static-dynamic framework for emotion classification, in Proceedings of the ICASSP 2011, Prague, Czech Republic, May 2011. IEEE, pp. 2372–2375

M. Nicolaou, H. Gunes, M. Pantic, Audio-visual classification and fusion of spontaneous affective data in likelihood space, in Proceedings of the 2010 20th International Conference on Pattern Recognition (ICPR), Istanbul, Turkey, August 2010. IEEE, pp. 3695–3699

V. Parsa, D. Jamieson, Acoustic discrimination of pathological voice: sustained vowels versus continuous speech. J. Speech, Lang. Hear. Res. 44, 327–339 (2001)CrossRef

C. Peters, C. O’Sullivan, Synthetic vision and memory for autonomous virtual humans. Comput. Graph. Forum 21(4), 743–753 (2002)CrossRef

M. Riedmiller, H. Braun, A direct adaptive method for faster backpropagation learning: the RPROP algorithm, Proceedings of the IEEE International Conference on Neural Networks, vol. 1 (San Francisco, 1993). IEEE, pp. 586–591. doi:10.1109/icnn.1993.298623

E.M. Schmidt, Y.E. Kim, Prediction of time-varying musical mood distributions from audio, in Proceedings of ISMIR 2010, Utrecht, The Netherlands, 2010. ISMIR

M. Schröder, E. Bevacqua, R. Cowie, F. Eyben, H. Gunes, D. Heylen, M. ter Maat, G. McKeown, S. Pammi, M. Pantic, C. Pelachaud, B. Schuller, E. de Sevin, M. Valstar, M. Wöllmer, Building autonomous sensitive artificial listeners. IEEE Trans. Affect. Comput. 3(2), 165–183 (2012)CrossRef

B. Schuller, G. Rigoll, Timing levels in segment-based speech emotion recognition, in Proceedings of the INTERSPEECH-ICSLP 2006, Pittsburgh, September 2006. ISCA, pp. 1818–1821

B. Schuller, B. Vlasenko, R. Minguez, G. Rigoll, A. Wendemuth, Comparing one and two-stage acoustic modeling in the recognition of emotion in speech, in Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) 2007, Kyoto, Japan, 2007a. IEEE, pp. 596–600

B. Schuller, F. Eyben, G. Rigoll, Fast and robust meter and tempo recognition for the automatic discrimination of ballroom dance styles, in Proceedings of the ICASSP 2007, vol. I, Honolulu, April 2007b. IEEE, pp. 217–220

B. Schuller, D. Seppi, A. Batliner, A. Maier, S. Steidl, Towards more reality in the recognition of emotional speech, in Proceedings of the ICASSP 2007, vol. IV, Honolulu, 2007c. IEEE, pp. 941–944

B. Schuller, F. Eyben, G. Rigoll, Beat-synchronous data-driven automatic chord labeling, in Proceedings of the 34 Jahrestagung für Akustik (DAGA) 2008, Dresden, Germany, March 2008. DEGA pp. 555–556

B. Schuller, R. Müller, F. Eyben, J. Gast, B. Hörnler, M. Wöllmer, G. Rigoll, A. Höthker, H. Konosu, Being bored? Recognising natural interest by extensive audiovisual integration for real-life application. Image Vis. Comput., Special Issue onVisual and Multimodal Analysis of Human Spontaneous Behavior 27(12), 1760–1774 (2009a)

B. Schuller, S. Steidl, A. Batliner, F. Jurcicek, The INTERSPEECH 2009 emotion challenge, in Proceedings of INTERSPEECH 2009, Brighton, UK, September 2009b. pp. 312–315

B. Schuller, B. Vlasenko, F. Eyben, G. Rigoll, A. Wendemuth, Acoustic emotion recognition: a benchmark comparison of performances, in Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) 2009, Merano, Italy, December 2009c. IEEE,pp. 552–557

B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. Müller, S. Narayanan, The INTERSPEECH 2010 paralinguistic challenge, in Proceedings of INTERSPEECH 2010, Makuhari, Japan, September 2010. ISCA, pp. 2794–2797

B. Schuller, M. Valstar, F. Eyben, G. McKeown, R. Cowie, M. Pantic, AVEC 2011 - the first international audio/visual emotion challenge, in Proceedings of the First International Audio/Visual Emotion Challenge and Workshop, AVEC 2011, held in conjunction with the International HUMAINE Association Conference on Affective Computing and Intelligent Interaction (ACII) 2011, vol. II, ed. by B. Schuller, M. Valstar, R. Cowie, M. Pantic (Springer, Memphis, 2011a), pp. 415–424

B. Schuller, A. Batliner, S. Steidl, F. Schiel, J. Krajewski, The INTERSPEECH 2011 speaker state challenge, in Proceedings of INTERSPEECH 2011, Florence, Italy, August 2011b. ISCA, pp. 3201–3204

B. Schuller, A. Batliner, S. Steidl, D. Seppi, Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Commun., Special Issue on Sensing Emotion and Affect – Facing Realism in 53(9/10), 1062–1087 (2011c)

B. Schuller, M. Valstar, R. Cowie, M. Pantic, AVEC 2012: the continuous audio/visual emotion challenge - an introduction, in Proceedings of the 14th ACM International Conference on Multimodal Interaction (ICMI) 2012, ed. by L.-P. Morency, D. Bohus, H.K. Aghajan, J. Cassell, A. Nijholt, J. Epps (ACM, Santa Monica, 2012a), pp. 361–362

B. Schuller, S. Steidl, A. Batliner, E. Nöth, A. Vinciarelli, F. Burkhardt, R. van Son, F. Weninger, F. Eyben, T. Bocklet, G. Mohammadi, B. Weiss, The INTERSPEECH 2012 speaker trait challenge, in Proceedings of INTERSPEECH 2012, Portland, OR, USA, September 2012b. ISCA

B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, M. Chetouani, et al, The INTERSPEECH 2013 computational paralinguistics challenge: Social Signals, Conflict, Emotion, Autism, in Proceedings of INTERSPEECH 2013, Lyon, France, 2013. ISCA, pp. 148–152

S. Steidl, Automatic Classification of Emotion-Related User States in Spontaneous Children’s Speech (Logos Verlag, Berlin, 2009)

S. Steidl, B. Schuller, A. Batliner, D. Seppi, The hinterland of emotions: facing the open-microphone challenge, in Proceedings of the 4th International HUMAINE Association Conference on Affective Computing and Intelligent Interaction (ACII), vol. I, Amsterdam, The Netherlands, 2009. IEEE, pp. 690–697

P. Werbos, Backpropagation through time: what it does and how to do it. Proc. IEEE 78, 1550–1560 (1990)CrossRef

M. Wöllmer, F. Eyben, S. Reiter, B. Schuller, C. Cox, E. Douglas-Cowie, R. Cowie. Abandoning emotion classes – towards continuous emotion recognition with modelling of long-range dependencies, in Proceedings of the INTERSPEECH 2008, Brisbane, Australia, September 2008. ISCA, pp. 597–600

M. Wöllmer, F. Eyben, B. Schuller, E. Douglas-Cowie, R. Cowie, Data-driven clustering in emotional space for affect recognition using discriminatively trained LSTM networks, in Proceedings of INTERSPEECH 2009, Brighton, UK, September 2009. ISCA, pp. 1595–1598

M. Wöllmer, B. Schuller, F. Eyben, G. Rigoll, Combining long short-term memory and dynamic Bayesian networks for incremental emotion-sensitive artificial listening. IEEE J. Sel. Top. Signal Process., Special Issue on “Speech Processing for Natural Interaction with Intelligent Environments” 4(5), 867–881 (2010)

D. Wu, T. Parsons, E. Mower, S.S. Narayanan, Speech emotion estimation in 3d space, in Proceedings of the IEEE International Conference on Multimedia and Expo (ICME) 2010, Singapore, July 2010a. IEEE, pp. 737–742

D. Wu, T. Parsons, S.S. Narayanan, Acoustic feature analysis in speech emotion primitives estimation, in Proceedings of the INTERSPEECH 2010, Makuhari, Japan, September 2010b. ISCA, pp. 785–788

P.V. Yee, S. Haykin, Regularized Radial Basis Function Networks: Theory and Applications (Wiley, New York, 2001), 208 p. ISBN 0-471-35349-3

Z. Zeng, M. Pantic, G.I. Rosiman, T.S. Huang, A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell. 31(1), 39–58 (2009)CrossRef

Z. Zhang, B. Schuller, Semi-supervised learning helps in sound event classification, in Proceedings of ICASSP 2012, Kyoto, March 2012. IEEE, pp. 333–336

Title: Real-time Incremental Processing
Author: Florian Eyben
Publisher: Springer International Publishing
Book: Real-time Speech and Music Classification by Large Audio Feature Space Extraction
Print ISBN: 978-3-319-27298-6

Electronic ISBN: 978-3-319-27299-3

Copyright Year: 2016
DOI: https://doi.org/10.1007/978-3-319-27299-3_4