Top

International Journal of Speech Technology

Published in:

28-10-2016

Enhanced multiclass SVM with thresholding fusion for speech-based emotion classification

Authors: Na Yang, Jianbo Yuan, Yun Zhou, Ilker Demirkol, Zhiyao Duan, Wendi Heinzelman, Melissa Sturge-Apple

Published in: International Journal of Speech Technology | Issue 1/2017

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

As an essential approach to understanding human interactions, emotion classification is a vital component of behavioral studies as well as being important in the design of context-aware systems. Recent studies have shown that speech contains rich information about emotion, and numerous speech-based emotion classification methods have been proposed. However, the classification performance is still short of what is desired for the algorithms to be used in real systems. We present an emotion classification system using several one-against-all support vector machines with a thresholding fusion mechanism to combine the individual outputs, which provides the functionality to effectively increase the emotion classification accuracy at the expense of rejecting some samples as unclassified. Results show that the proposed system outperforms three state-of-the-art methods and that the thresholding fusion mechanism can effectively improve the emotion classification, which is important for applications that require very high accuracy but do not require that all samples be classified. We evaluate the system performance for several challenging scenarios including speaker-independent tests, tests on noisy speech signals, and tests using non-professional acted recordings, in order to demonstrate the performance of the system and the effectiveness of the thresholding fusion mechanism in real scenarios.

previous article Modification of energy spectra, epoch parameters and prosody for emotion conversion in speech

next article Speech based automatic personality perception using spectral features

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Al Machot, F., Mosa, A. H., Dabbour, K., Fasih, A., Schwarzlmuller, C., Ali, M., & Kyamakya, K. (2011). A novel real-time emotion detection system from audio streams based on Bayesian quadratic discriminate classifier for ADAS. In Nonlinear Dynamics and Synchronization 16th Int’l Symposium on Theoretical Electrical Engineering, Joint 3rd Int’l Workshop on.

Ang, J., Dhillon, R., Krupski, A., Shriberg, E., & Stolcke, A. (2002). Prosody-based automatic detection of annoyance and frustration in human-computer dialog. In Proceeings of International Conference on Spoken Language Processing (pp. 2037–2040).

Bakeman, R. (1997). Behavioral observation and coding. Handbook of research methods in social psychology. Cambridge: Cambridge University Press.

Bänziger, T., Patel, S., & Scherer, K. R. (2014). The role of perceived voice and speech characteristics in vocal emotion communication. Journal of nonverbal behavior, 38(1), 31–52.CrossRef

Bao, H., Xu, M. X., & Zheng, T. F. (2007). Emotion attribute projection for speaker recognition on emotional speech. In Procceedings of Interspeech (pp. 758–761).

Barra-Chicote, R., Yamagishi, J., King, S., Montero, J. M., & Macias-Guarasa, J. (2010). Analysis of statistical parametric and unit selection speech synthesis systems applied to emotional speech. Speech Communication, 52(5), 394–404.CrossRef

Batliner, A., Steidl, S., Schuller, B., Seppi, D., Laskowski, K., Vogt, T., Devillers, L., Vidrascu, L., Amir, N., Kessous, L., & Aharonson, V. (2006). Combining efforts for improving automatic classification of emotional user states. In Proceedings of the Fifth Slovenian and First International Language Technologies Conference.

Bellegarda, J. R. (2013). Data-driven analysis of emotion in text using latent affective folding and embedding. Computational Intelligence, 29(3), 506–526.MathSciNetCrossRef

Bitouk, D., Ragini, V., & Ani, N. (2010). Class-level spectral features for emotion recognition. Journal of Speech Communication, 52(7–8), 613–625.CrossRef

Black, M. P., Katsamanis, A., Baucom, B. R., Lee, C. C., Lammert, A. C., & Christensen, A. (2013). Toward automating a human behavioral coding system for married couples’ interactions using speech acoustic features. Speech communication, 55(1), 1–21.CrossRef

Chang, K., Fisher, D., & Canny, J. (2011). AMMON: a speech analysis library for analyzing affect, stress, and mental health on mobile phones. In 2nd International Workshop on Sensing Applications on Mobile Phones.

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16(1), 321–357.MATH

Cowie, R., Douglas-Cowie, E., Savvidou, S., McMahon, E., Sawey, M., & Schröder, M. (2000). ‘FEELTRACE’: an instrument for recording perceived emotion in real time. In ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion.

Eskimez, S. E., Imade, K., Yang, N., Sturge-Appley, M., Duan, Z., & Heinzelman, W. (2016). Emotion classification: How does an automated system compare to naive human coders? In Acoustics, Speech and Signal Processing, Proceedings of the IEEE International Conference on.

Farrús, M., Ejarque, P., Temko, A., & Hernando, J. (2007). Histogram equalization in SVM multimodal person verification. In Proceedings of IAPR/IEEE International Conference on Biometrics.

Goudbeek, M., Goldman, J. P., & Scherer, K. R. (2009). Emotion dimensions and formant position. In INTERSPEECH (pp. 1575–1578).

Goyal, A., Riloff, E., Daumé III, H., & Gilbert, N. (2010). Toward plot units: automatic affect state analysis. In Proceedings of HLT/NAACL Workshop on Computational Approaches to Analysis and Generation of Emotion in Text (CAET).

Gupta P., & Rajput N. (2007). Two-stream emotion recognition for call center monitoring. In INTERSPEECH (pp. 2241–2244).

Hoque, M., Yeasin, M., & Louwerse, M. (2006). Robust recognition of emotion from speech. Intelligent virtual agents (pp. 42–53). Berlin: Springer. Lecture notes in computer science.CrossRef

Hsu, C.W., Chang ,C.C., & Lin, C.J. (2003). A practical guide to support vector classification.

Huang, X., Acero, A., & Hon, H. W. (2001). Spoken language processing. New Jersey: Prentice Hall PTR.

Huisman, G., Van Hout, M., van Dijk, E., van der Geest, T., & Heylen, D. (2013). Lemtool—measuring emotions in visual interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems.

Implementation of extracting MFCCs included in the VOICEBOX toolkit. http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html.

Jong, N. H. D., & Wempe, T. (2007). Automatic measurement of speech rate in spoken Dutch. ACLC Working Papers, 2(2), 49–58.

Kawanami, H., Iwami, Y., Toda, T., Saruwatari, H., & Shikano, K. (2003). GMM-based voice conversion applied to emotional speech synthesis. In Proceedings of Eurospeech.

Kerig, P., & Baucom, D. (2004). Couple observational coding systems. Abington: Routledge.

Kwon, O.W., Chan, K., Hao, J., & Lee, T. W. (2003). Emotion recognition by speech signals. In EUROSPEECH. (pp. 125–128).

Lee, C., & Lee, G. G. (2007). Emotion recognition for affective user interfaces using natural language dialogs. In Procceedings of IEEE International Symposium on Robot and Human interactive Communication. (pp. 798–801).

Lee, C. M., & Narayanan, S. S. (2005). Toward detecting emotions in spoken dialogs. IEEE Transactions on Speech and Audio Processing, 13(2), 293–303.CrossRef

Lee, C. M., Narayanan, S. S., & Pieraccini, R. (2002). Combining acoustic and language information for emotion recognition. In Proceeding of 7th International Conference on Spoken Language Processing.

Lee, L., & Rose, R. C. (1996). Speaker normalization using efficient frequency warping procedures. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 1 (pp. 353–356).

Liberman, M., Davis, K., Grossman, M., Martey, N., & Bell, J. (2002). Emotional prosody speech and transcripts. Philadelphia: Linguistic Data Consortium (LDC).

Ling, C., Dong, M., Li, H., Yu, Z. L., & Chan, P. (2010). Machine learning methods in the application of speech emotion recognition. In Application of Machine Learning (pp. 1–19).

MATLAB implementation of mutual information. 2007. http://www.mathworks.com/matlabcentral/fileexchange/14888-mutual-information-computation.

Özkul, S., Bozkurt, E., Asta, S., Yemez, Y., & Erzin, E. (2012). Multimodal analysis of upper-body gestures, facial expressions and speech. In Procceedings of the 4th International Workshop on Corpora for Research on Emotion Sentiment and Social Signals.

Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Mchine Intelligence, 27(8), 1226–1238.CrossRef

Platt, J. C. (1999). Fast training of support vector machines using sequential minimal optimization. Advances in Kernel Methods (pp. 185–208). Cambridge: MIT Press.

Qin, L., Ling, Z. H., Wu, Y. J., Zhang, B. F., & Wang, R. H. (2006). Hmm-based emotional speech synthesis using average emotion model. In Procceedings of Chinese Spoken Language Processing (pp. 233–240).

Rachuri, K. K., Musolesi, M., Mascolo, C., Rentfrow, P. J., Longworth, C., & Aucinas, A. (2010). EmotionSense: a mobile phones based adaptive platform for experimental social psychology research. In Proceedings of the 12th ACM International Conference on Ubiquitous Computing (pp. 281–290).

Roberto, B. (1994). Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks, 5(4), 537–550.CrossRef

Rong, J., Li, G., & Chen, Y. P. P. (2009). Acoustic feature selection for automatic emotion recognition from speech. Information Processing and Management, 45(3), 315–328.CrossRef

Sauter, D. A., Eisner, F., Calder, A. J., & Scott, S. K. (2010). Perceptual cues in nonverbal vocal expressions of emotion, 63(11), 2251–2272.

Scherer, K. R. (2003). Vocal communication of emotion: a review of research paradigms. Speech Communication, 40(1–2), 227–256.CrossRefMATH

Scherer, K. R. (2005). What are emotions? And how can they be measured? Social Science Information, 44(4), 695–729.MathSciNetCrossRef

Schuller, B., Rigoll, G., & Lang, M. (2003). Hidden markov model-based speech emotion recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 2 (p. 1).

Schuller, B., Rigoll, G., & Lang, M. (2004). Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 1 (p. 577).

Schuller, B., Vlasenko, B., Minguez, R., Rigoll, G., & Wendemuth, A. (2007). Comparing one and two-stage acoustic modeling in the recognition of emotion in speech. In: IEEE Workshop on Automatic Speech Recognition Understanding (pp. 596–600).

Sethu, V., Ambikairajah, E., & Epps, J. (2008). Empirical mode decomposition based weighted frequency feature for speech-based emotion classification. In IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 5017–5020).

Shafran, I. (2005). A comparison of classifiers for detecting emotion from speech. In IEEE International Conference on Acoustics, Speech and Signal Processing.

Shrawankar, U., & Thakare, V.M. (2013). Adverse conditions and ASR techniques for robust speech user interface. arXiv preprint arXiv:13035515.

Steidl, S., Polzehl, T., Bunnell, H. T., Dou, Y., Muthukumar, P. K., Perry, D., Prahallad, K., Vaughn, C., Black, A. W., & Metze, F. (2012). Emotion identification for evaluation of synthesized emotional speech. In Procceedings of Speech Prosody.

Tacconi, D., Mayora, O., Lukowicz, P., Arnrich, B., Setz, C., Troster, G., & Haring, C. (2008). Activity and emotion recognition to support early diagnosis of psychiatric diseases. In Pervasive Computing Technologies for Healthcare (PervasiveHealth), Second International Conference on (pp. 100–102).

Tang, H., Chu, S. M., Hasegawa-Johnson, M., & Huang, T. S. (2009). Emotion recognition from speech via boosted Gaussian mixture models. In IEEE International Conference on Multimedia and Expo (ICME) (pp. 294–297).

Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer.CrossRefMATH

Vapnik, V. N. (1998). Statistical learning theory. New Jersey: Wiley.MATH

Varga A. P., Steeneken H. J. M., Tomlinson M., & Jones D. (1992). NOISEX-92 study on the effect of additive noise on automatic speech recognition. http://spib.ece.rice.edu/spib/data/signals/noise/.

Vlasenko, B., Schuller, B., Wendemuth, A., & Rigoll, G. (2007). Combining frame and turn-level information for robust recognition of emotions within speech. In INTERSPEECH. (pp. 2249–2252).

Wireless communication and networking group, University of Rochester. 2016. http://www.ece.rochester.edu/projects/wcng/project_bridge.html.

Wu, C. H., Kung, C., Lin, J. C., & Wei, W. L. (2013). Two-level hierarchical alignment for semi-coupled hmm-based audiovisual emotion recognition with temporal course. IEEE Transactions on Multimedia, 15(8), 1880–1895.CrossRef

Wu, G., & Chang, E. Y. (2003). Class-boundary alignment for imbalanced dataset learning. In Workshop on Learning from Imbalanced Datasets II, ICML (pp. 49–56).

Wu, S., Falk, T. H., & Chan, W. Y. (2009). Automatic recognition of speech emotion using long-term spectro-temporal features. In Procceedings of the 16th International Conference on Digital Signal Processing.

Xia, R., & Liu, Y. (2012). Using i-vector space model for emotion recognition. In Procceedings of Interspeech.

Yang, N., Ba, H., Cai, W., Demirkol, I., & Heinzelman, W. (2014). BaNa: a noise resilient fundamental frequency detection algorithm for speech and music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12), 1833–1848.CrossRef

Yang, N., Muraleedharan, R., Kohl, J., Demirkol, I., Heinzelman, W., & Sturge-Apple, M. (2012). Speech-based emotion classification using multiclass SVM with hybrid kernel and thresholding fusion. In Spoken Language Technology Workshop (SLT), 2012 IEEE (pp. 455–460).

Yang, Y., Fairbairn, C., & Cohn, J. F. (2013). Detecting depression severity from vocal prosody. IEEE Transactions on Affective Computing, 4(2), 142–150.CrossRef

Yun, S., & Yoo, C. D. (2012). Loss-scaled large-margin gaussian mixture models for speech emotion classification. IEEE Transactions on Audio, Speech, and Language Processing, 20(2), 585–598.CrossRef

Zhang S., Zhao X., & Lei B. (2013). Speech emotion recognition using an enhanced kernel isomap for human-robot interaction. International Journal of Advanced Robotic Systems. doi:10.5772/55403.

Title: Enhanced multiclass SVM with thresholding fusion for speech-based emotion classification
Authors: Na Yang
Jianbo Yuan
Yun Zhou
Ilker Demirkol
Zhiyao Duan
Wendi Heinzelman
Melissa Sturge-Apple
Publication date: 28-10-2016
Publisher: Springer US
Published in: International Journal of Speech Technology / Issue 1/2017
Print ISSN: 1381-2416
Electronic ISSN: 1572-8110
DOI: https://doi.org/10.1007/s10772-016-9364-2

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 1/2017

Multiclass classification of Parkinson’s disease using different classifiers and LLBFS feature selection algorithm

Glottal opening instants detection using zero frequency resonator

Subjective speech quality measurement repeatability: comparison of laboratory test results

Voice recognition package for ERTU’s cloud

Speech enhancement based on stationary bionic wavelet transform and maximum a posterior estimator of magnitude-squared spectrum

Quantification system of Parkinson’s disease