Acoustic feature selection for automatic emotion recognition from speech

https://doi.org/10.1016/j.ipm.2008.09.003Get rights and content

Abstract

Emotional expression and understanding are normal instincts of human beings, but automatical emotion recognition from speech without referring any language or linguistic information remains an unclosed problem. The limited size of existing emotional data samples, and the relative higher dimensionality have outstripped many dimensionality reduction and feature selection algorithms. This paper focuses on the data preprocessing techniques which aim to extract the most effective acoustic features to improve the performance of the emotion recognition. A novel algorithm is presented in this paper, which can be applied on a small sized data set with a high number of features. The presented algorithm integrates the advantages from a decision tree method and the random forest ensemble. Experiment results on a series of Chinese emotional speech data sets indicate that the presented algorithm can achieve improved results on emotional recognition, and outperform the commonly used Principle Component Analysis (PCA)/Multi-Dimensional Scaling (MDS) methods, and the more recently developed ISOMap dimensionality reduction method.

Introduction

Emotional recognition is a common instinct for human beings, which has been studied by researchers from different disciplines for more than 70 years (Fairbanks and Pronovost, 1939, Fairbanks and Pronovost, 1941). Fairbanks et al’s pioneering work on emotional speech (Fairbanks and Pronovost, 1939, Fairbanks and Pronovost, 1941) revealed the importance of vocal cues in the expression of emotion, and the powerful effects of vocal emotion expression on interpersonal interaction. Understanding the emotional state of the speaker during communication can help the listeners to catch more information than is represented by the content of the dialogue sentences, especially to detect the ‘real’ meaning of the speech hidden between words. The practical value of emotion recognition from speech is suggested by the rapidly growing number of areas to which it is being applied, such as humanoid robots, the car industry, calling centers, etc. (Cowie and Cornelius, 2003, Lee and Narayanan, 2004, Lee et al., 2004, Pantic and Rothkrantz, 2003, Schuller et al., 2005).

Although machine learning and data mining techniques have obtained flourishing applications (Mitchell, 1997), only a few works have utilized these powerful tools and achieved better performance in emotion recognition from speech. Here a serious encumbrance is the lack of available emotional speech data. There is only few public benchmark databases available for research purpose.

A sufficient number of training examples is the premise for most machine learning and data mining algorithms to work well. When there are only a few training examples, it is very possible to have the problem of overfitting, which means that a model can be trained with perfect performance on the training set, but can hardly generalize well on new examples. In practise, how many training examples will be adequate is task-dependent; for example, the task of learning the XOR function, four different training examples are sufficient, while for more complex tasks such as emotion recognition, thousands of training examples might still be insufficient. In general, if a data set can not fully cover the whole variable space then the data set is referred to as a small data set. In this sense, the data sets collected for emotion recognition are small because the typical size of the data set is less than 1000, while the number of features is close to 100. Such kind of data scarcity usually outstrips many machine learning and data mining algorithms (Vapnik, 1995).

There are two obvious ways to overcome the problem of data scarcity: one is to collect more data while the second is to design techniques that can deal with small data sets. Considering the fact that further data collection is manual, cost intensive and hard to achieve, it is more feasible and desirable for the second way. Based on this recognition, this paper presents a novel feature selection algorithm, ERFTrees, to extract effective features from small data sets. There are two facets of benefits by using this algorithm for emotion recognition: firstly, the irrelevant data can be removed and the dimensionality of the training data can be reduced; secondly, with a reduced data set, most existing machine learning algorithms which do not work well on small data set, can now produce better recognition accuracy. The empirical results on Chinese (Mandarin) emotional data sets indicate that the presented algorithm outperforms other linear and non-linear dimensionality reduction methods, including Principle Component Analysis (PCA), Multi-Dimensional Scaling (MDS), and ISOMap.

The rest of the paper is organized as follows: we introduce the background and the related work in Section 2. The algorithm, ERFTrees, is presented in Section 3. The experiment design and empirical results are presented in Section 4, and finally in Section 5, we conclude the paper with a perspective analysis of possible future work.

Section snippets

Theory of human emotions

Constructing an automatic emotion recognizer depends on a sense of what emotion is. Most people have an informal understanding, but there is a formal research tradition which has probed the nature of emotion systematically. It has been shaped by major figures in several disciplines – philosophy, biology, and psychology – but conventionally it is called the ‘psychological tradition’.

In psychology, the theories of emotion are grouped into four main traditions, each making different basic

The feature selection algorithm: ERFTrees

From the last section, we know that a data preparation step is important for the performance of emotion recognition algorithms. However, the small size of the emotion data samples, usually with tens of dimensions, has outstripped the capability of many existing feature selection algorithms which requires adequate samples. To address this challenge, a novel method, called Ensemble Random Forest to Trees (ERFTrees), is introduced to do the feature selection task by integrating the random forest

Experiment design and result analysis

In this section, we evaluate the performance of the presented feature selection algorithm against other common methods. According to the common sources of collecting emotional speech data, the data used in this work contains two speech corpora from different sources: (1) acted speech corpora and (2) natural speech corpora. The language spoken in both speech corpus is Chinese (Mandarin).

Conclusion

It is no doubt that introducing advanced machine learning techniques into emotion recognition is beneficial. However, since collecting a large set of emotion speech samples is time consuming and labor intensive, machine learning algorithms that can work with only a small number of training examples are desirable. In this paper, the Ensemble Random Forest to Trees (ERFTrees) is presented, which can be used to extract effective features from small data sets.

The small size of the training data

Acknowledgements

The related work is partially supported by Deakin CRGS grant 2008. The authors would like to thank Sam Schmidt for proof reading the English of the manuscript.

References (67)

  • R. Cowie et al.

    Describing the emotional states that are expressed in speech

  • R. Kohavi et al.

    Wrappers for feature subset selection

    Artificial Intelligence

    (1997)
  • Amir, N. (2001). Classifying emotions in speech: A comparison of methods. In Proceedings of European conference on...
  • R. Banse et al.

    Acoustic profiles in vocal emotion expression

    Journal of Personality and Social Psychology

    (1996)
  • R.V. Bezooijen

    Characteristics and recognizability of vocal expressions of emotions

    (1984)
  • Bhatti, M. W., Wang, Y., & Guan, L. (2004). A neural network approach for human emotion recognition in speech. In...
  • L. Breiman

    Random forest

    Machine Learning

    (2001)
  • Cai, L., Jiang, C., Wang, Z., Zhao, L., & Zou, C. (2003). A method combining the global and time series structure...
  • Chuang, Z.-J., & Wu, C.-H. (2004). Emotion recognition using acoustic features and textual content. In Proceedings of...
  • R. Coleman et al.

    Identification of emotional states using perceptual and acoustic analyses

  • R.R. Cornelius

    The science of emotion: Research and tradition in the psychology of emotion

    (1996)
  • J.R. Davitz

    Personality, perceptual, and cognitive correlates of emotional sensitivity

    The Communication of Emotional Meaning

    (1964)
  • Dellaert, F., Polzin, T., & Waibel, A. (1996). Recognizing emotion in speech. In Proceedings of fourth international...
  • G. Fairbanks et al.

    An experimental study of the pitch characteristics of the voice during the expression of emotion

    Speech Monograph

    (1939)
  • G. Fairbanks et al.

    An experimental study of the durational characteristics of the voice during the expression of emotion

    Speech Monograph

    (1941)
  • I. Fónagy

    A new method of investigating the perception of prosodic features

    Language and Speech

    (1978)
  • I. Fónagy

    Emotions, voice and music

    Language and Speech

    (1978)
  • I. Fónagy et al.

    Emotional patterns in intonation and music

    Zeitschrift für Phonetik SprachWissenschaft und Kommunicationsforschung

    (1963)
  • J. Han et al.

    Data mining concepts and techniques

    (2000)
  • W. Hargreaves et al.

    Voice quality in depression

    Journal of Abnormal Psychology

    (1965)
  • Z. Havrdova et al.

    Changes of the voice expression during suggestively influenced states of experiencing

    Activitas Nervosa Superior

    (1979)
  • Hoch, S., Althoff, F., McGlaun, G., & Rigoll, G. (2005). Bimodal fusion of emotional data in an automotive environment....
  • W.L. Höffe

    On the relation between speech melody and intensity

    Phonetica

    (1960)
  • G.L. Huttar

    Relations between prosodic variables and emotions in normal american english utterances

    Journal of the Acoustical Society of America

    (1967)
  • Hyun, K. H., Kim, E. H., & Kwak, Y. K. (2005). Improvement of emotion recognition by Bayesian classifier using...
  • A. Hyvärinen

    Survey on independent component analysis

    Neural Computing Surveys

    (1999)
  • Inanoglu, Z., & Caneel, R. (2005). Emotive alert: Hmm-based motion detection in voicemail messages. In Proceedings of...
  • W. Johnson et al.

    Recognition of emotion from vocal cues

    Arch Gen Psychiatry

    (1986)
  • L. Kaiser

    Communication of affects by single vowels

    Synthese

    (1962)
  • Klasmeyer, G., & Sendlneier, W. F. (1995). Objective voice parameteres to characterize the emotional content in speech....
  • G. Kotlyar et al.

    Acoustic correlates of the emotional content of vocalized speech

    Journal of Acoustical Academy of Sciences of the USSR

    (1976)
  • Kwon, O.-W., Chan, K., Hao, J., & Lee, T.-W. (2003). Emotion recognition by speech signal. In Proceedings of the eighth...
  • J. Lattin et al.

    Analyzing multivariate data

    (2003)
  • Cited by (170)

    • A blockchain-based application for genomic access and variant discovery using smart contracts and homomorphic encryption

      2022, Future Generation Computer Systems
      Citation Excerpt :

      It is an emerging technology that has been applied in areas such as finance, health care, law enforcement and supply chains to transfer ownership of assets and value, and publicly record and verify transactions between parties without the need for a trusted entity [9]. In addition, blockchain has the potential to integrate integrity for the selection of machine learning features in information processing [10,11]. We employ blockchain features (e.g., immutability) together with fully homomorphic encryption to propose a novel framework to address the challenges of the traditional genomic data sharing model.

    View all citing articles on Scopus
    View full text