Acoustic feature selection and classification of emotions in speech using a 3D continuous emotion model

https://doi.org/10.1016/j.bspc.2011.02.008Get rights and content

Abstract

In this paper we report the results obtained from experiments with a database of emotional speech in English in order to find the most important acoustic features to estimate Emotion Primitives which determine the emotional content on speech. We are interested in exploiting the potential benefits of continuous emotion models, so in this paper we demonstrate the feasibility of applying this approach to annotation of emotional speech and we explore ways to take advantage of this kind of annotation to improve the automatic classification of basic emotions.

Introduction

Emotions are very important in our everyday life, they are present in everything we do. There is a continuous interaction between emotions, behavior and thoughts, in such a way that they constantly influence each other. Emotions are a great source of information in communication and interaction among people, they are assimilated intuitively.

The applications of emotion recognition encompass many fields, for example, as a supporting technology in medical areas such as psychology, neurology and caring of aged and impaired people. Automatic emotion recognition based on biomedical signals, facial and vocal expressions has been applied to diagnosis and following-up of progressive neurological disorders, specifically Huntington's and Parkinson's diseases [1]. These pathologies are characterized by a deficit in emotional processing of fear and disgust and, thus, the system could be utilized to determine the subject reaction/or absence of reaction to specific emotions, helping the health professionals to gain a better understanding of these disorders. Furthermore, the system could be used as a reference to evaluate the patients’ response to certain medicines. Another important medical application is remote medical support. These kinds of environments enable communication between medical professionals and patients for cases of regular monitoring and emergency situations. In this scenario the system recognizes patient's emotional states and then transmits data indicating the patient is experiencing depression or sadness, health-care providers monitoring them will be better prepared to respond. Such a system has the potential to improve patient satisfaction and health [2], [3], [4]. It can also be an asset for disabled people who have difficulties with communication. Hearing-impaired people who are not profoundly deaf can use residual hearing to communicate with other people and learn to speak with emotions making communication more complete and understandable. In these cases, emotion recognition engines can be used as an important element of a computer-assisted emotional speech training system [5]. For hearing-impaired people, it could provide an easier way to learn how to speak with emotion more naturally or help speech therapist to guide them to express correctly emotions in speech. Emotion recognition arouses great interest in the interface design given that recognizing and understanding emotions automatically is one of the key steps towards emotional intelligence in Human–Computer Interaction (HCI). The need for automatic emotion recognition has emerged due to the tendency towards a more natural interaction between humans and computers. Affective computing is a topic within HCI that encompass these research tendency trying to endow computers with the ability to detect, recognize, model and take into account user's emotional state that plays a role of paramount importance in the way humans make decisions [6]. Emotions are essential for human thought processes that influence interactions between people and intelligent systems.

In the area of automatic emotion recognition mainly two annotation schemes have been used to capture and describe the emotional content in speech: discrete and continuous approaches. Discrete approach is based on the concept of basic emotions such as anger, joy, and sadness, that are the most intense form of emotions from which all other emotions are generated by variations or combinations of them. They assume the existence of universal emotions that can be clearly distinguished from one another by most people. On the other hand, continuous approach represent emotional states using a continuous multidimensional space. Emotions are represented by regions in an n-dimensional space where each dimension represents an Emotion Primitive. Emotion Primitives are subjective properties shown by all emotions. The most widely accepted primitives are Arousal and Valence. Valence describes how negative or positive is a specific emotion. Arousal, also called Activation, describes the internal excitement of an individual and ranges from being very calm to be very active. Also, three-dimensional models have been proposed. The most common three-dimensional model includes Valence, Activation as well as Dominance [7]. This additional primitive describes the degree of control that the individual intends to take on the situation, or in other words, how strong or weak the individual seems to be.

Both approaches, discrete and continuous, provide complementary information about the emotional manifestations observed in individuals [8]. Discrete categorization allows a more particularized representation of emotions in applications where it is needed to recognize a predefined set of emotions. However, this approach ignores most of the spectrum of human emotional expressions. In contrast, continuous models allow the representation of any emotional state and subtle emotions, but it has been found that it is difficult to estimate with high precision the Emotion Primitives based only on acoustic information. Some authors have begun to research about how to take advantage of this theory [9], [10], [11] to estimate more adequately the emotional content in speech. Both approaches are closely related, by assessing the emotional content in speech using one of these two schemes we can infer its counterparts in the other scheme. For instance, if an utterance is evaluated as anger we may infer that the utterance would have a low value for Valence and high for Activation and Dominance. Conversely, if an utterance is evaluated with low Valence and high Activation and Dominance we could infer that this is Anger. Several authors [12], [13], [14] have worked on the analysis of the most important acoustic features from the point of view of discrete categorization; however, they have not yet studied with the same depth the importance of acoustic attributes from the continuous models point of view. We believe that the continuous approach has great potential to model the occurrence of emotions in the real world. This three-dimensional continuous model is adopted in this paper. As a first step towards exploiting the continuous approach we analyze the most important acoustic features to automatically estimate Emotion Primitives in speech. Then, we will be able to use this estimation in order to locate the individual's emotional state in the multidimensional space, and if necessary, to map it to a basic emotion. In this work we apply these ideas to improve the automatic emotion recognition from acoustic information. We perform some experiments in order to find the most important acoustic features for estimating Emotion Primitives in speech and then we propose a method that uses these estimations to determine the individual's emotional state mapping it to a basic emotion. The remainder of this paper is organized as follows. First, we describe the database in Section 2. Next, in Section 3 we describe the acoustic features and how we are extracting them from speech signals. The filters we applied to select the best instances are explained in Section 4. In Section 5 we propose and describe two ways of finding the best acoustic features for Emotion Primitives Estimation. Experimental results and discussions about feature selection are provided in Section 6. In Section 7 we propose a way of applying the automatic Emotion Primitive estimation in order to classify basic emotions. In Section 8, we present and discuss the results obtained from basic emotions classification. Finally, the conclusion of this study and future work are discussed in Section 9.

Section snippets

Database

For the purposes of this work it is necessary to have a database labeled with Emotion Primitives, namely Valence, Activation and Dominance. In addition, we need basic emotions annotations for each sample to validate the Emotion Primitives estimation accuracy by evaluating the mapping done from continuous to discrete approach. There are many databases labeled with emotional categories such as FAU Aibo [15], Berlin Database of Emotional Speech [16] Spanish Emotional Speech [17] and a few that are

Features extraction

We extracted acoustic features from the speech signal using two programs for acoustic analysis Praat [19] and OpenEar [20]. We evaluated two sets of features; one of them was obtained through a selective approach, i.e., based on a study taking into account the features that could be useful, the features that have been successful in related works and features used for other similar tasks. The second feature set was obtained by applying a brute force approach, i.e., generating a large amount of

Instance selection

Doing an inspection on the database instances, we realized that there were problematic instances; we thought that our machine learning algorithm would have a better performance by selecting the most appropriate and congruent instances representing the properties of continuous and discrete approaches. Our initial data set consists of 1820 instances. We worked only with the more represented classes, so we first selected the instances from the four classes with more examples. We choose the

Feature selection

The need of finding the best feature subsets for building our learning models arises given the low correlation obtained in the estimation of primitives using a trained model for the full set of 6920 acoustic features. Correlation coefficient measures the quality of the estimated variable determining the strength and the direction of a linear relationship between the estimated and the actual value of the variable. The closer the coefficient is to either −1 or 1, the stronger the correlation

Feature selection results

All the results of the learning experiments in this paper were obtained using Support Vector Machines (SVM) and validated by 10-Fold Cross-Validation.

The metrics used to measure the importance of feature groups are: correlation coefficient, share and portion. The correlation coefficient is the most common parameter to measure the machine learning algorithms performance on regression tasks, as in our case. We use share and portion that are measures proposed in [14] to assess the impact of

Basic emotion classification based on Emotion Primitives

Having identified the most relevant acoustic features for estimating the Emotion Primitives, the next step is to devise a way to use these estimations to discriminate emotional states in people. In this section, we want to strengthen the findings in the features study by demonstrating that, the continuous approach can actually help us to improve the automatic emotion classification from acoustic features based on a continuous emotion model. Section 2 describes the database we are working on.

Basic emotions classification results

Table 10 shows the results when classifying Emotion Primitives according to high and low classes as described in Section 7 point 3 of the classification process description.

Table 11 shows a comparison of the results obtained by applying the process described in Fig. 4, using a discrete approach for the classification of emotions and the baseline results for the corpus with which we are working reported in [37]. We can see that it has achieved a best recall (number of correctly classified

Conclusions

We carried out a study about the importance of different acoustic feature types from a continuous three-dimensional emotions model point of view. We analyzed each Emotion Primitive separately. Through the identification of the best features the automatic estimation of Emotion Primitives becomes more accurately and thus the recognition and classification of people's emotional state improves. To our knowledge the importance of acoustic features has not been studied from this approach. We have

Acknowledgements

The authors wish to express their gratitude for the support given to carry out this research to the National Council of Science and Technology of Mexico through the postgraduate scholarship 49296 and the project 106013.

References (40)

  • P. Pudil et al.

    Floating search methods in feature selection

    Pattern Recognition Letters

    (1994)
  • C. Vera-Muñoz et al.

    A Wearable EMG Monitoring System for Emotions Assessment

    (2008)
  • G. González, Bilingual computer-assisted psychological assessment: an innovative approach for screening depression in...
  • F. Nasoz et al.

    Emotion recognition from physiological signals using wireless sensors for presence technologies

    Cognition, Technology & Work

    (2004)
  • L. Vidrascu et al.

    Real-life emotion representation and detection in call centers data

  • M.S. Hussain et al.

    A framework for multimodal affect recognition

  • M. Murugappan et al.
  • H. Schlosberg

    Three dimensions of emotion

    Psychological Review

    (1954)
  • C. Busso et al.

    IEMOCAP: Interactive Emotional Dyadic Motion Capture Database

    Journal of Language Resources and Evaluation

    (2008)
  • M. Lugger et al.

    Cascaded emotion classification via psychological emotion dimensions using large set of voice quality parameters

  • M. Wollmer et al.

    Data-driven clustering in emotional space for affect recognition using discriminatively trained lstm networks

  • F. Eyben et al.

    On-line emotion recognition in a 3-d activation-valence-time continuum using acoustic and linguistic cues

    Journal on Multimodal User Interfaces

    (2010)
  • B. Xie et al.

    Statistical Feature Selection for Mandarin Speech Emotion Recognition. Lecture Notes in Computer Science

    (2005)
  • M. Lugger et al.

    An incremental analysis of different feature groups in speaker independent emotion recognition

  • A. Batliner et al.

    Whodunnit – searching for the most important feature types signalling emotion-related user states in speech

    Computer Speech and Language

    (2010)
  • S. Steidl

    Automatic Classification of Emotion-Related User States in Spontaneous Children's Speech

    (2009)
  • F. Burkhardt et al.

    A database of german emotional speech

  • J. Montero, Estrategias para la mejora de la naturalidad y la incorporación de variedad emocional a la conversión texto...
  • M. Grimm et al.

    The vera am mittag german audio–visual emotional speech database

  • P. Boersma

    Praat, a system for doing phonetics by computer

    Glot International

    (2001)
  • Cited by (28)

    • A novel spatio-temporal convolutional neural framework for multimodal emotion recognition

      2022, Biomedical Signal Processing and Control
      Citation Excerpt :

      The first step is the most important part of SER. Pérez-Espinosa et al. [10]proposed a method to map emotion primitives (valence, activation, and dominance emotions) into the basic emotions to achieve high performance in speech emotion detection. Similarly, in [5] two methods are proposed for feature extraction based on empirical mode decomposition (EMD) features and MFCC features.

    • A survey of speech emotion recognition in natural environment

      2021, Digital Signal Processing: A Review Journal
      Citation Excerpt :

      Gharavian et al. [122] used the FCBF feature selection method and genetic algorithm-optimized fuzzy ARTMAP neural network. Perez-Espinosa et al. [123] used the group-wise feature selection using the linear floating forward selection (LFFS) algorithm for 3-D speech emotion recognition. The authors claimed that the MFCC, LPC, and cochleogram groups are very prominent in estimating three emotion primitives.

    • Emotion recognition: From speech and facial expressions

      2021, Biosignal Processing and Classification Using Computational Learning and Intelligence: Principles, Algorithms, and Applications
    • Speech emotion recognition using deep 1D & 2D CNN LSTM networks

      2019, Biomedical Signal Processing and Control
      Citation Excerpt :

      The paralinguistic information comes to mean the implicit messages such as the emotion contained in the speech [4,6–8]. There are many distinguishing acoustic features usually used into recognizing the speech emotion: continuous features, qualitative features, and spectral features [9–13]. Many features have been investigated to recognize speech emotion.

    • Using acoustic paralinguistic information to assess the interaction quality in speech-based systems for elderly users

      2017, International Journal of Human Computer Studies
      Citation Excerpt :

      We employed a SVM algorithm using a polynomial kernel (Hall et al., 2009) to classify the APPIs. We selected a SVM because this technique obtained good results in a previous study using a similar acoustic feature set (Pérez-Espinosa et al., 2012). Second, we measured the performance of the trained models using the leave one speaker out and stratified 80–20 schemes.

    • Classification of bipolar disorder episodes based on analysis of voice and motor activity of patients

      2016, Pervasive and Mobile Computing
      Citation Excerpt :

      In the current study, we extracted acoustic features from the speech signal using OpenEar [43] and Praat [44]. We evaluated features that have been successful in previous work [45]. Table 3 shows the acoustic features that were included in this research.

    View all citing articles on Scopus
    View full text