Acoustic feature selection for automatic emotion recognition from speech

doi:10.1016/j.ipm.2008.09.003

Information Processing & Management

Volume 45, Issue 3, May 2009, Pages 315-328

https://doi.org/10.1016/j.ipm.2008.09.003 Get rights and content

Abstract

Emotional expression and understanding are normal instincts of human beings, but automatical emotion recognition from speech without referring any language or linguistic information remains an unclosed problem. The limited size of existing emotional data samples, and the relative higher dimensionality have outstripped many dimensionality reduction and feature selection algorithms. This paper focuses on the data preprocessing techniques which aim to extract the most effective acoustic features to improve the performance of the emotion recognition. A novel algorithm is presented in this paper, which can be applied on a small sized data set with a high number of features. The presented algorithm integrates the advantages from a decision tree method and the random forest ensemble. Experiment results on a series of Chinese emotional speech data sets indicate that the presented algorithm can achieve improved results on emotional recognition, and outperform the commonly used Principle Component Analysis (PCA)/Multi-Dimensional Scaling (MDS) methods, and the more recently developed ISOMap dimensionality reduction method.

Introduction

Emotional recognition is a common instinct for human beings, which has been studied by researchers from different disciplines for more than 70 years (Fairbanks and Pronovost, 1939, Fairbanks and Pronovost, 1941). Fairbanks et al’s pioneering work on emotional speech (Fairbanks and Pronovost, 1939, Fairbanks and Pronovost, 1941) revealed the importance of vocal cues in the expression of emotion, and the powerful effects of vocal emotion expression on interpersonal interaction. Understanding the emotional state of the speaker during communication can help the listeners to catch more information than is represented by the content of the dialogue sentences, especially to detect the ‘real’ meaning of the speech hidden between words. The practical value of emotion recognition from speech is suggested by the rapidly growing number of areas to which it is being applied, such as humanoid robots, the car industry, calling centers, etc. (Cowie and Cornelius, 2003, Lee and Narayanan, 2004, Lee et al., 2004, Pantic and Rothkrantz, 2003, Schuller et al., 2005).

Although machine learning and data mining techniques have obtained flourishing applications (Mitchell, 1997), only a few works have utilized these powerful tools and achieved better performance in emotion recognition from speech. Here a serious encumbrance is the lack of available emotional speech data. There is only few public benchmark databases available for research purpose.

A sufficient number of training examples is the premise for most machine learning and data mining algorithms to work well. When there are only a few training examples, it is very possible to have the problem of overfitting, which means that a model can be trained with perfect performance on the training set, but can hardly generalize well on new examples. In practise, how many training examples will be adequate is task-dependent; for example, the task of learning the XOR function, four different training examples are sufficient, while for more complex tasks such as emotion recognition, thousands of training examples might still be insufficient. In general, if a data set can not fully cover the whole variable space then the data set is referred to as a small data set. In this sense, the data sets collected for emotion recognition are small because the typical size of the data set is less than 1000, while the number of features is close to 100. Such kind of data scarcity usually outstrips many machine learning and data mining algorithms (Vapnik, 1995).

There are two obvious ways to overcome the problem of data scarcity: one is to collect more data while the second is to design techniques that can deal with small data sets. Considering the fact that further data collection is manual, cost intensive and hard to achieve, it is more feasible and desirable for the second way. Based on this recognition, this paper presents a novel feature selection algorithm, ERFTrees, to extract effective features from small data sets. There are two facets of benefits by using this algorithm for emotion recognition: firstly, the irrelevant data can be removed and the dimensionality of the training data can be reduced; secondly, with a reduced data set, most existing machine learning algorithms which do not work well on small data set, can now produce better recognition accuracy. The empirical results on Chinese (Mandarin) emotional data sets indicate that the presented algorithm outperforms other linear and non-linear dimensionality reduction methods, including Principle Component Analysis (PCA), Multi-Dimensional Scaling (MDS), and ISOMap.

The rest of the paper is organized as follows: we introduce the background and the related work in Section 2. The algorithm, ERFTrees, is presented in Section 3. The experiment design and empirical results are presented in Section 4, and finally in Section 5, we conclude the paper with a perspective analysis of possible future work.

Section snippets

Theory of human emotions

Constructing an automatic emotion recognizer depends on a sense of what emotion is. Most people have an informal understanding, but there is a formal research tradition which has probed the nature of emotion systematically. It has been shaped by major figures in several disciplines – philosophy, biology, and psychology – but conventionally it is called the ‘psychological tradition’.

In psychology, the theories of emotion are grouped into four main traditions, each making different basic

The feature selection algorithm: ERFTrees

From the last section, we know that a data preparation step is important for the performance of emotion recognition algorithms. However, the small size of the emotion data samples, usually with tens of dimensions, has outstripped the capability of many existing feature selection algorithms which requires adequate samples. To address this challenge, a novel method, called Ensemble Random Forest to Trees (ERFTrees), is introduced to do the feature selection task by integrating the random forest

Experiment design and result analysis

In this section, we evaluate the performance of the presented feature selection algorithm against other common methods. According to the common sources of collecting emotional speech data, the data used in this work contains two speech corpora from different sources: (1) acted speech corpora and (2) natural speech corpora. The language spoken in both speech corpus is Chinese (Mandarin).

Conclusion

It is no doubt that introducing advanced machine learning techniques into emotion recognition is beneficial. However, since collecting a large set of emotion speech samples is time consuming and labor intensive, machine learning algorithms that can work with only a small number of training examples are desirable. In this paper, the Ensemble Random Forest to Trees (ERFTrees) is presented, which can be used to extract effective features from small data sets.

The small size of the training data

Acknowledgements

The related work is partially supported by Deakin CRGS grant 2008. The authors would like to thank Sam Schmidt for proof reading the English of the manuscript.

References (67)

R. Cowie et al.
Describing the emotional states that are expressed in speech
R. Kohavi et al.
Wrappers for feature subset selection
Artificial Intelligence
(1997)
Amir, N. (2001). Classifying emotions in speech: A comparison of methods. In Proceedings of European conference on...
R. Banse et al.
Acoustic profiles in vocal emotion expression
Journal of Personality and Social Psychology
(1996)
R.V. Bezooijen
Characteristics and recognizability of vocal expressions of emotions
(1984)
Bhatti, M. W., Wang, Y., & Guan, L. (2004). A neural network approach for human emotion recognition in speech. In...
L. Breiman
Random forest
Machine Learning
(2001)
Cai, L., Jiang, C., Wang, Z., Zhao, L., & Zou, C. (2003). A method combining the global and time series structure...
Chuang, Z.-J., & Wu, C.-H. (2004). Emotion recognition using acoustic features and textual content. In Proceedings of...
R. Coleman et al.
Identification of emotional states using perceptual and acoustic analyses

R.R. Cornelius

The science of emotion: Research and tradition in the psychology of emotion

(1996)

J.R. Davitz

Personality, perceptual, and cognitive correlates of emotional sensitivity

The Communication of Emotional Meaning

(1964)

Dellaert, F., Polzin, T., & Waibel, A. (1996). Recognizing emotion in speech. In Proceedings of fourth international...

G. Fairbanks et al.

An experimental study of the pitch characteristics of the voice during the expression of emotion

Speech Monograph

(1939)

G. Fairbanks et al.

An experimental study of the durational characteristics of the voice during the expression of emotion

Speech Monograph

(1941)

I. Fónagy

A new method of investigating the perception of prosodic features

Language and Speech

(1978)

I. Fónagy

Emotions, voice and music

Language and Speech

(1978)

I. Fónagy et al.

Emotional patterns in intonation and music

Zeitschrift für Phonetik SprachWissenschaft und Kommunicationsforschung

(1963)

J. Han et al.

Data mining concepts and techniques

(2000)

W. Hargreaves et al.

Voice quality in depression

Journal of Abnormal Psychology

(1965)

Z. Havrdova et al.

Changes of the voice expression during suggestively influenced states of experiencing

Activitas Nervosa Superior

(1979)

Hoch, S., Althoff, F., McGlaun, G., & Rigoll, G. (2005). Bimodal fusion of emotional data in an automotive environment....

W.L. Höffe

On the relation between speech melody and intensity

Phonetica

(1960)

G.L. Huttar

Relations between prosodic variables and emotions in normal american english utterances

Journal of the Acoustical Society of America

(1967)

Hyun, K. H., Kim, E. H., & Kwak, Y. K. (2005). Improvement of emotion recognition by Bayesian classifier using...

A. Hyvärinen

Survey on independent component analysis

Neural Computing Surveys

(1999)

Inanoglu, Z., & Caneel, R. (2005). Emotive alert: Hmm-based motion detection in voicemail messages. In Proceedings of...

W. Johnson et al.

Recognition of emotion from vocal cues

Arch Gen Psychiatry

(1986)

L. Kaiser

Communication of affects by single vowels

Synthese

(1962)

Klasmeyer, G., & Sendlneier, W. F. (1995). Objective voice parameteres to characterize the emotional content in speech....

G. Kotlyar et al.

Acoustic correlates of the emotional content of vocalized speech

Journal of Acoustical Academy of Sciences of the USSR

(1976)

Kwon, O.-W., Chan, K., Hao, J., & Lee, T.-W. (2003). Emotion recognition by speech signal. In Proceedings of the eighth...

J. Lattin et al.

Analyzing multivariate data

(2003)

Cited by (170)

Employees’ acceptance of AI-based emotion analytics from speech on a group level in virtual meetings
2024, Technology in Society
Detecting emotions in virtual meetings can be a demanding task for humans due to technological constraints of current videoconferencing systems. AI-based emotion analytics from speech is a promising technological option to provide helpful analyses for online business meetings, but might be perceived as scary as emotions are a sensitive topic. However, employees' acceptance of such a type of software is not understood properly yet as there are no offers on the market yet. To investigate potential users' intentions, we conducted a survey in 2021 as well as in 2023 with data from 470 employees in Germany with regard to a novel type of software analyzing emotions from speech on a group level in virtual meetings. We argue that employees' decision of acceptance bases on balancing different pros and cons for sticking with an existing behavior or using an emotion analytics software. A partial least squares (PLS) approach to structural equation modeling (SEM) was used to test the study's hypotheses. Our results show that attitude, perceived norms, perceived efficacy and perceived threats are significant predictors for the intention to accept such a software. At the same time, privacy concerns were only in 2023 a reason to reject the software. This paper contributes to a better understanding on why employees are willing to use emotion analytics software.
A deep interpretable representation learning method for speech emotion recognition
2023, Information Processing and Management
This paper focuses on the active interpretability for deep learning-based speech emotion recognition (SER). To achieve this, we propose an explicit feature constrained model, the interpretable group convolutional neural network (IG-CNN) model. In the proposed model, we first introduce the interpretability constraint to learn human-understandable interpretable representations. The emotion prediction decision can be active interpreted via the model coefficients. To acquire more representations beyond interpretable ones, and ensure they are useful for SER, we then design the uncorrelation constraint between interpretable and autonomous representations and introduce group CNN structure. We test the model on IEMOCAP, RAVDESS, eNTERFACE’05, and CREMA-D datasets. Experimental results show that our model outperforms all the baselines. In addition, the proposed model can also learn the patterns of human perception of speech emotion and provide explanation for the recognition results.
Emotional speech Recognition using CNN and Deep learning techniques
2023, Applied Acoustics
Emotions are considered an integral and vital part of human life. It acts as a means of expressing one's opinions and informing others about one's bodily and emotional well-being. The Speech Emotion Recognition (SER) system is used to extract and predict the emotional tone of a speaker through audio signals. Universally, emotions can be often categorized and grouped into mere types like Anger, Happiness, Sadness, and even Neutral emotional state. One can then use this as a criterion to develop a system provided, they are equipped with finite and required resources. Then through proper training and procedure, the system can be made to recognize the speaker’s emotional state. In these situations, speech emotion detection is performed using spectral and prosodic capabilities because each of those aspects includes the essential data to determine the speaker's emotional state. Mel-frequency Cepstral Coefficients (MFCC) are one of the spectral features that can be used to detect the speaker’s emotional state. Frequency, loudness, and pitch of a sound signal are some of the many available prosodic features that would then be able to help train and construct a machine-learning model that can uniquely identify and differentiate between the underlying emotions present in a given speech signal. It is known that the pitch of an audio signal is unique and can be used as a measure to easily distinguish between different audio signals. They can be detected from the selected features, allowing us to classify the gender type of the speaker. Support Vector Machines (SVM) are supervised learning models that examine the data for regression and classification in machine learning. They generally identify and classify the gender of the speaker in the case of a Speech Emotion Recognition system. Furthermore, some studies rely on Radial-Basis Function (RBF) and Back Propagation networks. They are capable enough to be used to identify and recognize the human emotions in the signal constrained on specific and selected features. This study presents a Speech Emotion Recognition system that outperforms an existing system in terms of data, feature selection, and methodology. Its goal is to identify speech precepts based on emotions more accurately, on average 78% accurately, and with fewer false positives. The paper aims to discuss the importance of emotions in human communication and introduce the concept of Speech Emotion Recognition, which extracts and predicts the emotional tone of a speaker through audio signals. It explains how spectral and prosodic features can be used to identify the emotional state of the speaker, with MFCC being one of the commonly used spectral features. Prosodic features such as pitch, loudness, and frequency can also be used to identify gender and differentiate between emotions in speech signals. SVM, RBF, and Back Propagation networks are machine learning models used in speech emotion recognition; these more advanced models outperform existing systems in data collection and feature selection methodology, achieving an accuracy rate of 78% with fewer false positives.
Acoustic characterization and machine prediction of perceived masculinity and femininity in adults
2023, Speech Communication
Previous research has found that human voice can provide reliable information to be used for gender identification with a high level of accuracy. In social psychology, perceived masculinity and femininity (masculinity and femininity rated by humans) has often been considered an important feature when investigating the influence of vocal features on social behaviours. While previous studies have characterized the acoustic features that contributed to perceivers’ judgements of speakers’ masculinity or femininity, there is limited research on developing a machine masculinity/femininity scoring model and characterizing the independent acoustic factors that contribute to perceivers’ masculinity and femininity judgements. In this work, we first propose a machine scoring model of perceived masculinity/femininity based on the Extreme Random Forest and then characterize the independent and meaningful acoustic factors that contribute to perceivers’ judgements by using a correlation matrix based hierarchical clustering method. Our results show that the machine ratings of masculinity and femininity strongly correlated with the human ratings of masculinity and femininity when we used an optimal speech duration of 7 s, with a correlation coefficient of up to .63 for females and .77 for males. Nine independent clusters of acoustic measures were generated from our modelling of femininity judgements for female voices and eight clusters were found for masculinity judgements for male voices. The results revealed that, for both genders, the F0 mean is the most important acoustic measure affecting the judgement of acoustic-related masculinity and femininity. The F3 mean, F4 mean and VTL estimators were found to be highly inter-correlated and appeared in the same cluster, forming the second most significant factor in influencing the assessment of acoustic-related masculinity and femininity. Next, F1 mean, F2 mean and F0 standard deviation are independent factors that share similar importance. The voice perturbation measures, including HNR, jitter and shimmer, are of lesser importance in influencing masculinity/femininity judgements.
A blockchain-based application for genomic access and variant discovery using smart contracts and homomorphic encryption
2022, Future Generation Computer Systems
Citation Excerpt :
It is an emerging technology that has been applied in areas such as finance, health care, law enforcement and supply chains to transfer ownership of assets and value, and publicly record and verify transactions between parties without the need for a trusted entity [9]. In addition, blockchain has the potential to integrate integrity for the selection of machine learning features in information processing [10,11]. We employ blockchain features (e.g., immutability) together with fully homomorphic encryption to propose a novel framework to address the challenges of the traditional genomic data sharing model.
Genomic data repositories are rapidly growing due to the decline in the cost of DNA sequencing. This has increased the demand from stakeholders such as researchers to analyze these datasets to advance areas in biomedical research. Genomic datasets are mostly maintained by third-party direct-to-consumer (DTC) genomic companies who operate a business model of collecting DNA data from their customers and selling them to pharmaceutical companies. This puts the privacy of their customers at risk since each individual’s human genome is unique. In addition, customers lose ownership of and access to their genomic data to DTCs and DTCs do not share profits from data sales with them. In this paper, we propose a system based on blockchain technology and homomorphic computation to address the aforementioned problems. We use blockchain transactions and smart contracts to allow genomic data owners (DOs) to have control of their data and sell access to it, and homomorphic computation with secure two-party protocol to enable genomic data users (DUs) to run queries to securely discover DOs of interest. We further optimize the query response time by proposing an approach based on genomic data partitions, binary search trees and bloom filters to reduce the search space. We also propose a blockchain penalty mechanism to encourage parties to behave honestly to avoid malicious behaviors such as uploading fake or non-human genomic data. We conduct our experiments on real genomic datasets and demonstrate that our proposed scheme allows DOs to control access to their data and is feasible and efficient in terms of computation cost, query response time and scalability.
An Effective Speech Emotion Recognition Model for Multi-Regional Languages Using Threshold-based Feature Selection Algorithm
2024, Circuits, Systems, and Signal Processing

View all citing articles on Scopus

View full text

Acoustic feature selection for automatic emotion recognition from speech

Abstract

Introduction

Section snippets

Theory of human emotions

The feature selection algorithm: ERFTrees

Experiment design and result analysis

Conclusion

Acknowledgements

Artificial Intelligence

Acoustic profiles in vocal emotion expression

Journal of Personality and Social Psychology

Characteristics and recognizability of vocal expressions of emotions

Random forest

Machine Learning

Identification of emotional states using perceptual and acoustic analyses

The science of emotion: Research and tradition in the psychology of emotion

Personality, perceptual, and cognitive correlates of emotional sensitivity

The Communication of Emotional Meaning

An experimental study of the pitch characteristics of the voice during the expression of emotion

Speech Monograph

An experimental study of the durational characteristics of the voice during the expression of emotion

Speech Monograph

A new method of investigating the perception of prosodic features

Language and Speech

Emotions, voice and music

Language and Speech

Emotional patterns in intonation and music

Zeitschrift für Phonetik SprachWissenschaft und Kommunicationsforschung

Data mining concepts and techniques

Voice quality in depression

Journal of Abnormal Psychology

Changes of the voice expression during suggestively influenced states of experiencing

Activitas Nervosa Superior

On the relation between speech melody and intensity

Phonetica

Relations between prosodic variables and emotions in normal american english utterances

Journal of the Acoustical Society of America

Survey on independent component analysis

Neural Computing Surveys

Recognition of emotion from vocal cues

Arch Gen Psychiatry

Communication of affects by single vowels

Synthese

Acoustic correlates of the emotional content of vocalized speech

Journal of Acoustical Academy of Sciences of the USSR

Analyzing multivariate data