Study of empirical mode decomposition and spectral analysis for stress and emotion classification in natural speech

https://doi.org/10.1016/j.bspc.2010.11.001Get rights and content

Abstract

Two new approaches to the feature extraction process for automatic stress and emotion classification in speech are proposed and examined. The first method uses the empirical model decomposition (EMD) of speech into intrinsic mode functions (IMF) and calculates the average Renyi entropy for the IMF channels. The second method calculates the average spectral energy in the sub-bands of speech spectrograms and can be enhanced by anisotropic diffusion filtering of spectrograms. In the second method, three types of sub-bands were examined: critical, Bark and ERB. The performance of the new features was compared with the conventional mel frequency cepstral coefficients (MFCC) method. The modeling and classification process applied the classical GMM and KNN algorithms. The experiments used two databases containing natural speech, SUSAS (annotated with three different levels of stress) and the Oregon Research Institute (ORI) data (annotated with five different emotions: neutral, angry, anxious, dysphoric, and happy). For the SUSAS data, the best average recognition rates of 77% were obtained when using spectrogram features calculated within ERB bands and combined with anisotropic filtering. For the ORI data, the best result of 53% was obtained with the same method but without anisotropic filtering. Both the GMM and KNN classifiers showed similar performance. These results indicated that the spectrogram patterns provide promising results in the case of stress recognition, however further improvements are needed in the case of emotion recognition.

Introduction

The speech signal plays an essential role in human communication. It is used to convey linguistic information between speakers, as well as paralinguistic information of speakers’ emotional states. Emotion recognition of speech is a rapidly growing research area. Example applications include various human–machine communication systems, medical diagnosis, emotional robotics and virtual reality environments [1], [12].

Speech emotion analysis or classification refers to the use of various methods to analyze and classify vocal behavior as an indicator of affect (emotions, moods, or stress), taking into account only the non-verbal aspects of speech. It is assumed that there is a set of objectively measurable voice parameters that reflect the affective state a person is currently experiencing. The quality of affect recognition methods depends on the choice of the characteristic features.

Due to the lack of theoretical investigation of the underlying mechanisms of emotional speech, the majority of current approaches examine many possible acoustic parameters and determine their correlation with different emotions. Following this assumption, features such as pitch, spectral and intensity features were investigated [12]. There are also studies proposing the use of linear predictive coefficients (LPC), and mel frequency cepstral coefficients (MFCC) [9], [25]. The majority of features listed in the literature are derived from linear models of speech, such as the source-filter model with a single excitation oscillating at the fundamental frequency, F0. A number of recent studies [8], [10] postulated that in emotional states, the laminar air flow component of speech is accompanied by additional non-linear components in the form of turbulent vortices, caused by phonation and interacting with vocal tract boundaries, which provide additional speech components. The vortices become sources of sound when hitting the solid barrier of the vocal tract boundaries. Different emotions or stress levels can therefore be characterized by different numbers and energies of vortex components; this line of investigation has been studied in [21], [23] and a number of feature extraction methods have been proposed, providing very promising results in the case of stress and emotion classification in speech.

In this study, a similar assumption regarding the multi-component and non-linear nature of speech production is undertaken; however the proposed feature extraction methods use different approaches to signal analysis. The first approach proposed in this study divides the speech signal into separate components using the empirical mode decomposition (EMD) method proposed by Huang [2], and derives the speech features as average Renyi entropies of the intrinsic mode functions resulting from the EMD. The use of this algorithm is aimed at investigating the efficiency of the new proposed features based on the non-linear speech production mechanism.

The second approach proposed in this paper generates characteristic features based on speech spectrograms. A spectrogram is a two-dimensional graphical display of the time varying spectral density. It is a compact and highly efficient representation carrying information about the glottal pulse, formants, energy distribution, timing and harmonics. Previous studies used various spectrogram-based features; for example, Kleinschmidt [13] defined the mel-spectrogram and applied it to speech recognition; Ezzat [14] described a spectro-temporal Gabor filter bank and used it to analyze localized patches of spectrograms, which showed advantages over one-dimensional features in word recognition.

The new spectral feature extraction applied anisotropic filtering to spectrograms prior to the calculation of features in order to enhance the directional characteristics of the spectrograms. The anisotropic diffusion filtering of images, also called Perona–Malik diffusion [15], is a technique aimed at reducing image noise without removing significant parts of the image content. Anisotropic filtering was previously successfully used in biomedical image processing to reduce noise and enhance contrast in specific regions of images. Gerig et al. [16] applied an anisotropic filtering technique to the 2-D and 3-D spin echo and gradient echo magnetic resonance (MR) data. Ding et al. [17] tested anisotropic smoothing on in vivo diffusion tensor data for noise reduction.

The performance of the proposed feature extraction methods was compared with the performance of the conventional mel frequency cepstral coefficients (MFCCs) used successfully in a number of previous studies of stress and emotion recognition in speech. For example, in the experiments of pairwise emotion classification described in [8] and based on the Simulated Domain of the SUSAS data, the MFCC provided average accuracy of 67%.

The remaining parts of this paper include Section 2, which describes the speech data. It is followed by Section 3, which describes feature extraction and modeling methods. The experiments and results are presented in Section 4, and finally, Section 5 provides the discussion and conclusions.

Section snippets

Speech data

The classification accuracy of stress and emotion depends largely on the type of speech samples used in the process of statistical modeling of different classes of stress or emotion. Current studies of stress and emotion recognition use three types of data. The first type uses emotions simulated by professional actors in a recording laboratory; it allows experimental control but the ecological validity of speech samples generated in this way is relatively low. The second type of data represents

Method

The automatic stress or emotion recognition process followed a typical 2-stage pattern recognition procedure illustrated in Fig. 1 and included training and testing stages. During the training stage, characteristic features extracted from the annotated speech signals were used to develop statistical models of different stress levels or different emotions. In the testing stage, characteristic features calculated from speech samples with unknown stress or emotion were passed to the classifier and

Stress classification with the SUSAS database

The experiments compared the performance of the EMD based feature extraction method and the spectrogram based feature extraction with the classical MFCC features using both the SUSAS and ORI databases. The average classification accuracy for the SUSAS data is shown in Table 3.

The EMD features were generated by calculating Renyi entropy of order q = 2 for each of the intrinsic mode functions (IMF).

The spectral features were obtained by calculating the average spectral energy for frequency

Discussion and conclusions

The best average classification accuracy for stress was 77% and it was provided by the spectrogram features calculated within ERB bands combined with anisotropic filtering. For emotion recognition, the best average accuracy for emotions was 53% and it was obtained for the same method but without anisotropic filtering. The average accuracy achieved for stress (77%) is higher when compared to the average accuracy of 67% provided by the MFCC features in [8] when applied to the simulated stress

Acknowledgments

This work was supported by the Australian Research Council Linkage Grant LP0776235. The authors would like to thank the Oregon Research Institute, USA, for providing the database and Dr. Lisa Sheeber and Mr. Lu-Shih Alex Low for their invaluable help and support.

References (25)

  • D.L. Donoho

    Denoising by soft thresholding

    IEEE Transaction on Information Theory

    (1995)
  • G. Zhou et al.

    Nonlinear feature based classification of speech under stress nonlinear feature based classification of speech under stress

    IEEE Transactions on Speech and Audio Processing

    (2001)
  • Cited by (0)

    View full text