Study of empirical mode decomposition and spectral analysis for stress and emotion classification in natural speech
Introduction
The speech signal plays an essential role in human communication. It is used to convey linguistic information between speakers, as well as paralinguistic information of speakers’ emotional states. Emotion recognition of speech is a rapidly growing research area. Example applications include various human–machine communication systems, medical diagnosis, emotional robotics and virtual reality environments [1], [12].
Speech emotion analysis or classification refers to the use of various methods to analyze and classify vocal behavior as an indicator of affect (emotions, moods, or stress), taking into account only the non-verbal aspects of speech. It is assumed that there is a set of objectively measurable voice parameters that reflect the affective state a person is currently experiencing. The quality of affect recognition methods depends on the choice of the characteristic features.
Due to the lack of theoretical investigation of the underlying mechanisms of emotional speech, the majority of current approaches examine many possible acoustic parameters and determine their correlation with different emotions. Following this assumption, features such as pitch, spectral and intensity features were investigated [12]. There are also studies proposing the use of linear predictive coefficients (LPC), and mel frequency cepstral coefficients (MFCC) [9], [25]. The majority of features listed in the literature are derived from linear models of speech, such as the source-filter model with a single excitation oscillating at the fundamental frequency, F0. A number of recent studies [8], [10] postulated that in emotional states, the laminar air flow component of speech is accompanied by additional non-linear components in the form of turbulent vortices, caused by phonation and interacting with vocal tract boundaries, which provide additional speech components. The vortices become sources of sound when hitting the solid barrier of the vocal tract boundaries. Different emotions or stress levels can therefore be characterized by different numbers and energies of vortex components; this line of investigation has been studied in [21], [23] and a number of feature extraction methods have been proposed, providing very promising results in the case of stress and emotion classification in speech.
In this study, a similar assumption regarding the multi-component and non-linear nature of speech production is undertaken; however the proposed feature extraction methods use different approaches to signal analysis. The first approach proposed in this study divides the speech signal into separate components using the empirical mode decomposition (EMD) method proposed by Huang [2], and derives the speech features as average Renyi entropies of the intrinsic mode functions resulting from the EMD. The use of this algorithm is aimed at investigating the efficiency of the new proposed features based on the non-linear speech production mechanism.
The second approach proposed in this paper generates characteristic features based on speech spectrograms. A spectrogram is a two-dimensional graphical display of the time varying spectral density. It is a compact and highly efficient representation carrying information about the glottal pulse, formants, energy distribution, timing and harmonics. Previous studies used various spectrogram-based features; for example, Kleinschmidt [13] defined the mel-spectrogram and applied it to speech recognition; Ezzat [14] described a spectro-temporal Gabor filter bank and used it to analyze localized patches of spectrograms, which showed advantages over one-dimensional features in word recognition.
The new spectral feature extraction applied anisotropic filtering to spectrograms prior to the calculation of features in order to enhance the directional characteristics of the spectrograms. The anisotropic diffusion filtering of images, also called Perona–Malik diffusion [15], is a technique aimed at reducing image noise without removing significant parts of the image content. Anisotropic filtering was previously successfully used in biomedical image processing to reduce noise and enhance contrast in specific regions of images. Gerig et al. [16] applied an anisotropic filtering technique to the 2-D and 3-D spin echo and gradient echo magnetic resonance (MR) data. Ding et al. [17] tested anisotropic smoothing on in vivo diffusion tensor data for noise reduction.
The performance of the proposed feature extraction methods was compared with the performance of the conventional mel frequency cepstral coefficients (MFCCs) used successfully in a number of previous studies of stress and emotion recognition in speech. For example, in the experiments of pairwise emotion classification described in [8] and based on the Simulated Domain of the SUSAS data, the MFCC provided average accuracy of 67%.
The remaining parts of this paper include Section 2, which describes the speech data. It is followed by Section 3, which describes feature extraction and modeling methods. The experiments and results are presented in Section 4, and finally, Section 5 provides the discussion and conclusions.
Section snippets
Speech data
The classification accuracy of stress and emotion depends largely on the type of speech samples used in the process of statistical modeling of different classes of stress or emotion. Current studies of stress and emotion recognition use three types of data. The first type uses emotions simulated by professional actors in a recording laboratory; it allows experimental control but the ecological validity of speech samples generated in this way is relatively low. The second type of data represents
Method
The automatic stress or emotion recognition process followed a typical 2-stage pattern recognition procedure illustrated in Fig. 1 and included training and testing stages. During the training stage, characteristic features extracted from the annotated speech signals were used to develop statistical models of different stress levels or different emotions. In the testing stage, characteristic features calculated from speech samples with unknown stress or emotion were passed to the classifier and
Stress classification with the SUSAS database
The experiments compared the performance of the EMD based feature extraction method and the spectrogram based feature extraction with the classical MFCC features using both the SUSAS and ORI databases. The average classification accuracy for the SUSAS data is shown in Table 3.
The EMD features were generated by calculating Renyi entropy of order q = 2 for each of the intrinsic mode functions (IMF).
The spectral features were obtained by calculating the average spectral energy for frequency
Discussion and conclusions
The best average classification accuracy for stress was 77% and it was provided by the spectrogram features calculated within ERB bands combined with anisotropic filtering. For emotion recognition, the best average accuracy for emotions was 53% and it was obtained for the same method but without anisotropic filtering. The average accuracy achieved for stress (77%) is higher when compared to the average accuracy of 67% provided by the MFCC features in [8] when applied to the simulated stress
Acknowledgments
This work was supported by the Australian Research Council Linkage Grant LP0776235. The authors would like to thank the Oregon Research Institute, USA, for providing the database and Dr. Lisa Sheeber and Mr. Lu-Shih Alex Low for their invaluable help and support.
References (25)
- et al.
Statistical mechanics based on Renyi entropy
Physica A: Statistical Mechanics and Its Applications
(2000) - et al.
Emotional speech recognition: resources, features, and methods
Speech Communication
(2006) - et al.
Sub-band SNR estimation using auditory feature processing
Speech Communication
(2003) - et al.
An evaluation of the robustness of existing supervised machine learning approaches to the classification of emotions in speech
Speech Communication
(2007) Vocal communication of emotion: a review of research paradigms
Speech Communication
(2003)- et al.
Emotion recognition in human–computer interaction
IEEE Signal Processing Magazine
(2001) - et al.
The empirical mode decomposition and the Hilbert spectrum for non-linear and non-stationary time series analysis
Proceedings of the Royal Society of London Series A: Mathematical Physical and Engineering Sciences
(1998) - et al.
Living in Family Environments (LIFE) Coding. A Reference Manual for Coders
(2006) Some observations on oral air flow during phonation
IEEE Transactions on Acoustics, Speech and Signal Processing
(1980)- et al.
Speaker identification performance enhancement using Gaussian mixture model with GMM classification post-processor
IEEE International Conference on Signal Processing and Communication
(2007)