Semi-supervised facial expression recognition using reduced spatial features and Deep Belief Networks

doi:10.1016/j.neucom.2019.08.029

Neurocomputing

Volume 367, 20 November 2019, Pages 188-197

https://doi.org/10.1016/j.neucom.2019.08.029 Get rights and content

Abstract

A semi-supervised emotion recognition algorithm using reduced features as well as a novel feature selection approach is proposed. The proposed algorithm consists of a cascaded structure where first a feature extraction is applied to the facial images, followed by a feature reduction. A semi-supervised training with all the available labeled and unlabeled data is applied to a Deep Belief Network (DBN). Feature selection is performed to eliminate those features that do not provide information, using a reconstruction error-based ranking. Results show that HOG features of mouth provide the best performance. The performance evaluation has been done between the semi-supervised approach using DBN and other supervised strategies such as Support Vector Machine (SVM) and Convolutional Neural Network (CNN). The results show that the semi-supervised approach has improved efficiency using the information contained in both labeled and unlabeled data. Different databases were used to validate the experiments and the application of Linear Discriminant Analysis (LDA) on the HOG features of mouth gave the highest recognition rate.

Introduction

Amongst the various modes of emotion recognition (ER), the facial expression is one of the conveying forms used for the display of emotions. ER can be applied in various fields like medicine, marketing, entertainment. For example, a medical robot can be designed to continuously monitoring their emotional state [1], [2], or a diagnostic suggestion system for therapists [3]. In Human-Computer Interaction, a system endowed with emotional intelligence can be used to create effective communication with users [4]. In emergency situations, as part of the corresponding situational awareness, real-time decisions can be made from the behavioral patterns of the subjects.

The development of a facial ER system is challenging since the images of the same person with the same facial expression can vary with the lighting conditions, background, and occlusions [5], which precludes homogeneity. Certain emotions have only subtle distinctions which make them harder to analyze and describe. The state-of-the-art approaches in facial ER used feature-based methods [6], [7] and template-based methods [8]. The first ones focus appearance and geometric modelled feature extraction. Template-based methods were less reliable because they are limited to only frontal faces and the accuracy rate changed with variations in pose, scale, and shape. Feature extraction is mostly based on Histogram of Oriented Gradients (HOG) [9] and Local Binary Patterns (LBP) [10]. HOG descriptors were used to encode facial components since it projected the appearance of gradient orientation in an image. Other works use Discrete Wavelet Transform (DWT) for feature extraction and Neural Networks for classification [11]. Dimensionality reduction (DR) techniques in ER include principal components analysis (PCA) [12] and linear discriminant analysis (LDA) [13]. Recently, PCA based facial feature projection has also been used for age progression application [14]. These methods cannot be used to find the nonlinear structure of the data. To overcome this limitation various nonlinear DR algorithms such as kernel PCA [15], locally linear embedding (LLE) [16], isometric feature mapping (Isomap) [17] and T-distributed Stochastic Neighbor Embedding (t-SNE) [18] have been proposed. The sparse representation-based methods for classification (SRC) are also widely used since 2009 [19]. SRC is most effective when there is high separability between the subspaces [20], [21], [22]. But its main disadvantage over classical subspace learning algorithms is that the classification criterion of SRC fails and leads to misclassification when the samples are highly correlated. Deep Neural Networks [23] have gained popularity in the recent years as a choice for supervised learning. The major drawback of supervised learning comes from the fact that most of the data available in general is unlabeled., something particularly evident in the case of human face images. In 2004, Hinton and co-workers Hinton et al. [24] proposed the idea of the Restricted Boltzmann Machines (RBM) and its generalization to Deep Belief Networks (DBN), where unsupervised techniques can model the probabilistic distribution of the data and cluster it [25].

In this paper, a semi-supervised DBN is used to include unlabeled and labeled data to improve the accuracy of the classifier. Semi-supervised learning is comparable to human learning, which involves a small amount of labeled data along with greater amounts of unlabelled observations [26]. To make use of unlabeled data, DBNs are applied to learn the model [27] and the obtained discriminative model is fitted to a labeled dataset by performing Backpropagation (BP).

The major contributions in this paper are two. First, we propose to use semi-supervised learning during feature selection to determine the features that are more explanatory of the human emotions in the available data. The proposed DBN has an input layer taking in the dimensionality-reduced feature vectors corresponding to Histogram Oriented Gradients (HOG) of mouth, HOG of the eye, Wavelet Transform of mouth and Wavelet transform of the eye. Reconstruction error and validation accuracy were used to find the most significant feature vector. Second, a semisupervised learning process is proposed that uses non labelled data to train a DBN. After convergence, we proceed to fine tune the structure with BP and the available labelled data. The data is previously processed by a dimensionality reduction method. The most efficient linear method was LDA and amongst the nonlinear approaches, the best one was Isomap.The proposed semi-supervised framework was evaluated on the CK+, MMI and RAFD databases. The results show that the presented approach have a performance similar or better than those of the state of the art (SVM and CNN) with the additional benefits of using significantly less labelled data and a dramatically reduced training and test computational burdens.

Section snippets

Proposed approach

The introduced semi-supervised deep belief network for facial ER is shown in Fig. 1. The proposed method incorporates different feature extraction methods and dimensionality reduction techniques prior to passing the data into the DBN. Based on the characteristics of the different facial expressions, the mouth and eye patches are extracted from the facial data. Then two feature extraction methods namely HOG and 2D-DWT were used to compute the significant spatial components from the mouth and eye

Databases

The extended Cohn-Kanade database (CK+) [49], the Radboud Faces database (RaFD) [50] and MMI database [51] were used to test the proposed method for facial ER. The CK+ and MMI databases were captured from a lab-based environment whereas the RaFD database contained facial images with varying poses and gaze directions. Firstly, in the case of CK+ database, there are 327 image sequences with 7 expression labels namely anger, neutral, disgust, fear, happy, sad, and surprise. The last frame of each

Conclusion

A semi-supervised approach for facial ER utilizing reduced facial features with most of the data being unlabeled is introduced with a four-layered neural network. They are convenient to use due to their easy training. Since we use CD and BP, training can be done sequentially. Semi-supervised learning was achieved by combining CD and BP, as CD is unsupervised, and BP is supervised. The facial features used were mouth and eye HOG, 2D-DWT of eyes and 2D-DWT of mouth. Further, the analysis was done

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work has been supported by National Science Foundation S&CC EAGER grant 1637092. Authors would like to thank UNM Center for Advanced Research Computing, supported in part by the NSF, for providing high performance computing, large-scale storage and visualization resources.

Aswathy Rajendra Kurup received the bachelor’s degree in Electronics and Communication Engineering from Amrita school of Engineering in 2015 and the master’s degree in Electrical Engineering from The University of New Mexico in 2017. She is currently working towards her Ph.D. degree in Electrical Engineering from The University of New Mexico. Her research interests are Image Processing, Signal Processing and Machine Learning.

References (58)

ShanC. et al.
Facial expression recognition based on local binary patterns: a comprehensive study
Image Vis. Comput.
(2009)
N. Dalal et al.
Histograms of oriented gradients for human detection
Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)
(2005)
H. Ali et al.
Feature extraction using radon transform and discrete wavelet transform for facial emotion recognition
Proceedings of the 2nd IEEE International Symposium on Robotics and Manufacturing Automation (ROMA)
(2016)
S.T. Roweis et al.
Nonlinear dimensionality reduction by locally linear embedding
Science
(2000)
YuZ. et al.
Spatio-temporal convolutional features with nested LSTM for facial expression recognition
Neurocomputing
(2018)
SunW. et al.
A visual attention based ROI detection method for facial expression recognition
Neurocomputing
(2018)
ParkJ. et al.
Feature vector classification based speech emotion recognition for service robots
IEEE Trans. Consum. Electr.
(2009)
D.J. France et al.
Acoustical properties of speech as indicators of depression and suicidal risk
IEEE Trans. Biomed. Eng.
(2000)
LiuP. et al.
Facial expression recognition via a boosted deep belief network
Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition
(2014)
M.F. Valstar et al.
Combined support vector machines and hidden markov models for modeling facial action temporal dynamics
Proceedings of the 2007 IEEE International Conference on Human-computer Interaction, HCI’07
(2007)

C. Soladi et al.

A new invariant representation of facial expressions: definition and application to blended expression recognition

Proceedings of the 2012 19th IEEE International Conference on Image Processing

(2012)

M. Pantic et al.

Dynamics of facial expression: recognition of facial actions and their temporal segments from face profile image sequences

IEEE Trans. Syst. Man Cybern. Part B (Cybern.)

(2006)

Y.-l. Tian et al.

Recognizing action units for facial expression analysis

IEEE Trans. Pattern Anal. Mach. Intell.

(2001)

M. Pantic et al.

Automatic analysis of facial expressions: the state of the art

IEEE Trans. Pattern Anal. Mach. Intell.

(2000)

Y. Hu et al.

Multi-view facial expression recognition

Proceedings of the 2008 8th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2008

(2008)

S.B. Kazmi et al.

Wavelets based facial expression recognition using a bank of neural networks

Proceedings of the 2010 5th International Conference on Future Information Technology

(2010)

M. Turk et al.

Eigenfaces for recognition

J. Cognit. Neurosci.

(1991)

P.N. Belhumeur et al.

Eigenfaces vs. fisherfaces: recognition using class specific linear projection

IEEE Trans. Pattern Anal. Mach. Intell.

(1997)

shux. et al.

Personalized age progression with bi-level aging dictionary learning

IEEE Trans. Pattern Anal. Mach. Intell.

(2018)

T. Balachander et al.

Kernel based subspace pattern classification

Proceedings of the International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339), IJCNN’99.

(1999)

L.K. Saul et al.

Think globally, fit locally: unsupervised learning of low dimensional manifolds

J. Mach. Learn. Res.

(2003)

J.B. Tenenbaum et al.

A global geometric framework for nonlinear dimensionality reduction

Science

(2000)

L. van der Maaten

Accelerating t-SNE using tree-based algorithms

J. Mach. Learn. Res.

(2014)

J. Wright et al.

Robust face recognition via sparse representation

IEEE Trans. Pattern Anal. Mach. Intell.

(2009)

ShuX. et al.

Image classification with tailored fine-grained dictionaries

IEEE Trans. Circuits Syst. Video Technol.

(2018)

LanX. et al.

Learning common and feature-specific patterns: a novel multiple-sparse-representation-based tracker

IEEE Trans. Image Process.

(2018)

GuJ. et al.

Random subspace based ensemble sparse representation

Pattern Recognit.

(2017)

S. Hochreiter et al.

Gradient flow in recurrent nets: the difficulty of learning long-term dependencies

G.E. Hinton et al.

A fast learning algorithm for deep belief nets

Neural Comput.

(2006)

Cited by (44)

SMFNM: Semi-supervised multimodal fusion network with main-modal for real-time emotion recognition in conversations
2023, Journal of King Saud University - Computer and Information Sciences
Real-time emotion recognition in conversations (ERC), which relies on only the historical utterances to achieve ERC, has recently gained increasing attention due to its significance in providing real-time empathetic services. Although utilizing multimodal information can mitigate the issues of unimodal approaches, few real-time ERC studies consider the differences in representation ability of different modalities and explore comprehensive conversational context from different perspectives based on different structures. Furthermore, the heavy annotation cost makes it difficult to collect sufficient labeled data, which also limits the performance of current supervised ERC approaches. To address these issues, we propose a novel framework SMFNM for real-time ERC, which integrates semi-supervised learning with multimodal fusion under the guidance of main-modal. Specifically, SMFNM utilizes additional unlabeled data to extract high-quality intra-modal representations, and implements cross-modal interaction to capture complementary information to enhance the audio representations. Then SMFNM employs the directed acyclic graph and the Gated Recurrent Units for exploring more accurate conversational context from both the multimodal and main-modal perspectives, respectively. Finally, these two types of contextual features are fused for emotion identification. Extensive experiments on benchmark datasets (i.e., IEMOCAP (4-way), IEMOCAP (6-way) and MELD) demonstrate the effectiveness, superiority and rationality of our SMFNM.
Neural network-based control of an ultrafast laser
2023, Nuclear Instruments and Methods in Physics Research, Section A: Accelerators, Spectrometers, Detectors and Associated Equipment
With the recent advances in machine learning (ML) and data science (DS), the control, modeling, and analysis of these complex systems continues to improve. In this work, we report on the optimization of the intensity of a femtosecond laser using feedforward neural networks (FFNN) that model the input–output relationships of the data. The input parameters of the system were optimized to achieve the required performance of the femtosecond laser. We propose a neural network-based control system to model the relationship between the spectral amplitude and phase of the input laser pulse at the amplifier input and the shape of the output pulse. Low-jitter laser parameter inputs and the resulting laser pulse duration were modeled, and the resulting correlation between the input and output data was used to optimize the laser pulse. We demonstrate improved processing and laser control performance.
SelfNet: A semi-supervised local Fisher discriminant network for few-shot learning
2022, Neurocomputing
Citation Excerpt :
Semi-supervised learning algorithms, as is well known, are trained on a combination of a small amount of labeled data and a large amount of unlabeled data, which provides the benefits of both supervised and unsupervised learning. Semi-supervised technology has found widespread application in a variety of fields, such as face recognition [22,23], action recognition [24], and object detection [25]. In this sense, we might tackle the few-shot learning problem with semi-supervised technology, and several works have proved that this strategy is both effective and feasible [26–29].
Few-shot learning, employing small-scale labeled samples to recognize new objects, has received substantial research interest. The prototypical network (ProtoNet) is a simple yet effective meta-learning method to solve this problem. In the few-shot scenario, however, the scarcity of data usually has a negative impact on the representational ability of prototypes. In this paper, a unique semi-supervised few-shot learning architecture, referred to as Semi-supervised local Fisher discriminant network (SelfNet), which integrates few-shot learning with subspace learning, is proposed. Using the union of the support set and the additional unlabeled set, a feature projection module is constructed to achieve the subspace projection. Additionally, a pseudo-labeling strategy, which adds the unlabeled samples with high prediction confidence to the support set, is employed to refine the original prototypes. Experimental results on two few-shot classification benchmarks demonstrate that SelfNet can achieve superior performance to the state-of-the-arts, indicating the benefits of utilizing unlabeled samples for feature projection.
SG-DSN: A Semantic Graph-based Dual-Stream Network for facial expression recognition
2021, Neurocomputing
Citation Excerpt :
One key challenge of implementing effective FER is to capture discriminative expression information from static images or video sequences. Previous studies mainly depend on hand-craft feature design or automatic feature learning followed by classifier construction [3,4]. However, these methods generally handle local and holistic expression cues in the view of classic image processing, without considering latent semantic information.
Facial expression recognition (FER) is a crucial task for human emotion analysis and has attracted wide interest in the field of computer vision and affective computing. General convolutional-based FER methods rely on the powerful pattern abstraction of deep models, but they lack the ability to use semantic information behind significant facial areas in physiological anatomy and cognitive neurology. In this work, we propose a novel approach for expression feature learning called Semantic Graph-based Dual-Stream Network (SG-DSN), which designs a graph representation to model key appearance and geometric facial changes as well as their semantic relationships. A dual-stream network (DSN) with stacked graph convolutional attention blocks (GCABs) is introduced to automatically learn discriminative features from the organized graph representation and finally predict expressions. Experiments on three lab-controlled datasets and two in-the-wild datasets demonstrate that the proposed SG-DSN achieves competitive performance compared with several latest methods.
APPLICATION ANALYSIS OF ENGLISH PERSONALIZED LEARNING BASED ON LARGE-SCALE OPEN NETWORK COURSES
2024, Scalable Computing
Challenges and issues in facial emotion recognition techniques
2024, International Journal of Business Intelligence and Data Mining

View all citing articles on Scopus

Meenu Ajith received the bachelor’s degree in Electronics and Communication Engineering from Amrita school of Engineering in 2015 and the master’s degree in 2017 in Electrical Engineering from The University of New Mexico in 2017. She is currently working towards her Ph.D. degree in Electrical Engineering from The University of New Mexico. Her research interests are Machine Learning, Computer Vision, Pattern Recognition and Image Processing.

Manel Martínez Ramón is a professor with the ECE department of The University of New Mexico. He holds the King Felipe VI Endowed Chair of the University of New Mexico, a chair sponsored by the Household of the King of Spain. He is a Telecommunications Engineer (Universitt Politecnica de Catalunya, Spain, 1996) and Ph.D. in Communications Technologies (Universidad Carlos III de Madrid, Spain, 1999). His research interests are in Machine Learning applications to smart antennas, neuroimage, first responders and other cyber-human systems, smart grid and others. His last work is the monographic book “Signal Processing with Kernel Methods”, Wiley, 2018.

View full text

Semi-supervised facial expression recognition using reduced spatial features and Deep Belief Networks

Abstract

Introduction

Section snippets

Proposed approach

Databases

Conclusion

Declaration of Competing Interest

Acknowledgments

Image Vis. Comput.

Science

Neurocomputing

Neurocomputing

Feature vector classification based speech emotion recognition for service robots

IEEE Trans. Consum. Electr.

Acoustical properties of speech as indicators of depression and suicidal risk

IEEE Trans. Biomed. Eng.

Facial expression recognition via a boosted deep belief network

Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition

Combined support vector machines and hidden markov models for modeling facial action temporal dynamics

Proceedings of the 2007 IEEE International Conference on Human-computer Interaction, HCI’07

A new invariant representation of facial expressions: definition and application to blended expression recognition

Proceedings of the 2012 19th IEEE International Conference on Image Processing

Dynamics of facial expression: recognition of facial actions and their temporal segments from face profile image sequences

IEEE Trans. Syst. Man Cybern. Part B (Cybern.)

Recognizing action units for facial expression analysis

IEEE Trans. Pattern Anal. Mach. Intell.

Automatic analysis of facial expressions: the state of the art

IEEE Trans. Pattern Anal. Mach. Intell.

Multi-view facial expression recognition

Proceedings of the 2008 8th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2008

Wavelets based facial expression recognition using a bank of neural networks

Proceedings of the 2010 5th International Conference on Future Information Technology

Eigenfaces for recognition

J. Cognit. Neurosci.

Eigenfaces vs. fisherfaces: recognition using class specific linear projection

IEEE Trans. Pattern Anal. Mach. Intell.

Personalized age progression with bi-level aging dictionary learning

IEEE Trans. Pattern Anal. Mach. Intell.

Kernel based subspace pattern classification

Proceedings of the International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339), IJCNN’99.

Think globally, fit locally: unsupervised learning of low dimensional manifolds

J. Mach. Learn. Res.

A global geometric framework for nonlinear dimensionality reduction

Science

Accelerating t-SNE using tree-based algorithms

J. Mach. Learn. Res.

Robust face recognition via sparse representation

IEEE Trans. Pattern Anal. Mach. Intell.

Image classification with tailored fine-grained dictionaries

IEEE Trans. Circuits Syst. Video Technol.

Learning common and feature-specific patterns: a novel multiple-sparse-representation-based tracker

IEEE Trans. Image Process.

Random subspace based ensemble sparse representation

Pattern Recognit.

Gradient flow in recurrent nets: the difficulty of learning long-term dependencies

A fast learning algorithm for deep belief nets

Neural Comput.