Elsevier

Pattern Recognition

Volume 61, January 2017, Pages 610-628
Pattern Recognition

Facial expression recognition with Convolutional Neural Networks: Coping with few data and the training sample order

https://doi.org/10.1016/j.patcog.2016.07.026Get rights and content

Highlights

  • A CNN based approach for facial expression recognition.

  • A set of pre-processing steps allowing for a simpler CNN architecture.

  • A study of the impact of each pre-processing step in the accuracy.

  • A study for lowering the impact of the sample presentation order during training.

  • High facial expression recognition accuracy (96.76%) with real time evaluation.

Abstract

Facial expression recognition has been an active research area in the past 10 years, with growing application areas including avatar animation, neuromarketing and sociable robots. The recognition of facial expressions is not an easy problem for machine learning methods, since people can vary significantly in the way they show their expressions. Even images of the same person in the same facial expression can vary in brightness, background and pose, and these variations are emphasized if considering different subjects (because of variations in shape, ethnicity among others). Although facial expression recognition is very studied in the literature, few works perform fair evaluation avoiding mixing subjects while training and testing the proposed algorithms. Hence, facial expression recognition is still a challenging problem in computer vision. In this work, we propose a simple solution for facial expression recognition that uses a combination of Convolutional Neural Network and specific image pre-processing steps. Convolutional Neural Networks achieve better accuracy with big data. However, there are no publicly available datasets with sufficient data for facial expression recognition with deep architectures. Therefore, to tackle the problem, we apply some pre-processing techniques to extract only expression specific features from a face image and explore the presentation order of the samples during training. The experiments employed to evaluate our technique were carried out using three largely used public databases (CK+, JAFFE and BU-3DFE). A study of the impact of each image pre-processing operation in the accuracy rate is presented. The proposed method: achieves competitive results when compared with other facial expression recognition methods – 96.76% of accuracy in the CK+ database – it is fast to train, and it allows for real time facial expression recognition with standard computers.

Introduction

Facial expression is one of the most important features of human emotion recognition [1]. It was introduced as a research field by Darwin in his book “The Expression of the Emotions in Man and Animals' [2]. According to Li and Jain [3], it can be defined as the facial changes in response to a person's internal emotional state, intentions, or social communication. Nowadays, automated facial expression recognition has a large variety of applications, such as data-driven animation, neuromarketing, interactive games, sociable robotics and many other human–computer interaction systems.

Expression recognition is a task that humans perform daily and effortlessly [3], but it is not yet easily performed by computers, despite recent methods have presented with accuracies larger than 95% in some conditions (frontal face, controlled environments, and high-resolution images). Many works in the literature do not perform a consistent evaluation methodology (e.g. without subject overlap in training and testing) and therefore present a misleading high-accuracy, but do not represent most of the face expression recognition problems real scenarios. On the other hand, low accuracy has been reported on databases with uncontrolled environments and in cross-database evaluations. Trying to cope with these limitations, several research works have tried to make computers reach the same accuracy of humans, and some examples of these works are highlighted below. This problem is still a challenge for computers because it is very hard to separate the expressions’ feature space, i.e. facial features from one subject in two different expressions may be very close in the feature space, while facial features from two subjects with the same expression may be very far from each other. In addition, some expressions like “sad” and “fear”, for example, are, in some cases, very similar.

Fig. 1 shows three subjects with a happy expression. As it can be seen in the figure, the images vary a lot from each other not only in the way that the subjects show their expression, but also in lighting, brightness, pose and background. This figure also exemplifies another challenge related to the facial expression recognition that is the uncontrolled training–testing scenarios (training images can be very different in terms of environmental conditions and subject ethnicity from the testing images). One approach to evaluate the facial expression recognition under these scenarios is to train the method with one database and to test it with another (possibly from different ethnic groups). We present results following this approach.

Facial expression recognition systems can be divided into two main categories: those that work with static images [7], [8], [9], [10], [11], [12], [13] and those that work with dynamic image sequences [14], [15], [16], [17]. Static-based methods do not use temporal information, i.e. the feature vector comprises information about the current input image only. Sequence based methods, in the other hand, use temporal information of images to recognize the expression captured from one or more frames. Automated systems for facial expression recognition receive the expected input (static image or image sequence) and typically give as output one of six basic expressions (anger, sad, surprise, happy, disgust and fear, for example); some systems also recognize the neutral expression. This work will focus on methods based on static images and it will consider the six and seven expressions sets (six basic plus neutral), for controlled and uncontrolled scenarios.

As described by Li and Jain [3], automatic facial expression analysis comprises three steps: face acquisition, facial data extraction and representation, and facial expression recognition. Face acquisition can be split in two major steps: face detection [18], [19], [20], [21] and head pose estimation [22], [23], [24]. After the face acquisition, the facial changes caused by facial expressions need to be extracted. These changes are usually extracted using geometric feature-based methods [25], [26], [27], [21] or appearance-based methods [8], [9], [10], [11], [13], [25], [28]. The extracted features are often represented in vectors, referred as feature vectors. Geometric feature-based methods work with shape and location of facial components like mouth, eyes, nose and eyebrows. The feature vector that represents the face geometry is composed of facial components or facial feature points. Appearance-based methods work with feature vectors extracted from the whole face, or from specific regions; these feature vectors are acquired using image filters applied to the whole face image [3].

Once feature vectors related to the facial expression are available, expression recognition can be performed. According to Liu et al. [7], expression recognition systems basically use a three-stage training procedure: feature learning, feature selection and classifier construction, in this order. The feature learning stage is responsible for the extraction of all features related to the facial expression. The feature selection selects the best features to represent the facial expression. They should minimize the intra-class variation of expressions while maximizing the inter-class variation [8]. Minimizing the intra-class variation of expressions is a problem because images of different individuals with the same expression are far from each other in the pixel's space. Maximizing the inter-class variation is also difficult because images of the same person in different expressions may be very close to one another in the pixel's space [29]. At the end of the whole process, a classifier (or a set of classifiers, with one for each expression) is used to infer the facial expression, given the selected features.

One of the techniques that has been successfully applied to the facial expression recognition problem was the deep multi-layer neural network [30], [31], [32], [14], [10], [11], [13]. This technique comprises the three steps of facial expression recognition (learning and selection of features and classification) in one single step. In the last decade, neural network researches were motivated to find a way to train deep multi-layer neural networks (i.e. networks with more than one or two hidden layers) in order to increase their accuracy [33], [34]. According to Bengio [35], until 2006, many new attempts have shown little success. Although somewhat old, the Convolutional Neural Networks (CNNs) proposed in 1998 by Lecun et al. [36] has shown to be very effective in learning features with a high level of abstraction when using deeper architectures (i.e. with a lot of layers) and new training techniques. In general, this type of hierarchical network has alternating types of layers, including convolutional layers, sub-sampling layers and fully connected layers. Convolutional layers are characterized by the kernel's size and the number of generated maps. The kernel is shifted over the valid region of the input image generating one map. Sub-sampling layers are used to increase the position invariance of the kernels by reducing the map size [37]. The main types of sub-sampling layers are maximum-pooling and average pooling [37]. Fully connected layers on CNN's are similar to the ones in general neural networks, its neurons are fully connected with the previous layer (generally: convolution layer, sub-sampling layer or even a fully connected layer). The learning procedure of CNNs consists of finding the best synapses’ weights. Supervised learning can be performed using a gradient descent method, like the one proposed by Lecun et al. [36]. One of the main advantages of CNN, is that the models' input is a raw image rather than a set of hand-coded features.

Besides the methods using deep architecture, there are many others in the literature, but some aspects of the evaluation of these methods still deserve attention. For example, validation methods could be improved in [10], [30], [38], [39], [40], [41], [42] in order to consider situations where the subject in the test set is not in the training set (i.e. test without subject overlap), accuracy is somewhat low in [38], [1], [43], [44], and the recognition time in [7], [43], [16] could be improved so as to perform real time evaluations.

Trying to cope with some of these limitations while keeping a simple solution, in this paper, we present a deep learning approach combining standard methods, like image normalizations, synthetic training-samples generation (for example, real images with artificial rotations, translation and scaling) and Convolutional Neural Network, into a simple solution that is able to achieve a very high accuracy rate of 96.76% in the CK+ database for 6 expressions, which is the state-of -the-art. The training time is significantly smaller if compared with other methods in the literature and the whole facial expression recognition system can operate in real time in standard computers. We have examined the performance of our system using the Extensive Cohn–Kanade (CK+) database [4], the Japanese Female Facial Expression (JAFFE) database [5] and the Binghamton University 3D Facial Expression (BU-3DFE) database [6], achieving a better accuracy in the CK+ database, which contains more samples (important for deep learning techniques) than the JAFFE and BU-3DFE databases. In addition, we have performed an extensive validation, with cross-database tests (i.e. training the method using one database and evaluating its accuracy using another one). In summary, the main contributions of this work are:

  • i.

    an efficient method for facial expression recognition that operates in real time;

  • ii.

    a study of the effects of image pre-processing operations in the facial expression recognition problem;

  • iii.

    a set of pre-processing operations for face normalization (spatial and intensity) in order to decrease the need of controlled environments and to cope with the lack of data;

  • iv.

    a study to handle the variability in the accuracy caused by the presentation order of the samples during training; and

  • v.

    a study of the performance of the proposed system with different cultures and environments (cross-database evaluation).

This work extends the one presented in the 28th SIBGRAPI (Conference on Graphics, Patterns and Images) [45] as follows:

  • i.

    it presents a deeper literature review which were used to enlarge the comparisons of the results;

  • ii.

    it presents the results using a new implementation based on a different framework (from ConvNet, a Matlab based implementation [46], to Caffe, a C++ based implementation [47]), which consequently reduced the total training time by almost a factor of four;

  • iii.

    it presents results showing a reduced recognition time, which is now real time;

  • iv.

    it presents results showing better accuracy due to longer training and small changes (see below) in the method;

  • v.

    it includes improved experimental methodology, as for example, using training, validation and test sets, instead of training and test sets only; and

  • vi.

    it presents a more complete evaluation including cross-database tests.

The changes in the method that allowed better accuracy were as follows:
  • a.

    The synthetic samples have a slight different generation process, allowing larger variation among them (now synthetic samples can be the original image rotated, scaled or translated, instead of only rotated);

  • b.

    We increased the number of synthetic samples, from 30 to 70 (motivated by the previous item); and

  • c.

    The logistic regression loss function was replaced by a SoftmaxWithLoss function (described in Section 3).

The remainder of this paper is organized as follows: the next section presents the most recent related work, while Section 3 describes the proposed approach. In Section 4, the experiments we have performed to evaluate our system are presented and compared with several recent facial expression recognition methods. Finally, we conclude in Section 5.

Section snippets

Related work

Several facial expression recognition approaches were developed in the last decades with an increasing progress in recognition performance. An important part of this recent progress was achieved thanks to the emergence of deep learning methods [7], [10], [12] and more specifically with Convolutional Neural Networks [14], [11], which is one of the deep learning approaches. These approaches became feasible due to: the larger amount of data available nowadays to train learning methods and the

Facial expression recognition system

Our system for facial expression recognition performs the three learning stages in just one classifier (CNN). The proposed system operates in two main phases: training and test. During training, the system receives a training data comprising grayscale images of faces with their respective expression id and eye center locations and learns a set of weights for the network. To ensure that the training performance is not affected by the order of presentation of the examples, a few images are

Experiments and discussions

The experiments were performed using three publicly available databases in the facial expression recognition research field: The Extended Cohn–Kanade (CK+) database [4], the Japanese Female Facial Expressions (JAFFE) database [5] and the Binghamton University 3D Facial Expression (BU-3DFE) database [6]. Accuracy is computed considering one classifier to classify all learned expressions. In addition, to allow for a fair comparison with some methods in the literature, accuracy is also computed

Conclusion

In this paper, we propose a facial expression recognition system that uses a combination of standard methods, like Convolutional Neural Network and specific image pre-processing steps. Experiments showed that the combination of the normalization procedures improve significantly the method's accuracy. As shown in the results, in comparison with the recent methods in the literature, that use the same facial expression database and experimental methodology, our method achieves competitive results

Conflict of interest

We wish to confirm that there are no known conflicts of interest associated with this work.

Acknowledgment

We would like to thank Universidade Federal do Espírito Santos – UFES (project SIEEPEF, 5911/2015), Fundação de Amparo Pesquisa do Espírito Santo – FAPES (grants 65883632/14, 53631242/11, and 60902841/13), Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – CAPES (grant 11012/13-7 and scholarship) and Conselho Nacional de Desenvolvimento Científico e Tecnológico – CNPq (grants 552630/2011-0 and 312786/2013-1).

André Teixeira Lopes was born in Cachoeiro de Itapemirim, ES, Brazil, on March 19, 1992. He received the B.Sc. degree in computer science in 2013 from the Universidade Federal do Espírito Santo (UFES). He received the M.Sc. degree in computer science from the same university in 2016. Currently, he is a Ph.D. student and a member of the Laboratório de Computação de Alto Desempenho (LCAD – High Performance Computing Laboratory), both at UFES, in Vitória, ES, Brazil. His research interests include

References (82)

  • S.H. Lee et al.

    Collaborative expression representation using peak expression and intra class variation face images for practical subject-independent emotion recognition in videos

    Pattern Recognit.

    (2016)
  • T. Sha et al.

    Feature level analysis for 3d facial expression recognition

    Neurocomputing

    (2011)
  • A. Maalej et al.

    Shape analysis of local facial patches for 3d facial expression recognition

    Pattern Recognit.

    (2011)
  • D. Mery et al.

    Automatic facial attribute analysis via adaptive sparse representation of random patches

    Pattern Recognit. Lett.

    (2015)
  • Y. Wu, H. Liu, H. Zha, Modeling facial expression space for recognition, in: 2005 IEEE/RSJ International Conference on...
  • C. Darwin, The Expression of the Emotions in Man and Animals, CreateSpace Independent Publishing Platform,...
  • S.Z. Li, A.K. Jain, Handbook of Face Recognition, Springer Science & Business Media, Secaucus, NJ, USA,...
  • P. Lucey, J. Cohn, T. Kanade, J. Saragih, Z. Ambadar, I. Matthews, The extended Cohn–Kanade dataset (CK+): a complete...
  • M. Lyons et al.

    Automatic classification of single facial images

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1999)
  • L. Yin, X. Wei, Y. Sun, J. Wang, M. Rosato, A 3d facial expression database for facial behavior research, in: 7th...
  • P. Liu, S. Han, Z. Meng, Y. Tong, Facial expression recognition via a boosted deep belief network, in: 2014 IEEE...
  • W. Liu, C. Song, Y. Wang, Facial expression recognition based on discriminative dictionary learning, in: 2012 21st...
  • I. Song, H.-J. Kim, P.B. Jeon, Deep learning for real-time robust facial expression recognition on a smartphone, in:...
  • P. Burkert, F. Trier, M.Z. Afzal, A. Dengel, M. Liwicki, Dexpression: Deep Convolutional Neural Network for Expression...
  • Y.-H. Byeon, K.-C. Kwak, Facial expression recognition using 3d convolutional neural network. International Journal of...
  • J.-J.J. Lien, T. Kanade, J. Cohn, C. Li, Detection, tracking, and classification of action units in facial expression,...
  • C.-R. Chen et al.

    A 0.64 mm real-time cascade face detection design based on reduced two-field extraction

    IEEE Trans. Very Large Scale Integr. (VLSI) Syst.

    (2011)
  • C. Garcia et al.

    Convolutional face findera neural architecture for fast and robust face detection

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2004)
  • Z. Zhang et al.

    Regularized transfer boosting for face detection across spectrum

    IEEE Signal Process. Lett.

    (2012)
  • M. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel, J. Movellan, Recognizing facial expression: machine...
  • P. Liu, M. Reale, L. Yin, 3d head pose estimation based on scene flow and generic head model, in: 2012 IEEE...
  • W.W. Kim, S. Park, J. Hwang, S. Lee, Automatic head pose estimation from a single camera using projective geometry, in:...
  • M. Demirkus, D. Precup, J. Clark, T. Arbel, Multi-layer temporal graphical model for head pose estimation in real-world...
  • Z. Zhang, M. Lyons, M. Schuster, S. Akamatsu, Comparison between geometry-based and gabor-wavelets-based facial...
  • P. Yang, Q. Liu, D. Metaxas, Boosting coded dynamic features for facial action units and facial expression recognition,...
  • S. Jain, C. Hu, J. Aggarwal, Facial expression recognition with temporal modeling of shapes, in: 2011 IEEE...
  • Y. Lin et al.

    Sparse coding for flexible, robust 3d facial-expression synthesis

    IEEE Comput. Graph. Appl.

    (2012)
  • S. Rifai, Y. Bengio, A. Courville, P. Vincent, M. Mirza, Disentangling factors of variation for facial expression...
  • B. Fasel, Robust face analysis using convolutional neural networks, in: Proceedings of the 16th International...
  • F. Beat, Head-pose invariant facial expression recognition using convolutional neural networks, in: Proceedings of the...
  • Y. Bengio, Y. LeCun, Scaling learning algorithms towards AI, in: L. Bottou, O. Chapelle, D. DeCoste, J. Weston (Eds.),...
  • Cited by (0)

    André Teixeira Lopes was born in Cachoeiro de Itapemirim, ES, Brazil, on March 19, 1992. He received the B.Sc. degree in computer science in 2013 from the Universidade Federal do Espírito Santo (UFES). He received the M.Sc. degree in computer science from the same university in 2016. Currently, he is a Ph.D. student and a member of the Laboratório de Computação de Alto Desempenho (LCAD – High Performance Computing Laboratory), both at UFES, in Vitória, ES, Brazil. His research interests include the following topics computer vision, image processing, machine learning and computer graphics.

    Edilson de Aguiar was born in Vila Velha, ES, Brazil, on June 11, 1979. In March 2002, he received the B.Sc. degree in computer engineering from the Universidade Federal do Espírito Santo (UFES), in Vitória, ES, Brazil. He received the M.Sc. degree in computer science in December 2003 and the Ph.D. degree in computer science in December 2008, both from the Saarland University and the Max-Planck Institute for Computer Science, in Saarbrücken, Saarland, Germany. After that he worked as researcher from 2009 to 2010 at the Disney Research laboratory in Pittsburgh, USA. Since then, he has been with the Departamento de Computação e Eletrônica of UFES, in São Mateus, ES, Brazil, where he is an adjunct professor and researcher at the Laboratório de Computação de Alto Desempenho (LCAD – High Performance Computing Laboratory). His research interests are in the areas of computer graphics, computer vision, image processing and robotics. He has been involved in research projects financed through Brazilian research agencies, such as State of Espírito Santo Research Foundation (Fundação de Apoio a Pesquisa do Estado do Espírito Santo – FAPES). He has also been in the program committee and organizing committee of national and international conferences in computer science.

    Alberto F. De Souza was born in Cachoeiro de Itapemirim, ES, Brazil, on October 27, 1963. He received the B. Eng. (Cum Laude) degree in electronics engineering and M.Sc. in systems engineering in computer science from the Universidade Federal do Rio de Janeiro (COPPE/UFRJ), in Rio de Janeiro, RJ, Brazil, in 1988 and 1993, respectively; and Doctor of Philosophy (Ph.D.) in computer science from the University College London, in London, United Kingdom, in 1999. He is a professor of computer science and coordinator of the Laboratório de Computação de Alto Desempenho (LCAD – High Performance Computing Laboratory) at the Universidade Federal do Espírito Santo (UFES), in Vitória, ES, Brazil. He has authored/co-authored one USA patent and over 90 publications. He has edited proceedings of four conferences (two IEEE sponsored conferences), is a standing member of the Steering Committee of the International Conference in Computer Architecture and High Performance Computing (SBAC-PAD), senior member of the IEEE, and comendador of the order of Rubem Braga.

    Thiago Oliveira-Santos was born in Vitória, ES, Brazil, on December 13, 1979. In 2004, he received the B.Sc. degree in computer engineering from the Universidade Federal do Espírito Santo (UFES), in Vitória, ES, Brazil. He received the M.Sc. degree in computer science from the same university in 2006. In 2011, he received a Ph.D. degree in biomedical engineering from the University of Bern in Switzerland, where he also worked as a post-doctoral researcher until 2013. Since then, he has been working as an adjunct professor at the Department of Computer Science of UFES in Vitoria, ES, Brazil. His research activities are performed at the Laboratório de Computação de Alto Desempenho (LCAD – High Performance Computing Laboratory) and include the following topics computer vision, image processing, computer graphics, and robotics.

    View full text