MEG: Texture operators for multi-expert gender classification

https://doi.org/10.1016/j.cviu.2016.09.004Get rights and content

Abstract

In this paper we focus on gender classification from face images. Despite advances in equipment as well as methods, automatic face image processing for recognition or even just for the extraction of demographics, is still a challenging task in unrestricted scenarios. Our tests are aimed at carrying out an extensive comparison of a feature based approach with two score based ones. When directly using features, we first apply different operators to extract the corresponding feature vectors, and then stack such vectors. These are classified by a SVM-based approach. When using scores, the different operators are applied in a completely separate way, so that each of them produces the corresponding scores. Answers are then either fed to a SVM, or compared pairwise to exploit Likelihood Ratio. The testbeds used for experiments are EGA database, which presents a good balance with respect to demographic features of stored face images, and GROPUS, an increasingly popular benchmark for massive experiments. The obtained performances confirm that feature level fusion achieves an often better classification accuracy. However, it is computationally expensive. We contribute to the research on this topic in three ways: 1) we show that the proposed score level fusion approaches, though less demanding, can achieve results that are comparable to feature level fusion, or even slightly better given that we fuse a particular set of experts; the main advantage over the feature-based approach relying on chained vectors, is that it is not required to evaluate a complex multi-feature distribution and the training process: thanks to the individual training of experts the overall process is more efficient and flexible, since experts can be easily added or discarded from the final architecture; 2) we evaluate the number of uncertain/ambiguous cases, i.e., those that might cause classification errors depending on the classification thresholds used, and show that with our score level fusion these significantly decreases; despite the final rate of correct classifications, this results in a more robust system; 3) we achieve very good results with operators that are not computationally expensive.

Introduction

Demographic classification of people appearing in photos as well as in (real time) videos is attracting increasing interest in the scientific community, and especially among biometrics researchers. The different possible uses span a wide range. Among the commercially attractive ones, we find the possibility to improve marketing strategies and recommendation systems, that can be better tailored to the present user without explicit inquiry (for an example of the possible impact, see Kim et al., 2007). Possible further applications relate to Human-Computer Interaction and Ambient Intelligence, by allowing tailoring interfaces or ambient services for classes of users. For instance, recognizing an elder user might trigger automatically a suited visualization of elements on the screen, or appropriate events in the environment (Casas et al., 2008). Finally, as demonstrated in literature, it is possible to increase the accuracy of biometric recognition in forensic and security-related applications. It is interesting to point out that demographics, and gender in particular, are always mentioned in the first place in almost all works dealing with the so called soft biometric traits. Those traits are defined as soft biometrics, since they are not able to univocally identify a subject, but rather a subclass of the population. However, they can support new recognition strategies (Nixon et al., 2015), though they somehow remind of the earliest biometrics approaches derived from the work by Bertillon in the 19th century, namely Bertillonage. As a matter of fact, in one of the earliest mentions of soft traits in biometric recognition, Wayman (Wayman et al., 2005) proposes their use just for filtering a large biometric database. Limiting the number of entries to search in a database can greatly improve the speed of the response. However, errors in filtering can degrade the recognition performance. A different strategy to exploit soft biometrics is in Jain et al. (2004). In this case they are used to improve the accuracy of recognition in both verification (identity claim, 1:1 matching) and identification (no identity claim, 1:N matching) modes, by using them in combination with strong biometric traits (e.g., fingerprints in the cited work) to enforce the system response. Soft biometrics are exploited to improve face verification also in the very recent work presented in Zhang et al. (2015).

The work in Klare et al. (2012) goes further, by presenting experiments showing at which extent the preliminary determination of those demographics can improve the accuracy of identity recognition carried out by strong traits, e.g., face. In this case, there is neither filtering (in the sense of applying the same system to a subset of the gallery), nor a-posteriori enforcement of the response. Rather, the idea is to train different systems on different traits/combination of traits, and to choose the right one to submit the incoming sample. The approach presented in the cited paper entails a human-in-the-loop approach, where an operator submits the biometric sample to the most suited system in a set trained beforehand on different combinations of demographic traits. An alternative is presented in De Marsico et al. (2013), where no human intervention is requested. The common outcome is a significant increase of performance. In this paper, we focus on the gender classification (GC) problem from face images.

It is to notice that the influence of demographics on human appearance cannot be sharply identified, and yet is to be taken into account (Bekios-Calfa et al., 2011). Moreover, GC in turn can be affected by age and ethnicity. The first issue has been investigated in Guo et al. (2009), while the second one can be also related to the more general problem of the “other-race” effect. This denomination refers to the fact that humans are less proficient in discriminating demographics of people from other races, if not helped by elements like clothes or hair. This is often explained by psychologists by the “contact hypothesis” (Chiroro and Valentine, 1995). This hypothesis suggests that the effect occurs as a result of a longer and wider experience with one’s own- versus other-race faces, especially during childhood, when cognitive categories are acquired and consolidated. A somehow similar hypothesis may be formulated for computational approaches, whose performance in GC can be positively affected by a suitable, ethnicity-balanced training set, when a training phase is required. As a matter of fact, the work in Gao and Ai (2009) demonstrates that ethnicity-specific gender classifiers can improve the GC accuracy in a multiethnic environment. It is interesting to notice that, at a different level of detail/classification (gender vs. specific subject) this result is conceptually similar to the mentioned work in Klare et al. (2012). Despite the specific demographics under investigation, the benchmark dataset should be fairly balanced with respect to each factor (Torralba and Efros, 2011), or specific classifiers should be trained on specific features and then combined (De Marsico et al., 2013). This is the main motivation for choosing EGA (Ethnicity, Gender and Age) dataset (Riccio et al., 2012) as one of the datasets used in our experiments, since this aspect is especially cared of. All EGA images are annotated with corresponding demographics information, that, in the present study, are used as the ground-truth for assessing demographic classification performance. It is to underline that images in the collection come from other popular face datasets, and are selected as to maintain at a minimum the distortions due to pose, illumination and expression (PIE) to better concentrate on demographics. As we will show in experimental results, this kind of controlled conditions seems to produce such a kind of homogeneity of images to also hinder GC. The 2015 NIST evaluation on GC (Ngan and Grother, 2015) evidences the difference between GC with constrained or controlled, and unconstrained or in the wild datasets. A clear example of the second group is The images of Groups (GROUPS) (Gallagher and Chen, 2009), that we use for further experiments.

We propose to address the problem of automatic GC from face images by a multi-expert approach. We investigated the most appropriate choice and combination strategy from a set of local operators, each able to capture different aspects of images, to achieve accurate GC. As for combinations, we tested both feature level and score level fusion. According to current literature, the former is expected to provide more accurate classification (Ross and Jain, 2004). The reverse of the medal is the use of more computational resources and of a most demanding training process, since the feature vectors obtained are of larger size, namely the sum of sizes provided by the single experts, unless a further expensive step of feature selection/learning is performed. This also calls for more samples during training. On the contrary, when exploiting score-level fusion, experts can be trained individually, on smaller vectors, and this makes the training process both easier and parallelizable, since it does not require to evaluate a complex multi-feature distribution. Using the scores provided by the single experts as elements in a new feature vector, for a further training/classification step, the resulting size is equal to the number of experts. According to the quality of achieved results, we might accept this as a good compromise between accuracy and cost.

Before proceeding, it is worth underlining that we did not aim at demonstrating the performance of either new operators or new fusion strategies. Rather our contribution can be summarized in the following three points. 1) The achieved performance demonstrates that, when suitably applied, score fusion can provide results that are comparable to those obtained by feature fusion. 2) We further take into account the number of uncertain/ambiguous cases, and even when the accuracy is similar in percentage, this number significantly decreases, i.e., less situations arise that possibly require manual decision. This means that the obtained system is overall more efficient. 3) The satisfying results are achieved by the use of quite light/popular operators, and we consider this a further added value. We find worth to especially underline the novelty of point 2). At the best of our knowledge, no investigation in literature has taken into account in a thorough manner the effect of different (combined) classification operators on the number of ambiguous responses. Of course, this depends on both the operator(s) exploited, and on the classification thresholds. Given a similar accuracy, this characteristic can be used to further differentiate among classification performance and appreciate a possible higher robustness under this point of view.

This work extends (Castrillón-Santana et al., 2015b) in three respects. First, we add GROUPS as a new dataset for our experiments. The use of this new testbed of a significantly different size allowed us to assess possible influence on the performances of larger scale datasets, and to analyze and discuss some issues regarding scalability, robustness (in particular, performance degradation/stability across different data sources), and computational costs. Second, we test a larger number of local operators. Third, a deeper analysis of robustness to classification ambiguity (uncertain cases) is presented. The paper continues as follows. Section 2 summarizes some related work. Section 3 shortly describes the operation of the different local descriptors used. Section 4 illustrates score computation algorithms as well as fusion strategies, and introduces the problem of ambiguous classification. Section 5 shortly describes the datasets exploited for the experiments. We present the results of our experiments in Section 6. Section 7 closes the paper by drawing some conclusions and sketching future work.

Section snippets

Some related work

Face sex is particularly relevant for human interactions and for this reason the cognitive mechanisms driving the process of GC, when carried out by humans, has been often investigated. Some interesting studies are mentioned in Fellous (1997). In Bruce and Young (1998) the authors discuss how face GC is an extremely efficient cognitive process, that is acquired early during childhood, able to achieve almost 100% correct guesses for frontal unkown pictures. A subset of the experiments presented

Local descriptors/experts

We consider the following collection of local descriptors, that have already been applied in different scenarios of facial analysis: 1) Local Binary Patterns (LBP) (Pietikäinen et al., 2011); 2) Local Gradient Patterns (LGP) (Jun and Kim, 2012); 3) Local Ternary Patterns (LTP) (Tan and Triggs, 2007); 4) Local Derivative Patterns (LDP) (Zhang et al., 2010); 5) Weber Local Descriptor (WLD) (Chen et al., 2010); 6) Local Phase Quantization (LPQ) (Ojansivu and HeikkilÈñ, 2008); 7) Histogram of

Score computation

Score computation by likelihood ratio (LR). The Likelihood Ratio (LR) is used to evaluate the membership of a sample to a specific class, after learning the class statistics. It has been introduced in biometrics to separate the class of genuine probes (those belonging to users enrolled in the system), from that of impostor probes (those belonging to unregistered users). The authors of Ulery et al. (2006) experimentally assess that, consistently with the Neyman-Pearson lemma, if, when False

The image datasets

Two different datasets are considered below for this particular problem, to provide conclusions in different scenarios of applications.

Experiments and results

Below we summarize the proposed strategies in terms of descriptors and fusion policies, and the results achieved for both datasets. For each of them, we first singularly analyze the whole collection of descriptors, to evaluate their respective best grid configurations and their relative performance. Then, we evaluate SL and FL fusion, and report the combinations providing the best performance.

All results are reported in terms of accuracy, defined as the number of correct classifications in

Conclusions

In this paper we have analyzed a wide collection of local descriptors for the GC problem, increasing the variety of operators, and the nature of the evaluation datasets, with respect to our previous work. Our proposal deals with the fusion of the score output from the different operators, that we demonstrate offers a better trade-off between accuracy and cost with respect to fusion of features. We search for possible improvements in terms of both accuracy and reliability (number of ambiguous

Acknowledgments

Work partially funded by the project TIN2015 64395-R from the Spanish Ministry of Economy and Competivity.

References (67)

  • A. Ross et al.

    Multimodal biometrics: An overview

    Proc. 12th European Signal Processing Conference (EUSIPCO)

    (2004)
  • A. Torralba et al.

    Unbiased look at dataset bias

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2011)
  • B. Ulery et al.

    Studies of Biometric Fusion.

    Technical Report

    (2006)
  • CASIA-FaceV5....
  • The FEI face database....
  • T. Ahonen et al.

    Face description with local binary patterns: application to face recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2006)
  • L.A. Alexandre

    Gender recognition: A multiscale decision fusion approach

    Pattern Recog. Lett.

    (2010)
  • J. Bekios-Calfa et al.

    Revisiting linear discriminant techniques in gender recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2011)
  • J. Bekios-Calfa et al.

    Robust gender recognition by exploiting facial attributes dependencies

    Pattern Recog. Lett.

    (2014)
  • V. Bruce et al.

    In the Eye of the Beholder: The Science of Face Perception

    (1998)
  • A.M. Burton et al.

    What’S the difference between men and women? evidence from facial measurement

    PERCEPTION-LONDON-

    (1993)
  • L. Cao et al.

    Gender recognition from body

    Proceedings of the 16th ACM international conference on Multimedia

    (2008)
  • R. Casas et al.

    User Modelling in Ambient Intelligence for Elderly and Disabled People

    (2008)
  • M. Castrillón-Santana et al.

    Descriptors and regions of interest fusion for gender classification in the wild. comparison and combination with convolutional neural networks

    ArXiv e-prints

    (2015)
  • M. Castrillón-Santana et al.

    Multi-scale score level fusion of local descriptors for gender classification in the wild, Multimedia Tools and Applications

    (2016)
  • M. Castrillón-Santana et al.

    MEG: Multi-Expert Gender classification in a demographics-balanced dataset

    18th International Conference on Image Analysis and Processing (ICIAP)

    (2015)
  • A. Cellerino et al.

    Sex differences in face gender recognition in humans

    Brain Res. bull.

    (2004)
  • Z. Chai et al.

    Local salient patterns - a novel local descriptor for face recognition

    International Conference on Biometrics (ICB)

    (2013)
  • D.G. Childers et al.

    Gender recognition from speech. part ii: Fine analysis

    J. Acousti. soc. Am.

    (1991)
  • P. Chiroro et al.

    An investigation of the contact hypothesis of the own-race bias in face recognition

    Q. J. Exp. Psychol.

    (1995)
  • P. Dago-Casas et al.

    Single- and cross- database benchmarks for gender classification under unconstrained settings

    Proc. First IEEE International Workshop on Benchmarking Facial Image Analysis Technologies

    (2011)
  • N. Dalal et al.

    Histograms of oriented gradients for human detection

  • M. De Marsico et al.

    Demographics versus biometric automatic interoperability

    Image Analysis and Processing–ICIAP 2013

    (2013)
  • Cited by (0)

    View full text