MEG: Texture operators for multi-expert gender classification
Introduction
Demographic classification of people appearing in photos as well as in (real time) videos is attracting increasing interest in the scientific community, and especially among biometrics researchers. The different possible uses span a wide range. Among the commercially attractive ones, we find the possibility to improve marketing strategies and recommendation systems, that can be better tailored to the present user without explicit inquiry (for an example of the possible impact, see Kim et al., 2007). Possible further applications relate to Human-Computer Interaction and Ambient Intelligence, by allowing tailoring interfaces or ambient services for classes of users. For instance, recognizing an elder user might trigger automatically a suited visualization of elements on the screen, or appropriate events in the environment (Casas et al., 2008). Finally, as demonstrated in literature, it is possible to increase the accuracy of biometric recognition in forensic and security-related applications. It is interesting to point out that demographics, and gender in particular, are always mentioned in the first place in almost all works dealing with the so called soft biometric traits. Those traits are defined as soft biometrics, since they are not able to univocally identify a subject, but rather a subclass of the population. However, they can support new recognition strategies (Nixon et al., 2015), though they somehow remind of the earliest biometrics approaches derived from the work by Bertillon in the 19th century, namely Bertillonage. As a matter of fact, in one of the earliest mentions of soft traits in biometric recognition, Wayman (Wayman et al., 2005) proposes their use just for filtering a large biometric database. Limiting the number of entries to search in a database can greatly improve the speed of the response. However, errors in filtering can degrade the recognition performance. A different strategy to exploit soft biometrics is in Jain et al. (2004). In this case they are used to improve the accuracy of recognition in both verification (identity claim, 1:1 matching) and identification (no identity claim, 1:N matching) modes, by using them in combination with strong biometric traits (e.g., fingerprints in the cited work) to enforce the system response. Soft biometrics are exploited to improve face verification also in the very recent work presented in Zhang et al. (2015).
The work in Klare et al. (2012) goes further, by presenting experiments showing at which extent the preliminary determination of those demographics can improve the accuracy of identity recognition carried out by strong traits, e.g., face. In this case, there is neither filtering (in the sense of applying the same system to a subset of the gallery), nor a-posteriori enforcement of the response. Rather, the idea is to train different systems on different traits/combination of traits, and to choose the right one to submit the incoming sample. The approach presented in the cited paper entails a human-in-the-loop approach, where an operator submits the biometric sample to the most suited system in a set trained beforehand on different combinations of demographic traits. An alternative is presented in De Marsico et al. (2013), where no human intervention is requested. The common outcome is a significant increase of performance. In this paper, we focus on the gender classification (GC) problem from face images.
It is to notice that the influence of demographics on human appearance cannot be sharply identified, and yet is to be taken into account (Bekios-Calfa et al., 2011). Moreover, GC in turn can be affected by age and ethnicity. The first issue has been investigated in Guo et al. (2009), while the second one can be also related to the more general problem of the “other-race” effect. This denomination refers to the fact that humans are less proficient in discriminating demographics of people from other races, if not helped by elements like clothes or hair. This is often explained by psychologists by the “contact hypothesis” (Chiroro and Valentine, 1995). This hypothesis suggests that the effect occurs as a result of a longer and wider experience with one’s own- versus other-race faces, especially during childhood, when cognitive categories are acquired and consolidated. A somehow similar hypothesis may be formulated for computational approaches, whose performance in GC can be positively affected by a suitable, ethnicity-balanced training set, when a training phase is required. As a matter of fact, the work in Gao and Ai (2009) demonstrates that ethnicity-specific gender classifiers can improve the GC accuracy in a multiethnic environment. It is interesting to notice that, at a different level of detail/classification (gender vs. specific subject) this result is conceptually similar to the mentioned work in Klare et al. (2012). Despite the specific demographics under investigation, the benchmark dataset should be fairly balanced with respect to each factor (Torralba and Efros, 2011), or specific classifiers should be trained on specific features and then combined (De Marsico et al., 2013). This is the main motivation for choosing EGA (Ethnicity, Gender and Age) dataset (Riccio et al., 2012) as one of the datasets used in our experiments, since this aspect is especially cared of. All EGA images are annotated with corresponding demographics information, that, in the present study, are used as the ground-truth for assessing demographic classification performance. It is to underline that images in the collection come from other popular face datasets, and are selected as to maintain at a minimum the distortions due to pose, illumination and expression (PIE) to better concentrate on demographics. As we will show in experimental results, this kind of controlled conditions seems to produce such a kind of homogeneity of images to also hinder GC. The 2015 NIST evaluation on GC (Ngan and Grother, 2015) evidences the difference between GC with constrained or controlled, and unconstrained or in the wild datasets. A clear example of the second group is The images of Groups (GROUPS) (Gallagher and Chen, 2009), that we use for further experiments.
We propose to address the problem of automatic GC from face images by a multi-expert approach. We investigated the most appropriate choice and combination strategy from a set of local operators, each able to capture different aspects of images, to achieve accurate GC. As for combinations, we tested both feature level and score level fusion. According to current literature, the former is expected to provide more accurate classification (Ross and Jain, 2004). The reverse of the medal is the use of more computational resources and of a most demanding training process, since the feature vectors obtained are of larger size, namely the sum of sizes provided by the single experts, unless a further expensive step of feature selection/learning is performed. This also calls for more samples during training. On the contrary, when exploiting score-level fusion, experts can be trained individually, on smaller vectors, and this makes the training process both easier and parallelizable, since it does not require to evaluate a complex multi-feature distribution. Using the scores provided by the single experts as elements in a new feature vector, for a further training/classification step, the resulting size is equal to the number of experts. According to the quality of achieved results, we might accept this as a good compromise between accuracy and cost.
Before proceeding, it is worth underlining that we did not aim at demonstrating the performance of either new operators or new fusion strategies. Rather our contribution can be summarized in the following three points. 1) The achieved performance demonstrates that, when suitably applied, score fusion can provide results that are comparable to those obtained by feature fusion. 2) We further take into account the number of uncertain/ambiguous cases, and even when the accuracy is similar in percentage, this number significantly decreases, i.e., less situations arise that possibly require manual decision. This means that the obtained system is overall more efficient. 3) The satisfying results are achieved by the use of quite light/popular operators, and we consider this a further added value. We find worth to especially underline the novelty of point 2). At the best of our knowledge, no investigation in literature has taken into account in a thorough manner the effect of different (combined) classification operators on the number of ambiguous responses. Of course, this depends on both the operator(s) exploited, and on the classification thresholds. Given a similar accuracy, this characteristic can be used to further differentiate among classification performance and appreciate a possible higher robustness under this point of view.
This work extends (Castrillón-Santana et al., 2015b) in three respects. First, we add GROUPS as a new dataset for our experiments. The use of this new testbed of a significantly different size allowed us to assess possible influence on the performances of larger scale datasets, and to analyze and discuss some issues regarding scalability, robustness (in particular, performance degradation/stability across different data sources), and computational costs. Second, we test a larger number of local operators. Third, a deeper analysis of robustness to classification ambiguity (uncertain cases) is presented. The paper continues as follows. Section 2 summarizes some related work. Section 3 shortly describes the operation of the different local descriptors used. Section 4 illustrates score computation algorithms as well as fusion strategies, and introduces the problem of ambiguous classification. Section 5 shortly describes the datasets exploited for the experiments. We present the results of our experiments in Section 6. Section 7 closes the paper by drawing some conclusions and sketching future work.
Section snippets
Some related work
Face sex is particularly relevant for human interactions and for this reason the cognitive mechanisms driving the process of GC, when carried out by humans, has been often investigated. Some interesting studies are mentioned in Fellous (1997). In Bruce and Young (1998) the authors discuss how face GC is an extremely efficient cognitive process, that is acquired early during childhood, able to achieve almost 100% correct guesses for frontal unkown pictures. A subset of the experiments presented
Local descriptors/experts
We consider the following collection of local descriptors, that have already been applied in different scenarios of facial analysis: 1) Local Binary Patterns (LBP) (Pietikäinen et al., 2011); 2) Local Gradient Patterns (LGP) (Jun and Kim, 2012); 3) Local Ternary Patterns (LTP) (Tan and Triggs, 2007); 4) Local Derivative Patterns (LDP) (Zhang et al., 2010); 5) Weber Local Descriptor (WLD) (Chen et al., 2010); 6) Local Phase Quantization (LPQ) (Ojansivu and HeikkilÈñ, 2008); 7) Histogram of
Score computation
Score computation by likelihood ratio (LR). The Likelihood Ratio (LR) is used to evaluate the membership of a sample to a specific class, after learning the class statistics. It has been introduced in biometrics to separate the class of genuine probes (those belonging to users enrolled in the system), from that of impostor probes (those belonging to unregistered users). The authors of Ulery et al. (2006) experimentally assess that, consistently with the Neyman-Pearson lemma, if, when False
The image datasets
Two different datasets are considered below for this particular problem, to provide conclusions in different scenarios of applications.
Experiments and results
Below we summarize the proposed strategies in terms of descriptors and fusion policies, and the results achieved for both datasets. For each of them, we first singularly analyze the whole collection of descriptors, to evaluate their respective best grid configurations and their relative performance. Then, we evaluate SL and FL fusion, and report the combinations providing the best performance.
All results are reported in terms of accuracy, defined as the number of correct classifications in
Conclusions
In this paper we have analyzed a wide collection of local descriptors for the GC problem, increasing the variety of operators, and the nature of the evaluation datasets, with respect to our previous work. Our proposal deals with the fusion of the score output from the different operators, that we demonstrate offers a better trade-off between accuracy and cost with respect to fusion of features. We search for possible improvements in terms of both accuracy and reliability (number of ambiguous
Acknowledgments
Work partially funded by the project TIN2015 64395-R from the Spanish Ministry of Economy and Competivity.
References (67)
- et al.
Boosting sex identification performance
Int. J. comput. vision
(2007) - et al.
Shape matching and object recognition using shape contexts
Pattern Anal. Mach. Intell., IEEE Trans.
(2002) - et al.
Wld: A robust local image descriptor
Pattern Anal. Mach. Intell. IEEE Trans.
(2010) - et al.
Face gender classification on consumer images in a multiethnic environment
Advances in Biometrics
(2009) - et al.
Soft biometric traits for personal recognition systems
ICBA2004
(2004) - et al.
Face recognition performance: Role of demographic information
IEEE Trans. on Inf. Forensics Secur.
(2012) - et al.
Gender classification by combining clothing, hair and facial component classifiers
Neurocomputing
(2012) - et al.
Multi-view gender classification using local binary patterns and support vector machines
Advances in Neural Networks - ISNN
(2006) - et al.
Trainable classifier-fusion schemes: An application to pedestrian detection
12th International IEEE Conference on Intelligent Transportation Systems (ITSC)
(2009) - et al.
Overview of the Face Recognition Grand Challenge.
Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)
(2005)