Skip to main content

Über dieses Buch

The four-volume set LNCS 8925, 8926, 8927, and 8928 comprises the thoroughly refereed post-workshop proceedings of the Workshops that took place in conjunction with the 13th European Conference on Computer Vision, ECCV 2014, held in Zurich, Switzerland, in September 2014.

The 203 workshop papers were carefully reviewed and selected for inclusion in the proceedings. They where presented at workshops with the following themes: where computer vision meets art; computer vision in vehicle technology; spontaneous facial behavior analysis; consumer depth cameras for computer vision; "chalearn" looking at people: pose, recovery, action/interaction, gesture recognition; video event categorization, tagging and retrieval towards big data; computer vision with local binary pattern variants; visual object tracking challenge; computer vision + ontology applies cross-disciplinary technologies; visual perception of affordance and functional visual primitives for scene analysis; graphical models in computer vision; light fields for computer vision; computer vision for road scene understanding and autonomous driving; soft biometrics; transferring and adapting source knowledge in computer vision; surveillance and re-identification; color and photometry in computer vision; assistive computer vision and robotics; computer vision problems in plant phenotyping; and non-rigid shape analysis and deformable image alignment. Additionally, a panel discussion on video segmentation is included.



W06 - Video Event Categorization, Tagging and Retrieval towards Big Data


Grading Tai Chi Performance in Competition with RGBD Sensors

In order to grade objectively, referees of Tai Chi practices always have to be very concentrated on every posture of the performer. This makes the referees easy to be fatigue and thus grade with occasional mistakes. In this paper, we propose using Kinect sensors to grade automatically. Firstly, we record the joint movement of the performer skeleton. Then we adopt the joint differences both temporally and spatially to model the joint dynamics and configuration. We apply Principal Component Analysis (PCA) to the joint differences in order to reduce redundancy and noise. We then employ non-parametric Nave-Bayes-Nearest-Neighbor (NBNN) as a classifier to recognize the multiple categories of Tai Chi forms. To give grade of each form, we study the grading criteria and convert them into decision on angles or distances between vectors. Experiments on several Tai Chi forms show the feasibility of our method.

Hui Zhang, Haipeng Guo, Chaoyun Liang, Ximin Yan, Jun Liu, Jie Weng

Human Action Recognition by Random Features and Hand-Crafted Features: A Comparative Study

One popular approach for human action recognition is to extract features from videos as representations, subsequently followed by a classification procedure of the representations. In this paper, we investigate and compare hand-crafted and random feature representation for human action recognition on YouTube dataset. The former is built on 3D HoG/HoF and SIFT descriptors while the latter bases on random projection. Three encoding methods: Bag of Feature(BoF), Sparse Coding(SC) and VLAD are adopted. Spatial temporal pyramid and a two-layer SVM classifier are employed for classification. Our experiments demonstrate that: 1) Sparse Coding is confirmed to outperform Bag of Feature; 2) Using a model of hybrid features incorporating frame-static can significantly improve the overall recognition accuracy; 3) The frame-static features works surprisingly better than motion features only; 4) Compared with the success of hand-crafted feature representation, the random feature representation does not perform well in this dataset.

Haocheng Shen, Jianguo Zhang, Hui Zhang

Modeling Supporting Regions for Close Human Interaction Recognition

This paper addresses the problem of recognizing human interactions with close physical contact from videos. Different from conventional human interaction recognition, recognizing close interactions faces the problems of ambiguities in feature-to-person assignments and frequent occlusions. Therefore, it is infeasible to accurately extract the interacting people, and the recognition performance of an interaction model is degraded. We propose a patch-aware model to overcome the two problems in close interaction recognition. Our model learns discriminative supporting regions for each interacting individual. The learned supporting regions accurately extract individuals at patch level, and explicitly indicate feature assignments. In addition, our model encodes a set of body part configurations for one interaction class, which provide rich representations for frequent occlusions. Our approach is evaluated on the UT-Interaction dataset and the BIT-Interaction dataset, and achieves promising results.

Yu Kong, Yun Fu

W07 - Computer Vision with Local Binary Patterns Variants


Fast Features Invariant to Rotation and Scale of Texture

A family of novel texture representations called Ffirst, the Fast Features Invariant to Rotation and Scale of Texture, is introduced. New rotation invariants are proposed, extending the LBP-HF features, improving the recognition accuracy. Using the full set of LBP features, as opposed to uniform only, leads to further improvement. Linear Support Vector Machines with an approximate

$$\chi ^2$$

-kernel map are used for fast and precise classification.

Experimental results show that Ffirst exceeds the best reported results in texture classification on three difficult texture datasets KTH-TIPS2a, KTH-TIPS2b and ALOT, achieving 88 %, 76 % and 96 % accuracy respectively. The recognition rates are above 99 % on standard texture datasets KTH-TIPS, Brodatz32, UIUCTex, UMD, CUReT.

Milan Sulc, Jiri Matas

Local Binary Patterns to Evaluate Trabecular Bone Structure from Micro-CT Data: Application to Studies of Human Osteoarthritis

Osteoarthritis (OA) causes progressive degeneration of articular cartilage and pathological changes in subchondral bone. These changes can be assessed volumetrically using micro-computed tomography (

$$\mu $$

CT) imaging. The local descriptor, i.e. local binary pattern (LBP), is a new alternative solution to perform analysis of local bone structures from

$$\mu $$

CT scans. In this study, different trabecular bone samples were prepared from patients diagnosed with OA and treated with total knee arthroplasty. The LBP descriptor was applied to correlate the distribution of local patterns with the severity of the disease. The results obtained suggest the appearance and disappearance of specific oriented patterns with OA, as an adaptation of the bone to the decrease of cartilage thickness. The experimental results suggest that the LBP descriptor can be used to assess the changes in the trabecular bone due to OA.

Jérôme Thevenot, Jie Chen, Mikko Finnilä, Miika Nieminen, Petri Lehenkari, Simo Saarakkala, Matti Pietikäinen

Impact of Topology-Related Attributes from Local Binary Patterns on Texture Classification

A general texture description model is proposed, using topology related attributes calculated from Local Binary Patterns (LBP). The proposed framework extends and generalises existing LBP-based descriptors like LBP-rotation invariant uniform patterns (

$$\mathrm {LBP}^{riu2}$$

), and Local Binary Count (LBC). Like them, it allows contrast and rotation invariant image description using more compact descriptors than classic LBP. However, its expressiveness, and then its discrimination capability, is higher, since it includes additional information, including the number of connected components. The impact of the different attributes on texture classification performance is assessed through a systematic comparative evaluation, performed on three texture datasets. The results validate the interest of the proposed approach, by showing that some combinations of attributes outperform state-of-the-art LBP-based texture descriptors.

Thanh Phuong Nguyen, Antoine Manzanera, Walter G. Kropatsch

Gait-Based Person Identification Using Motion Interchange Patterns

Understanding human motion in unconstrained 2D videos has been a central theme in Computer Vision research, and over the years many attempts have been made to design effective representations of video content. In this paper, we apply to gait recognition the Motion Interchange Patterns (MIP) framework, a 3D extension of the LBP descriptors to videos that was successfully employed in action recognition. This effective framework encodes motion by capturing local changes in motion directions. Our scheme does not rely on silhouettes commonly used in gait recognition, and benefits from the capability of MIP encoding to model real world videos. We empirically demonstrate the effectiveness of this modeling of human motion on several challenging gait recognition datasets.

Gil Freidlin, Noga Levy, Lior Wolf

Micro-Facial Movements: An Investigation on Spatio-Temporal Descriptors

This paper aims to investigate whether micro-facial movement sequences can be distinguished from neutral face sequences. As a micro-facial movement tends to be very quick and subtle, classifying when a movement occurs compared to the face without movement can be a challenging computer vision problem. Using local binary patterns on three orthogonal planes and Gaussian derivatives, local features, when interpreted by machine learning algorithms, can accurately describe when a movement and non-movement occurs. This method can then be applied to help aid humans in detecting when the small movements occur. This also differs from current literature as most only concentrate in emotional expression recognition. Using the CASME II dataset, the results from the investigation of different descriptors have shown a higher accuracy compared to state-of-the-art methods.

Adrian K. Davison, Moi Hoon Yap, Nicholas Costen, Kevin Tan, Cliff Lansley, Daniel Leightley

Analysis of Sampling Techniques for Learning Binarized Statistical Image Features Using Fixations and Salience

This paper studies the role of different sampling techniques in the process of learning

Binarized Statistical Image Features

(BSIF). It considers various sampling approaches including random sampling and selective sampling. The selective sampling utilizes either human eye tracking data or artificially generated fixations. To generate artificial fixations, this paper exploits salience models which apply to key point localization. Therefore, it proposes a framework grounded on the hypothesis that the most salient point conveys important information. Furthermore, it investigates possible performance gain by training BSIF filters on class specific data. To summarize, the contribution of this paper are as follows: 1) it studies different sampling strategies to learn BSIF filters, 2) it employs human fixations in the design of a binary operator, 3) it proposes an attention model to replicate human fixations, and 4) it studies the performance of learning application specific BSIF filters using attention modeling.

Hamed Rezazadegan Tavakoli, Esa Rahtu, Janne Heikkilä

Facial Expression Analysis Based on High Dimensional Binary Features

High dimensional engineered features have yielded high performance results on a variety of visual recognition tasks and attracted significant recent attention. Here, we examine the problem of expression recognition in static facial images. We first present a technique to build high dimensional,

$$\sim 60\mathrm{k}$$

features composed of dense Census transformed vectors based on locations defined by facial keypoint predictions. The approach yields state of the art performance at 96.8% accuracy for detecting facial expressions on the well known Cohn-Kanade plus (CK+) evaluation and 93.2% for smile detection on the GENKI dataset. We also find that the subsequent application of a linear discriminative dimensionality reduction technique can make the approach more robust when keypoint locations are less precise. We go on to explore the recognition of expressions captured under more challenging pose and illumination conditions. Specifically, we test this representation on the GENKI smile detection dataset. Our high dimensional feature technique yields state of the art performance on both of these well known evaluations.

Samira Ebrahimi Kahou, Pierre Froumenty, Christopher Pal

Weight-Optimal Local Binary Patterns

In this work, we have proposed a learning paradigm for obtaining weight-optimal local binary patterns (WoLBP). We first re-formulate the LBP problem into matrix multiplication with all the bitmaps flattened and then resort to the Fisher ratio criterion for obtaining the optimal weight matrix for LBP encoding. The solution is closed form and can be easily solved using one eigen-decomposition. The experimental results on the FRGC ver2.0 database have shown that the WoLBP gains significant performance improvement over traditional LBP, and such WoLBP learning procedure can be directly ported to many other LBP variants to further improve their performances.

Felix Juefei-Xu, Marios Savvides

Some Faces are More Equal than Others: Hierarchical Organization for Accurate and Efficient Large-Scale Identity-Based Face Retrieval

This paper presents a novel method for hierarchically organizing large face databases, with application to efficient identity-based face retrieval. The method relies on metric learning with local binary pattern (LBP) features. On one hand, LBP features have proved to be highly resilient to various appearance changes due to illumination and contrast variations while being extremely efficient to calculate. On the other hand, metric learning (ML) approaches have been proved very successful for face verification ‘in the wild’,


in uncontrolled face images with large amounts of variations in pose, expression, appearances, lighting,


. While such ML based approaches compress high dimensional features into low dimensional spaces using discriminatively learned projections, the complexity of retrieval is still significant for large scale databases (with millions of faces). The present paper shows that learning such discriminative projections locally while organizing the database hierarchically leads to a more accurate and efficient system. The proposed method is validated on the standard Labeled Faces in the Wild (LFW) benchmark dataset with millions of additional distracting face images collected from photos on the internet.

Binod Bhattarai, Gaurav Sharma, Frédéric Jurie, Patrick Pérez

On the Effects of Illumination Normalization with LBP-Based Watchlist Screening

Still-to-video face recognition (FR) is an important function in several video surveillance applications like watchlist screening, where faces captured over a network of video cameras are matched against reference stills belonging to target individuals. Screening of faces against a watchlist is a challenging problem due to variations in capturing conditions (e.g., pose and illumination), to camera inter-operability, and to the limited number of reference stills. In holistic approaches to FR, Local Binary Pattern (LBP) descriptors are often considered to represent facial captures and reference stills. Despite their efficiency, LBP descriptors are known as being sensitive to illumination changes. In this paper, the performance of still-to-video FR is compared when different passive illumination normalization techniques are applied prior to LBP feature extraction. This study focuses on representative retinex, self-quotient, diffusion, filtering, means de-noising, retina, wavelet and frequency-based techniques that are suitable for fast and accurate face screening. Experimental results obtained with videos from the Chokepoint dataset indicate that, although Multi-Scale Weberfaces and Tan and Triggs techniques tend to outperform others, the benefits of these techniques varies considerably according to the individual and illumination conditions. Results suggest that a combination of these techniques should be selected dynamically based on changing capture conditions.

Ibtihel Amara, Eric Granger, Abdenour Hadid

W09 - Visual Object Tracking Challenge


The Visual Object Tracking VOT2014 Challenge Results

The Visual Object Tracking challenge 2014, VOT2014, aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance. Results of 38 trackers are presented. The number of tested trackers makes VOT 2014 the largest benchmark on short-term tracking to date. For each participating tracker, a short description is provided in the appendix. Features of the VOT2014 challenge that go beyond its VOT2013 predecessor are introduced: (i) a new VOT2014 dataset with full annotation of targets by rotated bounding boxes and per-frame attribute, (ii) extensions of the VOT2013 evaluation methodology, (iii) a new unit for tracking speed assessment less dependent on the hardware and (iv) the VOT2014 evaluation toolkit that significantly speeds up execution of experiments. The dataset, the evaluation kit as well as the results are publicly available at the challenge website (


Matej Kristan, Roman Pflugfelder, Aleš Leonardis, Jiri Matas, Luka Čehovin, Georg Nebehay, Tomáš Vojíř, Gustavo Fernández, Alan Lukežič, Aleksandar Dimitriev, Alfredo Petrosino, Amir Saffari, Bo Li, Bohyung Han, CherKeng Heng, Christophe Garcia, Dominik Pangeršič, Gustav Häger, Fahad Shahbaz Khan, Franci Oven, Horst Possegger, Horst Bischof, Hyeonseob Nam, Jianke Zhu, JiJia Li, Jin Young Choi, Jin-Woo Choi, João F. Henriques, Joost van de Weijer, Jorge Batista, Karel Lebeda, Kristoffer Öfjäll, Kwang Moo Yi, Lei Qin, Longyin Wen, Mario Edoardo Maresca, Martin Danelljan, Michael Felsberg, Ming-Ming Cheng, Philip Torr, Qingming Huang, Richard Bowden, Sam Hare, Samantha YueYing Lim, Seunghoon Hong, Shengcai Liao, Simon Hadfield, Stan Z. Li, Stefan Duffner, Stuart Golodetz, Thomas Mauthner, Vibhav Vineet, Weiyao Lin, Yang Li, Yuankai Qi, Zhen Lei, Zhi Heng Niu

Weighted Update and Comparison for Channel-Based Distribution Field Tracking

There are three major issues for visual object trackers: model representation, search and model update. In this paper we address the last two issues for a specific model representation, grid based distribution models by means of channel-based distribution fields. Particularly we address the comparison part of searching. Previous work in the area has used standard methods for comparison and update, not exploiting all the possibilities of the representation. In this work we propose two comparison schemes and one update scheme adapted to the distribution model. The proposed schemes significantly improve the accuracy and robustness on the Visual Object Tracking (VOT) 2014 Challenge dataset.

Kristoffer Öfjäll, Michael Felsberg

Exploiting Contextual Motion Cues for Visual Object Tracking

In this paper, we propose an algorithm for on-line, real-time tracking of arbitrary objects in videos from unconstrained environments. The method is based on a particle filter framework using different visual features and motion prediction models. We effectively integrate a discriminative on-line learning classifier into the model and propose a new method to collect negative training examples for updating the classifier at each video frame. Instead of taking negative examples only from the surroundings of the object region, or from specific distracting objects, our algorithm samples the negatives from a contextual motion density function. We experimentally show that this type of learning improves the overall performance of the tracking algorithm. Finally, we present quantitative and qualitative results on four challenging public datasets that show the robustness of the tracking algorithm with respect to appearance and view changes, lighting variations, partial occlusions as well as object deformations.

Stefan Duffner, Christophe Garcia

Clustering Local Motion Estimates for Robust and Efficient Object Tracking

We present a new short-term tracking algorithm called Best Displacement Flow (BDF). This approach is based on the idea of ‘Flock of Trackers’ with two main contributions. The first contribution is the adoption of an efficient clustering approach to identify what we term the ‘Best Displacement’ vector, used to update the object’s bounding box. This clustering procedure is more robust than the median filter to high percentage of outliers. The second contribution is a procedure that we term ‘Consensus-Based Reinitialization’ used to reinitialize trackers that have previously been classified as outliers. For this reason we define a new tracker state called ‘transition’ used to sample new trackers in according to the current inlier trackers.

Mario Edoardo Maresca, Alfredo Petrosino

A Scale Adaptive Kernel Correlation Filter Tracker with Feature Integration

Although the correlation filter-based trackers achieve the competitive results both on accuracy and robustness, there is still a need to improve the overall tracking capability. In this paper, we presented a very appealing tracker based on the correlation filter framework. To tackle the problem of the fixed template size in kernel correlation filter tracker, we suggest an effective scale adaptive scheme. Moreover, the powerful features including HoG and color-naming are integrated together to further boost the overall tracking performance. The extensive empirical evaluations on the benchmark videos and VOT 2014 dataset demonstrate that the proposed tracker is very promising for the various challenging scenarios. Our method successfully tracked the targets in about 72% videos and outperformed the state-of-the-art trackers on the benchmark dataset with 51 sequences.

Yang Li, Jianke Zhu

W10 - Computer Vision + ONTology Applied Cross-Disciplinary Technologies


Uncertainty Modeling Framework for Constraint-Based Elementary Scenario Detection in Vision Systems

Event detection has advanced significantly in the past decades relying on pixel- and feature-level representations of video-clips. Although effective those representations have difficulty on incorporating scene semantics. Ontology and description-based approaches can explicitly embed scene semantics, but their deterministic nature is susceptible to noise from underlying components of vision systems. We propose a probabilistic framework to handle uncertainty on a constraint-based ontology framework for event detection. This work focuses on elementary event (scenario) uncertainty and proposes probabilistic constraints to quantify the spatial relationship between person and contextual objects. The uncertainty modeling framework is demonstrated on the detection of activities of daily living of participants of an Alzheimer’s disease study, monitored by a vision system using a RGB-D sensor (Kinect, Microsoft) as input. Two evaluations were carried out: the first, a 3-fold cross-validation focusing on elementary scenario detection (n:10 participants); and the second devoted for complex scenario detection (semi-probabilistic approach, n:45). Results showed the uncertainty modeling improves the detection of elementary scenarios in recall (


In zone phone: 84 to 100 %) and precision indices (


In zone Reading: 54.5 to 85.7%), and the recall of Complex scenarios.

Carlos Fernando Crispim-Junior, Francois Bremond

Mixing Low-Level and Semantic Features for Image Interpretation

A Framework and a Simple Case Study

Semantic Content-Based Image Retrieval (SCBIR) allows users to retrieve images via complex expressions of some ontological language describing a domain of interest. SCBIR adds some flexibility to the state-of-the-art methods for image retrieval, which support query either by keywords or by image examples. The price for this additional flexibility is the generation of a semantically rich description of the image content reflecting the ontology constraints. Generating these semantic interpretations is an open research problem. This paper contributes to this research line by proposing an approach for SCBIR based on the somehow natural idea that the interpretation of a picture is an (onto) logical model of an ontology that describes the domain of the picture. We implement this idea in an unsupervised method that jointly exploits the ontological constraints and the low-level features of the image. The preliminary evaluation, presented in the paper, shows promising results.

Ivan Donadello, Luciano Serafini

Events Detection Using a Video-Surveillance Ontology and a Rule-Based Approach

In this paper, we propose the use of a Video-surveillance Ontology and a rule-based approach to detect an event. The scene is described using the concepts presented in the ontology. Then, the blobs are extracted from the video stream and are represented using the bounding boxes that enclose them. Finally, a set of rules have been proposed and have been applied to videos selected from PETS 2012 challenge that contain multiple objects events (e.g. Group walking, Group splitting, etc.).

Mohammed Yassine Kazi Tani, Adel Lablack, Abdelghani Ghomari, Ioan Marius Bilasco

Semantic-Analysis Object Recognition: Automatic Training Set Generation Using Textual Tags

Training sets of images for object recognition are the pillars on which classifiers base their performances. We have built a framework to support the entire process of image and textual retrieval from search engines, which, giving an input keyword, performs a statistical and a semantic analysis and automatically builds a training set. We have focused our attention on textual information and we have explored, with several experiments, three different approaches to automatically discriminate between positive and negative images: keyword position, tag frequency and semantic analysis. We present the best results for each approach.

Sami Abduljalil Abdulhak, Walter Riviera, Nicola Zeni, Matteo Cristani, Roberta Ferrario, Marco Cristani

Characterizing Predicate Arity and Spatial Structure for Inductive Learning of Game Rules

Where do the predicates in a game ontology come from? We use RGBD vision to learn a) the spatial structure of a board, and b) the number of parameters in a move or transition. These are used to define state-transition predicates for a logical description of each game state. Given a set of videos for a game, we use an improved 3D multi-object tracking to obtain the positions of each piece in games such as 4-peg solitaire or Towers of Hanoi. The spatial positions occupied by pieces over the entire game is clustered, revealing the structure of the board. Each frame is represented as a Semantic Graph with edges encoding spatial relations between pieces. Changes in the graphs between game states reveal the structure of a “move”. Knowledge from spatial structure and semantic graphs is mapped to FOL descriptions of the moves and used in an Inductive Logic framework to infer the valid moves and other rules of the game. Discovered predicate structures and induced rules are demonstrated for several games with varying board layouts and move structures.

Debidatta Dwibedi, Amitabha Mukerjee

Perceptual Narratives of Space and Motion for Semantic Interpretation of Visual Data

We propose a commonsense theory of




for the high-level semantic interpretation of dynamic scenes. The theory provides primitives for commonsense representation and reasoning with

qualitative spatial relations, depth profiles

, and

spatio-temporal change

; these may be combined with probabilistic methods for modelling and hypothesising event and object relations. The proposed framework has been implemented as a general activity abstraction and reasoning engine, which we demonstrate by generating declaratively grounded visuo-spatial narratives of perceptual input from vision and depth sensors for a benchmark scenario.

Our long-term goal is to provide general tools (integrating different aspects of space, action, and change) necessary for tasks such as real-time human activity interpretation and dynamic sensor control within the purview of cognitive vision, interaction, and control.

Jakob Suchan, Mehul Bhatt, Paulo E. Santos

Multi-Entity Bayesian Networks for Knowledge-Driven Analysis of ICH Content

In this paper we introduce Multi-Entity Bayesian Networks (MEBNs) as the means to combine first-order logic with probabilistic inference and facilitate the semantic analysis of Intangible Cultural Heritage (ICH) content. First, we mention the need to capture and maintain ICH manifestations for the safeguarding of cultural treasures. Second, we present the MEBN models and stress their key features that can be used as a powerful tool for the aforementioned cause. Third, we present the methodology followed to build a MEBN model for the analysis of a traditional dance. Finally, we compare the efficiency of our MEBN model with that of a simple Bayesian network and demonstrate its superiority in cases that demand for situation-specific treatment.

Giannis Chantas, Alexandros Kitsikidis, Spiros Nikolopoulos, Kosmas Dimitropoulos, Stella Douka, Ioannis Kompatsiaris, Nikos Grammalidis

$$\mathcal {ALC({\mathbf {F}}})$$ : A New Description Logic for Spatial Reasoning in Images

In image interpretation and computer vision, spatial relations between objects and spatial reasoning are of prime importance for recognition and interpretation tasks. Quantitative representations of spatial knowledge have been proposed in the literature. In the Artificial Intelligence community, logical formalisms such as ontologies have also been proposed for spatial knowledge representation and reasoning, and a challenging and open problem consists in bridging the gap between these ontological representations and the quantitative ones used in image interpretation. In this paper, we propose a new description logic, named

$$\mathcal {ALC({\mathbf {F}})}$$

, dedicated to spatial reasoning for image understanding. Our logic relies on the family of description logics equipped with concrete domains, a widely accepted way to integrate quantitative and qualitative qualities of real world objects in the conceptual domain, in which we have integrated mathematical morphological operators as predicates. Merging description logics with mathematical morphology enables us to provide new mechanisms to derive useful concrete representations of spatial concepts and new qualitative and quantitative spatial reasoning tools. It also enables imprecision and uncertainty of spatial knowledge to be taken into account through the fuzzy representation of spatial relations. We illustrate the benefits of our formalism on a model-guided cerebral image interpretation task.

Céline Hudelot, Jamal Atif, Isabelle Bloch

SceneNet: A Perceptual Ontology for Scene Understanding

Scene recognition systems which attempt to deal with a large number of scene categories currently lack proper knowledge about the perceptual ontology of scene categories and would enjoy significant advantage from a perceptually meaningful scene representation. In this work we perform a large-scale human study to create “SceneNet”, an online ontology database for scene understanding that organizes scene categories according to their perceptual relationships. This perceptual ontology suggests that perceptual relationships do not always conform the semantic structure between categories, and it entails a lower dimensional perceptual space with “perceptually meaningful” Euclidean distance, where each embedded category is represented by a single prototype. Using the SceneNet ontology and database we derive a computational scheme for learning non-linear mapping of scene images into the perceptual space, where each scene image is closest to its category prototype than to any other prototype by a large margin. Then, we demonstrate how this approach facilitates improvements in large-scale scene categorization over state-of-the-art methods and existing semantic ontologies, and how it reveals novel perceptual findings about the discriminative power of visual attributes and the typicality of scenes.

Ilan Kadar, Ohad Ben-Shahar

W11 - Visual Perception of Affordances and Functional Visual Primitives for Scene Analysis


Affordances in Video Surveillance

This paper articulates the concept of affordances use as the building block of an automated video surveillance system which learns and evolves over time. It grounds its arguments on the basis of a visual attention hardware and affordances.

Agheleh Yaghoobi, Hamed Rezazadegan-Tavakoli, Juha Röning

Affordance-Based Object Recognition Using Interactions Obtained from a Utility Maximization Principle

The interaction of biological agents within the real world is based on their abilities and the affordances of the environment. By contrast, the classical view of perception considers only sensory features, as do most object recognition models. Only a few models make use of the information provided by the integration of sensory information as well as possible or executed actions. Neither the relations shaping such an integration nor the methods for using this integrated information in appropriate representations are yet entirely clear. We propose a probabilistic model integrating the two information sources in one system. The recognition process is equipped with an utility maximization principle to obtain optimal interactions with the environment

Tobias Kluth, David Nakath, Thomas Reineking, Christoph Zetzsche, Kerstin Schill

Detecting Fine-Grained Affordances with an Anthropomorphic Agent Model

In this paper we propose an approach to distinguish affordances on a fine-grained scale. We define an anthropomorphic agent model and parameterized affordance models. The agent model is transformed according to affordance parameters to detect affordances in the input data. We present first results on distinguishing two closely related affordances derived from


. The promising results support our concept of fine-grained affordance detection.

Viktor Seib, Nicolai Wojke, Malte Knauf, Dietrich Paulus

A Bio-Inspired Robot with Visual Perception of Affordances

We present a visual robot whose associated neural controller develops a realistic perception of affordances. The controller uses known insect brain principles; particularly the time stabilized sparse code communication between the Antennal Lobe and the Mushroom Body. The robot perceives the world through a webcam and canny border openCV routines. Self-controlled neural agents process this massive raw data and produce a time stabilized sparse version, where implicit time-space information is encoded. Preprocessed information is relayed to a population of neural agents specialized in cognitive activities and trained under self-critical isolated conditions. Isolation induces an emergent behavior which makes possible the invariant visual recognition of objects. This later capacity is assembled into cognitive strings which incorporate time-elapse learning resources activation. By using this assembled capacity during an extended learning period the robot finally achieves perception of affordances. The system has been tested in real time with real world elements.

Oscar Chang

Integrating Object Affordances with Artificial Visual Attention

Affordances, e.g., grasping possibilities, play a role in the guidance of human attention. We report experiments on the integration of affordance estimation with artificial visual attention in a prototypical model. Furthermore, Growing Neural Gas is discussed as a potential framework for future attention models that deeply integrate affordance, saliency and further attentional mechanisms.

Jan Tünnermann, Christian Born, Bärbel Mertsching

Modelling Primate Control of Grasping for Robotics Applications

The neural circuits that control grasping and perform related visual processing have been studied extensively in macaque monkeys. We are developing a computational model of this system, in order to better understand its function, and to explore applications to robotics. We recently modelled the neural representation of three-dimensional object shapes, and are currently extending the model to produce hand postures so that it can be tested on a robot. To train the extended model, we are developing a large database of object shapes and corresponding feasible grasps. Finally, further extensions are needed to account for the influence of higher-level goals on hand posture. This is essential because often the same object must be grasped in different ways for different purposes. The present paper focuses on a method of incorporating such higher-level goals. A proof-of-concept exhibits several important behaviours, such as choosing from multiple approaches to the same goal. Finally, we discuss a neural representation of objects that supports fast searching for analogous objects.

Ashley Kleinhans, Serge Thill, Benjamin Rosman, Renaud Detry, Bryan Tripp

OBEliSK: Novel Knowledgebase of Object Features and Exchange Strategies

This paper presents the design and development of a system intended for storing, querying and managing all required data related to a fluent human-robot object handover process. Our system acts as a bridge between visual perception and control systems in a robotic setup intended to collaborate with human partners, while the perception module provides information about the exchange environment. In order to achieve these goals, a semantic-ontological approach has been selected favouring system’s interoperability and extensibility, complemented with a set of utilities developed ad-hoc for easing the knowledge inference, query and management. As a result, the proposed knowledgebase provides a completeness level not previously reached in related state of the art approaches.

David Cabañeros Blanco, Ana Belén Rodríguez Arias, Víctor Fernández-Carbajales Cañete, Joaquín Canseco Suárez

How Industrial Robots Benefit from Affordances

In this paper we discuss the potential of Gibson’s affordance concept in industrial robotics. Recent advances in robotics introduce more and more robots to collaborate with human co-workers in industrial environments, more ambitious development of using mobile manipulators in industrial environments has also received widespread attentions. We investigate how the conventional robotic affordance concept fits the pragmatic industrial robotic applications with the focuses on flexibility, re-purposing and safety.

Kai Zhou, Martijn Rooker, Sharath Chandra Akkaladevi, Gerald Fritz, Andreas Pichler

The Aspect Transition Graph: An Affordance-Based Model

In this work we introduce the Aspect Transition Graph (ATG), an affordance-based model that is grounded in the robot’s own actions and perceptions. An ATG summarizes how observations of an object or the environment changes in the course of interaction. Through the Robonaut 2 simulator, we demonstrate that by exploiting these learned models the robot can recognize objects and manipulate them to reach certain goal state.

Li Yang Ku, Shiraj Sen, Erik G. Learned-Miller, Roderic A. Grupen

W12 - Graphical Models in Computer Vision


MAP-Inference on Large Scale Higher-Order Discrete Graphical Models by Fusion Moves

Many computer vision problems can be cast into optimization problems over discrete graphical models also known as Markov or conditional random fields. Standard methods are able to solve those problems quite efficiently. However, problems with huge label spaces and or higher-order structure remain challenging or intractable even for approximate methods.

We reconsider the work of Lempitsky et al. 2010 on fusion moves and apply it to general discrete graphical models. We propose two alternatives for calculating fusion moves that outperform the standard in several applications. Our generic software framework allows us to easily use different proposal generators which spans a large class of inference algorithms and thus makes exhaustive evaluation feasible.

Because these fusion algorithms can be applied to models with huge label spaces and higher-order terms, they might stimulate and support research of such models which may have not been possible so far due to the lack of adequate inference methods.

Jörg Hendrik Kappes, Thorsten Beier, Christoph Schnörr

Feedback Loop Between High Level Semantics and Low Level Vision

High level semantic analysis typically involves constructing a Markov network over detections from low level detectors to encode context and model relationships between them. In complex higher order networks (e.g. Markov Logic Networks), each detection can be part of many factors and the network size grows rapidly as a function of the number of detections. Hence to keep the network size small, a threshold is applied on the confidence measures of the detections to discard the less likely detections. A practical challenge is to decide what thresholds to use to discard noisy detections. A high threshold will lead to a high false dismissal rate. A low threshold can result in many detections including mostly noisy ones which leads to a large network size and increased computational requirements. We propose a feedback based incremental technique to keep the network size small. We initialize the network with detections above a high confidence threshold and then based on the high level semantics in the initial network, we incrementally select the relevant detections from the remaining ones that are below the threshold. We show three different ways of selecting detections which are based on three scoring functions that bound the increase in the optimal value of the objective function of network, with varying degrees of accuracy and computational cost. We perform experiments with an event recognition task in one-on-one basketball videos that uses Markov Logic Networks.

Varun K. Nagaraja, Vlad I. Morariu, Larry S. Davis

How to Supervise Topic Models

Supervised topic models are important machine learning tools which have been widely used in computer vision as well as in other domains. However, there is a gap in the understanding of the supervision impact on the model. In this paper, we present a thorough analysis on the behaviour of supervised topic models using Supervised Latent Dirichlet Allocation (SLDA) and propose two factorized supervised topic models, which factorize the topics into signal and noise. Experimental results on both synthetic data and real-world data for computer vision tasks show that supervision need to be boosted to be effective and factorized topic models are able to enhance the performance.

Cheng Zhang, Hedvig Kjellström

W14 - Light Fields for Computer Vision


Barcode Imaging Using a Light Field Camera

We present a method to capture sharp barcode images, using a microlens-based light field camera. Relative to standard barcode readers, which typically use fixed-focus cameras in order to reduce mechanical complexity and shutter lag, employing a light field camera significantly increases the scanner’s depth of field. However, the increased computational complexity that comes with software-based focusing is a major limitation on these approaches. Whereas traditional light field rendering involves time-consuming steps intended to produce a focus stack in which all objects appear sharply-focused, a scanner only needs to produce an image of the barcode region that falls within the decoder’s inherent robustness to defocus. With this in mind, we speed up image processing by segmenting the barcode region before refocus is applied. We then estimate the barcode’s depth directly from the raw sensor image, using a lookup table characterizing a relationship between depth and the code’s spatial frequency. Real image experiments with the Lytro camera illustrate that our system can produce a decodable image with a fraction of the computational complexity.

Xinqing Guo, Haiting Lin, Zhan Yu, Scott McCloskey

Depth Estimation for Glossy Surfaces with Light-Field Cameras

Light-field cameras have now become available in both consumer and industrial applications, and recent papers have demonstrated practical algorithms for depth recovery from a passive single-shot capture. However, current light-field depth estimation methods are designed for Lambertian objects and fail or degrade for glossy or specular surfaces. Because light-field cameras have an array of micro-lenses, the captured data allows modification of both focus and perspective viewpoints. In this paper, we develop an iterative approach to use the benefits of light-field data to estimate and remove the specular component, improving the depth estimation. The approach enables light-field data depth estimation to support both specular and diffuse scenes. We present a physically-based method that estimates one or multiple light source colors. We show our method outperforms current state-of-the-art diffuse and specular separation and depth estimation algorithms in multiple real world scenarios.

Michael W. Tao, Ting-Chun Wang, Jitendra Malik, Ravi Ramamoorthi

Accurate Disparity Estimation for Plenoptic Images

In this paper we propose a post-processing pipeline to recover accurately the views (light-field) from the raw data of a plenoptic camera such as Lytro and to estimate disparity maps in a novel way from such a light-field. First, the microlens centers are estimated and then the raw image is demultiplexed without demosaicking it beforehand. Then, we present a new block-matching algorithm to estimate disparities for the mosaicked plenoptic views. Our algorithm exploits at best the configuration given by the plenoptic camera: (i) the views are horizontally and vertically rectified and have the same baseline, and therefore (ii) at each point, the vertical and horizontal disparities are the same. Our strategy of demultiplexing without demosaicking avoids image artifacts due to view cross-talk and helps estimating more accurate disparity maps. Finally, we compare our results with state-of-the-art methods.

Neus Sabater, Mozhdeh Seifi, Valter Drazic, Gustavo Sandri, Patrick Pérez

SocialSync: Sub-Frame Synchronization in a Smartphone Camera Network

SocialSync is a sub-frame synchronization protocol for capturing images simultaneously using a smartphone camera network. By synchronizing image captures to within a frame period, multiple smartphone cameras, which are often in use in social settings, can be used for a variety of applications including light field capture, depth estimation, and free viewpoint television. Currently, smartphone camera networks are limited to capturing static scenes due to motion artifacts caused by frame misalignment. Because frame misalignment in smartphones camera networks is caused by variability in the camera system, we characterize frame capture on mobile devices by analyzing the statistics of camera setup latency and frame delivery within an Android app. Next, we develop the SocialSync protocol to achieve sub-frame synchronization between devices by estimating frame capture timestamps to within millisecond accuracy. Finally, we demonstrate the effectiveness of SocialSync on mobile devices by reducing motion-induced artifacts when recovering the light field.

Richard Latimer, Jason Holloway, Ashok Veeraraghavan, Ashutosh Sabharwal

Depth and Arbitrary Motion Deblurring Using Integrated PSF

In recent years, research for recovering depth blur and motion blur in images has been making a significant progress. In particular, the progress in computational photography enabled us to generate all-in-focus images and control depth of field in images. However, the simultaneous recovery of depth and motion blurs is still a big problem, and recoverable motion blurs are limited.

In this paper, we show that by moving a camera during the exposure, the PSF of the whole image becomes invariant, and motion deblurring and all-in-focus imaging can be achieved simultaneously. In particular, motion blurs caused by arbitrary multiple motions can be recovered. The validity and the advantages of the proposed method are shown by real image experiments and synthetic image experiments.

Takeyuki Kobayashi, Fumihiko Sakaue, Jun Sato

Acquiring 4D Light Fields of Self-Luminous Light Sources Using Programmable Filter

Self-luminous light sources in the real world often have nonnegligible sizes and radiate light inhomogeneously. Acquiring the model of such a light source is highly important for accurate image synthesis and understanding. In this paper, we propose a method for measuring 4D light fields of self-luminous extended light sources by using a liquid crystal (LC) panel,


a programmable filter and a diffuse-reflection board. The proposed method recovers the 4D light field from the images of the board illuminated by the light radiated from a light source and passing through the LC panel. We make use of the feature that the transmittance of the LC panel can be controlled both spatially and temporally. The proposed method enables us to utilize multiplexed sensing, and therefore is able to acquire 4D light fields more efficiently and densely than the straightforward method. We implemented the prototype setup, and confirmed through a number of experiments that the proposed method is effective for modeling self-luminous extended light sources in the real world.

Motohiro Nakamura, Takahiro Okabe, Hendrik P. A. Lensch

Light Field from Smartphone-Based Dual Video

In this work, we introduce a light field acquisition approach for standard smartphones. The smartphone is manually translated along a horizontal rail, while recording synchronized video with front and rear camera. The front camera captures a control pattern, mounted parallel to the direction of translation to determine the smartphones current position. This information is used during a postprocessing step to identify an equally spaced subset of recorded frames from the rear camera, which captures the actual scene. From this data we assemble a light field representation of the scene. For subsequent disparity estimation, we apply a structure tensor approach in the epipolar plane images.

We evaluate our method by comparing the light fields resulting from manual translation of the smartphone against those recorded with a constantly moving translation stage.

Bernd Krolla, Maximilian Diebold, Didier Stricker

W15 - Computer Vision for Road Scene Understanding and Autonomous Driving


Ten Years of Pedestrian Detection, What Have We Learned?

Paper-by-paper results make it easy to miss the forest for the trees. We analyse the remarkable progress of the last decade by discussing the main ideas explored in the 40+ detectors currently present in the Caltech pedestrian detection benchmark. We observe that there exist three families of approaches, all currently reaching similar detection quality. Based on our analysis, we study the complementarity of the most promising ideas by combining multiple published strategies. This new decision forest detector achieves the current best known performance on the challenging Caltech-USA dataset.

Rodrigo Benenson, Mohamed Omran, Jan Hosang, Bernt Schiele

Fast 3-D Urban Object Detection on Streaming Point Clouds

Efficient and fast object detection from continuously streamed 3-D point clouds has a major impact in many related research tasks, such as autonomous driving, self localization and mapping and understanding large scale environment. This paper presents a LIDAR-based framework, which provides fast detection of 3-D urban objects from point cloud sequences of a Velodyne HDL-64E terrestrial LIDAR scanner installed on a moving platform. The pipeline of our framework receives raw streams of 3-D data, and produces distinct groups of points which belong to different urban objects. In the proposed framework we present a simple, yet efficient hierarchical grid data structure and corresponding algorithms that significantly improve the processing speed of the object detection task. Furthermore, we show that this approach confidently handles streaming data, and provides a speedup of two orders of magnitude, with increased detection accuracy compared to a baseline connected component analysis algorithm.

Attila Börcs, Balázs Nagy, Csaba Benedek

Relative Pose Estimation and Fusion of Omnidirectional and Lidar Cameras

This paper presents a novel approach for the extrinsic parameter estimation of omnidirectional cameras with respect to a 3D Lidar coordinate frame. The method works without specific setup and calibration targets, using only a pair of 2D-3D data. Pose estimation is formulated as a 2D-3D nonlinear shape registration task which is solved without point correspondences or complex similarity metrics. It relies on a set of corresponding regions, and pose parameters are obtained by solving a small system of nonlinear equations. The efficiency and robustness of the proposed method was confirmed on both synthetic and real data in urban environment.

Levente Tamas, Robert Frohlich, Zoltan Kato

Good Edgels to Track: Beating the Aperture Problem with Epipolar Geometry

An open issue in multiple view geometry and structure from motion, applied to real life scenarios, is the sparsity of the matched key-points and of the reconstructed point cloud. We present an approach that can significantly improve the density of measured displacement vectors in a sparse matching or tracking setting, exploiting the partial information of the motion field provided by linear oriented image patches (edgels). Our approach assumes that the epipolar geometry of an image pair already has been computed, either in an earlier feature-based matching step, or by a robustified differential tracker. We exploit key-points of a lower order,


, which cannot provide a unique 2D matching, but can be employed if a constraint on the motion is already given. We present a method to extract edgels, which can be effectively tracked given a known camera motion scenario, and show how a constrained version of the Lucas-Kanade tracking procedure can efficiently exploit epipolar geometry to reduce the classical KLT optimization to a 1D search problem. The potential of the proposed methods is shown by experiments performed on real driving sequences.

Tommaso Piccini, Mikael Persson, Klas Nordberg, Michael Felsberg, Rudolf Mester

W16 - Soft Biometrics


Facial Age Estimation Through the Fusion of Texture and Local Appearance Descriptors

Automatic extraction of soft biometric characteristics from face images is a very prolific field of research. Among these soft biometrics, age estimation can be very useful for several applications, such as advanced video surveillance [




], demographic statistics collection, business intelligence and customer profiling, and search optimization in large databases. However, estimating age from uncontrollable environments, with insufficient and incomplete training data, dealing with strong person-specificity, and high within-range variance, can be very challenging. These difficulties have been addressed in the past with complex and strongly hand-crafted descriptors, which make it difficult to replicate and compare the validity of posterior classification schemes. This paper presents a simple yet effective approach which fuses and exploits texture- and local appearance-based descriptors to achieve faster and more accurate results. A series of local descriptors and their combinations have been evaluated under a diversity of settings, and the extensive experiments carried out on two large databases (MORPH and FRGC) demonstrate state-of-the-art results over previous work.

Ivan Huerta, Carles Fernández, Andrea Prati

Privacy of Facial Soft Biometrics: Suppressing Gender But Retaining Identity

We consider the problem of perturbing a face image in such a way that it cannot be used to ascertain soft biometric attributes such as age, gender and race, but can be used for automatic face recognition. Such an exercise is useful for extending different levels of privacy to a face image in a central database. In this work, we focus on masking the gender information in a face image with respect to an automated gender estimation scheme, while retaining its ability to be used by a face matcher. To facilitate this privacy-enhancing technique, the input face image is combined with another face image via a morphing scheme resulting in a mixed image. The mixing process can be used to progressively modify the input image such that its gender information is progressively suppressed; however, the modified images can still be used for recognition purposes if necessary. Preliminary experiments on the MUCT database suggest the potential of the scheme in imparting “differential privacy” to face images.

Asem Othman, Arun Ross

Exploring the Magnitude of Human Sexual Dimorphism in 3D Face Gender Classification

Human faces demonstrate clear Sexual Dimorphism (SD) for recognizing the gender. Different faces, even of the same gender, convey different magnitude of sexual dimorphism. However, in gender classification, gender has been interpreted discretely as either male or female. The exact magnitude of the sexual dimorphism in each gender is ignored. In this paper, we propose to evaluate the SD magnitude, using the ratio of votes from the Random Forest algorithm performed on 3D geometric features related to the face morphology. Then, faces are separated into a


group and a


group. In the


experiments, when the training is performed with scans of similar SD magnitude than the testing scan, the classification accuracy improves. In


experiments, the scans with low magnitude of SD demonstrate higher gender discrimination power than the ones with high SD magnitude. With a


fusion method, our method achieves 97.46 % gender classification rate on the 466 earliest 3D scans of FRGCv2 (mainly neutral), and 97.18 % on the whole FRGCv2 dataset (with expressions).

Baiqiang Xia, Boulbaba Ben Amor, Mohamed Daoudi

Towards Predicting Good Users for Biometric Recognition Based on Keystroke Dynamics

This paper studies ways to detect good users for biometric recognition based on keystroke dynamics. Keystroke dynamics is an active research field for the biometric scientific community. Despite the great efforts made during the last decades, the performance of keystroke dynamics recognition systems is far from the performance achieved by traditional hard biometrics. This is very pronounced for some users, who generate many recognition errors even with the most sophisticate recognition algorithms. On the other hand, previous works have demonstrated that some other users behave particularly well even with the simplest recognition algorithms. Our purpose here is to study ways to distinguish such classes of users using only the genuine enrollment data. The experiments comprise a public database and two popular recognition algorithms. The results show the effectiveness of the Kullback-Leibler divergence as a quality measure to categorize users in comparison with other four statistical measures.

Aythami Morales, Julian Fierrez, Javier Ortega-Garcia

How Much Information Kinect Facial Depth Data Can Reveal About Identity, Gender and Ethnicity?

Human face images acquired using conventional 2D cameras may have inherent restrictions that hinder the inference of some specific information in the face. The low-cost depth sensors such as Microsoft Kinect introduced in late 2010 allow extracting directly 3D information, together with RGB color images. This provides new opportunities for computer vision and face analysis research. Although more accurate sensors for detailed facial image analysis are expected to be available soon (e.g. Kinect 2), this paper investigates the usefulness of the depth images provided by the current Microsoft Kinect sensors in different face analysis tasks. We conduct an in-depth study comparing the performance of the depth images provided by Microsoft Kinect sensors against RGB counterpart images in three face analysis tasks, namely identity, gender and ethnicity. Four local feature extraction methods are investigated for both face texture and shape description. Moreover, the two modalities (i.e. depth and RGB) are fused to gain insight into their complementarity. The experimental analysis conducted on two publicly available kinect face databases, EurecomKinect and Curtinfaces, yields into interesting results.

Elhocine Boutellaa, Messaoud Bengherabi, Samy Ait-Aoudia, Abdenour Hadid

An Overview of Research Activities in Facial Age Estimation Using the FG-NET Aging Database

The FG-NET aging database was released in 2004 in an attempt to support research activities related to facial aging. Since then a number of researchers used the database for carrying out research in various disciplines related to facial aging. Based on the analysis of published work where the FG-NET aging database was used, conclusions related to the type of research carried out in relation to the impact of the dataset in shaping up the research topic of facial aging, are presented. In particular we focus our attention on the topic of age estimation that proved to be the most popular among users of the FG-NET aging database. Through the review of key papers in age estimation and the presentation of benchmark results the main approaches/directions in facial aging are outlined and future trends, requirements and research directions are drafted.

Gabriel Panis, Andreas Lanitis

Gender Classification from Iris Images Using Fusion of Uniform Local Binary Patterns

This paper is concerned in analyzing iris texture in order to determine “soft biometric”, attributes of a person, rather than identity. In particular, this paper is concerned with predicting the gender of a person based on analysis of features of the iris texture. Previous researchers have explored various approaches for predicting the gender of a person based on iris texture. We explore using different implementations of Local Binary Patterns from the iris image using the masked information. Uniform LBP with concatenated histograms significantly improves accuracy of gender prediction relative to using the whole iris image. Using a subject-disjoint test set, we are able to achieve over 91 % correct gender prediction using the texture of the iris. To our knowledge, this is the highest accuracy yet achieved for predicting gender from iris texture.

Juan E. Tapia, Claudio A. Perez, Kevin W. Bowyer

Evaluation of Texture Descriptors for Automated Gender Estimation from Fingerprints

Gender is an important demographic attribute. In the context of biometrics, gender information can be used to index databases or enhance the recognition accuracy of primary biometric traits. A number of studies have demonstrated that gender can be automatically deduced from face images. However, few studies have explored the possibility of automatically estimating gender information from fingerprint images. Consequently, there is a limited understanding in this topic. Fingerprint being a widely adopted biometrics, gender cues from the fingerprint image will significantly aid in commercial applications and forensic investigations. This study explores the use of classical texture descriptors - Local Binary Pattern (LBP), Local Phase Quantization (LPQ), Binarized Statistical Image Features (BSIF) and Local Ternary Pattern (LTP) - to estimate gender from fingerprint images. The robustness of these descriptors to various types of image degradations is evaluated. Experiments conducted on the WVU fingerprint dataset suggest the efficacy of LBP descriptor in encoding gender information from good quality fingerprints. The BSIF descriptor is observed to be robust to partial fingerprints, while LPQ is observed to work well on blurred fingerprints. However, the gender estimation accuracy in the case of fingerprints is much lower than that of face, thereby suggesting that more work is necessary on this topic.

Ajita Rattani, Cunjian Chen, Arun Ross

Recognition of Facial Attributes Using Adaptive Sparse Representations of Random Patches

It is well known that some facial attributes –like soft biometric traits– can increase the performance of traditional biometric systems and help recognition based on human descriptions. In addition, other facial attributes –like facial expressions– can be used in human–computer interfaces, image retrieval, talking heads and human emotion analysis. This paper addresses the problem of automated recognition of facial attributes by proposing a new general approach called Adaptive Sparse Representation of Random Patches (ASR+). In the learning stage, random patches are extracted from representative face images of each class (


, in gender recognition –a two-class problem–, images of females/males) in order to construct representative dictionaries. In the testing stage, random test patches of the query image are extracted, and for each test patch a dictionary is built concatenating the ‘best’ representative dictionary of each class. Using this adapted dictionary, each test patch is classified following the Sparse Representation Classification (SRC) methodology. Finally, the query image is classified by patch voting. Thus, our approach is able to learn a model for each recognition task dealing with a larger degree of variability in ambient lighting, pose, expression, occlusion, face size and distance from the camera. Experiments were carried out on seven face databases in order to recognize facial expression, gender, race and disguise. Results show that ASR+ deals well with unconstrained conditions, outperforming various representative methods in the literature in many complex scenarios.

Domingo Mery, Kevin Bowyer

Person Identification in Natural Static Postures Using Kinect

Automatic person identification using un-obtrusive methods are of immense importance in the area of computer vision. Anthropometric approaches are robust to external factors including environmental illumination and obstructions due to hair, spectacles, hats or any other wearable. Recently, there have been efforts made on people identification using walking pattern of the skeleton data obtained from Kinect. In this paper we investigate the possibility of identification using static postures namely sitting and standing. Existing gait based identifications, mostly rely on the dynamics of the joints of the skeleton data. In case of static postures the motion information is not available, hence the identification mainly relies on the static distance information between the joints. Moreover, the variation of pose in a particular posture makes the identification more challenging. The proposed methodology, initially sub-divides the body-parts into static, dynamic and noisy parts followed by a combinatorial element responsible for selectively extracting features for each of those parts. Finally a radial basis function support vector machine classifier is used to perform the training and testing for the identification. Results indicate an identification accuracy of more than 97 % in terms of F-score for 10 people using a dataset created with various poses of natural sitting and standing posture.

Vempada Ramu Reddy, Kingshuk Chakravarty, S. Aniruddha

Activity-Based Person Identification Using Discriminative Sparse Projections and Orthogonal Ensemble Metric Learning

In this paper, we propose an activity-based human identification approach using discriminative sparse projections (DSP) and orthogonal ensemble metric learning (OEML). Unlike gait recognition which recognizes person only from his/her walking activity, this study aims to identify people from more general types of human activities such as eating, drinking, running, and so on. That is because people may not always walk in the scene and gait recognition fails to work in this scenario. Given an activity video, human body mask in each frame is first extracted by background substraction. Then, we propose a DSP method to map these body masks into a low-dimensional subspace and cluster them into a number of clusters to form a dictionary, simultaneously. Subsequently, each video clip is pooled as a histogram feature for activity representation. Lastly, we propose an OEML method to learn a similarity distance metric to exploit discriminative information for recognition. Experimental results show the effectiveness of our proposed approach and better recognition rate is achieved than state-of-the-art methods.

Haibin Yan, Jiwen Lu, Xiuzhuang Zhou

Facial Ethnic Appearance Synthesis

In this work, we have explored several subspace reconstruction methods for facial ethnic appearance synthesis (FEAS). In our experiments, our proposed dual subspace modeling using the Fukunaga Koontz transform (FKT) yields much better facial ethnic synthesis results than the

$$\ell _1$$

minimization, the

$$\ell _2$$

minimization and the principal component analysis (PCA) reconstruction method. With that, we are able to automatically and efficiently synthesize different facial ethnic appearance and alter the facial ethnic appearance of the query image to any other ethnic appearance as desired. Our technique well preserves the facial structure of the query image and simultaneously synthesize the skin tone and ethnic features that best matches target ethnicity group. Facial ethnic appearance synthesis can be applied to synthesizing facial images of a particular ethnicity group for unbalanced database, and can be used to train ethnicity invariant classifiers by generating multiple ethnic appearances of the same subject in the training stage.

Felix Juefei-Xu, Marios Savvides


Weitere Informationen

Premium Partner