Skip to main content
Top

2017 | Book

Computer Vision – ACCV 2016

13th Asian Conference on Computer Vision, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part III

insite
SEARCH

About this book

The five-volume set LNCS 10111-10115 constitutes the thoroughly refereed post-conference proceedings of the 13th Asian Conference on Computer Vision, ACCV 2016, held in Taipei, Taiwan, in November 2016.

The total of 143 contributions presented in these volumes was carefully reviewed and selected from 479 submissions. The papers are organized in topical sections on Segmentation and Classification; Segmentation and Semantic Segmentation; Dictionary Learning, Retrieval, and Clustering; Deep Learning; People Tracking and Action Recognition; People and Actions; Faces; Computational Photography; Face and Gestures; Image Alignment; Computational Photography and Image Processing; Language and Video; 3D Computer Vision; Image Attributes, Language, and Recognition; Video Understanding; and 3D Vision.

Table of Contents

Frontmatter

Computational Photography

Frontmatter
Layered Scene Reconstruction from Multiple Light Field Camera Views

We propose a framework to infer complete geometry of a scene with strong reflections or hidden by partially transparent occluders from a set of 4D light fields captured with a hand-held light field camera. For this, we first introduce a variant of bundle adjustment specifically tailored to 4D light fields to obtain improved pose parameters. Geometry is recovered in a global framework based on convex optimization for a weighted minimal surface. To allow for non-Lambertian materials and semi-transparent occluders, the point-wise costs are not based on the principle of photo-consistency. Instead, we perform a layer analysis of the light field obtained by finding superimposed oriented patterns in epipolar plane image space to obtain a set of depth hypotheses and confidence scores, which are integrated into a single functional.

Ole Johannsen, Antonin Sulc, Nico Marniok, Bastian Goldluecke
A Dataset and Evaluation Methodology for Depth Estimation on 4D Light Fields

In computer vision communities such as stereo, optical flow, or visual tracking, commonly accepted and widely used benchmarks have enabled objective comparison and boosted scientific progress.In the emergent light field community, a comparable benchmark and evaluation methodology is still missing. The performance of newly proposed methods is often demonstrated qualitatively on a handful of images, making quantitative comparison and targeted progress very difficult. To overcome these difficulties, we propose a novel light field benchmark. We provide 24 carefully designed synthetic, densely sampled 4D light fields with highly accurate disparity ground truth. We thoroughly evaluate four state-of-the-art light field algorithms and one multi-view stereo algorithm using existing and novel error measures.This consolidated state-of-the art may serve as a baseline to stimulate and guide further scientific progress. We publish the benchmark website http://www.lightfield-analysis.net, an evaluation toolkit, and our rendering setup to encourage submissions of both algorithms and further datasets.

Katrin Honauer, Ole Johannsen, Daniel Kondermann, Bastian Goldluecke
Radial Lens Distortion Correction Using Convolutional Neural Networks Trained with Synthesized Images

Radial lens distortion often exists in images taken by common cameras, which violates the assumption of pinhole camera model. Estimating the radial lens distortion of an image is an important preprocessing step for many vision applications. This paper intends to employ CNNs (Convolutional Neural Networks), to achieve radial distortion correction. However, the main issue hinder its progress is the scarcity of training data with radial distortion annotations. Inspired by the growing availability of image dataset with non-radial distortion, we propose a framework to address the issue by synthesizing images with radial distortion for CNNs. We believe that a large number of images of high variation of radial distortion is generated, which can be well exploited by deep CNN with a high learning capacity. We present quantitative results that demonstrate the ability of our technique to estimate the radial distortion with comparisons against several baseline methods, including an automatic method based on Hough transforms of distorted line images.

Jiangpeng Rong, Shiyao Huang, Zeyu Shang, Xianghua Ying
Ultrasound Speckle Reduction via Minimization

Speckle reduction is a crucial prerequisite of many computer-aided ultrasound diagnosis and treatment systems. However, most of existing speckle reduction filters concentrate the blurring near features and introduced the hole artifacts, making the subsequent processing procedures complicated. Optimization-based methods can globally distribute such blurring, leading to better feature preservation. Motivated by this, we propose a novel optimization framework based on $$L_{0}$$ minimization for feature preserving ultrasound speckle reduction. We observed that the GAP, which integrates gradient and phase information, is extremely sparser in despeckled images than in speckled images. Based on this observation, we propose the $$L_{0}$$ minimization framework to remove speckle noise and simultaneously preserve features in ultrasound images. It seeks for the $$L_{0}$$ sparsity of the $$\textit{GAP}$$ values, and such sparsity is achieved by reducing small $$\textit{GAP}$$ values to zero in an iterative manner. Since features have larger $$\textit{GAP}$$ magnitudes than speckle noise, the proposed $$L_{0}$$ minimization is capable of effectively suppressing the speckle noise. Meanwhile, the rest of $$\textit{GAP}$$ values corresponding to prominent features are kept unchanged, leading to better preservation of those features. In addition, we propose an efficient and robust numerical scheme to transform the original intractable $$L_{0}$$ minimization into several sub-optimizations, from which we can quickly find their closed-form solutions. Experiments on synthetic and clinical ultrasound images demonstrate that our approach outperforms other state-of-the-art despeckling methods in terms of noise removal and feature preservation.

Lei Zhu, Weiming Wang, Xiaomeng Li, Qiong Wang, Jing Qin, Kin-Hong Wong, Pheng-Ann Heng
A Variational Model for Intrinsic Light Field Decomposition

We present a novel variational model for intrinsic light field decomposition, which is performed on four-dimensional ray space instead of a traditional 2D image. As most existing intrinsic image algorithms are designed for Lambertian objects, their performance suffers when considering scenes which exhibit glossy surfaces. In contrast, the rich structure of the light field with many densely sampled views allows us to cope with non-Lambertian objects by introducing an additional decomposition term that models specularity. Regularization along the epipolar plane images further encourages albedo and shading consistency across views. In evaluations of our method on real-world data sets captured with a Lytro Illum plenoptic camera, we demonstrate the advantages of our approach with respect to intrinsic image decomposition and specular removal.

Anna Alperovich, Bastian Goldluecke
Dense Depth-Map Estimation and Geometry Inference from Light Fields via Global Optimization

Light field camera captures abundant and dense angular samplings in a single shot. The surface camera (SCam) model is an image gathering angular sample rays passing through a 3D point. By analyzing the statistics of SCam, a consistency-depth measurement is evaluated for depth estimation. However, local depth estimation still has limitations. A global method with pixel-wise plane label is presented in this paper. Plane model inference at each pixel not only recovers depth but also local geometry of scene, which is suitable for light fields with floating disparities and continuous view variation. The 2nd order surface smoothness is enforced to allow local curvature surfaces. We use a random strategy to generate candidate plane parameters and refine the plane labels to avoid falling in local minima. We cast the selection of defined labels as fusion move with sequential proposals. The proposals are elaborately constructed to satisfy sub-modular condition with 2nd order smoothness regularizer, so that the minimization can be efficiently solved by graph cuts (GC). Our method is evaluated on public light field datasets and achieves the state-of-the-art accuracy.

Lipeng Si, Qing Wang
Direct and Global Component Separation from a Single Image Using Basis Representation

Previous research showed that the separation of direct and global components could be done with a single image by assuming neighboring scene points have similar direct and global components, but it normally leads to loss of spatial resolution of the results. To tackle such problem, we present a novel approach for separating direct and global components of a scene in full spatial resolution from a single captured image, which employs linear basis representation to approximate direct and global components. Due to the basis dependency of these two components, high frequency light pattern is utilized to modulate the frequency of direct components, which can effectively improve stability of linear model between direct and global components. The effectiveness of our approach is demonstrated on both simulated and real images captured by a standard off-the-shelf camera and a projector mounted in a coaxial system. Our results show better visual quality and less error compared with those obtained by the conventional single-shot approach on both still and moving objects.

Art Subpa-asa, Ying Fu, Yinqiang Zheng, Toshiyuki Amano, Imari Sato
Ultra-Shallow DoF Imaging Using Faced Paraboloidal Mirrors

We propose a new imaging method that achieves an ultra-shallow depth of field (DoF) to clearly visualize a particular depth in a 3-D scene. The key optical device consists of a pair of faced paraboloidal mirrors with holes around their vertexes. In the device, a lens-less image sensor is set at one side of their holes and an object is set at the opposite side. The characteristic of the device is that the shape of the point spread function varies depending on both the positions of the target 3-D point and the image sensor. By leveraging this characteristic, we reconstruct a clear image for a particular depth by solving a linear system involving position-dependent point spread functions. In experiments, we demonstrate the effectiveness of the proposed method using both simulation and an actually developed prototype imaging system.

Ryoichiro Nishi, Takahito Aoto, Norihiko Kawai, Tomokazu Sato, Yasuhiro Mukaigawa, Naokazu Yokoya
ConvNet-Based Depth Estimation, Reflection Separation and Deblurring of Plenoptic Images

In this paper, we address the problem of reflection removal and deblurring from a single image captured by a plenoptic camera. We develop a two-stage approach to recover the scene depth and high resolution textures of the reflected and transmitted layers. For depth estimation in the presence of reflections, we train a classifier through convolutional neural networks. For recovering high resolution textures, we assume that the scene is composed of planar regions and perform the reconstruction of each layer by using an explicit form of the plenoptic camera point spread function. The proposed framework also recovers the sharp scene texture with different motion blurs applied to each layer. We demonstrate our method on challenging real and synthetic images.

Paramanand Chandramouli, Mehdi Noroozi, Paolo Favaro
Learning a Mixture of Deep Networks for Single Image Super-Resolution

Single image super-resolution (SR) is an ill-posed problem which aims to recover high-resolution (HR) images from their low-resolution (LR) observations. The crux of this problem lies in learning the complex mapping between low-resolution patches and the corresponding high-resolution patches. Prior arts have used either a mixture of simple regression models or a single non-linear neural network for this propose. This paper proposes the method of learning a mixture of SR inference modules in a unified framework to tackle this problem. Specifically, a number of SR inference modules specialized in different image local patterns are first independently applied on the LR image to obtain various HR estimates, and the resultant HR estimates are adaptively aggregated to form the final HR image. By selecting neural networks as the SR inference module, the whole procedure can be incorporated into a unified network and be optimized jointly. Extensive experiments are conducted to investigate the relation between restoration performance and different network architectures. Compared with other current image SR approaches, our proposed method achieves state-of-the-arts restoration results on a wide range of images consistently while allowing more flexible design choices.

Ding Liu, Zhaowen Wang, Nasser Nasrabadi, Thomas Huang
A Fast Blind Spatially-Varying Motion Deblurring Algorithm with Camera Poses Estimation

Most existing non-uniform deblurring algorithms model the blurry image as a weighted summation of several sharp images which are warped by one latent image with different homographies. These algorithms usually suffer from high computational cost due to the huge number of homographies to be considered. In order to solve this problem, we introduce a novel single image deblurring algorithm to remove the spatially-varying blur. Since the real motion blur kernel is very sparse, in this paper we first estimate a feasible active set of homographies which may hold large weights in the blur kernel and then compute the corresponding weights on these homographies to reconstruct the blur kernel. Since the size of the active set is quite small, the deblurring algorithm will become much faster. Experiment results show that the proposed algorithm can effectively and efficiently remove the non-uniform blur caused by camera shake.

Yuquan Xu, Seiichi Mita, Silong Peng
Removing Shadows from Images of Documents

In this work, we automatically detect and remove distracting shadows from photographs of documents and other text-based items. Documents typically have a constant colored background; based on this observation, we propose a technique to estimate background and text color in local image blocks. We match these local background color estimates to a global reference to generate a shadow map. Correcting the image with this shadow map produces the final unshadowed output. We demonstrate that our algorithm is robust and produces high-quality results, qualitatively and quantitatively, in both controlled and real-world settings containing large regions of significant shadow.

Steve Bako, Soheil Darabi, Eli Shechtman, Jue Wang, Kalyan Sunkavalli, Pradeep Sen
Video Enhancement via Super-Resolution Using Deep Quality Transfer Network

Streaming low bitrate while preserving high-quality video content is a crucial topic in multimedia and video surveillance. In this work, we explore the problem of spatially and temporally reconstructing high-resolution (HR) frames from a high frame-rate low-resolution (LR) sequence and a few temporally subsampled HR frames. The targeted problem is essentially different from the problems handled by typical super-resolution (SR) methods such as single-image SR and video SR, which attempt to reconstruct HR images using only LR images. To tackle the targeted problem, we propose a deep quality transfer network, based on the convolutional neural network (CNN), which consists of modules including generation and selection of HR pixel candidates, fusion with LR input, residual learning and bidirectional architecture. The proposed CNN model has real-time performance at inference stage. The empirical studies have verified the generality of the proposed CNN model showing significant quality gains for video enhancement.

Pai-Heng Hsiao, Ping-Lin Chang

Face and Gestures

Frontmatter
Age Estimation Based on a Single Network with Soft Softmax of Aging Modeling

In this paper, we propose a novel approach based on a single convolutional neural network (CNN) for age estimation. In our proposed network architecture, we first model the randomness of aging with the Gaussian distribution which is used to calculate the Gaussian integral of an age interval. Then, we present a soft softmax regression function used in the network. The new function applies the aging modeling to compute the function loss. Compared with the traditional softmax function, the new function considers not only the chronological age but also the interval nearby true age. Moreover, owing to the complex of Gaussian integral in soft softmax function, a look up table is built to accelerate this process. All the integrals of age values are calculated offline in advance. We evaluate our method on two public datasets: MORPH II and Cross-Age Celebrity Dataset (CACD), and experimental results have shown that the proposed method has gained superior performances compared to the state of the art.

Zichang Tan, Shuai Zhou, Jun Wan, Zhen Lei, Stan Z. Li
Illumination-Recovered Pose Normalization for Unconstrained Face Recognition

Identifying subjects with pose variations is still considered as one of the most challenging problems in face recognition, despite the great progress achieved in unconstrained face recognition in recent years. Pose problem is essentially a misalignment problem together with self-occlusion (information loss). In this paper, we propose a continuous identity-preserving face pose normalization method and produce natural results in terms of preserving the illumination condition of the query face, based on only five fiducial landmarks. “Raw” frontalization is performed by aligning a generic 3D face model into the query face and rendering it at frontal pose, with an accurate self-occlusion part estimation based on face borderline detection. Then we apply Quotient Image as a face symmetrical feature which is robust to illumination to fill the self-occlusion part. Natural normalization result is obtained where the self-occlusion part keeps the illumination conditions of the query face. Large scale face recognition experiments on LFW and MultiPIE achieve comparative results with state-of-the-art methods, verifying effectiveness of proposed method, with advantage of being database-independent and suitable both for face identification and face verification.

Zhongjun Wu, Weihong Deng, Zhanfu An
Local Fractional Order Derivative Vector Quantization Pattern for Face Recognition

Previous works have shown that fractional order derivative can give a better image description compared with conventional integral one in applications of edge detection, image segmentation, image restoration, and so on. Motivated by this conclusion, in this paper, we propose a novel local image descriptor, local fractional order derivative vector quantization pattern (fVQP), based on image local directional fractional order derivative feature vector and vector quantization method for face recognition. Compared with image integral order derivative information based local binary pattern (LBP), local derivative pattern (LDP) and local directional derivative pattern (LDDP), our fVQP image descriptor has the advantages of better image recognition performance and robust to noise. Extensive experimental results conducted on four benchmark face databases demonstrate the superior performance of our fVQP compared with existing state-of-the-art descriptors for face recognition in terms of recognition rate.

Jing Li, Nong Sang, Changxin Gao
Learning Facial Point Response for Alignment by Purely Convolutional Network

Face alignment is important for most facial analysis system. Regression based methods directly map the input face to shape space, make them sensitive to the face bounding boxes. In this work, we aim at developing a model that can deal with complex non-linear variations and be invariant to face bounding box distributions, while preserving high alignment accuracy. We define response map for each facial point, which is a 2D probability map indicating the presence likelihood of facial point at the corresponding locations. We solve the face alignment problem by two-stage processes. The first stage is response mapping stage, we use deep Purely Convolutional Network (a specialised Convolutional Neural Network designed for face alignment problem) to reconstruct the response maps. The second stage is shape mapping stage, which processes the response maps to get locations of facial key points. We explored four functions for this stage: max function, max + PCA, mean function and mean + PCA function. Experiments done on 300 W dataset show that our algorithm outperforms state-of-the-art methods.

Zhenqi Xu, Weihong Deng, Jiani Hu
Random Forest with Suppressed Leaves for Hough Voting

Random forest based Hough-voting techniques have been widely used in a variety of computer vision problems. As an ensemble learning method, the voting weights of leaf nodes in random forest play critical role to generate reliable estimation result. We propose to improve Hough-voting with random forest via simultaneously optimizing the weights of leaf votes and pruning unreliable leaf nodes in the forest. After constructing the random forest, the weight assignment problem at each tree is formulated as a L0-regularized optimization problem, where unreliable leaf nodes with zero voting weights are suppressed and trees are pruned to ignore sub-trees that contain only suppressed leaves. We apply our proposed techniques to several regression and classification problems such as hand gesture recognition, head pose estimation and articulated pose estimation. The experimental results demonstrate that by suppressing unreliable leaf nodes, it not only improves prediction accuracy, but also reduces both prediction time cost and model complexity of the random forest.

Hui Liang, Junhui Hou, Junsong Yuan, Daniel Thalmann
Sign-Correlation Partition Based on Global Supervised Descent Method for Face Alignment

Face alignment is an essential task for facial performance capture and expression analysis. As a complex nonlinear problem in computer vision, face alignment across poses is still not studied well. Although the state-of-the-art Supervised Descent Method (SDM) has shown good performance, it learns conflict descent direction in the whole complex space due to various poses and expressions. Global SDM has been presented to deal with this case by domain partition in feature and shape PCA spaces for face tracking and pose estimation. However, it is not suitable for the face alignment problem due to unknown ground truth shapes. In this paper we propose a sign-correlation subspace method for the domain partition of global SDM. In our method only one reduced low dimensional subspace is enough for domain partition, thus adjusting the global SDM efficiently for face alignment. Unlike previous methods, we analyze the sign correlation between features and shapes, and project both of them into a mutual sign-correlation subspace. Each pair of projected shape and feature keep sign consistent in each dimension of the subspace, so that each hyperoctant holds the condition that one general descent exists. Then a set of general descent directions are learned from the samples in different hyperoctants. Our sign-correlation partition method is validated in the public face datasets, which includes a range of poses. It indicates that our methods can reveal their latent relationships to poses. The comparison with state-of-the-art methods for face alignment demonstrates that our method outperforms them especially in uncontrolled conditions with various poses, while keeping comparable speed.

Yongqiang Zhang, Shuang Liu, Xiaosong Yang, Daming Shi, Jian Jun Zhang
Deep Video Code for Efficient Face Video Retrieval

In this paper, we address the problem of face video retrieval. Given one face video of a person as query, we search the database and return the most relevant face videos, i.e., ones have same class label with the query. Such problem is of great challenge. For one thing, faces in videos have large intra-class variations. For another, it is a retrieval task which has high request on efficiency of space and time. To handle such challenges, this paper proposes a novel Deep Video Code (DVC) method which encodes face videos into compact binary codes. Specifically, we devise a multi-branch CNN architecture that takes face videos as training inputs, models each of them as a unified representation by temporal feature pooling operation, and finally projects the high-dimensional representations into Hamming space to generate a single binary code for each video as output, where distance of dissimilar pairs is larger than that of similar pairs by a margin. To this end, a smooth upper bound on triplet loss function which can avoid bad local optimal solution is elaborately designed to preserve relative similarity among face videos in the output space. Extensive experiments with comparison to the state-of-the-arts verify the effectiveness of our method.

Shishi Qiao, Ruiping Wang, Shiguang Shan, Xilin Chen
From Face Images and Attributes to Attributes

The face is an important part of the identity of a person. Numerous applications benefit from the recent advances in prediction of face attributes, including biometrics (like age, gender, ethnicity) and accessories (eyeglasses, hat). We study the attributes’ relations to other attributes and to face images and propose prediction models for them. We show that handcrafted features can be as good as deep features, that the attributes themselves are powerful enough to predict other attributes and that clustering the samples according to their attributes can mitigate the training complexity for deep learning. We set new state-of-the-art results on two of the largest datasets to date, CelebA and Facebook BIG5, by predicting attributes either from face images, from other attributes, or from both face and other attributes. Particularly, on Facebook dataset, we show that we can accurately predict personality traits (BIG5) from tens of ‘likes’ or from only a profile picture and a couple of ‘likes’ comparing positively to human reference.

Robert Torfason, Eirikur Agustsson, Rasmus Rothe, Radu Timofte
Learning with Ambiguous Label Distribution for Apparent Age Estimation

Annotating age classes for humans’ facial images according to their appearance is very challenging because of dynamic person-specific ageing pattern, and thus leads to a set of unreliable apparent age labels for each image. For utilising ambiguous label annotations, an intuitive strategy is to generate a pseudo age for each image, typically the average value of manually-annotated age annotations, which is thus fed into standard supervised learning frameworks designed for chronological age estimation. Alternatively, inspired by the recent success of label distribution learning, this paper introduces a novel concept of ambiguous label distribution for apparent age estimation, which is developed under the following observations that (1) soft labelling is beneficial for alleviating the suffering of inaccurate annotations and (2) more reliable annotations should contribute more. To achieve the goal, label distributions of sparse age annotations for each image are weighted according to their reliableness and then combined to construct an ambiguous label distribution. In the light, the proposed learning framework not only inherits the advantages from conventional learning with label distribution to capture latent label correlation but also exploits annotation reliableness to improve the robustness against inconsistent age annotations. Experimental evaluation on the FG-NET age estimation benchmark verifies its effectiveness and superior performance over the state-of-the-art frameworks for apparent age estimation.

Ke Chen, Joni-Kristian Kämäräinen
Prototype Discriminative Learning for Face Image Set Classification

This paper presents a novel Prototype Discriminative Learning (PDL) method to solve the problem of face image set classification. We aim to simultaneously learn a set of prototypes for each image set and a linear discriminative transformation to make projections on the target subspace satisfy that each image set can be optimally classified to the same class with its nearest neighbor prototype. For an image set, its prototypes are actually “virtual” as they do not certainly appear in the set but are only assumed to belong to the corresponding affine hull, i.e., affine combinations of samples in the set. Thus, the proposed method not only inherits the merit of classical affine hull in revealing unseen appearance variations implicitly in an image set, but more importantly overcomes its flaw caused by too loose affine approximation via efficiently shrinking each affine hull with a set of discriminative prototypes. The proposed method is evaluated by face identification and verification tasks on three challenging and large-scale databases, YouTube Celebrities, COX and Point-and-Shoot Challenge, to demonstrate its superiority over the state-of-the-art.

Wen Wang, Ruiping Wang, Shiguang Shan, Xilin Chen
Collaborative Learning Network for Face Attribute Prediction

This paper proposes a facial attributes learning algorithm with deep convolutional neural networks (CNN). Instead of jointly predicting all the facial attributes (40 attributes in our case) with a shared CNN feature extraction hierarchy, we cluster the facial attributes into groups and the CNN only shares features within each group in later feature extraction stages to jointly predicts the attributes in each group respectively. This paper also proposes a simple yet effective attribute clustering algorithm, based on the observation that some attributes are more collaborated (their prediction accuracy improve more when jointly learned) than others, and the proposed deep network is referred to as the collaborative learning network. Contrary to the previous state-of-the-art facial attribute recognition methods which require pre-training on external datasets, the proposed collaborative learning network is trained for attribute recognition from scratch without external data while achieving the best attribute recognition accuracy on the challenging CelebA dataset and the second best on the LFW dataset.

Shiyao Wang, Zhidong Deng, Zhenyang Wang
Facial Expression-Aware Face Frontalization

Face frontalization is a rising technique for view-invariant face analysis. It enables a non-frontal facial image to recover its general facial appearances to frontal view. A few pioneering works have been proposed very recently. However, face frontalization with detailed facial expression recovering is still very challenging due to the non-linear relationships between head-pose and expression variations. In this paper, we propose a novel facial expression-aware face frontalization method aiming at reconstructing the frontal view while maintaining vivid appearances with regards to facial expressions. First of all, we design multiple face shape models as the reference templates in order to fit in with various shape of facial expressions. Each template describes a set of typical facial actions referred to Facial Action Coding System (FACS). Then a template matching strategy is applied by measuring a weighted Chi Square error such that the input image can be matched with the most approximate template. Finally, Robust Statistical face Frontalization (RSF) method is employed for the task of frontal view recovery. This method is validated on a spontaneous facial expression database and the experimental results show that the proposed method outperforms the state-of-the-art methods.

Yiming Wang, Hui Yu, Junyu Dong, Brett Stevens, Honghai Liu
Eigen-Aging Reference Coding for Cross-Age Face Verification and Retrieval

Recent works have achieved near or over human performance in traditional face recognition under PIE (pose, illumination and expression) variation. However, few works focus on the cross-age face recognition task, which means identifying the faces from same person at different ages. Taking human-aging into consideration broadens the application area of face recognition. It comes at the cost of making existing algorithms hard to maintain effectiveness. This paper presents a new reference based approach to address cross-age problem, called Eigen-Aging Reference Coding (EARC). Different from other existing reference based methods, our reference traces eigen faces instead of specific individuals. The proposed reference has smaller size and contains more useful information. To the best of our knowledge, we achieve state-of-the-art performance and speed on CACD dataset, the largest public face dataset containing significant aging information.

Kaihua Tang, Sei-ichiro Kamata, Xiaonan Hou, Shouhong Ding, Lizhuang Ma
Consistent Sparse Representation for Video-Based Face Recognition

This paper presents a novel method named Consist Sparse Representation (CSR) to solve the problem of video-based face recognition. We treat face images from each set as an ensemble. For each probe set, our goal is that the non-zero elements of the coefficient matrix can ideally focus on the gallery examples from a few/one subject(s). To obtain the sparse representation of a probe set, we simultaneously consider group-sparsity of gallery sets and probe sets. A new matrix norm (i.e. $$l_{F,0}$$-mixed norm) is designed to describe the number of gallery sets selected to represent the probe set. The coefficient matrix is obtained by minimizing the $$l_{F,0}$$-mixed norm which directly counts the number of gallery sets used to represent the probe set. It could better characterize the relations among classes than previous methods based on sparse representation. Meanwhile, a special alternating optimization strategy based on the idea of introducing auxiliary variables is adopted to solve the discontinuous optimization problem. We conduct extensive experiments on Honda, COX and some image set databases. The results demonstrate that our method is more competitive than those state-of-the-art video-based face recognition methods.

Xiuping Liu, Aihong Shen, Jie Zhang, Junjie Cao, Yanfang Zhou
Unconstrained Gaze Estimation Using Random Forest Regression Voting

In this paper we address the problem of automatic gaze estimation using a depth sensor under unconstrained head pose motion and large user-sensor distances. To achieve robustness, we formulate this problem as a regression problem. To solve the task in hand, we propose to use a regression forest according to their high ability of generalization by handling large training set. We train our trees on an important synthetic training data using a statistical model of the human face with an integrated parametric 3D eyeballs. Unlike previous works relying on learning the mapping function using only RGB cues represented by the eye image appearances, we propose to integrate the depth information around the face to build the input vector. In our experiments, we show that our approach can handle real data scenarios presenting strong head pose changes even though it is trained only on synthetic data, we illustrate also the importance of the depth information on the accuracy of the estimation especially in unconstrained scenarios.

Amine Kacete, Renaud Séguier, Michel Collobert, Jérôme Royan
A Novel Time Series Kernel for Sequences Generated by LTI Systems

The recent introduction of Hankelets to describe time series relies on the assumption that the time series has been generated by a vector autoregressive model (VAR) of order p. The success of Hankelet-based time series representations prevalently in nearest neighbor classifiers poses questions about if and how this representation can be used in kernel machines without the usual adoption of mid-level representations (such as codebook-based representations). It is also of interest to investigate how this representation relates to probabilistic approaches for time series modeling, and which characteristics of the VAR model a Hankelet can capture. This paper aims at filling these gaps by: deriving a time series kernel function for Hankelets (TSK4H), demonstrating the relations between the derived TSK4H and former dissimilarity/similarity scores, highlighting an alternative probabilistic interpretation of Hankelets.Experiments with an off-the-shelf SVM implementation and extensive validation in action classification and emotion recognition on several feature representations, show that the proposed TSK4H allows achieving state-of-the-art or even superior accuracy values in classification with respect to past work. In contrast to state-of-the-art time series kernel functions that suffer of numerical issues and tend to provide diagonally dominant kernel matrices, empirical results suggest that the TSK4H has limited numerical issues in high-dimensional spaces. On three widely used public benchmarks, TSK4H consistently outperforms other time series kernel functions despite its simplicity and limited time complexity.

Liliana Lo Presti, Marco La Cascia
Hand Pose Regression via a Classification-Guided Approach

Hand pose estimation from single depth image has achieved great progress in recent years, however, up-to-data methods are still not satisfying the application requirements like in human-computer interaction. One possible reason is that existing methods try to learn a general regression function for all types of hand depth images. To handle this problem, we propose a novel “divide-and-conquer” method, which includes a classification step and a regression step. At first, a convolutional neural network classifier is used to classify the input hand depth image into different types. Then, an effective and efficient multiway cascaded random forest regressor is used to estimate the hand joints’ 3D positions. Experiments demonstrate that the proposed method achieves state-of-the-art performance on challenging dataset. Moreover, the proposed method can be easily combined with other regression method.

Hongwei Yang, Juyong Zhang
Who’s that Actor? Automatic Labelling of Actors in TV Series Starting from IMDB Images

In this work, we aim at automatically labeling actors in a TV series. Rather than relying on transcripts and subtitles, as has been demonstrated in the past, we show how to achieve this goal starting from a set of example images of each of the main actors involved, collected from the Internet Movie Database (IMDB). The problem then becomes one of domain adaptation: actors’ IMDB photos are typically taken at awards ceremonies and are quite different from their appearances in TV series. In each series as well, there is considerable change in actor appearance due to makeup, lighting, ageing, etc. To bridge this gap, we propose a graph-matching based self-labelling algorithm, which we coin HSL (Hungarian Self Labeling). Further, we propose a new metric to be used in this context, as well as an extension that is more robust to outliers, where prototypical faces for each of the actors are selected based on a hierarchical clustering procedure. We conduct experiments with 15 episodes from 3 different TV series and demonstrate automatic annotation with an accuracy of 90% and up.

Rahaf Aljundi, Punarjay Chakravarty, Tinne Tuytelaars
Backmatter
Metadata
Title
Computer Vision – ACCV 2016
Editors
Shang-Hong Lai
Vincent Lepetit
Ko Nishino
Yoichi Sato
Copyright Year
2017
Electronic ISBN
978-3-319-54187-7
Print ISBN
978-3-319-54186-0
DOI
https://doi.org/10.1007/978-3-319-54187-7

Premium Partner