Skip to main content

Über dieses Buch

The four-volume set LNCS 7724--7727 constitutes the thoroughly refereed post-conference proceedings of the 11th Asian Conference on Computer Vision, ACCV 2012, held in Daejeon, Korea, in November 2012. The total of 226 contributions presented in these volumes was carefully reviewed and selected from 869 submissions. The papers are organized in topical sections on object detection, learning and matching; object recognition; feature, representation, and recognition; segmentation, grouping, and classification; image representation; image and video retrieval and medical image analysis; face and gesture analysis and recognition; optical flow and tracking; motion, tracking, and computational photography; video analysis and action recognition; shape reconstruction and optimization; shape from X and photometry; applications of computer vision; low-level vision and applications of computer vision.



Poster Session 3 (Continued): Segmentation, Grouping, and Classification

Multi-layer Spectral Clustering for Video Segmentation

Video segmentation with spatial priority suffers from incoherence problem, since the presegments of consecutive frames may be very different. To address this problem, this paper proposes an effective and scalable approach for video segmentation, aiming to cluster video pixels that are coherent in both appearance and motion. We build up a multi-layer graph based on multiple segmentations of the video frames, where each presegment corresponds to a vertex in the graph and each layer corresponds to the segmentation result using mean shift algorithm under specific granularity. Three types of edges are connected in the graph and the corresponding affinities are defined which convey local grouping cues of intra-frame, inter-frame and inter-layer neighborhoods. Then the task of video segmentation is formulated into graph partition, which can be solved efficiently by power iteration clustering algorithm. Both qualitative and quantitative experimental results demonstrate the efficacy of our proposed method.

Xiaofei Di, Hong Chang, Xilin Chen

Video Co-segmentation

Segmentation of a single image is in general a highly underconstrained problem. A frequent approach to solve it is to somehow provide prior knowledge or constraints on how the objects of interest look like (in terms of their shape, size, color, location or structure). Image co-segmentation trades the need for such knowledge for something much easier to obtain, namely, additional images showing the object from other viewpoints. Now the segmentation problem is posed as one of differentiating the similar object regions in all the images from the more varying background. In this paper, for the first time, we extend this approach to video segmentation: given two or more video sequences showing the same object (or objects belonging to the same class) moving in a similar manner, we aim to outline its region in all the frames. In addition, the method works in an unsupervised manner, by learning to segment at testing time. We compare favorably with two state-of-the-art methods on video segmentation and report results on benchmark videos.

Jose C. Rubio, Joan Serrat, Antonio López

Globally Minimal Path Method Using Dynamic Speed Functions Based on Progressive Wave Propagation

In this paper, we propose a novel framework which extends the classical minimal path methods. Usually, minimal path methods can be interpreted as the simulation of the outward propagation of a wavefront emanating from a specific start point at a certain speed derived from an image. In previous methods, either a


speed is computed before the wavefront starts to propagate, or the normal of the wavefront is used to update the speed


. We generalize the latter methods by introducing more general dynamic speed functions: During the outward propagation of the wavefront, features of the region already visited by the wavefront are used to update the speed dynamically. Our framework can incorporate both the fast marching method and Dijkstra’s algorithm. We prove that the

global optimum

can be found using our approach and demonstrate its advantage experimentally by applying it for segmentation of tubular structures in synthetic and real images.

Wei Liao, Stefan Wörz, Karl Rohr

Vanishing Points Estimation and Line Classification in a Manhattan World

The problem of estimating vanishing points for visual scenes under the Manhattan world assumption [1, 2] has been addressed for more than a decade. Surprisingly, the special characteristic of the Manhattan world that lines should be orthogonal or parallel to each other is seldom well utilized. In this paper, we present an algorithm that accurately and efficiently estimates vanishing points and classifies lines by thoroughly taking advantage of this simple fact in the Manhattan world with a calibrated camera. We first present a one-unknown-parameter representation of the 3D line direction in the camera frame. Then derive a quadratic which is employed to solve three orthogonal vanishing points formed by a line triplet. Finally, we develop a RANSAC-based approach to fulfill the task. The performance of proposed approach is demonstrated on the York Urban Database[3] and compared to the state-of-the-art method.

Lilian Zhang, Reinhard Koch

Online Learning for Fast Segmentation of Moving Objects

This work addresses the problem of fast, online segmentation of moving objects in video. We pose this as a discriminative online semi-supervised appearance learning task, where supervising labels are autonomously generated by a motion segmentation algorithm. The computational complexity of the approach is significantly reduced by performing learning and classification on oversegmented image regions (superpixels), rather than per pixel. In addition, we further exploit the sparse trajectories from the motion segmentation to obtain a simple model that encodes the spatial properties and location of objects at each frame. Fusing these complementary cues produces good object segmentations at very low computational cost. In contrast to previous work, the proposed approach (1) performs segmentation on-the-fly (allowing for applications where data arrives sequentially), (2) has no prior model of object types or ‘objectness’, and (3) operates at significantly reduced computational cost. The approach and its ability to learn, disambiguate and segment the moving objects in the scene is evaluated on a number of benchmark video sequences.

Liam Ellis, Vasileios Zografos

A Linear Programming Based Method for Joint Object Region Matching and Labeling

Object matching can be achieved by finding the superpixels matched across the image and the object template. It can therefore be used for detecting or labeling the object region. However, the matched superpixels are often sparsely distributed within the image domain, and there could therefore be a significant proportion of incorrectly detected or labeled regions even though there are few outlier matches. Consequently, the labeled regions may be unreliable for locating, extracting or representing the object. To address these problems, we propose to impose label priors that were previously incorporated in segmentation on the object matching. Specifically, in order to label as many regions as possible on the object, we propose to adopt the boundary-weighted smoothness prior. To reduce the singular outlier matches as much as possible, we propose to adopt the minimum description length principle adopted in segmentation. We then linearize the priors and incorporate them in the linear programming (LP) formulation of matching. The above gives rise to a novel general LP model for joint object region matching and labeling. This work extends the scope of conventional LP based object matching. The experimental results show that our method compares favorably to the LP based matching methods for object region labeling on a challenging dataset.

Junyan Wang, Li Wang, Kap Luk Chan, Martin Constable

Using Models of Objects with Deformable Parts for Joint Categorization and Segmentation of Objects

Several formulations based on Random Fields (RFs) have been proposed for joint categorization and segmentation (JCaS) of objects in images. The RF’s sites correspond to pixels or superpixels of an image and one defines potential functions (typically over local neighborhoods) which define costs for the different possible assignments of labels to several different sites. Since the segmentation is unknown a priori, one cannot define potential functions over arbitrarily large neighborhoods as that may cross object boundaries. Categorization algorithms extract a set of interest points from the entire image and solve the categorization problem by optimizing cost functions that depend on the feature descriptors extracted from these interest points. There is some disconnect between segmentation algorithms which consider local neighborhoods and categorization algorithms which consider non-local neighborhoods. In this work, we propose to bridge this gap by introducing a novel formulation which uses models of objects with deformable parts, classically used for object categorization, to solve the JCaS problem. We use these models to introduce two new classes of potential functions for JCaS; (a) the first class of potential functions encodes the model score for detecting an object as a function of its visible parts only, and (b) the second class of potential functions encodes

shape priors

for each visible part and is used to bias the segmentation of the pixels in the support region of the part, towards the foreground object label. We show that most existing deformable parts formulations can be used to define these potential functions and that the resulting potential functions can be optimized exactly using min-cut. As a result, these new potential functions can be integrated with most existing RF-based formulations for JCaS.

Nikhil Naikal, Dheeraj Singaraju, S. Shankar Sastry

A Robust Stereo Prior for Human Segmentation

The emergence of affordable depth cameras has enabled significant advances in human segmentation and pose estimation in recent years. While it leads to impressive results in many tasks, the use of infra-red cameras have their drawbacks, in particular the fact that they don’t work in direct sunlight. One alternative is to use a stereo pair of cameras to produce a disparity space image. In this work, we propose a robust method of using a disparity space image to create a prior for human segmentation. This new prior leads to greatly improved segmentation results; it can be applied to any task where a stereo pair of cameras is available, and segmentation results are desired. As an application, we show how the prior can be inserted into a dual decomposition formulation for stereo, segmentation and human pose estimation.

Glenn Sheasby, Julien Valentin, Nigel Crook, Philip Torr

Voronoi-Based Extraction of a Feature Skeleton from Noisy Triangulated Surfaces

Recent advances in 3D reconstruction allow to quickly acquire highly detailed and complex geometry. However, the outcome of such systems is usually unstructured, noisy and redundant. In order to enable further processing such as CAD modeling, physical measurement or rendering, semantic information about shape and topology needs to be derived from the data. In this paper, a robust approach to the extraction of a feature skeleton is presented. The skeleton reflects the overall structure of an object. It is given by a set of lines that run along ridges or valleys and meet at umbilical points. The computed data is not just useful for building semantic-driven CAD models in reverse engineering disciplines but also to identify geometrical features for tasks like object recognition, registration, rendering or re-meshing. Based on the mean curvature, a Markov random field is used to robustly classify each vertex either belonging to convex, concave or flat regions. The boundaries of the regions are described by a set of points that are robustly estimated using linear interpolation. A novel algorithm is used to extract the feature skeleton based on the Voronoi decomposition of the boundary points. The method has been successfully tested on real world examples and the paper concludes with a detailed evaluation.

Tilman Wekel, Olaf Hellwich

Maximal Cliques Based Rigid Body Motion Segmentation with a RGB-D Camera

Motion segmentation is a key underlying problem in computer vision for dynamic scenes. Given 3D data from a RGB-D camera, this paper presents a novel method for motion segmentation without explicitly estimating motions. Building up from a recent literature [1] that proposes a similarity measure between two 3D points belonging to a rigid body, we show that identifying rigid motion groups corresponds to a maximal clique enumeration problem of the similarity graph. Using efficient maximal clique enumeration algorithms we show that it is practically feasible to find all the unique candidate motion groups in a deterministic fashion. We investigate the relationship to traditional hypothesis sampling and show that under certain conditions the inliers to a hypothesis form a clique in the similarity graph. Further, we show that identifying true motions from the candidate motions can be cast as a minimum set cover problem (for outlier-free data) or a max k-cover problem (for data with outliers). This allows us to use the greedy algorithm for max k-cover to segment the motion groups. Presented results using synthetic and real RGB-D data show the validity of our approach.

Samunda Perera, Nick Barnes

Online Web-Data-Driven Segmentation of Selected Moving Objects in Videos

We present an online Web-data-driven framework for segmenting moving objects in videos. This framework uses object shape priors learned in an online fashion from relevant labeled images ranked in a large-scale Web image set. The method for online prior learning has three steps: (1) relevant silhouette images for training are online selected using a user-provided bounding box and an object class annotation; (2) image patches containing the annotated object for testing are obtained via an online trained tracker; (3) a holistic shape energy term is learned for the object, while the object and background seed labels are propagated between frames. Finally, the segmentation is optimized via 3-D Graph cuts with the shape term and soft assignments of seeds. The system’s performance is evaluated on the challenging Youtube dataset and found to be competitive with the state-of-the-art that requires offline modeling based on pre-selected templates and a pre-trained person detector. Comparison experiments have verified that tracking and seed label propagation both induce less distraction, while the shape prior induces more complete segments.

Xiang Xiang, Hong Chang, Jiebo Luo

Multi-Level Structured Image Coding on High-Dimensional Image Representation

Robust image representations such as classemes [1], Object Bank (OB) [2], spatial pyramid representation(SPM) [3] have been proposed, showing superior performance in various high level visual recognition tasks. Our work is motivated by the need of exploring rich structural information encoded by these image representations. In this paper, we propose a novel Multi-Level Structured Image Coding approach to uncover the structure embedded in representations with rich regular structural information by learning a structured dictionary from it. Specifically, we choose Object Bank [2] to demonstrate our algorithm since it encodes both semantics and spatial location as structural information. By using the learned structured dictionary from Object Bank, we can compute a lower-dimensional and more compact encoding of the image features while preserving and accentuating the rich semantic and spatial information of OB. Our framework is an unsupervised method based on minimizing the reconstruction error of the image and object codes, with an innovative multi-level structural regularization scheme. The object dictionary and the image code obtained by our model offer intriguing intuition of real-world image structures while preserving informative structure of the original OB. We show that our more compact representation outperforms several state-of-the-art representations (including the original OB) on a wide range of high-level visual tasks such as scene classification, image retrieval and annotation.

Li-Jia Li, Jun Zhu, Hao Su, Eric P. Xing, Li Fei-Fei

Incremental Slow Feature Analysis with Indefinite Kernel for Online Temporal Video Segmentation

Slow Feature Analysis (SFA) is a subspace learning method inspired by the human visual system, however, it is seldom seen in computer vision. Motivated by its application for unsupervised activity analysis, we develop SFA’s first implementation of online temporal video segmentation to detect episodes of motion changes. We utilize a domain-specific indefinite kernel which takes the data representation into account to introduce robustness. As our kernel is indefinite (


defines instead of a Hilbert, a Krein space), we formulate SFA in Krein space. We propose an incremental kernel SFA framework which utilizes the special properties of our kernel. Finally, we employ our framework to online temporal video segmentation and perform qualitative and quantitative evaluation.

Stephan Liwicki, Stefanos Zafeiriou, Maja Pantic

Texture Classification Based on BIMF Monogenic Signals

This paper proposes a new texture feature based on HHT, Riesz transform and LBP. Hilbert-Huang transform (HHT) is a novel efficient signal analysis method proposed by N.E.Huang. It consists two parts: Empirical Mode Decomposition (EMD) and Hilbert transform. Images are decomposed to several Bidimensional Intrinsic Mode Functions (BIMFs) by BEMD, which present new multi-scale characters and present illumination invariant. And then, for two-dimensional signal BIMFs, we proposed using the Riesz transform instead of Hilbert transform to generate monogenic signals, which are rotation invariant. After then, Local Binary Pattern (LBP) detected the features from the Monogenic-BIMFs space. Experiments demonstrate the LBP histogram of Monogenic-BIMFs present a better classification result than other state-of-the-art texture representation methods.

JianJia Pan, Yuan Yan Tang

Toward Perception-Based Shape Decomposition

The aim of this work is to decompose shapes into parts which are consistent to human perception. We propose a novel shape decomposition method which utilizes the three perception rules suggested by psychological study: the Minima rule, the Short-cut rule and the Convexity rule. Unlike the previous work, we focus on improving the convexity of the decomposed parts while minimizing the cut length as much as possible. The problem is formulated as a combinatorial optimization problem and solved by a quadratic programming method. In addition, we consider the curved branches which introduce “false” concavity. To solve this problem, we straighten the curved branches before shape decomposition which makes the results more consistent with human perception. We test our approach on the MPEG-7 shape dataset, and the comparison results to previous work show that the proposed method can improve the part convexity while keeping the cuts short, and the decomposition is more consistent with human perception.

Tingting Jiang, Zhongqian Dong, Chang Ma, Yizhou Wang

Co-regularized PLSA for Multi-view Clustering

Multi-view data is common in a wide variety of application domains. Properly exploiting the relations among different views is helpful to alleviate the difficulty of a learning problem of interest. To this end, we propose an extended Probabilistic Latent Semantic Analysis (PLSA) model for multi-view clustering, named Co-regularized PLSA (CoPLSA). CoPLSA integrates individual PLSAs in different views by pairwise co-regularization. The central idea behind the co-regularization is that the sample similarities in the topic space from one view should agree with those from another view. An EM-based scheme is employed for parameter estimation, and a local optimal solution is obtained through an iterative process. Extensive experiments are conducted on three real-world datasets and the compared results demonstrate the superiority of our approach.

Yu Jiang, Jing Liu, Zechao Li, Peng Li, Hanqing Lu

Oral Session 4: Image Representation

Quadra-Embedding: Binary Code Embedding with Low Quantization Error

Thanks to compact data representations and fast similarity computation, many binary code embedding techniques have been recently proposed for large-scale similarity search used in many computer vision applications including image retrieval. Most of prior techniques have centered around optimizing a set of projections for accurate embedding. In spite of active research efforts, existing solutions suffer both from diminishing marginal efficiency as more code bits are used, and high quantization errors naturally coming from the binarization.

In order to reduce both quantization error and diminishing efficiency we propose a novel binary code embedding scheme,


, that assigns two bits for each projection to define four quantization regions, and a novel binary code distance function tailored specifically to our encoding scheme. Our method is directly applicable to a wide variety of binary code embedding methods. Our scheme combined with four state-of-the-art embedding methods has been evaluated with three public image benchmarks. We have observed that our scheme achieves meaningful accuracy improvement in most experimental configurations under


- and


-NN search.

Youngwoon Lee, Jae-Pil Heo, Sung-Eui Yoon

Learning a Context Aware Dictionary for Sparse Representation

Recent successes in the use of sparse coding for many computer vision applications have triggered the attention towards the problem of how an over-complete dictionary should be learned from data. This is because the quality of a dictionary greatly affects performance in many respects, including computational. While so far the focus has been on learning compact, reconstructive, and discriminative dictionaries, in this work we propose to retain the previous qualities, and further enhance them by learning a dictionary that is able to predict the contextual information surrounding a sparsely coded signal. The proposed framework leverages the K-SVD for learning, fully inheriting its benefits of simplicity and efficiency. A model of structured prediction is designed around this approach, which leverages contextual information to improve the combined recognition and localization of multiple objects from multiple classes within one image. Results on the PASCAL VOC 2007 dataset are in line with the state-of-the-art, and clearly indicate that this is a viable approach for learning a context aware dictionary for sparse representation.

Farzad Siyahjani, Gianfranco Doretto

Robust Multiple-Instance Learning with Superbags

Multiple-instance learning consists of two alternating optimization steps: learning a classifier with missing labels and finding the missing labels with the classifier. These steps are iteratively performed on the same training data, thus imputing labels by evaluating the classifier on the data it is trained upon. Consequently this alternating optimization is prone to self-amplification and overfitting. To resolve this crucial issue of popular multiple-instance learning we propose to establish a random ensemble of sets of bags, i.e., superbags. Classifier training and label inference are then decoupled by performing them on different superbags. Label inference is performed on samples from separate superbags, and thus avoids label imputation on training samples in the same superbag. Experimental evaluations on standard datasets show consistent improvement over widely used approaches for multiple-instance learning.

Borislav Antić, Björn Ommer

Poster Session 4: Image/Video Retrieval and Medical Image Analysis

Automatic Grading of Cortical and PSC Cataracts Using Retroillumination Lens Images

In this paper, we propose an automatic approach to grade cortical and Posterior Sub-Capsular (PSC) cataracts using retroillumination images. Low-level vision features are used to characterize the photometric appearances and geometric structures of cortical and PSC cataracts in retroillumination images. The prediction result gives an opacity score that serves as an estimation of cataract severity. The system is tested on 434 pairs of lens images with ground truth labels from professional graders. Five-fold cross-validation with random partition of the data shows that the mean correlation between the proposed method and the grader’s result is 0.7392 with variance 0.0003, which is promising.

The proposed prediction approach can also be used as a preliminary estimation to improve existing detection systems. Most existing detection systems apply one method to all types of lens images. Such single method may fail for one to some types of lens images as cataracts with different severity not only have different levels of opacity, but also have different photometric appearances and geometric structures. We demonstrate an improved cortical cataracts detection system that employs different strategies to address the challenges in cataract detection for lenses with different levels of estimated opacity. The strategies simultaneously overcome the over-detection issue for clear lenses and the under-detection issue for lenses with high opacity. The results show an improvement of accuracy in cortical cataract detection from 51% to 62%. The corresponding Kappa agreement score is improved from 0.25 to 0.40.

Xinting Gao, Damon Wing Kee Wong, Tian-Tsong Ng, Carol Yim Lui Cheung, Ching-Yu Cheng, Tien Yin Wong

Registration of Pre-Operative CT and Non-Contrast-Enhanced C-Arm CT: An Application to Trans-Catheter Aortic Valve Implantation (TAVI)

Trans-catheter Aortic Valve Implantation (TAVI) has proven to be an effective minimal-invasive alternative to traditional open-heart valve replacement surgery. Despite the success of contrast enhanced C-arm CT for intra-operative guidance during TAVI, utilization of pre-operative CT in the Hybrid Operating Room provides additional advantages of an improved workflow and minimized usage of contrast agent. In this paper, we propose a framework for CT/non-contrast-enhanced C-arm CT volume registration so that pre-operative CT can be used intra-operatively without additional contrast medium. The proposed method consists of two steps, rigid-body coarse alignment followed by deformable fine registration. Our contribution is twofold. First, robust heart center detection on both image modalities is used to boost the success rate of rigid-body registration. Second, a structural encoded similarity measure and anatomical correlation-regularized deformation fields are proposed to improve the performance of intensity-based deformable registration using the variational framework. Experiments were performed on ten sets of TAVI patient data, and the results have shown that the proposed method provides a highly robust and accurate registration. The resulting accuracy of 1.83 mm mean mesh-to-mesh error at the aortic root and the high efficiency of an average running time of 2 minutes on a common computer make it potentially feasible for clinical usage in TAVI. The proposed heart registration method is generic and hence can be easily applied to other cardiac applications.

Yongning Lu, Ying Sun, Rui Liao, Sim Heng Ong

Automatic Webcam-Based Human Heart Rate Measurements Using Laplacian Eigenmap

Non-contact, long-term monitoring human heart rate is of great importance to home health care. Recent studies show that Photoplethysmography (PPG) can provide a means of heart rate measurement by detecting blood volume pulse (BVP) in human face. However, most of existing methods use linear analysis method to uncover the underlying BVP, which may be not quite adequate for physiological signals. They also lack rigorous mathematical and physiological models for the subsequent heart rate calculation. In this paper, we present a novel webcam-based heart rate measurement method using Laplacian Eigenmap (LE). Usually, the webcam captures the PPG signal mixed with other sources of fluctuations in light. Thus exactly separating the PPG signal from the collected data is crucial for heart rate measurement. In our method, more accurate BVP can be extracted by applying LE to efficiently discover the embedding ties of PPG with the nonlinear mixed data. We also operate effective data filtering on BVP and get heart rate based on the calculation of interbeat intervals (IBIs). Experimental results show that LE obtains higher degrees of agreement with measurements using finger blood oximetry than Independent Component Analysis (ICA), Principal Component Analysis (PCA) and other five alternative methods. Moreover, filtering and processing on IBIs are proved to increase the measuring accuracy in experiments.

Lan Wei, Yonghong Tian, Yaowei Wang, Touradj Ebrahimi, Tiejun Huang

Superpixel Classification Based Optic Disc Segmentation

Optic disc segmentation in retinal fundus images is important in computer aided diagnosis. In this paper, an optic disc segmentation method based on superpixel classification is proposed. In the classification, histograms from contrast enhanced image channels and center surround statistics from center surround difference maps are proposed as features to determine each superpixel as disc or non disc. In the training step, bootstrapping is adopted to handle the unbalanced cluster issue due to the presence of peripapillary atrophy. A self-assessment reliability score is computed to evaluate the quality of the automated optic disc segmentation. The proposed method has been tested on a database of 650 images with optic disc boundaries marked by trained professionals manually. The experimental results show a mean overlapping error of 9.5%, better than previous methods. The results also show an increase in overlapping error as the reliability score is reduced, which justifies the effectiveness of the self-assessment. The method can be used in computer aided diagnosis systems and the self-assessment can be used as an indicator of results with large errors and thus enhance the clinical deployment of the automatic segmentation.

Jun Cheng, Jiang Liu, Yanwu Xu, Fengshou Yin, Damon Wing Kee Wong, Ngan-Meng Tan, Ching-Yu Cheng, Yih Chung Tham, Tien Yin Wong

Anterior Cruciate Ligament Segmentation from Knee MR Images Using Graph Cuts with Geometric and Probabilistic Shape Constraints

Automatic segmentation of anterior cruciate ligament (ACL) is a challenging task due to its similar intensities with adjacent soft tissues, and inhomogeneity inside it in 3D knee magnetic resonance (MR) images. In this paper, an automatic ACL segmentation from 3D knee MR images using graph cuts is proposed. The proposed method consists of two steps: First, in the rough segmentation, adaptive thresholding using GMM fitting and ACL candidates extraction is performed to extract initial object and background candidates. Second, in the fine segmentation, iterative graph cut segmentation is incorporated with additional constraints including geometric and probabilistic shape costs to prevent the segmented ACL label from the leakage into adjacent soft tissues e.g. posterior cruciate ligament (PCL) and cartilage. In the experimental results, compared to the preceding work [1], the proposed method shows overall improved performances in sensitivity, specificity, and Dice similarity coefficient of 25%, 0.1%, and 29% for whole ACL, 34%, 0.5%, and 41% for major stem of ACL, respectively.

Hansang Lee, Helen Hong, Junmo Kim

Robust Mid-Sagittal Plane Extraction in 3-D Ultrasound Fetal Volume for First Trimester Screening

Extraction of the mid-sagittal plane (MSP) in 3-D ultrasound fetal volume is an important procedure for first trimester screening. We present a robust semi-automated MSP extraction method for 3-D ultrasound fetal volume based on parametric template matching of an ellipse to the skull edge and on the combination of the local similarity measure and adaptive support-weight map. The algorithm is intended to reduce the variability of manual MSP detection. Our semi-automatically extracted MSPs are compared to those extracted manually by experts and experimental results demonstrate that our method is robust even in the presence of asymmetry, outlier and speckle noise with excellent detection accuracy for a large set of fetal volume data

Kwang Hee Lee, Sang Wook Lee

Automatic Skin Lesion Segmentation Based on Texture Analysis and Supervised Learning

Automatic skin lesion detection is a key step in computer-aided diagnosis (CAD) of Skin cancers, since the accuracy of the subsequent steps in CAD crucially depends on it. In this paper, a novel method of automatic skin lesion segmentation based on texture analysis and supervised learning is proposed. It firstly involve the clustering of training image into homogeneous regions using Mean-shift; then fusion texture feature are extracted from each clustered region based on Gabor and GLCM feature; next, the classifier model is generated through supervised learning base on LIBSVM; finally, lesion regions of the unseen image are automatically predicted out by produced classifier. Comprehensive experiments have been performed on a dataset of 125 dermoscopy images. The proposed method is compared with three state-of-the-art methods and results demonstrate that the presented method achieves both robust and accurate lesion segmentation in dermoscopy images.

Yingding He, Fengying Xie

Multiscale Convolutional Neural Networks for Vision–Based Classification of Cells

We present a Multiscale Convolutional Neural Network (MCNN) approach for vision–based classification of cells. Based on several deep Convolutional Neural Networks (CNN) acting at different resolutions, the proposed architecture avoid the classical handcrafted features extraction step, by processing features extraction and classification as a whole. The proposed approach gives better classification rates than classical state–of–the–art methods allowing a safer Computer–Aided Diagnosis of pleural cancer.

Pierre Buyssens, Abderrahim Elmoataz, Olivier Lézoray

Semantic-Context-Based Augmented Descriptor for Image Feature Matching

This paper proposes an augmented version of local feature that enhances the discriminative power of the feature without affecting its invariance to image deformations. The idea is about learning local features, aiming to estimate its semantic, which is then exploited in conjunction with the bag of words paradigm to build an augmented feature descriptor. Basically, any local descriptor can be casted in the proposed context, and thus the approach can be easy generalized to fit in with any local approach. The semantic-context signature is a 2D histogram which accumulates the spatial distribution of the visual words around each local feature. The obtained semantic-context component is concatenated with the local feature to generate our proposed feature descriptor. This is expected to handle ambiguities occurring in images with multiple similar motifs and depicting slight complicated non-affine distortions, outliers, and detector errors. The approach is evaluated for two data sets. The first one is intentionally selected with images containing multiple similar regions and depicting slight non-affine distortions. The second is the standard data set of


. The evaluation results showed our approach performs significantly better than expected results as well as in comparison with other methods.

Samir Khoualed, Thierry Chateau, Umberto Castellani

Image Synthesis and Occlusion Removal of Intermediate Views by Stereo Matching

We suggest an efficient real-time method for removing occlusions when synthesizing intermediate views from two stereo images by stereo matching algorithm. The proposed method can carry out the correction of noise resulting from stereo matching, removal of occlusions to achieve high fidelity rendering, and parallel computing to achieve real-time synthesis. The proposed algorithm is compared with state-of-the-art algorithms to show its improved performance in terms of rendering and computational speed.

Seongyun Cho, JeongMok Ha, Hong Jeong

PEDIVHANDI: Multimodal Indexation and Retrieval System for Lecture Videos

Since text in slides and teacher’s speech complementarily represent lecture contents, lecture videos can be indexed and retrieved by using a fully automatic and complete system based on the multimodal analysis of speech and text. In this paper, we present the multimodal lecture content indexing approach used in the PEDIVHANDI project. We use the discretization of speech and changes of slide’s texts to identify lecture slides in the video. We also propose a duplicate verification to remove nearly-duplicate slides. After using the Stroke Width Transfrom (SWT) text detector to obtain text regions, a standard OCR engine is used for text recognition. Finally, a context-based spell check is proposed to correct words recognized. Our system achieves the recognition precision 71% and 57% recall on a corpus of 6 presentation videos for a total duration of 8 hours.

Nhu Van Nguyen, Jean-Marc Ogier, Franck Charneau

Digitization of Deformed Documents Using a High-Speed Multi-camera Array

Digitization of documents recently has become an important technology. However, it is difficult for existing scanners to read books at high speed and at high resolution simultaneously. In order to realize a promising new book scanning system, we aimed to scan a book containing many pages by using multiple high-speed cameras to acquire images while continuously flipping through the pages, then integrating the images viewed by different cameras to digitize all of the pages. However, high-accuracy integration with the non-uniform rectification required for such input images is a challenging task because the sheets of the document are deformed and the image resolution is so high that misalignment can easily occur. This paper proposes a new multi-camera-array book scanning system and a method of achieving high-accuracy three-dimensional deformation estimation and high-resolution rectification of the distorted document images with a system configuration in which multiple high-speed cameras are arranged with small overlapping captured areas. Experiments using the developed system showed that high-accuracy document images were reconstructed.

Yoshihiro Watanabe, Kotaro Itoyama, Masahiro Yamada, Masatoshi Ishikawa

A Phase-Based Approach for Caption Detection in Videos

The captions in videos are closely related to the video contents, so the research of automatic caption detection contributes to video contents analysis and content-based retrieval. In this paper, a novel phase-based static caption detection approach is proposed. Our phase-based algorithm consists of two processes: candidate caption region detection and candidate caption region refinement. Firstly, the candidate caption regions are extracted from the caption saliency map, which is mainly generated by phase-only Fourier synthesis. Secondly, the candidate regions are refined by text region shape features. The comparison experimental results with existing methods show a better performance of our proposed approach.

Shu Wen, Yonghong Song, Yuanlin Zhang, Yu Yu

Efficient Clothing Retrieval with Semantic-Preserving Visual Phrases

In this paper, we address the problem of large scale cross-scenario clothing retrieval with semantic-preserving visual phrases (SPVP). Since the human parts are important cues for clothing detection and segmentation, we firstly detect human parts as the semantic context, and refine the regions of human parts with sparse background reconstruction. Then, the semantic parts are encoded into the vocabulary tree under the bag-of-visual-word (BOW) framework, and the contextual constraint of visual words among different human parts is exploited through the SPVP. Moreover, the SPVP is integrated into the inverted index structure for accelerating the retrieval process. Experiments and comparisons on our clothing dataset indicate that the SPVP significantly enhances the discriminative power of local features with a slight increase of memory usage or runtime consumption compared to the BOW model. Therefore, the approach is superior to both the state-of-the-art approach and two clothing search engines.

Jianlong Fu, Jinqiao Wang, Zechao Li, Min Xu, Hanqing Lu

VISOR: Towards On-the-Fly Large-Scale Object Category Retrieval

This paper addresses the problem of object category retrieval in large unannotated image datasets. Our aim is to enable both fast learning of an object category model, and fast retrieval over the dataset. With these elements we show that new visual concepts can be learnt on-the-fly, given a text description, and so images of that category can then be retrieved from the dataset in realtime.

To this end we compare state of the art encoding methods and introduce a novel cascade retrieval architecture, with a focus on achieving the best trade-off between three important performance measures for a realtime system of this kind, namely: (i) class accuracy, (ii) memory footprint, and (iii) speed.

We show that an on-the-fly system is possible and compare its performance (using noisy training images) to that of using carefully curated images. For this evaluation we use the VOC 2007 dataset together with 100k images from ImageNet to act as distractors.

Ken Chatfield, Andrew Zisserman

A Picture Is Worth a Thousand Tags: Automatic Web Based Image Tag Expansion

We present an approach to automatically expand the annotation of images using the internet as an additional information source. The novelty of the work is in the expansion of image tags by automatically introducing new unseen complex linguistic labels which are collected unsupervised from associated webpages. Taking a small subset of existing image tags, a web based search retrieves additional textual information. Both a textual bag of words model and a visual bag of words model are combined and symbolised for data mining. Association rule mining is then used to identify rules which relate words to visual contents. Unseen images that fit these rules are re-tagged. This approach allows a large number of additional annotations to be added to unseen images, on average 12.8 new tags per image, with an 87.2% true positive rate. Results are shown on two datasets including a new 2800 image annotation dataset of landmarks, the results include pictures of buildings being tagged with the architect, the year of construction and even events that have taken place there. This widens the tag annotation impact and their use in retrieval. This dataset is made available along with tags and the 1970 webpages and additional images which form the information corpus. In addition, results for a common state-of-the-art dataset MIRFlickr25000 are presented for comparison of the learning framework against previous works.

Andrew Gilbert, Richard Bowden

Image Retrieval Using Eigen Queries

Category based image search, where the goal is to retrieve images of a specific category from a large database, is becoming increasingly popular. In such a setting, the query is often a classifier. However, the complexity of the classifiers (often SVMs) used for this purpose hinders the use of such a solution in practice. Problem becomes paramount when the database is huge and/or the dimensionality of the feature representation is also very large. In this paper, we address this issue by proposing a novel method which decomposes the query classifier into set of known

eigen queries

. We use their precomputed results (or scores) for computing the ranked list corresponding to novel queries. We also propose an approximate algorithm which accesses only a fraction of the data to perform fast retrieval. Experiments on various datasets show that our method reports high accuracy and efficiency. Apart from retrieval, the proposed method can also be used to discover interesting new concepts from the given dataset.

Nisarg Raval, Rashmi Vilas Tonge, C. V. Jawahar

Fast and Effective Retrieval of Plant Leaf Shapes

In this paper, a novel shape description and matching method based on multi-level curve segment measures (MLCSM) is proposed for plant leaf image retrieval. MLCSM extracts the statistical features of shape contour via measuring the curve bending, convexity and concavity of the curve segments with different length of shape contour to describe the shape. This method not only finely captures the global and local features, but also is very compact and has very low computational complexity. The performance of the proposed method is evaluated on the widely used Swedish leaf database and the leaf databases collected by ourselves which contains 1200 images and 100 plant leaf species. All the experiments show the superiority of our method over the state-of-the-art shape retrieval methods.

Bin Wang, Yongsheng Gao

Complete Generic Camera Calibration and Modeling Using Spline Surfaces

The generic camera model considered in this work can be regarded as a mapping between image pixels and viewing rays. These rays are independent of each other which prohibits a standard parametric approach for calibration and modeling of these cameras. Spline surfaces are used here to calibrate and model generic imaging devices. This allows the utilization of sparse planar calibration boards and facilitates general forward projection as well as subpixel back projection. In contrast to other works the complete image area is to be calibrated, not only a part of it. This is done by adding further views of calibration patterns after an initial calibration step, which expands the calibrated region of the camera image. Results with two different imaging devices prove the general applicability of the proposed method and the comparison to an established parametric calibration procedure shows its superiority.

Dennis Rosebrock, Friedrich M. Wahl

q-Gaussian Mixture Models Based on Non-extensive Statistics for Image and Video Semantic Indexing

Gaussian mixture models (GMMs) which extend the bag-of-visual-words (BoW) to a probabilistic framework have been proved to be effective for image and video semantic indexing. Recently, the


-Gaussian distribution, which is derived in the non-extensive statistics, has been shown to be useful for representing patterns in many


systems in physics such as fractals and cosmology. We propose


-Gaussian mixture models (


-GMMs), which are mixture models of


-Gaussian distributions, for image and video semantic indexing. It has a parameter


to control its tail-heaviness. The long-tailed distributions obtained for


 > 1 are expected to effectively represent complexly correlated data, and hence, to improve robustness against outliers. In our experiments, our proposed method outperformed the BoW method and achieved 49.4% and 10.9% in Mean Average Precision on the PASCAL VOC 2010 dataset and the TRECVID 2010 Semantic Indexing dataset, respectively.

Nakamasa Inoue, Koichi Shinoda

Oral Session 5: Object Recognition II

Rapid Uncertainty Computation with Gaussian Processes and Histogram Intersection Kernels

An important advantage of Gaussian processes is the ability to directly estimate classification uncertainties in a Bayesian manner. In this paper, we develop techniques that allow for estimating these uncertainties with a runtime linear or even constant with respect to the number of training examples. Our approach makes use of all training data without any sparse approximation technique while needing only a linear amount of memory. To incorporate new information over time, we further derive online learning methods leading to significant speed-ups and allowing for hyperparameter optimization on-the-fly. We conduct several experiments on public image datasets for the tasks of one-class classification and active learning, where computing the uncertainty is an essential task. The experimental results highlight that we are able to compute classification uncertainties within microseconds even for large-scale datasets with tens of thousands of training examples.

Alexander Freytag, Erik Rodner, Paul Bodesheim, Joachim Denzler

Histogram of Oriented Normal Vectors for Object Recognition with a Depth Sensor

We propose a feature, the

Histogram of Oriented Normal Vectors

(HONV), designed specifically to capture local geometric characteristics for object recognition with a depth sensor. Through our derivation, the normal vector orientation represented as an ordered pair of azimuthal angle and zenith angle can be easily computed from the gradients of the depth image. We form the HONV as a concatenation of local histograms of azimuthal angle and zenith angle. Since the HONV is inherently the local distribution of the tangent plane orientation of an object surface, we use it as a feature for object detection/classification tasks. The object detection experiments on the standard RGB-D dataset [1] and a self-collected Chair-D dataset show that the HONV significantly outperforms traditional features such as HOG on the depth image and HOG on the intensity image, with an improvement of 11.6% in average precision. For object classification, the HONV achieved 5.0% improvement over state-of-the-art approaches.

Shuai Tang, Xiaoyu Wang, Xutao Lv, Tony X. Han, James Keller, Zhihai He, Marjorie Skubic, Shihong Lao

Globally Optimal Consensus Set Maximization through Rotation Search

A popular approach to detect outliers in a data set is to find the largest consensus set, that is to say maximizing the number of inliers and estimating the underlying model. RANSAC is the most widely used method for this aim but is non-deterministic and does not guarantee to return the optimal solution. In this paper, we consider a rotation model and we present a new approach that performs consensus set maximization in a mathematically guaranteed globally optimal way. We solve the problem by a branch-and-bound framework associated with a rotation space search. Our mathematical formulation can be applied for various computer vision tasks such as panoramic image stitching, 3D registration with a rotating range sensor and line clustering and vanishing point estimation. Experimental results with synthetic and real data sets have successfully confirmed the validity of our approach.

Jean-Charles Bazin, Yongduek Seo, Marc Pollefeys

Graspable Parts Recognition in Man-Made 3D Shapes

We address the problem of automatic recognition of graspable parts in man-made 3D shapes, which exhibit high intra-class variability that cannot be captured with geometric descriptors alone. We observe that, in the presence of significant geometric and topological variations, the context of a part within a 3D shape provides important cues about its functionality. We propose to model the context as structural relationships between shape parts and use them, in addition to part geometry, as cues for identifying automatically the graspable parts. We design a set of spatial relationships that can be extracted from general shapes. Then, we propose a new similarity measure that captures a part context and enables better recognition of graspable parts. We use this property to design a classifier that learns the semantics of a shape part. We demonstrate that our approach outperforms the state-of-the-art approaches that are purely geometric-based.

Hamid Laga

Poster Session 5: Face/Gesture Analysis and Recognition

Face Recognition after Plastic Surgery: A Comprehensive Study

It has been observed that many face recognition algorithms fail to recognize faces after plastic surgery, which thus poses a new challenge to automatic face recognition. This paper first gives a comprehensive study on Face Recognition After Plastic Surgery (FRAPS), with careful analysis of the effects of plastic surgery on face appearance and its challenges to face recognition. Then, to address FRAPS problem, an ensemble of Gabor Patch classifiers via Rank-Order list Fusion (GPROF) is proposed, inspired by the assumption of the interior consistency of face components in terms of identity. On the face database of plastic surgery, GPROF achieves much higher face identification rate than the best known results in the literature. Furthermore, with our impressive results, we suggest that plastic surgery detection should be paid more attend to. To address this problem, a partial matching based plastic surgery detection algorithm is proposed, aiming to detect four distinct types of surgery, i.e., the eyelid surgery, nose surgery, forehead surgery and face lift surgery. Our experimental results demonstrate that plastic surgery detection is a nontrivial task, and thus deserves more research efforts.

Xin Liu, Shiguang Shan, Xilin Chen

Enhancing Expression Recognition in the Wild with Unlabeled Reference Data

Facial expression recognition is an important task in human-computer interaction. Some methods work well on ”lab-controlled” data. However, their performances degenerate dramatically on real-world data as expression covers large variations, including pose, illumination, occlusion, and even culture change. To deal with this problem, large scale data is definitely needed. On the other hand, collecting and labeling wild expression data can be difficult and time consuming. In this paper, aiming at robust expression recognition in wild which suffers from the mentioned problems, we propose a semi-supervised method to make use of the large scale unlabeled data in two steps: 1) We enrich reference manifolds using selected unlabeled data which are closed to certain kind of expression. The learned manifolds can help smooth the variation of original data and provide reliable metric to maintain semantic similarity of expression; 2) To elevate the original labeled set for enhanced training, we iteratively employ the semi-supervised clustering to assign labels for unlabeled data and add the most discriminant ones into the labeled set. Experiments on the latest wild expression database SFEW and GENKI show that the proposed method can effectively exploit unlabeled data to improve the performance on real-world expression recognition.

Mengyi Liu, Shaoxin Li, Shiguang Shan, Xilin Chen

Benchmarking Still-to-Video Face Recognition via Partial and Local Linear Discriminant Analysis on COX-S2V Dataset

In this paper, we explore the real-world Still-to-Video (S2V) face recognition scenario, where only very few (single, in many cases) still images per person are enrolled into the gallery while it is usually possible to capture one or multiple video clips as probe. Typical application of S2V is mug-shot based watch list screening. Generally, in this scenario, the still image(s) were collected under controlled environment, thus of high quality and resolution, in frontal view, with normal lighting and neutral expression. On the contrary, the testing video frames are of low resolution and low quality, possibly with blur, and captured under poor lighting, in non-frontal view. We reveal that the S2V face recognition has been heavily overlooked in the past. Therefore, we provide a benchmarking in terms of both a large scale dataset and a new solution to the problem. Specifically, we collect (and release) a new dataset named COX-S2V, which contains 1,000 subjects, with each subject a high quality photo and four video clips captured simulating video surveillance scenario. Together with the database, a clear evaluation protocol is designed for benchmarking. In addition, in addressing this problem, we further propose a novel method named Partial and Local Linear Discriminant Analysis (PaLo-LDA). We then evaluated the method on COX-S2V and compared with several classic methods including LDA, LPP, ScSR. Evaluation results not only show the grand challenges of the COX-S2V, but also validate the effectiveness of the proposed PaLo-LDA method over the competitive methods.

Zhiwu Huang, Shiguang Shan, Haihong Zhang, Shihong Lao, Alifu Kuerban, Xilin Chen

Fusing Magnitude and Phase Features for Robust Face Recognition

High accurate face recognition is of great importance for real-world applications such as identity authentication, watch list screening, and human-computer interaction. Despite tremendous progress made in the last decades, fully automatic face recognition systems are still far from the goal of surpassing the human vision system, especially in uncontrolled conditions. In this paper, we propose an approach for robust face recognition by fusing two complementary features: one is the Gabor magnitude of multiple scales and orientations and the other is Fourier phase encoded by Local Phase Quantization (LPQ). To further reduce the high dimensionality of both features, patch-wise Fisher Linear Discriminant Analysis is applied respectively and further combined by score-level fusion. In addition, multi-scale face models are exploited to make use of more information and improve the robustness of the proposed approach. Experimental results show that the proposed approach achieves 96.09%, 95.64% and 95.15% verification rates (when FAR=0.1%) on ROC1/2/3 of Face Recognition Grand Challenge (FRGC) version 2 Experiment 4, impressively surpassing the best known results, i.e. 93.91%, 93.55%, and 93.12%.

Yan Li, Shiguang Shan, Haihong Zhang, Shihong Lao, Xilin Chen

Finding Happiest Moments in a Social Context

We study the problem of expression analysis for a group of people. Automatic facial expression analysis has seen much research in recent times. However, little attention has been given to the estimation of the overall expression theme conveyed by an image of a group of people. Specifically, this work focuses on formulating a framework for happiness intensity estimation for groups based on social context information. The main contributions of this paper are: a) defining automatic frameworks for group expressions; b) social features, which compute weights on expression intensities; c) an automatic face occlusion intensity detection method; and d) an ‘in the wild’ labelled database containing images having multiple subjects from different scenarios. The experiments show that the global and local contexts provide useful information for theme expression analysis, with results similar to human perception results.

Abhinav Dhall, Jyoti Joshi, Ibrahim Radwan, Roland Goecke

Sparsity Sharing Embedding for Face Verification

Face verification in an uncontrolled environment is a challenging task due to the possibility of large variations in pose, illumination, expression, occlusion, age, scale, and misalignment. To account for these intra-personal settings, this paper proposes a sparsity sharing embedding (SSE) method for face verification that takes into account a pair of input faces under different settings. The proposed SSE method measures the distance between two input faces

${\mathbf x}_A$


${\mathbf x}_B$

under intra-personal settings






in two steps: 1) in the association step,

${\mathbf x}_A$


${\mathbf x}_B$

is represented in terms of a reconstructive weight vector and identity under settings






, respectively, from the generic identity dataset; 2) in the prediction step, the associated faces are replaced by embedding vectors that conserve their identity but are embedded to preserve the inter-personal structures of the intra-personal settings. Experiments on a MultiPIE dataset show that the SSE method performs better than the AP model in terms of the verification rate.

Donghoon Lee, Hyunsin Park, Junyoung Chung, Youngook Song, Chang D. Yoo

Semantic Pixel Sets Based Local Binary Patterns for Face Recognition

Feature extraction plays an important role in face recognition. Based on local binary patterns (LBP), we propose a novel face representation method which obtains histograms of semantic pixel sets based LBP (spsLBP) with a robust code voting (rcv). By clustering according the semantic pixel relations before the histogram estimation, the spsLBP makes better use of the spatial information over the original LBP. In this paper, we use a simple rule to use the semantic information. We cluster by the pixel intensity values, which is also invariant to monotonic grayscale changes, and it is in particular very useful when there are occlusions and expression variations on face images. Besides, the proposed representation adopts a new code voting strategy for LBP histogram computation, which makes it more robust. The proposed method is evaluated on three widely used face recognition databases: AR, FERET and LFW. Experimental results show that the proposed method can outperform the original uniform LBP and its extensions.

Zhenhua Chai, Heydi Mendez-Vazquez, Ran He, Zhenan Sun, Tieniu Tan

An Adaptation Framework for Head-Pose Classification in Dynamic Multi-view Scenarios

Multi-view head-pose estimation in low-resolution, dynamic scenes is difficult due to blurred facial appearance and perspective changes as targets move around freely in the environment. Under these conditions, acquiring sufficient training examples to learn the dynamic relationship between



face appearance



can be very expensive. Instead, a

transfer learning

approach is proposed in this work. Upon learning a weighted-distance function from many examples where the target position is


, we


these weights to the scenario where target positions are


. The adaptation framework incorporates reliability of the different face regions for pose estimation under positional variation, by transforming the target appearance to a

canonical appearance

corresponding to a


scene location. Experimental results confirm effectiveness of the proposed approach, which outperforms state-of-the-art by 9.5% under relevant conditions. To aid further research on this topic, we also make DPOSE- a dynamic, multi-view head-pose dataset with ground-truth publicly available with this paper.

Anoop K. Rajagopal, Ramanathan Subramanian, Radu L. Vieriu, Elisa Ricci, Oswald Lanz, Kalpathi Ramakrishnan, Nicu Sebe

Face Parts Localization Using Structured-Output Regression Forests

In this paper, we propose a method for face parts localization called Structured-Output Regression Forests (SO-RF). We assume that the spatial graph of face parts structure can be partitioned into star graphs associated with individual parts. At each leaf, a regression model for an individual part as well as an interdependency model between parts in the star graph is learned. During testing, individual part positions are determined by the product of two voting maps, corresponding to two different models. The part regression model captures local feature evidence while the interdependency model captures the structure configuration. Our method has shown state of the art results on the publicly available BioID dataset and competitive results on a more challenging dataset, namely Labeled Face Parts in the Wild.

Heng Yang, Ioannis Patras

Human Face Super-Resolution Based on NSCT

The nonsubsampled contourlet transform (NSCT) is a useful tool for vision and image processing, which has the property of multi-scale, multi-direction, and shift-invariant. In this paper, based on the relations between NSCT coefficients, we propose a novel strategy for single-frame human face super-resolution. Both the high resolution (HR) and low resolution (LR) images in the training set are decomposed beforehand by NSCT. Given a low resolution (LR) image, we first decompose it by NSCT, and then use the locally linear embedding (LLE) to learn the target SR image’s NSCT coefficients via our proposed SR strategies. At last we use the inverse transformation to compose the final SR image. Extensive experiments on CAS-PEAL Face Database demonstrate that our SR method outperforms the state-of-art methods both visually and in terms of SSIM and PSNR. Results on real world images further verify the effectiveness and superiority of our method.

Jing Liu, Guangda Su, Xiaolong Ren, Jiansheng Chen

Digital Paparazzi: Spotting Celebrities in Professional Photo Libraries

We propose a scalable solution to the problem of real-world face recognition when both training and test faces are under varying pose and illumination. Our proposed classifier solves a sparse approximation problem in a learned transform domain. Our algorithm uses a cascaded solution to significantly reduce the computational cost of the classification process. The cascaded solution first applies a more efficient Subspace Pursuit Algorithm on the test image, and only runs a more accurate ℓ


-minimization algorithm on those face images for which the Subspace Pursuit does not have enough confidence in prediction. We also show the application of our algorithm in automatic face annotation of media objects, and show that on average our algorithm achieves about 94% annotation accuracy over the celebrity benchmark dataset.

Sina Jafarpour, Li-Jia Li, Roelof van Zwol

Nighttime Face Recognition at Long Distance: Cross-Distance and Cross-Spectral Matching

Automatic face recognition capability in surveillance systems is important for security applications. However, few studies have addressed the problem of outdoor face recognition at a long distance (over 100 meters) in both daytime and nighttime environments. In this paper, we first report on a system that we have designed to collect face image database at a long distance, called the Long Distance Heterogeneous Face Database (LDHF-DB) to advance research on this topic. The LDHF-DB contains face images collected in an outdoor environment at distances of 60 meters, 100 meters, and 150 meters, with both visible light (VIS) face images captured in daytime and near infrared (NIR) face images captured in nighttime. Given this database, we have conducted two types of cross-distance face matching (matching long-distance probe to 1-meter gallery) experiments: (i) intra-spectral (VIS to VIS) face matching, and (ii) cross-spectral (NIR to VIS) face matching. The proposed face recognition algorithm consists of following three major steps: (i) Gaussian filtering to remove high frequency noise, (ii) Scale Invariant Feature Transform (SIFT) in local image regions for feature representation, and (iii) a random subspace method to build discriminant subspaces for face recognition. Experimental results show that the proposed face recognition algorithm outperforms two commercial state-of-the-art face recognition SDKs (FaceVACS and PittPatt) for long distance face recognition in both daytime and nighttime operations. These results highlight the need for better data capture setup and robust face matching algorithms for cross spectral matching at distances greater than 100 meters.

Hyunju Maeng, Shengcai Liao, Dongoh Kang, Seong-Whan Lee, Anil K. Jain

Hand Posture Recognition from Disparity Cost Map

In this paper, we address the problem of hand posture recognition with a binocular camera. As bare hand has a few landmarks for matching, instead of using accurate matching between two views, we define a kind of mapping score–

Disparity Cost Map

. The disparity cost map serves as the final hand representation for recognition. As we use the disparity cost map, an explicit segmentation stage is not necessary. Local Binary Pattern (LBP) is used as feature for classification in this paper. In order to align the LBP feature, we further design an annular mask to deal with the problem of scaling, rotation, translation (RST) and search for an accurate bounding box of hand. The experimental results demonstrate the efficiency and robustness of our method. For 15 hand postures in varies cluttered background, the proposed method achieves an average recognition rate of 95% with a SVM classifier.

Hanjie Wang, Qi Wang, Xilin Chen

Qualitative Pose Estimation by Discriminative Deformable Part Models

We present a discriminative deformable part model for the recovery of qualitative pose, inferring coarse pose labels (





left, front-right, back

), a task which we expect to be more robust to common confounding factors that hinder the inference of exact 2D or 3D joint locations. Our approach automatically selects parts that are predictive of qualitative pose and trains their appearance and deformation costs to best discriminate between qualitative poses. Unlike previous approaches, our parts are both selected and trained to improve qualitative pose discrimination and are shared by all the qualitative pose models. This leads to both increased accuracy and higher efficiency, since fewer parts models are evaluated for each image. In comparisons with two state-of-the-art approaches on a public dataset, our model shows superior performance.

Hyungtae Lee, Vlad I. Morariu, Larry S. Davis

Learning Discriminant Face Descriptor for Face Recognition

Face descriptor is a critical issue for face recognition. Many local face descriptors like Gabor, LBP have exhibited good discriminative ability for face recognition. However, most existing face descriptors are designed in a handcrafted way and the extracted features may not be optimal for face representation and recognition. In this paper, we propose a learning based mechanism to learn the discriminant face descriptor (DFD) optimal for face recognition in a data-driven way. In particular, the discriminant image filters and the optimal weight assignments of neighboring pixels are learned simultaneously to enhance the discriminative ability of the descriptor. In this way, more useful information is extracted and the face recognition performance is improved. Extensive experiments on FERET, CAS-PEAL-R1 and LFW face databases validate the effectiveness and good generalizations of the proposed method.

Zhen Lei, Stan Z. Li

Optimal Operator Space Pursuit: A Framework for Video Sequence Data Analysis

High dimensional data sequences, such as video clips, can be modeled as trajectories in a high dimensional space and, and usually exhibit a low dimensional structure intrinsic to each distinct class of data sequence [1]. In this paper, we exploit a fibre bundle formalism to model various realizations of each trajectory, and characterize these high dimensional data sequences by an optimal operator subspace. Each operator is calculated as a matched filter corresponding to a standard Gaussian output with the data as input. The low dimensional structure intrinsic to the data is further explored, by minimizing the dimension of the operator space under data driven constraints. The dimension minimization problem is reformulated as a convex nuclear norm minimization problem, and an associated algorithm is proposed. Moreover, a fast method with superior performance for video based human activity classification is implemented by searching for an optimal operator space and adapted to the data. Illustrating examples demonstrating the performance of this approach are presented.

Xiao Bian, Hamid Krim

Cross-view Graph Embedding

Recently, more and more approaches are emerging to solve the cross-view matching problem where reference samples and query samples are from different views. In this paper, inspired by Graph Embedding, we propose a unified framework for these cross-view methods called Cross-view Graph Embedding. The proposed framework can not only reformulate most traditional cross-view methods (e.g., CCA, PLS and CDFE), but also extend the typical single-view algorithms (e.g., PCA, LDA and LPP) to cross-view editions. Furthermore, our general framework also facilitates the development of new cross-view methods. In this paper, we present a new algorithm named Cross-view Local Discriminant Analysis (CLODA) under the proposed framework. Different from previous cross-view methods only preserving inter-view discriminant information or the intra-view local structure, CLODA preserves the local structure and the discriminant information of both intra-view and inter-view. Extensive experiments are conducted to evaluate our algorithms on two cross-view face recognition problems: face recognition across poses and face recognition across resolutions. These real-world face recognition experiments demonstrate that our framework achieves impressive performance in the cross-view problems.

Zhiwu Huang, Shiguang Shan, Haihong Zhang, Shihong Lao, Xilin Chen

Fast Training of Effective Multi-class Boosting Using Coordinate Descent Optimization

We present a novel column generation based boosting method for multi-class classification. Our multi-class boosting is formulated in a single optimization problem as in [1]. Different from most existing multi-class boosting methods, which use the same set of weak learners for all the classes, we train class specified weak learners (i.e., each class has a different set of weak learners). We show that using separate weak learner sets for each class leads to fast convergence, without introducing additional computational overhead in the training procedure. To further make the training more efficient and scalable, we also propose a fast coordinate descent method for solving the optimization problem at each boosting iteration. The proposed coordinate descent method is conceptually simple and easy to implement in that it is a closed-form solution for each coordinate update. Experimental results on a variety of datasets show that, compared to a range of existing multi-class boosting methods, the proposed method has much faster convergence rate and better generalization performance in most cases. We also empirically show that the proposed fast coordinate descent algorithm needs less training time than the MultiBoost algorithm in [1].

Guosheng Lin, Chunhua Shen, Anton van den Hengel, David Suter

Hierarchical Space Tiling for Scene Modeling

A typical scene category,


street and beach, contains an enormous number (


in the order of 10


to 10


) of distinct scene configurations that are composed of objects and regions of varying shapes in different layouts. A well-known representation that can effectively address such complexity is the family of compositional models; however, learning the structures of the hierarchical compositional models remains a challenging task in vision. The objective of this paper is to present an efficient method for learning such models from a set of scene configurations. We start with an over-complete representation called

Hierarchical Space Tiling (HST)

, which quantizes the huge and continuous scene configuration space in an And-Or tree (AOT). This hierarchical AOT can generate a combinatorial number of configurations (in the order of 10


) through a small dictionary of elements. Then we estimate the HST/AOT model through a learning-by-parsing strategy, which iteratively updates the HST/AOT parameters while constructing the optimal parse trees for each training configuration. Finally we prune out the branches with zero or low probability to obtain a much smaller HST/AOT. The HST quantization allows us to transfer the challenging


problem to a tractable


problem. We evaluate the representation in three aspects. (i) Coding efficiency. We show the learned representation can approximate valid configurations with less errors using smaller number of primitives than other popular representations. (ii) Semantic power of learning. The learned representation is less ambiguous in parsing configuration and has semantically meaningful inner concepts. It captures both the diversity and the frequency (prior) of the scene configurations. (iii) Scene classification. The model is not only fully generative but also yields discriminative scene classification performance which outperforms the state-of-the-art methods.

Shuo Wang, Yizhou Wang, Song-Chun Zhu


Weitere Informationen

Premium Partner

BranchenIndex Online

Die B2B-Firmensuche für Industrie und Wirtschaft: Kostenfrei in Firmenprofilen nach Lieferanten, Herstellern, Dienstleistern und Händlern recherchieren.



Best Practices für die Mitarbeiter-Partizipation in der Produktentwicklung

Unternehmen haben das Innovationspotenzial der eigenen Mitarbeiter auch außerhalb der F&E-Abteilung erkannt. Viele Initiativen zur Partizipation scheitern in der Praxis jedoch häufig. Lesen Sie hier  - basierend auf einer qualitativ-explorativen Expertenstudie - mehr über die wesentlichen Problemfelder der mitarbeiterzentrierten Produktentwicklung und profitieren Sie von konkreten Handlungsempfehlungen aus der Praxis.
Jetzt gratis downloaden!