Skip to main content

Über dieses Buch

The four-volume set LNCS 7724--7727 constitutes the thoroughly refereed post-conference proceedings of the 11th Asian Conference on Computer Vision, ACCV 2012, held in Daejeon, Korea, in November 2012. The total of 226 contributions presented in these volumes was carefully reviewed and selected from 869 submissions. The papers are organized in topical sections on object detection, learning and matching; object recognition; feature, representation, and recognition; segmentation, grouping, and classification; image representation; image and video retrieval and medical image analysis; face and gesture analysis and recognition; optical flow and tracking; motion, tracking, and computational photography; video analysis and action recognition; shape reconstruction and optimization; shape from X and photometry; applications of computer vision; low-level vision and applications of computer vision.



Oral Session 1: Object Detection and Learning

Beyond Dataset Bias: Multi-task Unaligned Shared Knowledge Transfer

Many visual datasets are traditionally used to analyze the performance of different learning techniques. The evaluation is usually done within each dataset, therefore it is questionable if such results are a reliable indicator of true generalization ability. We propose here an algorithm to exploit the existing data resources when learning on a new multiclass problem. Our main idea is to identify an image representation that decomposes orthogonally into two subspaces: a part specific to each dataset, and a part generic to, and therefore shared between, all the considered source sets. This allows us to use the generic representation as un-biased reference knowledge for a novel classification task. By casting the method in the multi-view setting, we also make it possible to use different features for different databases. We call the algorithm MUST, Multitask Unaligned Shared knowledge Transfer. Through extensive experiments on five public datasets, we show that MUST consistently improves the cross-datasets generalization performance.

Tatiana Tommasi, Novi Quadrianto, Barbara Caputo, Christoph H. Lampert

Cross-Database Transfer Learning via Learnable and Discriminant Error-Correcting Output Codes

We present a transfer learning approach that transfers knowledge across two multi-class, unconstrained domains (source and target), and accomplishes object recognition with few training samples in the target domain. Unlike most of previous work, we make no assumption about the relatedness of these two domains. Namely, data of the two domains can be from different databases and of distinct categories. To overcome the domain variations, we propose to learn a set of commonly-shared and discriminant attributes in form of

error-correcting output codes

. Upon each of attributes, the unrelated, multi-class recognition tasks of the two domains are transformed into correlative, binary-class ones. The extra source knowledge can alleviate the high risk of overfitting caused by the lack of training data in the target domain. Our approach is evaluated on several benchmark datasets, and leads to about 40% relative improvement in accuracy when only one training sample is available.

Feng-Ju Chang, Yen-Yu Lin, Ming-Fang Weng

Human Reidentification with Transferred Metric Learning

Human reidentification is to match persons observed in non-overlapping camera views with visual features for inter-camera tracking. The ambiguity increases with the number of candidates to be distinguished. Simple temporal reasoning can simplify the problem by pruning the candidate set to be matched. Existing approaches adopt a fixed metric for matching all the subjects. Our approach is motivated by the insight that different visual metrics should be optimally learned for different candidate sets. We tackle this problem under a transfer learning framework. Given a large training set, the training samples are selected and reweighted according to their visual similarities with the query sample and its candidate set. A weighted maximum margin metric is online learned and transferred from a generic metric to a candidate-set-specific metric. The whole online reweighting and learning process takes less than two seconds per candidate set. Experiments on the VIPeR dataset and our dataset show that the proposed transferred metric learning significantly outperforms directly matching visual features or using a single generic metric learned from the whole training set.

Wei Li, Rui Zhao, Xiaogang Wang

Poster Session 1: Object Detection, Learning and Matching

Tell Me What You Like and I’ll Tell You What You Are: Discriminating Visual Preferences on Flickr Data

The John Ruskin’s 19th century adage suggests that personal taste is not merely an absolute set of aesthetic principles valid for everyone: actually, it is a process of interpretation which have also roots in one’s life experiences. This aspect represents nowadays a major problem for inferring automatically the quality of a picture. In this paper, instead of trying to solve this age-old problem, we consider an intriguing, orthogonal direction, aimed at discovering how different are the personal tastes. Given a set of preferred images of a user, obtained from Flickr, we extract a pool of low- and high-level features; LASSO regression is then exploited to learn the most discriminative ones, considering a group of 200 random Flickr users. Such aspects can be easily recovered, allowing to understand what is the “what we like” which distinguish us from the others. We then perform multi-class classification, where a test sample is a set of preferred pictures of an unknown user, and the classes are all the users. The results are surprising: given only 1 image as test, we can match the user preferences definitely more than the chance, and with 20 images we reach an nAUC of 91%, considering the cumulative matching characteristic curve. Extensive experiments promote our approach, suggesting new intriguing perspectives in the study of computational aesthetics.

Pietro Lovato, Alessandro Perina, Nicu Sebe, Omar Zandonà, Alessio Montagnini, Manuele Bicego, Marco Cristani

Local Context Priors for Object Proposal Generation

State-of-the-art methods for object detection are mostly based on an expensive exhaustive search over the image at different scales. In order to reduce the computational time, one can perform a selective search to obtain a small subset of relevant object hypotheses that need to be evaluated by the detector. For that purpose, we employ a regression to predict possible object scales and locations by exploiting the local context of an image. Furthermore, we show how a priori information, if available, can be integrated to improve the prediction. The experimental results on three datasets including the Caltech pedestrian and PASCAL VOC dataset show that our method achieves the detection performance of an exhaustive search approach with much less computational load. Since we model the prior distribution over the proposals locally, it generalizes well and can be successfully applied across datasets.

Marko Ristin, Juergen Gall, Luc Van Gool

Arbitrary-Shape Object Localization Using Adaptive Image Grids

Sliding-window based search is a widely used technique for object localization. However, for objects of non-rectangle shapes, noises in windows may mislead the localization, causing unsatisfactory results. In this paper, we propose an efficient bottom-up approach for detecting arbitrary-shape objects using image grids as basic components. First, a test image is partitioned into




grids and the object is localized by finding a set of connected grids which maximize the classifier’s response. Then, graph cut segmentation is used to improve the object boundary by utilizing local image context. Instead of using bounding boxes, the proposed approach searches connected regions of any shapes. With the graph cut refinement, our approach can start with coarse image grids and is robust to noises. To make image grids better cover the object of arbitrary shape, we also propose a fast adaptive grid partition method which takes image content into account and can be efficiently implemented by dynamic programming. The use of adaptive partition further improves the localization accuracy of our approach. Experiments on PASCAL VOC 2007 and VOC 2008 datasets demonstrate the effectiveness of our approach.

Chunluan Zhou, Junsong Yuan

Disambiguation in Unknown Object Detection by Integrating Image and Speech Recognition Confidences

This paper presents a new method to detect unknown objects and their unknown names in object manipulation through man-robot dialog. In the method, the detection is carried out by using the information of object images and user’s speech in an integrated way. Originality of the method is to use logistic regression for the discrimination between unknown and known objects. The accuracy of the unknown object detection was 97% in the case when there were about fifty known objects.

Yuko Ozasa, Yasuo Ariki, Mikio Nakano, Naoto Iwahashi

Class-Specific Weighted Dominant Orientation Templates for Object Detection

We present a class-specific weighted Dominant Orientation Template (DOT) for class-specific object detection to exploit fast DOT, although the original DOT is intended for instance-specific object detection. We use automatic selection algorithm to select representative DOTs from training images of an object class and use three types of 2D Haar wavelets to construct weight templates of the object class. To generate class-specific weighted DOTs, we use a modified similarity measure to combine the representative DOTs with weight templates. In experiments, the proposed method achieved object detection that was better or at least comparable to that of existing methods while being very fast for both training and testing.

Hui-Jin Lee, Ki-Sang Hong

Salient Object Detection via Color Contrast and Color Distribution

In this paper, we take the advantages of color contrast and color distribution to get high quality saliency maps. The overall procedure flow of our unified framework contains superpixel pre-segmentation, color contrast and color distribution computation, combination, final refinement and then object segmentation. During color contrast saliency computation, we combine two color systems and then introduce the using of distribution prior before saliency smoothing. It works to select correct color components. In addition, we propose a novel saliency smoothing procedure that is based on superpixel regions and is realized in color space. This processing step leads to total object being highlighted evenly, contributing to high quality color contrast saliency maps. Finally, a new refinement approach is utilized to eliminate artifacts and recover unconnected parts in the combined saliency maps. In visual comparison, our method produces higher quality saliency maps which stress out the total object meanwhile suppress background clutters. Both qualitative and quantitative experiments show our approach outperforms 8 state-of-the-art methods, achieving the highest precision rate 96% (3% improvement from the current highest), when evaluated via one of the most popular data sets [1]. Excellent content-aware image resizing also can be achieved with our saliency maps.

Keren Fu, Chen Gong, Jie Yang, Yue Zhou

Data Decomposition and Spatial Mixture Modeling for Part Based Model

This paper presents a system of data decomposition and spatial mixture modeling for part based models. Recently, many enhanced part based models (with


multiple features, more components or parts) have been proposed. Nevertheless, those enhanced models bring high computation cost together with the risk of over-fitting. To tackle this problem, we propose a data decomposition method for part based models which not only accelerates training and testing process but also improves the performance on average. Besides, the original part based model uses a strict rigid structural model to describe the distribution of each part location. It is not “deformable” enough, especially for those instances with different viewpoints or poses in the same aspect ratio. To address this problem, we present a novel spatial mixture modeling method. The spatial mixture embedded model is then integrated into the proposed data decomposition framework. We evaluate our system on the challenging PASCAL VOC2007 and PASCAL VOC2010 datasets, demonstrating the state-of-the-art performance compared with other related methods in terms of accuracy and efficiency.

Junge Zhang, Yongzhen Huang, Kaiqi Huang, Zifeng Wu, Tieniu Tan

Appearance Sharing for Collective Human Pose Estimation

While human pose estimation (HPE) techniques usually process each test image independently, in real applications images come in collections containing interdependent images. Often several images have similar backgrounds or show persons wearing similar clothing (foreground). We present a novel human pose estimation technique to exploit these dependencies by sharing appearance models between images. Our technique automatically determines which images in the collection should share appearance. We extend the state-of-the art HPE model of Yang and Ramanan to include our novel appearance sharing cues and demonstrate on the highly challenging Leeds Sports Poses dataset that they lead to better results than traditional single-image pose estimation.

Marcin Eichner, Vittorio Ferrari

Max-Margin Regularization for Reducing Accidentalness in Chamfer Matching

Standard chamfer matching techniques and their state-ofthe- art extensions are utilizing object contours which only measure the mere sum of location and orientation differences of contour pixels. In our approach we are increasing the specificity of the model contour by learning the relative importance of all model points instead of treating them as independent. However, chamfer matching is still prone to accidental matches in dense clutter. To detect such accidental matches we learn the co-occurrence of generic background contours to further eliminate the number of false detections. Since, clutter only interferes with the foreground model contour we learn where to place the background contours with respect to the foreground object boundary. The co-occurrence of foreground model points and background contours are both integrated into a single max-margin framework. Thus our approach combines the advantages of accurately detecting objects or parts via chamfer matching and the robustness of a max-margin learning. Our results on standard benchmark datasets show that our method significantly outperforms current directional chamfer matching, thus redefining the state-of-the-art in this field.

Angela Eigenstetter, Pradeep Krishna Yarlagadda, Björn Ommer

Coupling-and-Decoupling: A Hierarchical Model for Occlusion-Free Car Detection

Handling occlusions in object detection is a long-standing problem. This paper addresses the problem of X-to-X-occlusion-free object detection (e.g. car-to-car occlusions in our experiment) by utilizing an intuitive coupling-and-decoupling strategy. In the “coupling” stage, we model the pair of occluding X’s (e.g. car pairs) directly to account for the statistically strong co-occurrence (i.e. coupling). Then, we learn a hierarchical And-Or directed acyclic graph (AOG) model under the latent structural SVM (LSSVM) framework. The learned AOG consists of, from the top to bottom, (i) a root Or-node representing different compositions of occluding X pairs, (ii) a set of And-nodes each of which represents a specific composition of occluding X pairs, (iii) another set of And-nodes representing single X’s decomposed from occluding X pairs, and (iv) a set of terminal-nodes which represent the appearance templates for the X pairs, single X’s and latent parts of the single X’s, respectively. The part appearance templates can also be shared among different single X’s. In detection, a dynamic programming (DP) algorithm is used and as a natural consequence we decouple the two single X’s from the X-to-X occluding pairs. In experiments, we test our method on roadside cars which are collected from real traffic video surveillance environment by ourselves. We compare our model with the state-of-the-art deformable part-based model (DPM) and obtain better detection performance.

Bo Li, Tianfu Wu, Wenze Hu, Mingtao Pei

The Pooled NBNN Kernel: Beyond Image-to-Class and Image-to-Image

While most image classification methods to date are based on image-to-image comparisons, Boiman

et al.

have shown that better generalization can be obtained by performing image-to-class comparisons. Here, we show that these are just two special cases of a more general formulation, where the feature space is partitioned into subsets of different granularity. This way, a series of representations can be derived that trade-off generalization against specificity.

Thereby we show a connection between NBNN classification and different pooling strategies, where, in contrast to traditional pooling schemes that perform spatial pooling of the features, pooling is performed in feature space. Moreover, rather than picking a single partitioning, we propose to combine them in a multi kernel framework. We refer to our method as the

Pooled NBNN kernel

. This new scheme leads to significant improvement over the standard image-to-image and image-to-class baselines, with only a small increase in computational cost.

Konstantinos Rematas, Mario Fritz, Tinne Tuytelaars

Local Hypersphere Coding Based on Edges between Visual Words

Local feature coding has drawn much attention in recent years. Many excellent coding algorithms have been proposed to improve the bag-of-words model. This paper proposes a new local feature coding method called local hypersphere coding (LHC) which possesses two distinctive differences from traditional coding methods. Firstly, we describe local features by the edges between visual words. Secondly, the reconstruction center is moved from the origin to the nearest visual word, thus feature coding is performed on the hypersphere of feature space. We evaluate our coding method on several benchmark datasets for image classification. The experimental results of the proposed method outperform several state-of-the-art coding methods, indicating the effectiveness of our method.

Weiqiang Ren, Yongzhen Huang, Xin Zhao, Kaiqi Huang, Tieniu Tan

Spatially Local Coding for Object Recognition

The spatial pyramid and its variants have been among the most popular and successful models for object recognition. In these models, local visual features are coded across elements of a visual vocabulary, and then these codes are pooled into histograms at several spatial granularities. We introduce spatially local coding, an alternative way to include spatial information in the image model. Instead of only coding visual appearance and leaving the spatial coherence to be represented by the pooling stage, we include location as part of the coding step. This is a more flexible spatial representation as compared to the fixed grids used in the spatial pyramid models and we can use a simple, whole-image region during the pooling stage. We demonstrate that combining features with multiple levels of spatial locality performs better than using just a single level. Our model performs better than all previous single-feature methods when tested on the Caltech 101 and 256 object recognition datasets.

Sancho McCann, David G. Lowe

Semantic Segmentation with Millions of Features: Integrating Multiple Cues in a Combined Random Forest Approach

In this paper, we present a new combined approach for feature extraction, classification, and context modeling in an iterative framework based on random decision trees and a huge amount of features. A major focus of this paper is to integrate different kinds of feature types like color, geometric context, and auto context features in a joint, flexible and fast manner. Furthermore, we perform an in-depth analysis of multiple feature extraction methods and different feature types. Extensive experiments are performed on challenging facade recognition datasets, where we show that our approach significantly outperforms previous approaches with a performance gain of more than 15% on the most difficult dataset.

Björn Fröhlich, Erik Rodner, Joachim Denzler

Semi-Supervised Learning on a Budget: Scaling Up to Large Datasets

Internet data sources provide us with large image datasets which are mostly without any explicit labeling. This setting is ideal for semi-supervised learning which seeks to exploit labeled data as well as a large pool of unlabeled data points to improve learning and classification. While we have made considerable progress on the theory and algorithms, we have seen limited success to translate such progress to the large scale datasets which these methods are inspired by. We investigate the computational complexity of popular graph-based semi-supervised learning algorithms together with different possible speed-ups. Our findings lead to a new algorithm that scales up to 40 times larger datasets in comparison to previous approaches and even increases the classification performance. Our method is based on the key insights that by employing a density-based measure unlabeled data points can be selected similar to an active learning scheme. This leads to a compact graph resulting in an improved performance up to 11.6% at reduced computational costs.

Sandra Ebert, Mario Fritz, Bernt Schiele

One-Class Multiple Instance Learning via Robust PCA for Common Object Discovery

Principal component analysis (PCA), as a key component in statistical learning, has been adopted in a wide variety of applications in computer vision and machine learning. From a different angle, weakly supervised learning, more specifically multiple instance learning (MIL), allows fine-grained information to be exploited from coarsely-grained label information. In this paper, we propose an algorithm using the robust PCA (RPCA) [1] in a iterative way to perform simultaneous common object discovery and model learning under a one-class multiple instance learning setting. We show the advantage of our method on common object discovery and model learning, which needs no fine/coarse alignment in the input data; in addition, it achieves comparable results with standard two-class MIL learning algorithms but our method is learning from one-class data only.

Xinggang Wang, Zhengdong Zhang, Yi Ma, Xiang Bai, Wenyu Liu, Zhuowen Tu

Online Semi-Supervised Discriminative Dictionary Learning for Sparse Representation

We present an online semi-supervised dictionary learning algorithm for classification tasks. Specifically, we integrate the reconstruction error of labeled and unlabeled data, the discriminative sparse-code error, and the classification error into an objective function for online dictionary learning, which enhances the dictionary’s representative and discriminative power. In addition, we propose a probabilistic model over the sparse codes of input signals, which allows us to expand the labeled set. As a consequence, the dictionary and the classifier learned from the enlarged labeled set yield lower generalization error on unseen data. Our approach learns a single dictionary and a predictive linear classifier jointly. Experimental results demonstrate the effectiveness of our approach in face and object category recognition applications.

Guangxiao Zhang, Zhuolin Jiang, Larry S. Davis

Efficient Discriminative Learning of Class Hierarchy for Many Class Prediction

Recently the maximum margin criterion has been employed to learn a discriminative class hierarchical model, which shows promising performance for rapid multi-class prediction. Specifically, at each node of this hierarchy, a separating hyperplane is learned to split its associated classes from all of the corresponding training data, leading to a time-consuming training process in computer vision applications with many classes such as large-scale object recognition and scene classification. To address this issue, in this paper we propose a new efficient discriminative class hierarchy learning approach for many class prediction. We first present a general objective function to unify the two state-of-the-art methods for multi-class tasks. When there are many classes, this objective function reveals that some classes are indeed redundant. Thus, omitting these redundant classes will not degrade the prediction performance of the learned class hierarchical model. Based on this observation, we decompose the original optimization problem into a sequence of much smaller sub-problems by developing an adaptive classifier updating method and an active class selection strategy. Specifically, we iteratively update the separating hyperplane by efficiently using the training samples only from a limited number of selected classes that are well separated by the current separating hyperplane. Comprehensive experiments on three large-scale datasets demonstrate that our approach can significantly accelerate the training process of the two state-of-the-art methods while achieving comparable prediction performance in terms of both classification accuracy and testing speed.

Lin Chen, Lixin Duan, Ivor W. Tsang, Dong Xu

Oral Session 2: Object Recognition I

Grouping Active Contour Fragments for Object Recognition

In this paper, we try to address the challenging problem of combining local shape features to describe long and continuous shape characteristics. To this end, we firstly propose a novel type of local shape feature, namely Active Contour Fragment (ACF), to encode the shape deformation in a local region. An ACF is automatically learnt from the contours of a specific object class and capable to describe the intra-class shape characteristics based on the point distribution model. Secondly, we combine multiple ACFs into a group, namely Active Contour Group (ACG), to describe the long shape characteristics .We model the ACFs in an ACG using an undirected chain model and estimate the parameters of the chain model in a subspace for accelerating the learning and matching processes of ACGs. Finally, we discriminatively train the classifiers based on ACFs and ACGs in a boosting framework for localizing objects as well as delineating object boundaries. Both qualitative and quantitative evaluations show that our approach is capable of describing long shapes and the proposed recognition algorithm achieves promising performance on the public datasets.

Wei Zheng, Songlin Song, Hong Chang, Xilin Chen

Detecting Partially Occluded Objects with an Implicit Shape Model Random Field

In this paper, we introduce a formulation for the task of detecting objects based on the information gathered from a standard Implicit Shape Model (ISM). We describe a probabilistic approach in a general random field setting, which enables to effectively detect object instances and additionally identifies all local patches contributing to the different instances. We propose a sparse graph structure and define a semantic label space, specifically tuned to the task of localizing objects. The design of the graph structure then allows to define a novel inference process that efficiently returns a good local minimum of our energy minimization problem. A key benefit of our method is, that we do not have to fix a range for local neighborhood suppression, as necessary for instance in related non maximum suppression approaches. Our inference process implicitly is capable to separate even strongly overlapping object instances. Experimental evaluation compares our method to state-of-the-art in this field on challenging sequences showing competitive and improved results.

Paul Wohlhart, Michael Donoser, Peter M. Roth, Horst Bischof

Relative Forest for Attribute Prediction

Human-Namable visual attributes are promising in leveraging various recognition tasks. Intuitively, the more accurate the attribute prediction is, the more the recognition tasks can benefit. Relative attributes [1] learns a ranking function per attribute which can provide more accurate attribute prediction, thus, show clear advantages over previous binary attribute. In this paper, we inherit the idea of learning ranking function per attribute but propose to improve the algorithm in two aspects: First, we propose a

Relative Tree

algorithm which facilitates more accurate nonlinear ranking to capture the semantic relationships. Second, we develop a

Relative Forest

algorithm which resorts to randomized learning to reduce training time of

Relative Tree

. Benefiting from multiple tree ensemble,

Relative Forest

can achieve even more accurate final ranking. To show the effectiveness of proposed method, we first compare

Relative Tree

method with Relative Attribute on PubFig and OSR dataset. Then to verify the efficiency of

Relative Forest

algorithm, we conduct age estimation evaluation on FG-NET dataset. With much less training time compared to Relative Attribute and

Relative Tree

, proposed

Relative Forest

achieves state-of-the-art age estimation accuracy. Finally, experiments on the large scale SUN Attribute database show the scalability of proposed

Relative Forest


Shaoxin Li, Shiguang Shan, Xilin Chen

Discriminative Dictionary Learning with Pairwise Constraints

In computer vision problems such as pair matching, only binary information - ‘same’ or ‘different’ label for pairs of images - is given during training. This is in contrast to classification problems, where the category labels of training images are provided. We propose a unified discriminative dictionary learning approach for both pair matching and multiclass classification tasks. More specifically, we introduce a new discriminative term called ‘pairwise sparse code error’ for the discriminativeness in sparse representation of pairs of signals, and then combine it with the classification error for discriminativeness in classifier construction to form a unified objective function. The solution to the new objective function is achieved by employing the efficient feature-sign search algorithm. The learned dictionary encourages feature points from a similar pair (or the same class) to have similar sparse codes. We validate the effectiveness of our approach through a series of experiments on face verification and recognition problems.

Huimin Guo, Zhuolin Jiang, Larry S. Davis

Poster Session 2: Feature, Representation, and Recognition

Adaptive Unsupervised Multi-view Feature Selection for Visual Concept Recognition

To reveal and leverage the correlated and complemental information between different views, a great amount of multi-view learning algorithms have been proposed in recent years. However, unsupervised feature selection in multi-view learning is still a challenge due to lack of data labels that could be utilized to select the discriminative features. Moreover, most of the traditional feature selection methods are developed for the single-view data, and are not directly applicable to the multi-view data. Therefore, we propose an unsupervised learning method called Adaptive Unsupervised Multi-view Feature Selection (AUMFS) in this paper. AUMFS attempts to jointly utilize three kinds of vital information, i.e., data cluster structure, data similarity and the correlations between different views, contained in the original data together for feature selection. To achieve this goal, a robust sparse regression model with the



-norm penalty is introduced to predict data cluster labels, and at the same time, multiple view-dependent visual similar graphs are constructed to flexibly model the visual similarity in each view. Then, AUMFS integrates data cluster labels prediction and adaptive multi-view visual similar graph learning into a unified framework. To solve the objective function of AUMFS, a simple yet efficient iterative method is proposed. We apply AUMFS to three visual concept recognition applications (i.e., social image concept recognition, object recognition and video-based human action recognition) on four benchmark datasets. Experimental results show the proposed method significantly outperforms several state-of-the-art feature selection methods. More importantly, our method is not very sensitive to the parameters and the optimization method converges very fast.

Yinfu Feng, Jun Xiao, Yueting Zhuang, Xiaoming Liu

Iris Recognition Using Consistent Corner Optical Flow

This paper proposes an efficient iris based authentication system. Iris segmentation is done using an improved circular hough transform and robust integro-differential operator to detect inner and outer iris boundary respectively. The segmented iris is normalized to polar coordinates and preprocessed using


(Local Gradient Binary Pattern). The corners features are extracted and matched using dissimilarity measure


(Corners having Inconsistent Optical Flow). The proposed approach has been tested on publicly available CASIA 4.0 Interval and Lamp databases consisting of 2,639 and 16,212 images respectively. It has been observed that the segmentation accuracy of more than 99.6% can be achieved on both databases. This paper also provides error classification for wrong segmentation and also determines influential parameters for errors. The proposed system has performed with


of 99.75% and 99.87% with an


of 0.108% and 1.29% on Interval and Lamp databases respectively.

Aditya Nigam, Phalguni Gupta

Face Recognition in Videos – A Graph Based Modified Kernel Discriminant Analysis

Grassmannian manifolds have been an effective way to represent image sets (video) which are mapped as data points on the manifold. Recognition can then be performed by applying the Discriminant Analysis (DA) on such manifolds. However, the local structure of the data points are not exploited in the DA. This paper proposes a modified Kernel Discriminant Analysis (KDA) approach on Grassmannian manifolds that utilizes the local structure of the data points on the manifold. The KDA exploits the local structure using between-class and within-class adjacency graphs that represent the between-class and within-class similarities, respectively. The maximum correlation from within-class and minimum correlation from between-class is utilized to define the connectivity between points in the graph thus exploiting the geometrical structure of the data. The discriminability is further improved by effective feature representation using LBP which can discriminate data across illumination, pose, and minor expressions. Effective recognition is performed by using only the cluster representatives extracted by clustering the frames of a video sequence. Experiments on several video datasets (Honda, MoBo, ChokePoint, NRC-IIT, and MOBIO) show that the proposed approach obtains better recognition rates, in comparison with the state-of-the-art approaches.

Gayathri Mahalingam, Chandra Kambhamettu

Learning Hierarchical Bag of Words Using Naive Bayes Clustering

Image analysis tasks such as classification, clustering, detection, and retrieval are only as good as the feature representation of the images they use. Much research in computer vision is focused on finding



semantically richer

image representations. Bag of visual Words (BoW) is a representation that has emerged as an effective one for a variety of computer vision tasks. BoW methods traditionally use low level features. We have devised a strategy to use these low level features to create ‘‘higher level’’ features by making use of the spatial context in images. In this paper, we propose a novel

hierarchical feature learning framework

that uses a

Naive Bayes Clustering

algorithm to convert a 2-D symbolic image at one level to a 2-D symbolic image at the next level with richer features. On two popular datasets, Pascal VOC 2007 and Caltech 101, we empirically show that classification accuracy obtained from the hierarchical features computed using our approach is significantly higher than the traditional SIFT based BoW representation of images even though our image representations are more compact.

Siddhartha Chandra, Shailesh Kumar, C. V. Jawahar

Efficient Human Parsing Based on Sketch Representation

In this paper, we present an efficient human parsing method which estimates human body poses from 2D images. Firstly we propose an edge sketch representation, which enhance critical information for pose estimation and prune the redundant. The sketch representation is generated by employing two sets of filters on extracted edges. Based on sketch representation, body part candidates can be located easily using parallel lines detection in Hough space. Then we use specifically trained linear SVM classifiers to detect each body part candidates based on parallel line feature. A dynamic programming algorithm is applied to calculate the MAP estimation based on standard pictorial structure model, which use a kinematic tree to describe human pose. To evaluate the representing ability of proposed sketch representation, as well as the accuracy and efficiency of our entire human pose estimation method, we run two sets of experiments on a sports image dataset respectively. Experimental results demonstrate that the human body parts in the images can be well described by our proposed sketch representation. Furthermore, our human pose estimation method is efficient and achieves comparable accuracy against the state-of-the-art.

Meng Wang, Zhaoxiang Zhang, Yunhong Wang

Exclusive Visual Descriptor Quantization

Vector quantization (VQ) using exhaustive nearest neighbor (NN) search is the speed bottleneck in classic bag of visual words (BOV) models. Approximate NN (ANN) search methods still cost great time in VQ, since they check multiple regions in the search space to reduce VQ errors. In this paper, we propose ExVQ, an exclusive NN search method to speed up BOV models. Given a visual descriptor, a portion of search regions is excluded from the whole search space by a linear projection. We ensure that minimal VQ errors are introduced in the exclusion by learning an accurate classifier. Multiple exclusions are organized in a tree structure in ExVQ, whose VQ speed and VQ error rate can be reliably estimated. We show that ExVQ is much faster than state-of-the-art ANN methods in BOV models while maintaining almost the same classification accuracy. In addition, we empirically show that even with the VQ error rate as high as 30%, the classification accuracy of some ANN methods, including ExVQ, is similar to that of exhaustive search (which has zero VQ error). In some cases, ExVQ has even higher classification accuracy than the exhaustive search.

Yu Zhang, Jianxin Wu, Weiyao Lin

Underwater Live Fish Recognition Using a Balance-Guaranteed Optimized Tree

Live fish recognition in the open sea is a challenging multi-class classification task. We propose a novel method to recognize fish in an unrestricted natural environment recorded by underwater cameras. This method extracts 66 types of features, which are a combination of color, shape and texture properties from different parts of the fish and reduce the feature dimensions with forward sequential feature selection (FSFS) procedure. The selected features of the FSFS are used by an SVM. We present a Balance-Guaranteed Optimized Tree (BGOT) to control the error accumulation in hierarchical classification and, therefore, achieve better performance. A BGOT of 10 fish species is automatically constructed using the inter-class similarities and a heuristic method. The proposed BGOT-based hierarchical classification method achieves about 4% better accuracy compared to state-of-the-art techniques on a live fish image dataset.

Phoenix X. Huang, Bastiaan J. Boom, Robert B. Fisher

Local 3D Symmetry for Visual Saliency in 2.5D Point Clouds

Many models of visual attention have been proposed in the past, and proved to be very useful, e.g. in robotic applications. Recently it has been shown in the literature that not only single visual features, such as color, orientation, curvature, etc., attract attention, but complete objects do. Symmetry is a feature of many man-made and also natural objects and has thus been identified as a candidate for attentional operators. However, not many techniques exist to date that exploit symmetry-based saliency. So far these techniques work mainly on 2D data. Furthermore, methods, which work on 3D data, assume complete object models. This limits their use as bottom-up attentional operators working on RGBD images, which only provide partial views of objects. In this paper, we present a novel local symmetry-based operator that works on 3D data and does not assume any object model. The estimation of symmetry saliency maps is done on different scales to detect objects of various sizes. For evaluation a Winner-Take-All neural network is used to calculate attention points. We evaluate the proposed approach on two datasets and compare to state-of-the-art methods. Experimental results show that the proposed algorithm outperforms current state-of-the-art in terms of quality of fixation points.

Ekaterina Potapova, Michael Zillich, Markus Vincze

Exploiting Features – Locally Interleaved Sequential Alignment for Object Detection

We exploit image features multiple times in order to make sequential decision process faster and better performing. In the decision process features providing knowledge about the object presence or absence in a given detection window are successively evaluated. We show that these features also provide information about object position within the evaluated window. The classification process is sequentially interleaved with estimating the correct position. The position estimate is used for steering the features yet to be evaluated. This locally interleaved sequential alignment (LISA) allows to run an object detector on sparser grid which speeds up the process. The position alignment is jointly learned with the detector. We achieve a better detection rate since the method allows for training the detector on perfectly aligned image samples. For estimation of the alignment we propose a learnable regressor that approximates a non-linear regression function and runs in negligible time.

Karel Zimmermann, David Hurych, Tomáš Svoboda

Efficient and Scalable 4th-Order Match Propagation

We propose a robust method to match image feature points taking into account geometric consistency. It is a careful adaptation of the match propagation principle to 4th-order geometric constraints (match quadruple consistency). With our method, a set of matches is explained by a network of locally-similar affinities. This approach is useful when simple descriptor-based matching strategies fail, in particular for highly ambiguous data, e.g., with repetitive patterns or where texture is lacking. As it scales easily to hundreds of thousands of matches, it is also useful when denser point distributions are sought, e.g., for high-precision rigid model estimation. Experiments show that our method is competitive (efficient, scalable, accurate, robust) against state-of-the-art methods in deformable object matching, camera calibration and pattern detection.

David Ok, Renaud Marlet, Jean-Yves Audibert

Hierarchical Object Representations for Visual Recognition via Weakly Supervised Learning

In this paper, we propose a weakly supervised approach to learn hierarchical object representations for visual recognition. The learning process is carried out in a bottom-up manner to discover latent visual patterns in multiple scales. To relieve the disturbance of complex backgrounds in natural images, bounding boxes of foreground objects are adopted as weak knowledge in the learning stage to promote those visual patterns which are more related to the target objects. The difference between the patterns of foreground objects and backgrounds is relatively vague at low-levels, but becomes more distinct along with the feature transformations to high-levels. In the test stage, an input image is verified against the learnt patterns level-by-level, and the responses at each level construct a hierarchy of representations which indicates the occurring possibilities of the target object at various scales. Experiments on two PASCAL datasets showed encouraging results for visual recognition.

Tianzhu Zhang, Rui Cai, Zhiwei Li, Lei Zhang, Hanqing Lu

Invariant Surface-Based Shape Descriptor for Dynamic Surface Encoding

This paper presents a novel approach to represent spatio-temporal visual information. We introduce a surface-based shape model whose structure is invariant to surface variations over time to describe 3D dynamic surfaces (e.g., obtained from multiview video capture). The descriptor is defined as a graph lying on object surfaces and anchored to invariant local features (e.g., extremal points). Geodesic-consistency-based priors are used as cues within a probabilistic framework to maintain the graph invariant, even though the surfaces undergo non-rigid deformations. Our contribution brings to 3D geometric data a temporally invariant structure that relies only on intrinsic surface properties, and is independent of surface parameterization (i.e., surface mesh connectivity). The proposed descriptor can therefore be used for efficient dynamic surface encoding, through transformation into 2D (geometry) images, as its structure can provide an invariant representation for 3D mesh models. Various experiments on challenging publicly available datasets are performed to assess invariant property and performance of the descriptor.

Tony Tung, Takashi Matsuyama

Linear Discriminant Analysis with Maximum Correntropy Criterion

Linear Discriminant Analysis (LDA) is a famous supervised feature extraction method for subspace learning in computer vision and pattern recognition. In this paper, a novel method of LDA based on a new Maximum Correntropy Criterion optimization technique is proposed. The conventional LDA, which is based on L2-norm, is sensitivity to the presence of outliers. The proposed method has several advantages: first, it is robust to large outliers. Second, it is invariant to rotations. Third, it can be effectively solved by half-quadratic optimization algorithm. And in each iteration step, the complex optimization problem can be reduced to a quadratic problem that can be efficiently solved by a weighted eigenvalue optimization method. The proposed method is capable of analyzing non-Gaussian noise to reduce the influence of large outliers substantially, resulting in a robust classification. Performance assessment in several datasets shows that the proposed approach is more effectiveness to address outlier issue than traditional ones.

Wei Zhou, Sei-ichiro Kamata

AfNet: The Affordance Network

There has been a growing need to build an object recognition system that can successfully characterize object constancy, irrespective of lighting, shading, occlusions, viewpoint variations and most importantly, deal with the multitude of shapes, colors and sizes in which objects are found. Affordances on the other hand, provide symbolic grounding mechanisms that enable linking features obtained from visual perception with the functionality of the objects, which provides the most consistent and holistic characterization of an object. Recognition by Component Affordances (RBCA) is a recent theory that builds affordance features for recognition. As an extension of the psychophysical theory of Recognition by Components (RBC) to generic visual perception, RBCA is well suited for cognitive visual processing systems which are required to perform implicit cognitive tasks. A common task is to substitute a cup for a mug, bottle, jug, pitcher, pilsner, beaker, chalice, goblet or any other unlabeled object, but with a physical part affording the ability to hold liquid and a part affording grasping by a human hand, given the goal of ’finding an empty cup’ and no cups are available in the work environment of interest. In this paper, we present affordance features for recognition of objects. Using a set of 25 structural and 10 material affordances we define a database of over 250 common household objects. This database called the Affordance Network or AfNet is available as community development framework and is well suited for deployment on domestic robots. Sample object recognition results using AfNet and the associated inference engine that grounds the affordances through visual perception features demonstrate the effectiveness of the approach.

Karthik Mahesh Varadarajan, Markus Vincze

A Directed Graphical Model for Linear Barcode Scanning from Blurred Images

Image blur is one of the major issues deteriorating the capability of a linear barcode scanning system. In this work, linear barcode scanning is treated under the perspective of stochastic modeling and inference. A directed graphical model is proposed to characterize the relationship between barcode value and its out-of-focused waveforms, based on which highly effective inference process can be implemented, allowing decoding barcode in real-time on mobile devices, directly from blurred images. The value of the proposed model is its potential to enlarge the operating range of current linear barcode scanning systems with no need for dedicated hardware components and making linear barcode scanning at close-up distance on fixed-focus lens a reality.

Ling Chen

A Probabilistic 3D Model Retrieval System Using Sphere Image

The view-based 3D model retrieval systems represent a 3D model using its projected views, and retrieve 3D models by comparing the projected views. Most of the existing view-based 3D model retrieval systems only analyze the features of the projected views, while the spatial arrangements of the viewpoints are not well considered. In this paper, we propose a new 3D model descriptor called sphere image, which is defined as a sphere with a large number of viewpoints distributed on it. Each viewpoint is regarded as a ”pixel”, associated with a projected view. The feature of the projected view is quantized into a vector, regarded as the ”intensity”. We also propose a probabilistic graphical model for 3D model matching, and develop a 3D model retrieval system to test our approach. The proposed approach was evaluated on the Princeton shape benchmark. Experimental results indicate that our approach outperforms most of the existing 3D model retrieval systems in respect of retrieval precision and computation cost.

Ke Ding, Yunhui Liu

Model Based Training, Detection and Pose Estimation of Texture-Less 3D Objects in Heavily Cluttered Scenes

We propose a framework for automatic modeling, detection, and tracking of 3D objects with a Kinect. The detection part is mainly based on the recent template-based LINEMOD approach [1] for object detection. We show how to build the templates automatically from 3D models, and how to estimate the 6 degrees-of-freedom pose accurately and in real-time. The pose estimation and the color information allow us to check the detection hypotheses and improves the correct detection rate by 13% with respect to the original LINEMOD. These many improvements make our framework suitable for object manipulation in Robotics applications. Moreover we propose a new dataset made of 15 registered, 1100+ frame video sequences of 15 various objects for the evaluation of future competing methods.

Stefan Hinterstoisser, Vincent Lepetit, Slobodan Ilic, Stefan Holzer, Gary Bradski, Kurt Konolige, Nassir Navab

Boosting with Side Information

In many problems of machine learning and computer vision, there exists side information, i.e., information contained in the training data and not available in the testing phase. This motivates the recent development of a new learning approach known as

learning with side information

that aims to incorporate side information for improved learning algorithms. In this work, we describe a new training method of boosting classifiers that uses side information, which we term as


. In particular, AdaBoost+ employs a novel classification label imputation method to construct extra weak classifiers from the available information that simulate the performance of better weak classifiers obtained from the features in side information. We apply our method to two problems, namely handwritten digit recognition and facial expression recognition from low resolution images, where it demonstrates its effectiveness in classification performance.

Jixu Chen, Xiaoming Liu, Siwei Lyu

Generalized Mutual Subspace Based Methods for Image Set Classification

The subspace-based methods are effectively applied to classify


of feature vectors by modeling them as subspaces. It is, however, difficult to appropriately determine the subspace dimensionality in advance for better performance. For alleviating such issue, we present a generalized mutual subspace method by introducing

soft weighting

across the basis vectors of the subspace. The bases are effectively combined via the soft weights to measure the subspace similarities (angles) without definitely setting the subspace dimensionality. By using the soft weighting, we consequently propose a novel mutual subspace-based method to construct the discriminative space which renders more discriminative subspace similarities. In the experiments on 3D object recognition using image sets, the proposed methods exhibit stably favorable performances compared to the other subspace-based methods.

Takumi Kobayashi

Oral Session 3: Segmentation and Grouping

Simultaneous Monocular 2D Segmentation, 3D Pose Recovery and 3D Reconstruction

We propose a novel framework for joint 2D segmentation and 3D pose and 3D shape recovery, for images coming from a single monocular source. In the past, integration of all three has proven difficult, largely because of the high degree of ambiguity in the 2D - 3D mapping. Our solution is to learn nonlinear and probabilistic low dimensional latent spaces, using the Gaussian Process Latent Variable Models dimensionality reduction technique. These act as class or activity constraints to a simultaneous and variational segmentation – recovery – reconstruction process. We define an image and level set based energy function, which we minimise with respect to 3D pose and shape, 2D segmentation resulting automatically as the projection of the recovered shape under the recovered pose. We represent 3D shapes as zero levels of 3D level set embedding functions, which we project down directly to probabilistic 2D occupancy maps, without the requirement of an intermediary explicit contour stage. Finally, we detail a fast, open-source, GPU-based implementation of our algorithm, which we use to produce results on both real and artificial video sequences.

Victor Adrian Prisacariu, Aleksandr V. Segal, Ian Reid

Joint Kernel Learning for Supervised Image Segmentation

This paper considers a supervised image segmentation algorithm based on joint-kernelized structured prediction. In the proposed algorithm, correlation clustering over a superpixel graph is conducted using a non-linear discriminant function, where the parameters are learned by a kernelized-structured support vector machine (SSVM). For an input superpixel image, correlation clustering is used to predict the superpixel-graph edge labels that determine whether adjacent superpixel pairs should be merged or not. In previous works, the discriminant functions for structured prediction were generally chosen to be linear with the model parameter and joint feature map. However, the linear model has two limitations: complex correlations between two input-output pairs are ignored, and the joint feature map should be explicitly designed. To cope with these limitations, a nonlinear discriminant function based on a joint kernel, which eliminates the need for explicit design of the joint feature map, is considered. The proposed joint kernel is defined as a combination of an

image similarity kernel

and an

edge-label similarity kernel

, which measure the resemblance of two input images and the similarity between two edge-label pairs, respectively. Each kernel function is designed for fast computation and efficient inference. The proposed algorithm is evaluated using two segmentation benchmark datasets: the Berkeley segmentation dataset (BSDS) and Microsoft Research Cambridge dataset (MSRC). It is observed that the joint feature map implicitly embedded in the proposed joint kernel performs comparably or even better than the explicitly designed joint feature map for a linear model.

Jongmin Kim, Youngjoo Seo, Sanghyuk Park, Sungrack Yun, Chang D. Yoo

Application of Heterogenous Motion Models towards Structure Recovery from Motion

Non-rigid structure estimates are often performed under the assumption that linear combination of a few rigid basis shapes can describe the deformation. However, the quality of reconstruction suffers as the number of basis shapes increase. When a natural video (which may contain rigid, articulated and non-rigid objects together with camera motion) is to be processed, the complexity of motion precludes use of rigid SFM methods. We propose that this problem may be approached using the notions of heterogeneity, articulation and stationarity. In this paper, we present a scheme for structure recovery based on motion classification and automatic selection of reconstruction algorithms for each scene object. Rigid, low-rank non-rigid and articulated structures are reconstructed separately. Using sub-sequence stationarity graphs, these are stitched together to form a coherent structure. We tested our method on data from human motion capture for objective analysis and provide results on natural videos.

Rohith M.V., Chandra Kambhamettu

Poster Session 3: Segmentation, Grouping, and Classification

Locality-Constrained Active Appearance Model

Although the conventional Active Appearance Model (AAM) has achieved some success for face alignment, it still suffers from the generalization problem when be applied to unseen subjects and images. In this paper, a novel Locality-constraint AAM (LC-AAM) algorithm is proposed to tackle the generalization problem of AAM. Theoretically, the proposed LC-AAM is a fast approximation for a sparsity-regularized AAM problem, where sparse representation is exploited for non-linear face modeling. Specifically, for an input image, its


-nearest neighbors are selected as the shape and appearance bases, which are adaptively fitted to the input image by solving a constrained AAM-like fitting problem. Essentially, the effectiveness of our LC-AAM algorithm comes from learning a strong localized shape and appearance prior for the input facial image through exploiting its


-similar patterns. To validate the effectiveness of our algorithm, comprehensive experiments are conducted on two publicly available face databases. Experimental results demonstrate that our method greatly outperforms the original AAM method and its variants. In addition, our method is better than the state-of-the-art face alignment methods and generalizes well to unseen subjects and images.

Xiaowei Zhao, Shiguang Shan, Xiujuan Chai, Xilin Chen

Modeling Hidden Topics with Dual Local Consistency for Image Analysis

Image representation is the crucial component in image analysis and understanding. However, the widely used low-level features cannot correctly represent the high-level semantic content of images in many situations due to the “semantic gap”. In order to bridge the “semantic gap”, in this brief, we present a novel topic model, which can learn an effective and robust mid-level representation in the latent semantic space for image analysis. In our model, the ℓ


-graph is constructed to model the local image neighborhood structure and the word co-occurrence is computed to capture the local word consistency. Then, the local information is incorporated into the model for topic discovering. Finally, the generalized EM algorithm is used to estimate the parameters. As our model considers both the local image structure and local word consistency simultaneously when estimating the probabilistic topic distributions, the image representations can have more powerful description ability in the learned latent semantic space. Extensive experiments on the publicly available databases demonstrate the effectiveness of our approach.

Peng Li, Jian Cheng, Hanqing Lu

Design of Non-Linear Discriminative Dictionaries for Image Classification

In recent years there has been growing interest in designing dictionaries for image classification. These methods, however, neglect the fact that data of interest often has non-linear structure. Motivated by the fact that this non-linearity can be handled by the kernel trick, we propose learning of dictionaries in the high-dimensional feature space which are simultaneously reconstructive and discriminative. The proposed optimization approach consists of two main stages- coefficient update and dictionary update. We propose a kernel driven simultaneous orthogonal matching pursuit algorithm for the task of sparse coding in the feature space. The dictionary update step is performed using an approximate but efficient KSVD algorithm in feature space. Extensive experiments on image classification demonstrate that the proposed non-linear dictionary learning method is robust and can perform significantly better than many competitive discriminative dictionary learning algorithms.

Ashish Shrivastava, Hien V. Nguyen, Vishal M. Patel, Rama Chellappa

Efficient Background Subtraction under Abrupt Illumination Variations

Background subtraction techniques require high segmentation quality and low computational cost. Achieving high accuracy is difficult under abrupt illumination changes. We develop a new background subtraction method in an expectation maximization (EM) framework. We describe foreground colors and illumination ratios using a few Gaussian mixture models. EM convergence is dependent on its initialization. We propose a novel initialization method that considers reflectance and illumination implicitly. Scene points occluded by a foreground object tend to have prominent illumination ratios since both the reflectance and illumination are different. We introduce a topological approach based on Morse theory to pre-classify pixels into foreground and background. Moreover, we only decompose the probability distributions in the initial step in our EM. Later iterations do not consider the probability distribution decomposition anymore. The experimental results demonstrate that our EM formulation provides high accuracy under abrupt variations in illumination. Additionally, in comparison with one of the state-of-the-art methods based on EM, our approach converges in fewer iterations, yielding computational savings.

Junqiu Wang, Yasushi Yagi

Naive Bayes Image Classification: Beyond Nearest Neighbors

Naive Bayes Nearest Neighbor (NBNN) has been proposed as a powerful, learning-free, non-parametric approach for object classification. Its good performance is mainly due to the avoidance of a vector quantization step, and the use of image-to-class comparisons, yielding good generalization. In this paper we study the replacement of the nearest neighbor part with more elaborate and robust (sparse) representations, as well as trading performance for speed for practical purposes. The representations investigated are


-Nearest Neighbors (


), Iterative Nearest Neighbors (


) solving a constrained least squares (LS) problem, Local Linear Embedding (


), a Sparse Representation obtained by



-regularized LS (


), and a Collaborative Representation obtained as the solution of a



-regularized LS problem (


). In particular, NIMBLE and K-DES descriptors proved viable alternatives to SIFT and, the NB


and NB


classifiers provide significant improvements over NBNN, obtaining competitive results on Scene-15, Caltech-101, and PASCAL VOC 2007 datasets, while remaining learning-free approaches (i.e., no parameters need to be learned).

Radu Timofte, Tinne Tuytelaars, Luc Van Gool

Contextual Pooling in Image Classification

The original bag-of-words (BoW) model in terms of image classification treats each local feature independently, and thus ignores the spatial relationships between a feature and its neighboring features, namely, the feature’s context. However, our intuition and empirical studies tell the importance of such spatial information. Although the global spatial information can be captured with the spatial pyramid matching scheme, the subject of capturing local spatial relationships between features is still open. In this paper, we propose a new method to embed such local spatial (context) information into the BoW model. A vector reflecting context information is firstly extracted along with each feature, context patterns are then code-specifically trained, and thus the context information is elegantly embedded into the BoW model by contextual pooling according to different context patterns. Extensive experiments on the PASCAL VOC 2007 dataset show that our method greatly enhances the BoW model, and achieves the state-of-the-art performance.

Zifeng Wu, Yongzhen Huang, Liang Wang, Tieniu Tan

Spatial Graph for Image Classification

Spatial information in images is considered to be of great importance in the process of object recognition. Recent studies show that human’s classification accuracy might drop dramatically if the spatial information of an image is removed. The original bag-of-words (BoW) model is actually a system simulating such a classification process with incomplete information. To handle the spatial information, spatial pyramid matching (SPM) was proposed, which has become the most widely used scheme in the purpose of spatial modeling. Given an image, SPM divides it into a series of spatial blocks on several levels and concatenates the representations obtained separately within all the blocks. SPM greatly improves the performance since it embeds spatial information into BoW. However, SPM ignores the relationships between the spatial blocks. To address this problems, we propose a new scheme based on a spatial graph, whose nodes correspond to the spatial blocks in SPM, and edges correspond to the relationships between the blocks. Thorough experiments on several popular datasets verify the advantages of the proposed scheme.

Zifeng Wu, Yongzhen Huang, Liang Wang, Tieniu Tan

Knowledge Leverage from Contours to Bounding Boxes: A Concise Approach to Annotation

In the class based image segmentation problem, one of the major concerns is to provide large training data for learning complex graphical models. To alleviate the labeling effort, a concise annotation approach working on bounding boxes is introduced. The main idea is to leverage the knowledge learned from a few object contours for the inference of unknown contours in bounding boxes. To this end, we incorporate the bounding box prior into the concept of multiple image segmentations to generate a set of distinctive tight segments, with the condition that at least one tight segment approaching to the true object contour. A good tight segment is then selected via semi-supervised regression, which bears the augmented knowledge transferred from object contours to bounding boxes. The experimental results on the challenging Pascal VOC dataset corroborate that our new annotation method can potentially replace the manual annotations.

Jie-Zhi Cheng, Feng-Ju Chang, Kuang-Jui Hsu, Yen-Yu Lin

Efficient Pixel-Grouping Based on Dempster’s Theory of Evidence for Image Segmentation

In this paper we propose an algorithm for image segmentation using graph cuts which can be used to efficiently solve labeling problems on high resolution images or image sequences. The basic idea of our method is to group large homogeneous regions to one single variable. Therefore we combine the appearance and the task specific similarity with Dempster’s theory of evidence to compute the basic belief that two pixels/groups will have the same label in the minimum energy state. Experiments on image and video segmentation show that our grouping leads to a significant speedup and memory reduction of the labeling problem. Thus large-scale labeling problems can be solved in an efficient manner with a low approximation loss.

Björn Scheuermann, Markus Schlosser, Bodo Rosenhahn

Video Segmentation with Superpixels

Due to its importance, video segmentation has regained interest recently. However, there is no common agreement about the necessary ingredients for best performance. This work contributes a thorough analysis of various within- and between-frame affinities suitable for video segmentation. Our results show that a frame-based superpixel segmentation combined with a few motion and appearance-based affinities are sufficient to obtain good video segmentation performance. A second contribution of the paper is the extension of [1] to include motion-cues, which makes the algorithm globally aware of motion, thus improving its performance for video sequences. Finally, we contribute an extension of an established image segmentation benchmark [1] to videos, allowing coarse-to-fine video segmentations and multiple human annotations. Our results are tested on BMDS [2], and compared to existing methods.

Fabio Galasso, Roberto Cipolla, Bernt Schiele

A Noise Tolerant Watershed Transformation with Viscous Force for Seeded Image Segmentation

The watershed transform was proposed as a novel method for image segmentation over 30 years ago. Today it is still used as an elementary step in many powerful segmentation procedures. The watershed transform constitutes one of the main concepts of mathematical morphology as an important region-based image segmentation approach. However, the original watershed transform is highly sensitive to noise and is incapable of detecting objects with broken edges. Consequently its adoption in domains where imaging is subject to high noise is limited. By incorporating a high-order energy term into the original watershed transform, we proposed the viscous force watershed transform, which is more immune to noise and able to detect objects with broken edges.

Di Yang, Stephen Gould, Marcus Hutter

Active Learning for Interactive Segmentation with Expected Confidence Change

Using human prior information to perform interactive segmentation plays a significant role in figure/ground segmentation. In this paper, we propose an active learning based approach to smartly guide the user to interact on crucial regions and can quickly achieve accurate segmentation results. To select the crucial regions from unlabeled candidates, we propose a new criterion, i.e. selecting the ones which maximize the expected confidence change (


) over all unlabeled regions. Given an image represented by oversegmented regions, our active learning based approach iterates following three steps: 1) selecting crucial unlabeled regions with maximal


; 2) refining the selected regions; 3) updating appearance models based on the refined regions and performing image segmentation. Specifically, a constrained random walks algorithm is employed for segmentation, since it can efficiently produce confidence for computing


during active learning. Compared to the conventional interactive segmentation methods, the experimental results demonstrate our method can largely reduce the interaction efforts while maintaining high figure/ground segmentation accuracy.

Dan Wang, Canxiang Yan, Shiguang Shan, Xilin Chen

Cross Anisotropic Cost Volume Filtering for Segmentation

We study an advanced method for supervised multi-label image segmentation. To this end, we adopt a classic framework which recently has been revitalised by Rhemann et al. (2011). Instead of the usual global energy minimisation step, it relies on a mere evaluation of a cost function for every solution label, which is followed by a spatial smoothing step of these costs. While Rhemann et al. concentrate on efficiency, the goal of this paper is to equip the general framework with sophisticated subcomponents in order to develop a high-quality method for multi-label image segmentation: First, we present a substantially improved cost computation scheme which incorporates texture descriptors, as well as an automatic feature selection strategy. This leads to a high-dimensional feature space, from which we extract the label costs using a support vector machine. Second, we present a novel anisotropic diffusion scheme for the filtering step. In this PDE-based process, the smoothing of the cost volume is steered along the structures of the previously computed feature space. Experiments on widely used image databases show that our scheme produces segmentations of clearly superior quality.

Vladislav Kramarev, Oliver Demetz, Christopher Schroers, Joachim Weickert


Weitere Informationen

Premium Partner

BranchenIndex Online

Die B2B-Firmensuche für Industrie und Wirtschaft: Kostenfrei in Firmenprofilen nach Lieferanten, Herstellern, Dienstleistern und Händlern recherchieren.



Best Practices für die Mitarbeiter-Partizipation in der Produktentwicklung

Unternehmen haben das Innovationspotenzial der eigenen Mitarbeiter auch außerhalb der F&E-Abteilung erkannt. Viele Initiativen zur Partizipation scheitern in der Praxis jedoch häufig. Lesen Sie hier  - basierend auf einer qualitativ-explorativen Expertenstudie - mehr über die wesentlichen Problemfelder der mitarbeiterzentrierten Produktentwicklung und profitieren Sie von konkreten Handlungsempfehlungen aus der Praxis.
Jetzt gratis downloaden!