Skip to main content

2015 | Buch

Computer Vision -- ACCV 2014

12th Asian Conference on Computer Vision, Singapore, Singapore, November 1-5, 2014, Revised Selected Papers, Part II

insite
SUCHEN

Über dieses Buch

The five-volume set LNCS 9003--9007 constitutes the thoroughly refereed post-conference proceedings of the 12th Asian Conference on Computer Vision, ACCV 2014, held in Singapore, Singapore, in November 2014.

The total of 227 contributions presented in these volumes was carefully reviewed and selected from 814 submissions. The papers are organized in topical sections on recognition; 3D vision; low-level vision and features; segmentation; face and gesture, tracking; stereo, physics, video and events; and poster sessions 1-3.

Inhaltsverzeichnis

Frontmatter

Poster Session 1

Frontmatter
Multi-view Geometry Compression

For large-scale and highly redundant photo collections, eliminating statistical redundancy in multi-view geometry is of great importance to efficient 3D reconstruction. Our approach takes the full set of images with initial calibration and recovered sparse 3D points as inputs, and obtains a subset of views that preserve the final reconstruction accuracy and completeness well. We first construct an image quality graph, in which each vertex represents an input image, and the problem is then to determine a connected sub-graph guaranteeing a consistent reconstruction and maximizing the accuracy and completeness of the final reconstruction. Unlike previous works, which only address the problem of efficient structure from motion (SfM), our technique is highly applicable to the whole reconstruction pipeline, and solves the problems of efficient bundle adjustment, multi-view stereo (MVS), and subsequent variational refinement.

Siyu Zhu, Tian Fang, Runze Zhang, Long Quan
Camera Calibration Based on the Common Self-polar Triangle of Sphere Images

Sphere has been used for camera calibration in recent years. In this paper, a new linear calibration method is proposed by using the common self-polar triangle of sphere images. It is shown that any two of sphere images have a common self-polar triangle. Accordingly, a simple method for locating the vertices of such triangles is presented. An algorithm for recovering the vanishing line of the support plane using these vertices is developed. This allows to find out the imaged circular points, which are used to calibrate the camera. The proposed method starts from an existing theory in projective geometry and recovers five intrinsic parameters without calculating the projected circle center, which is more intuitive and simpler than the previous linear ones. Experiments with simulated data, as well as real images, show that our technique is robust and accurate.

Haifei Huang, Hui Zhang, Yiu-ming Cheung
Multi-scale Tetrahedral Fusion of a Similarity Reconstruction and Noisy Positional Measurements

The fusion of a 3D reconstruction up to a similarity transformation from monocular videos and the metric positional measurements from GPS usually relies on the alignment of the two coordinate systems. When positional measurements provided by a low-cost GPS are corrupted by high-level noises, this approach becomes problematic. In this paper, we introduce a novel framework that uses similarity invariants to form a tetrahedral network of views for the fusion. Such a tetrahedral network decouples the alignment from the fusion to combat the high-level noises. Then, we update the similarity transformation each time a well-conditioned motion of cameras is successfully identified. Moreover, we develop a multi-scale sampling strategy to reduce the computational overload and to adapt the algorithm to different levels of noises. It is important to note that our optimization framework can be applied in both batch and incremental manners. Experiments on simulations and real datasets demonstrate the robustness and the efficiency of our method.

Runze Zhang, Tian Fang, Siyu Zhu, Long Quan
DEPT: Depth Estimation by Parameter Transfer for Single Still Images

In this paper, we propose a new method for automatic depth estimation from color images using parameter transfer. By modeling the correlation between color images and their depth maps with a set of parameters, we get a database of parameter sets. Given an input image, we compute the high-level features to find the best matched image sets from the database. Then the set of parameters corresponding to the best match are used to estimate the depth of the input image. Compared to the past learning-based methods, our trained database only consists of trained features and parameter sets, which occupy little space.We evaluate our depth estimation method on the benchmark RGB-D (RGB + depth) datasets. The experimental results are comparable to the state-of-the-art, demonstrating the promising performance of our proposed method.

Xiu Li, Hongwei Qin, Yangang Wang, Yongbing Zhang, Qionghai Dai
Object Ranking on Deformable Part Models with Bagged LambdaMART

Object detection methods based on sliding windows has long been considered a binary classification problem, but this formulation ignores order of examples. Deformable part models, which achieves great success in object detection, have the same problem.This paper aims to give better order to detections given by deformable part models. We use a bagged LambdaMART to model both pair-wise and list-wise relationships between detections. Experiments show our ranking models not only significantly improve detection rates compared to basic deformable part model detectors, but also outperform classification methods with same features.

Chaobo Sun, Xiaojie Wang, Peng Lu
Representation Learning with Smooth Autoencoder

In this paper, we propose a novel autoencoder variant, smooth autoencoder (SmAE), to learn robust and discriminative feature representations. Different from conventional autoencoders which reconstruct each sample from its encoding, we use the encoding of each sample to reconstruct its local neighbors. In this way, the learned representations are consistent among local neighbors and robust to small variations of the inputs. When trained with supervisory information, our approach forces samples from the same class to become more compact in the vicinity of data manifolds in the new representation space, where the samples are easier to be discriminated. Experimental results verify the effectiveness of the representations learned by our approach in image classification and face recognition tasks.

Kongming Liang, Hong Chang, Zhen Cui, Shiguang Shan, Xilin Chen
Single Image Smoke Detection

Despite the recent advances in smoke detection from video, detection of smoke from single images is still a challenging problem with both practical and theoretical implications. However, there is hardly any reported research on this topic in the literature. This paper addresses this problem by proposing a novel feature to detect smoke in a single image. An image formation model that expresses an image as a linear combination of smoke and non-smoke (background) components is derived based on the atmospheric scattering models. The separation of the smoke and non-smoke components is formulated as convex optimization that solves a sparse representation problem. Using the separated quasi-smoke and quasi-background components, the feature is constructed as a concatenation of the respective sparse coefficients. Extensive experiments were conducted and the results have shown that the proposed feature significantly outperforms the existing features for smoke detection.

Hongda Tian, Wanqing Li, Philip Ogunbona, Lei Wang
Adaptive Sparse Coding for Painting Style Analysis

Inspired by the outstanding performance of sparse coding in applications of image denoising, restoration, classification, etc., we propose an adaptive sparse coding method for painting style analysis that is traditionally carried out by art connoisseurs and experts. Significantly improved over previous sparse coding methods, which heavily rely on the comparison of query paintings, our method is able to determine the authenticity of a single query painting based on estimated decision boundary. Firstly, discriminative patches containing the most representative characteristics of the given authentic samples are extracted via exploiting the statistical information of their representation on the DCT basis. Subsequently, the strategy of adaptive sparsity constraint which assigns higher sparsity weight to the patch with higher discriminative level is enforced to make the dictionary trained on such patches more exclusively adaptive to the authentic samples than via previous sparse coding algorithms. Relying on the learnt dictionary, the query painting can be authenticated if both better denoising performance and higher sparse representation are obtained, otherwise it should be denied. Extensive experiments on impressionist style paintings demonstrate efficiency and effectiveness of our method.

Zhi Gao, Mo Shan, Loong-Fah Cheong, Qingquan Li
Efficient Image Detail Mining

Two novel problems straddling the boundary between image retrieval and data mining are formulated: for every pixel in the query image, (i) find the database image with the maximum resolution depicting the pixel and (ii) find the frequency with which it is photographed in detail.

An efficient and reliable solution for both problems is proposed based on two novel techniques, the hierarchical query expansion that exploits the document at a time (DAAT) inverted file and a geometric consistency verification sufficiently robust to prevent topic drift within a zooming search.

Experiments show that the proposed method finds surprisingly fine details on landmarks, even those that are hardly noticeable for humans.

Andrej Mikulík, Filip Radenović, Ondřej Chum, Jiří Matas
Accuracy and Specificity Trade-off in $$k$$ -nearest Neighbors Classification

The

$$k$$

-NN rule is a simple, flexible and widely used non-parametric decision method, also connected to many problems in image classification and retrieval such as annotation and content-based search. As the number of classes increases and finer classification is considered (e.g. specific dog breed), high accuracy is often not possible in such challenging conditions, resulting in a system that will often suggest a wrong label. However, predicting a broader concept (e.g. dog) is much more reliable, and still useful in practice. Thus, sacrificing certain specificity for a more secure prediction is often desirable. This problem has been recently posed in terms of accuracy-specificity trade-off. In this paper we study the accuracy-specificity trade-off in

$$k$$

-NN classification, evaluating the impact of related techniques (posterior probability estimation and metric learning). Experimental results show that a proper combination of

$$k$$

-NN and metric learning can be very effective and obtain good performance.

Luis Herranz, Shuqiang Jiang
Multi-view Point Cloud Registration Using Affine Shape Distributions

Registration is crucial for the reconstruction of multi-view single plane illumination microscopy. By using fluorescent beads as fiduciary markers, this registration problem can be reduced to the problem of point clouds registration. We present a novel method for registering point clouds across views. This is based on a new local geometric descriptor - affine shape distribution - to represent the random spatial pattern of each point and its neighbourhood. To enhance its robustness and discriminative power against the missing data and outliers, a permutation and voting scheme based on affine shape distributions is developed to establish putative correspondence pairs across views. The underlying affine transformations are estimated based on the putative correspondence pairs via the random sample consensus. The proposed method is evaluated on three types of datasets including 3D random points, benchmark datasets and datasets from multi-view microscopy. Experiments show that the proposed method outperforms the state-of-the-arts when both point sets are contaminated by extremely large amount of outliers. Its robustness against the anisotropic z-stretching is also demonstrated in the registration of multi-view microscopy data.

Jia Du, Wei Xiong, Wenyu Chen, Jierong Cheng, Yue Wang, Ying Gu, Shue-Ching Chia
Part Detector Discovery in Deep Convolutional Neural Networks

Current fine-grained classification approaches often rely on a robust localization of object parts to extract localized feature representations suitable for discrimination. However, part localization is a challenging task due to the large variation of appearance and pose. In this paper, we show how pre-trained convolutional neural networks can be used for robust and efficient object part discovery and localization without the necessity to actually train the network on the current dataset. Our approach called “part detector discovery” (PDD) is based on analyzing the gradient maps of the network outputs and finding activation centers spatially related to annotated semantic parts or bounding boxes. This allows us not just to obtain excellent performance on the CUB200-2011 dataset, but in contrast to previous approaches also to perform detection and bird classification jointly without requiring a given bounding box annotation during testing and ground-truth parts during training.

Marcel Simon, Erik Rodner, Joachim Denzler
Performance Evaluation of 3D Local Feature Descriptors

A number of 3D local feature descriptors have been proposed in literature. It is however, unclear which descriptors are more appropriate for a particular application. This paper compares nine popular local descriptors in the context of 3D shape retrieval, 3D object recognition, and 3D modeling. We first evaluate these descriptors on six popular datasets in terms of descriptiveness. We then test their robustness with respect to support radius, Gaussian noise, shot noise, varying mesh resolution, image boundary, and keypoint localization errors. Our extensive tests show that Tri-Spin-Images (TriSI) has the best overall performance across all datasets. Unique Shape Context (USC), Rotational Projection Statistics (RoPS), 3D Shape Context (3DSC), and Signature of Histograms of OrienTations (SHOT) also achieved overall acceptable results.

Yulan Guo, Mohammed Bennamoun, Ferdous Sohel, Min Lu, Jianwei Wan, Jun Zhang
Scene Text Detection Based on Robust Stroke Width Transform and Deep Belief Network

Text detection in natural scene images is an open and challenging problem due to the significant variations of the appearance of the text itself and its interaction with the context. In this paper, we present a novel text detection method combining two main ingredients: the robust extension of Stroke Width Transform (SWT) and the Deep Belief Network (DBN) based discrimination of text objects from other scene components. In the former, smoothness-based edge information is combined with gradient for generating high quality edge images, and various edge cues are exploited in Connected Component (CC) analysis on basis of SWT to eliminate inter-character and intra-character errors. In the latter, DBN is exploited for learning efficient representations discriminating character and non-character CCs, resulting in the improved detection accuracy. The proposed method is evaluated on ICDAR and SVT public datasets and achieves the state-of-the-art results, which reveal the effectiveness of the method.

Hailiang Xu, Like Xue, Feng Su
Cross-Modal Face Matching: Beyond Viewed Sketches

Matching face images across different modalities is a challenging open problem for various reasons, notably feature heterogeneity, and particularly in the case of sketch recognition – abstraction, exaggeration and distortion. Existing studies have attempted to address this task by engineering invariant features, or learning a common subspace between the modalities. In this paper, we take a different approach and explore learning a mid-level representation within each domain that allows faces in each modality to be compared in a domain invariant way. In particular, we investigate sketch-photo face matching and go beyond the well-studied viewed sketches to tackle forensic sketches and caricatures where representations are often symbolic. We approach this by learning a facial attribute model independently in each domain that represents faces in terms of semantic properties. This representation is thus more invariant to heterogeneity, distortions and robust to mis-alignment. Our intermediate level attribute representation is then integrated synergistically with the original low-level features using CCA. Our framework shows impressive results on cross-modal matching tasks using forensic sketches, and even more challenging caricature sketches. Furthermore, we create a new dataset with

$$\approx $$

59, 000 attribute annotations for evaluation and to facilitate future research.

Shuxin Ouyang, Timothy Hospedales, Yi-Zhe Song, Xueming Li
3D Aware Correction and Completion of Depth Maps in Piecewise Planar Scenes

RGB-D sensors are popular in the computer vision community, especially for problems of scene understanding, semantic scene labeling, and segmentation. However, most of these methods depend on

reliable

input depth measurements, while discarding unreliable ones. This paper studies how reliable depth values can be used to

correct

the unreliable ones, and how to

complete

(or extend) the available depth data beyond the raw measurements of the sensor (i.e. infer depth at pixels with unknown depth values), given a prior model on the 3D scene. We consider piecewise planar environments in this paper, since many indoor scenes with man-made objects can be modeled as such. We propose a framework that uses the RGB-D sensor’s noise profile to adaptively and robustly fit plane segments (e.g. floor and ceiling) and iteratively complete the depth map, when possible. Depth completion is formulated as a discrete labeling problem (MRF) with hard constraints and solved efficiently using graph cuts. To regularize this problem, we exploit 3D and appearance cues that encourage pixels to take on depth values that will be compatible in 3D to the piecewise planar assumption. Extensive experiments, on a new large-scale and challenging dataset, show that our approach results in more accurate depth maps (with 20 % more depth values) than those recorded by the RGB-D sensor. Additional experiments on the NYUv2 dataset show that our method generates more 3D aware depth. These generated depth maps can also be used to improve the performance of a state-of-the-art RGB-D SLAM method.

Ali K. Thabet, Jean Lahoud, Daniel Asmar, Bernard Ghanem
Regularity Guaranteed Human Pose Correction

Benefited from the advantages provided by depth sensors, 3D human pose estimation has become feasible. However, the current estimation systems usually yield poor results due to severe occlusion and sensor noise in depth data. In this paper, we focus on a post-process step, pose correction, which takes the initial estimated poses as the input and deliver more reliable results. Although the regression based correction approach [

1

] has shown its effectiveness in decreasing the estimated errors, it cannot guarantee the regularity of corrected poses. To address this issue, we formulate pose correction as an optimization problem, which combines the output of the regression model with a pose prior model learned on a pre-captured motion data set. By considering the complexity and the geometric property of the pose data, the pose prior is estimated by von Mises-Fisher distributions in subspaces following divide-and-conquer strategies. By introducing the pose prior into our optimization framework, the regularity of the corrected poses is guaranteed. The experimental results on a challenging data set demonstrate that the proposed pose correction approach not only improves the accuracy, but also outputs more regular poses, compared to the-state-of-the-art.

Wei Shen, Rui Lei, Dan Zeng, Zhijiang Zhang
Accelerated Kmeans Clustering Using Binary Random Projection

Codebooks have been widely used for image retrieval and image indexing, which are the core elements of mobile visual searching. Building a vocabulary tree is carried out offline, because the clustering of a large amount of training data takes a long time. Recently proposed adaptive vocabulary trees do not require offline training, but suffer from the burden of online computation. The necessity for clustering high dimensional large data has arisen in offline and online training. In this paper, we present a novel clustering method to reduce the burden of computation without losing accuracy. Feature selection is used to reduce the computational complexity with high dimensional data, and an ensemble learning model is used to improve the efficiency with a large number of data. We demonstrate that the proposed method outperforms the-state of the art approaches in terms of computational complexity on various synthetic and real datasets.

Yukyung Choi, Chaehoon Park, In So Kweon
Divide and Conquer: Efficient Large-Scale Structure from Motion Using Graph Partitioning

Despite significant advances in recent years, structure-from-motion (SfM) pipelines suffer from two important drawbacks. Apart from requiring significant computational power to solve the large-scale computations involved, such pipelines sometimes fail to correctly reconstruct when the accumulated error in incremental reconstruction is large or when the number of 3D to 2D correspondences are insufficient. In this paper we present a novel approach to mitigate the above-mentioned drawbacks. Using an image match graph based on matching features we partition the image data set into smaller sets or components which are reconstructed independently. Following such reconstructions we utilise the available epipolar relationships that connect images across components to correctly align the individual reconstructions in a global frame of reference. This results in both a significant speed up of at least one order of magnitude and also mitigates the problems of reconstruction failures with a marginal loss in accuracy. The effectiveness of our approach is demonstrated on some large-scale real world data sets.

Brojeshwar Bhowmick, Suvam Patra, Avishek Chatterjee, Venu Madhav Govindu, Subhashis Banerjee
A Homography Formulation to the 3pt Plus a Common Direction Relative Pose Problem

In this paper we present an alternative formulation for the minimal solution to the 3pt plus a common direction relative pose problem. Instead of the commonly used epipolar constraint we use the homography constraint to derive a novel formulation for the 3pt problem. This formulation allows the computation of the normal vector of the plane defined by the three input points without any additional computation in addition to the standard motion parameters of the camera. We show the working of the method on synthetic and real data sets and compare it to the standard 3pt method and the 5pt method for relative pose estimation. In addition we analyze the degenerate conditions for the proposed method.

Olivier Saurer, Pascal Vasseur, Cedric Demonceaux, Friedrich Fraundorfer
MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation

In this work, we propose a novel and efficient method for articulated human pose estimation in videos using a convolutional network architecture, which incorporates both color and motion features. We propose a new human body pose dataset,

FLIC-motion

(This dataset can be downloaded from

http://cs.nyu.edu/~ajain/accv2014/

.), that extends the FLIC dataset [

1

] with additional motion features. We apply our architecture to this dataset and report significantly better performance than current state-of-the-art pose detection systems.

Arjun Jain, Jonathan Tompson, Yann LeCun, Christoph Bregler
Accelerating Cost Volume Filtering Using Salient Subvolumes and Robust Occlusion Handling

Several fundamental computer vision problems, such as depth estimation from stereo, optical flow computation, etc., can be formulated as a discrete pixel labeling problem. Traditional Markov Random Fields (MRF) based solutions to these problems are computationally expensive. Cost Volume Filtering (CF) presents a compelling alternative. Still these methods must filter the entire cost volume to arrive at a solution. In this paper, we propose a new CF method for depth estimation by stereo. First, we propose the Accelerated Cost Volume Filtering (ACF) method which identifies salient subvolumes in the cost volume. Filtering is restricted to these subvolumes, resulting in significant performance gains. The proposed method does not consider the entire cost volume and results in a marginal increase in unlabeled (occluded) pixels. We address this by developing an Occlusion Handling (OH) technique, which uses superpixels and performs label propagation via a simulated annealing inspired method. We evaluate the proposed method (ACF+OH) on the Middlebury stereo benchmark and on high resolution images from Middlebury 2005/2006 stereo datasets, and our method achieves state-of-the-art results. Our occlusion handling method, when used as a post-processing step, also significantly improves the accuracy of two recent cost volume filtering methods.

Mohamed A. Helala, Faisal Z. Qureshi
3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network

In this paper, we propose a deep convolutional neural network for 3D human pose estimation from monocular images. We train the network using two strategies: (1) a multi-task framework that jointly trains pose regression and body part detectors; (2) a pre-training strategy where the pose regressor is initialized using a network trained for body part detection. We compare our network on a large data set and achieve significant improvement over baseline methods. Human pose estimation is a structured prediction problem, i.e., the locations of each body part are highly correlated. Although we do not add constraints about the correlations between body parts to the network, we empirically show that the network has disentangled the dependencies among different body parts, and learned their correlations.

Sijin Li, Antoni B. Chan
Plant Leaf Identification via a Growing Convolution Neural Network with Progressive Sample Learning

Plant identification is an important problem for ecologists, amateur botanists, educators, and so on. Leaf, which can be easily obtained, is usually one of the important factors of plants. In this paper, we propose a growing convolution neural network (GCNN) for plant leaf identification and report the promising results on the ImageCLEF2012 Plant Identification database. The GCNN owns a growing structure which starts training from a simple structure of a single convolution kernel and is gradually added new convolution neurons to. Simultaneously, the growing connection weights are modified until the squared-error achieves the desired result. Moreover, we propose a progressive learning method to determine the number of learning samples, which can further improve the recognition rate. Experiments and analyses show that our proposed GCNN outperforms other state-of-the-art algorithms such as the traditional CNN and the hand-crafted features with SVM classifiers.

Zhong-Qiu Zhao, Bao-Jian Xie, Yiu-ming Cheung, Xindong Wu
Understanding Convolutional Neural Networks in Terms of Category-Level Attributes

It has been recently reported that convolutional neural networks (CNNs) show good performances in many image recognition tasks. They significantly outperform the previous approaches that are not based on neural networks particularly for object category recognition. These performances are arguably owing to their ability of discovering better image features for recognition tasks through learning, resulting in the acquisition of better internal representations of the inputs. However, in spite of the good performances, it remains an open question why CNNs work so well and/or how they can learn such good representations. In this study, we conjecture that the learned representation can be interpreted as

category-level attributes

that have good properties. We conducted several experiments by using the dataset AwA (Animals with Attributes) and a CNN trained for ILSVRC-2012 in a fully supervised setting to examine this conjecture. We report that there exist units in the CNN that can predict some of the 85 semantic attributes fairly accurately, along with a detailed observation that this is true only for visual attributes and not for non-visual ones. It is more natural to think that the CNN may discover not only semantic attributes but non-semantic ones (or ones that are difficult to represent as a word). To explore this possibility, we perform zero-shot learning by regarding the activation pattern of upper layers as attributes describing the categories. The result shows that it outperforms the state-of-the-art with a significant margin.

Makoto Ozeki, Takayuki Okatani
Robust Scene Classification with Cross-Level LLC Coding on CNN Features

Convolutional Neural Network (CNN) features have demonstrated outstanding performance as global representations for image classification, but they lack invariance to scale transformation, which makes it difficult to adapt to various complex tasks such as scene classification. To strengthen the scale invariance of CNN features and meanwhile retain their powerful discrimination in scene classification, we propose a framework where cross-level Locality-constrained Linear Coding and cascaded fine-tuned CNN features are combined, which is shorted as

cross-level LLC-CNN

. Specifically, this framework first fine-tunes multi-level CNNs in a cascaded way, then extracts multi-level CNN features to learn a cross-level universal codebook, and finally performs locality-constrained linear coding (LLC) and max-pooling on the patches of all levels to form the final representation. It is experimentally verified that the LLC responses on the universal codebook outperform the CNN features and achieve the state-of-the-art performance on the two currently largest scene classification benchmarks, MIT Indoor Scenes and SUN 397.

Zequn Jie, Shuicheng Yan
A Graphical Model for Rapid Obstacle Image-Map Estimation from Unmanned Surface Vehicles

Obstacle detection plays an important role in unmanned surface vehicles (USV). Continuous detection from images taken onboard the vessel poses a particular challenge due to the diversity of the environment and the obstacle appearance. An obstacle may be a floating piece of wood, a scuba diver, a pier, or some other part of a shoreline. In this paper we tackle this problem by proposing a new graphical model that affords a fast and continuous obstacle image-map estimation from a single video stream captured onboard a USV. The model accounts for the semantic structure of marine environment as observed from USV by imposing weak structural constraints. A Markov random field framework is adopted and a highly efficient algorithm for simultaneous optimization of model parameters and segmentation mask estimation is derived. Our approach does not require computationally intensive extraction of texture features and runs faster than real-time. We also present a new, challenging, dataset for segmentation and obstacle detection in marine environments, which is the largest annotated dataset of its kind. Results on this dataset show that our model compares favorably in accuracy to the related approaches, requiring a fraction of computational effort.

Matej Kristan, Janez Perš, Vildana Sulič, Stanislav Kovačič
On the Performance of Pose-Based RGB-D Visual Navigation Systems

This paper presents a thorough performance analysis of several variants of the feature-based visual navigation system that uses RGB-D data to estimate in real-time the trajectory of a freely moving sensor. The evaluation focuses on the advantages and problems that are associated with choosing a particular structure of the sensor-tracking front-end, employing particular feature detectors/descriptors, and optimizing the resulting trajectory treated as a graph of sensor poses. Moreover, a novel yet simple graph pruning algorithm is introduced, which enables to remove spurious edges from the pose-graph. The experimental evaluation is performed on two publicly available RGB-D data sets to ensure that our results are scientifically verifiable.

Dominik Belter, Michał Nowicki, Piotr Skrzypczyński
Elastic Shape Analysis of Boundaries of Planar Objects with Multiple Components and Arbitrary Topologies

We consider boundaries of planar objects as level set distance functions and present a Riemannian metric for their comparison and analysis. The metric is based on a parameterization-invariant framework for shape analysis of quadrilateral surfaces. Most previous Riemannian formulations of 2D shape analysis are restricted to curves that can be parameterized with a single parameter domain. However, 2D shapes may contain multiple connected components and many internal details that cannot be captured with such parameterizations. In this paper we propose to register planar curves of arbitrary topologies by utilizing the re-parameterization group of quadrilateral surfaces. The criterion used for computing this registration is a proper distance, which can be used to quantify differences between the level set functions and is especially useful in classification. We demonstrate this framework with multiple examples using toy curves, medical imaging data, subsets of the TOSCA data set, 2D hand-drawn sketches, and a 2D version of the SHREC07 data set. We demonstrate that our method outperforms the state-of-the-art in the classification of 2D sketches and performs well compared to other state-of-the-art methods on complex planar shapes.

Sebastian Kurtek, Hamid Laga, Qian Xie

3D Vision

Frontmatter
A Minimal Solution to Relative Pose with Unknown Focal Length and Radial Distortion

In this paper, we study the minimal problem of estimating the essential matrix between two cameras with constant but unknown focal length and radial distortion. This problem is of both theoretical and practical interest and it has not been solved previously. We have derived a fast and stable polynomial solver based on Gröbner basis method. This solver enables simultaneous auto-calibration of focal length and radial distortion for cameras. For experiments, the numerical stability of the solver is demonstrated on synthetic data. We also evaluate on real images using either RANSAC or kernel voting. Compared with the standard minimal solver, which does not model the radial distortion, our proposed solver both finds a larger set of geometrically correct correspondences on distorted images and gives an accurate estimate of the radial distortion and focal length.

Fangyuan Jiang, Yubin Kuang, Jan Erik Solem, Kalle Åström
Simultaneous Entire Shape Registration of Multiple Depth Images Using Depth Difference and Shape Silhouette

This paper proposes a method for simultaneous global registration of multiple depth images which are obtained from multiple viewpoints. Unlike the previous method, the proposed method fully utilizes a silhouette-based cost function taking out-of-view and non-overlapping regions into account as well as depth differences at overlapping areas. With the combination of the above cost functions and a recent powerful meta-heuristics named self-adaptive Differential Evolution, it realizes the entire shape reconstruction from relatively small number (three or four) of depth images, which do not involve enough overlapping regions for Iterative Closest Point even if they are prealigned. In addition, to allow the technique to be applicable not only to time-of-flight sensors, but also projector-camera systems, which has deficient silhouette by occlusions, we propose a simple solution based on color-based silhouette. Experimental results show that the proposed method can reconstruct the entire shape only from three depth images of both synthetic and real data. The influence of noises and inaccurate silhouettes is also evaluated.

Takuya Ushinohama, Yosuke Sawai, Satoshi Ono, Hiroshi Kawasaki
Joint Camera Pose Estimation and 3D Human Pose Estimation in a Multi-camera Setup

In this paper we propose an approach to jointly perform camera pose estimation and human pose estimation from videos recorded by a set of cameras separated by wide baselines. Multi-camera pose estimation is very challenging in case of wide baselines or in general when patch-based feature correspondences are difficult to establish across images.

For this reason, we propose to exploit the motion of an articulated structure in the scene, such as a human, to relate these cameras. More precisely, we first run a part-based human pose estimation for each camera and each frame independently. Correctly detected joints are then used to compute an initial estimate of the epipolar geometry between pairs of cameras. In a combined optimization over all the recorded sequences, the multi-camera configuration and the 3D motion of the kinematic structure in the scene are inferred. The optimization accounts for time continuity, part-based detection scores, optical flow, and body part visibility.

Our approach was evaluated on 4 publicly available datasets, evaluating the accuracy of the camera poses and the human poses.

Jens Puwein, Luca Ballan, Remo Ziegler, Marc Pollefeys
Singly-Bordered Block-Diagonal Form for Minimal Problem Solvers

The Gröbner basis method for solving systems of polynomial equations became very popular in the computer vision community as it helps to find fast and numerically stable solutions to difficult problems. In this paper, we present a method that potentially significantly speeds up Gröbner basis solvers. We show that the elimination template matrices used in these solvers are usually quite sparse and that by permuting the rows and columns they can be transformed into matrices with nice block-diagonal structure known as the singly-bordered block-diagonal (SBBD) form. The diagonal blocks of the SBBD matrices constitute independent subproblems and can therefore be solved, i.e. eliminated or factored, independently. The computational time can be further reduced on a parallel computer by distributing these blocks to different processors for parallel computation. The speedup is visible also for serial processing since we perform

$$O(n^3)$$

Gauss-Jordan eliminations on smaller (usually two, approximately

$${n \over 2} \times {n \over 2}$$

and one

$${n \over 3} \times {n \over 3}$$

) matrices. We propose to compute the SBBD form of the elimination template in the preprocessing offline phase using hypergraph partitioning. The final online Gröbner basis solver works directly with the permuted block-diagonal matrices and can be efficiently parallelized. We demonstrate the usefulness of the presented method by speeding up solvers of several important minimal computer vision problems.

Zuzana Kukelova, Martin Bujnak, Jan Heller, Tomáš Pajdla
Stereo Fusion Using a Refractive Medium on a Binocular Base

The performance of depth reconstruction in binocular stereo relies on how adequate the predefined baseline for a target scene is. Wide-baseline stereo is capable of discriminating depth better than the narrow one, but it often suffers from spatial artifacts. Narrow-baseline stereo can provide a more elaborate depth map with less artifacts, while its depth resolution tends to be biased or coarse due to the short disparity. In this paper, we propose a novel optical design of heterogeneous stereo fusion on a binocular imaging system with a refractive medium, where the binocular stereo part operates as wide-baseline stereo; the refractive stereo module works as narrow-baseline stereo. We then introduce a stereo fusion workflow that combines the refractive and binocular stereo algorithms to estimate fine depth information through this fusion design. The quantitative and qualitative results validate the performance of our stereo fusion system in measuring depth, compared with homogeneous stereo approaches.

Seung-Hwan Baek, Min H. Kim

Low-Level Vision and Features

Frontmatter
Saliency Detection via Nonlocal $$L_{0}$$ Minimization

In this paper, by observing the intrinsic sparsity of saliency map for the image, we propose a novel nonlocal

$$L_{0}$$

minimization framework to extract the sparse geometric structure of the saliency maps for the natural images. Specifically, we first propose to use the

$$k$$

-nearest neighbors of superpixels to construct a graph in the feature space. The novel

$$L_{0}$$

-regularized nonlocal minimization model is then developed on the proposed graph to describe the sparsity of saliency maps. Finally, we develop a first order optimization scheme to solve the proposed non-convex and discrete variational problem. Experimental results on four publicly available data sets validate that the proposed approach yields significant improvement compared with state-of-the-art saliency detection methods.

Yiyang Wang, Risheng Liu, Xiaoliang Song, Zhixun Su
$$N^4$$ -Fields: Neural Network Nearest Neighbor Fields for Image Transforms

We propose a new architecture for difficult image processing operations, such as natural edge detection or thin object segmentation. The architecture is based on a simple combination of convolutional neural networks with the nearest neighbor search.

We focus our attention on the situations when the desired image transformation is too hard for a neural network to learn explicitly. We show that in such situations the use of the nearest neighbor search on top of the network output allows to improve the results considerably and to account for the underfitting effect during the neural network training. The approach is validated on three challenging benchmarks, where the performance of the proposed architecture matches or exceeds the state-of-the-art.

Yaroslav Ganin, Victor Lempitsky
Super-Resolution Using Sub-Band Self-Similarity

A popular approach for single image super-resolution (SR) is to use scaled down versions of the given image to build an

internal

training dictionary of pairs of low resolution (LR) and high resolution (HR) image patches, which is then used to predict the HR image. This self-similarity approach has the advantage of not requiring a separate

external

training database. However, due to their limited size, internal dictionaries are often inadequate for finding good matches for patches containing complex structures such as textures. Furthermore, the quality of matches found are quite sensitive to factors like patch size (larger patches contain structures of greater complexity and may be difficult to match), and dimensions of the given image (smaller images yield smaller internal dictionaries). In this paper we propose a self-similarity based SR algorithm that addresses the abovementioned drawbacks. Instead of seeking similar patches directly in the image domain, we use the self-similarity principle independently on each of a set of different sub-band images, obtained using a bank of orientation selective band-pass filters. Therefore, we allow the different directional frequency components of a patch to find matches independently, which may be in different image locations. Essentially, we decompose local image structure into component patches defined by different sub-bands, with the following advantages: (1) The sub-band image patches are simpler and therefore easier to find matches, than for the more complex textural patches from the original image. (2) The size of the dictionary defined by patches from the sub-band images is exponential in the number of sub-bands used, thus increasing the effective size of the internal dictionary. (3) As a result, our algorithm exhibits a greater degree of invariance to parameters like patch size and the dimensions of the LR image. We demonstrate these advantages and show that our results are richer in textural content and appear more natural than several state-of-the-art methods.

Abhishek Singh, Narendra Ahuja
Raindrop Detection and Removal from Long Range Trajectories

In rainy scenes, visibility can be degraded by raindrops which have adhered to the windscreen or camera lens. In order to resolve this degradation, we propose a method that automatically detects and removes adherent raindrops. The idea is to use long range trajectories to discover the motion and appearance features of raindrops locally along the trajectories. These motion and appearance features are obtained through our analysis of the trajectory behavior when encountering raindrops. These features are then transformed into a labeling problem, which the cost function can be optimized efficiently. Having detected raindrops, the removal is achieved by utilizing patches indicated, enabling the motion consistency to be preserved. Our trajectory based video completion method not only removes the raindrops but also complete the motion field, which benefits motion estimation algorithms to possibly work in rainy scenes. Experimental results on real videos show the effectiveness of the proposed method.

Shaodi You, Robby T. Tan, Rei Kawakami, Yasuhiro Mukaigawa, Katsushi Ikeuchi
Interest Points via Maximal Self-Dissimilarities

We propose a novel interest point detector stemming from the intuition that image patches which are highly dissimilar over a relatively large extent of their surroundings hold the property of being repeatable and distinctive. This concept of

contextual self-dissimilarity

reverses the key paradigm of recent successful techniques such as the Local Self-Similarity descriptor and the Non-Local Means filter, which build upon the presence of similar - rather than dissimilar - patches. Moreover, our approach extends to contextual information the local self-dissimilarity notion embedded in established detectors of corner-like interest points, thereby achieving enhanced repeatability, distinctiveness and localization accuracy.

Federico Tombari, Luigi Di Stefano
Improving Local Features by Dithering-Based Image Sampling

The recent trend of structure-guided feature detectors, as opposed to blob and corner detectors, has led to a family of methods that exploit image edges to accurately capture local shape. Among them, the W

$$\alpha $$

SH detector combines binary edge sampling with gradient strength and computational geometry representations towards distinctive and repeatable local features. In this work, we provide alternative, variable-density sampling schemes on smooth functions of image intensity based on dithering. These methods are parameter-free and more invariant to geometric transformations than uniform sampling. The resulting detectors compare well to the state-of-the-art, while achieving higher performance in a series of matching and retrieval experiments.

Christos Varytimidis, Konstantinos Rapantzikos, Yannis Avrithis, Stefanos Kollias

Poster Session 2

Frontmatter
Sparse Kernel Learning for Image Set Classification

No single universal image set representation can efficiently encode all types of image set variations. In the absence of expensive validation data, automatically ranking representations with respect to performance is a challenging task. We propose a sparse kernel learning algorithm for automatic selection and integration of the most discriminative subset of kernels derived from different image set representations. By optimizing a sparse linear discriminant analysis criterion, we learn a unified kernel from the linear combination of the best kernels only. Kernel discriminant analysis is then performed on the unified kernel. Experiments on four standard datasets show that the proposed algorithm outperforms current state-of-the-art image set classification and kernel learning algorithms.

Muhammad Uzair, Arif Mahmood, Ajmal Mian
Automatic Feature Learning to Grade Nuclear Cataracts Based on Deep Learning

Cataracts are a clouding of the lens and the leading cause of blindness worldwide. Assessing the presence and severity of cataracts is essential for diagnosis and progression monitoring, as well as to facilitate clinical research and management of the disease. Existing automatic methods for cataract grading utilize a predefined set of image features that may provide an incomplete, redundant, or even noisy representation. In this work, we propose a system to automatically learn features for grading the severity of nuclear cataracts from slit-lamp images. Local filters learned from image patches are fed into a convolutional neural network, followed by a set of recursive neural networks to further extract higher-order features. With these features, support vector regression is applied to determine the cataract grade. The proposed system is validated on a large population-based dataset of

$$5378$$

images, where it outperforms the state-of-the-art by yielding with respect to clinical grading a mean absolute error (

$$\varepsilon $$

) of

$$0.322$$

, a

$$68.6\,\%$$

exact integral agreement ratio (

$$R_0$$

), a

$$86.5\,\%$$

decimal grading error

$$\le $$

0.5 (

$$R_{e0.5}$$

), and a

$$99.1\,\%$$

decimal grading error

$$\le $$

1.0 (

$$R_{e1.0}$$

).

Xinting Gao, Stephen Lin, Tien Yin Wong
Texture Classification Using Dense Micro-block Difference (DMD)

The paper proposes a novel image representation for texture classification. The recent advancements in the field of patch based features compressive sensing and feature encoding are combined to design a robust image descriptor. In our approach, we first propose the local features, Dense Micro-block Difference (DMD), which capture the local structure from the image patches at high scales. Instead of the pixel we process the small blocks from images which capture the micro-structure from it. DMD can be computed efficiently using integral images. The features are then encoded using Fisher Vector method to obtain an image descriptor which considers the higher order statistics. The proposed image representation is combined with linear SVM classifier. The experiments are conducted on the standard texture datasets (KTH-TIPS-2a, Brodatz and Curet). On KTH-TIPS-2a dataset the proposed method outperforms the best reported results by

$$5.5\,\%$$

and has a comparable performance to the state-of-the-art methods on the other datasets.

Rakesh Mehta, Karen Egiazarian
Nuclear- $$L_1$$ Norm Joint Regression for Face Reconstruction and Recognition

Recognizing a face with significant lighting, disguise and occlusion variations is an interesting and challenging problem in pattern recognition. To address this problem, many regression based methods, represented by sparse representation classifier (SRC), are presented recently. SRC uses the

$$L_1$$

-norm to characterize the pixel-level sparse noise but ignore the spatial information of noise. In this paper, we find that nuclear-norm is good for characterizing image-wise structural noise, and thus we use the nuclear norm and

$$L_1$$

-norm to jointly characterize the error image in regression model. Our experimental results demonstrate that the proposed method is more effective than state-of-the-art regression methods for face reconstruction and recognition.

Lei Luo, Jian Yang, Jianjun Qian, Ying Tai
Segmentation of X-ray Images by 3D-2D Registration Based on Multibody Physics

X-ray imaging is commonly used in clinical routine. In radiotherapy, spatial information is extracted from X-ray images to correctly position patients before treatment. Similarly, orthopedic surgeons assess the positioning and migration of implants after Total Hip Replacement (THR) with X-ray images. However, the projective nature of X-ray imaging hinders the reliable extraction of rigid structures in X-ray images, such as bones or metallic components. We developed an approach based on multibody physics that simultaneously registers multiple 3D shapes with one or more 2D X-ray images. Considered as physical bodies, shapes are driven by image forces, which exploit image gradient, and constraints, which enforce spatial dependencies between shapes. Our method was tested on post-operative radiographs of THR and thoroughly validated with gold standard datasets. The final target registration error was in average

$$0.3\pm 0.16$$

mm and the capture range improved more than 40 % with respect to reference registration methods.

Jérôme Schmid, Christophe Chênes
View-Adaptive Metric Learning for Multi-view Person Re-identification

Person re-identification is a challenging problem due to drastic variations in viewpoint, illumination and pose. Most previous works on metric learning learn a global distance metric to handle those variations. Different from them, we propose a view-adaptive metric learning (VAML) method, which adopts different metrics adaptively for different image pairs under varying views. Specifically, given a pair of images (or features extracted), VAML firstly estimates their view vectors (consisting of probabilities belonging to each view) respectively, and then adaptively generates a specific metric for these two images. To better achieve this goal, we elaborately encode the automatically estimated view vector into an augmented representation of the input feature, with which the distance can be analytically learned and simply computed. Furthermore, we also contribute a new large-scale multi-view pedestrian dataset containing 1000 subjects and 8 kinds of view-angles. Extensive experiments show that the proposed method achieves state-of-the-art performance on the public VIPeR dataset and the new dataset.

Canxiang Yan, Shiguang Shan, Dan Wang, Hao Li, Xilin Chen
Backmatter
Metadaten
Titel
Computer Vision -- ACCV 2014
herausgegeben von
Daniel Cremers
Ian Reid
Hideo Saito
Ming-Hsuan Yang
Copyright-Jahr
2015
Electronic ISBN
978-3-319-16808-1
Print ISBN
978-3-319-16807-4
DOI
https://doi.org/10.1007/978-3-319-16808-1