Skip to main content

2017 | Buch

Computer Vision – ACCV 2016

13th Asian Conference on Computer Vision, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part IV

insite
SUCHEN

Über dieses Buch

The five-volume set LNCS 10111-10115 constitutes the thoroughly refereed post-conference proceedings of the 13th Asian Conference on Computer Vision, ACCV 2016, held in Taipei, Taiwan, in November 2016.

The total of 143 contributions presented in these volumes was carefully reviewed and selected from 479 submissions. The papers are organized in topical sections on Segmentation and Classification; Segmentation and Semantic Segmentation; Dictionary Learning, Retrieval, and Clustering; Deep Learning; People Tracking and Action Recognition; People and Actions; Faces; Computational Photography; Face and Gestures; Image Alignment; Computational Photography and Image Processing; Language and Video; 3D Computer Vision; Image Attributes, Language, and Recognition; Video Understanding; and 3D Vision.

Inhaltsverzeichnis

Frontmatter

Computational Photography and Image Processing

Frontmatter
Blind Image Quality Assessment Based on Natural Redundancy Statistics

Blind image quality assessment (BIQA) aims to evaluate the perceptual quality of a distorted image without information regarding its reference image and the distortion type. Existing BIQA methods usually predict the image quality by employing natural scene statistic (NSS), which is derived from the statistical distributions of image coefficients by reducing the redundancies in a transformed domain. Contrary to these methods, we directly measure the redundancy existing in a natural image and compute the natural redundancy statistics (NRS) to capture the distortion degree. Specially, we utilize the singular value decomposition (SVD) and asymmetric generalized Gaussian distribution (AGGD) modeling to obtain NRS from opponent color spaces, and learn a regression model to map the NRS features to the subjective quality score. Extensive experiments demonstrate very competitive quality prediction performance and generalization ability of the proposed method.

Jia Yan, Weixia Zhang, Tianpeng Feng
Sparse Coding on Cascaded Residuals

This paper seeks to combine dictionary learning and hierarchical image representation in a principled way. To make dictionary atoms capturing additional information from extended receptive fields and attain improved descriptive capacity, we present a two-pass multi-resolution cascade framework for dictionary learning and sparse coding. The cascade allows collaborative reconstructions at different resolutions using the same dimensional dictionary atoms. Our jointly learned dictionary comprises atoms that adapt to the information available at the coarsest layer where the support of atoms reaches their maximum range and the residual images where the supplementary details progressively refine the reconstruction objective. The residual at a layer is computed by the difference between the aggregated reconstructions of the previous layers and the downsampled original image at that layer. Our method generates more flexible and accurate representations using much less number of coefficients. Its computational efficiency stems from encoding at the coarsest resolution, which is minuscule, and encoding the residuals, which are relatively much sparse. Our extensive experiments on multiple datasets demonstrate that this new method is powerful in image coding, denoising, inpainting and artifact removal tasks outperforming the state-of-the-art techniques.

Tong Zhang, Fatih Porikli
End-to-End Learning for Image Burst Deblurring

We present a neural network model approach for multi-frame blind deconvolution. The discriminative approach adopts and combines two recent techniques for image deblurring into a single neural network architecture. Our proposed hybrid-architecture combines the explicit prediction of a deconvolution filter and non-trivial averaging of Fourier coefficients in the frequency domain. In order to make full use of the information contained in all images in one burst, the proposed network embeds smaller networks, which explicitly allow the model to transfer information between images in early layers. Our system is trained end-to-end using standard backpropagation on a set of artificially generated training examples, enabling competitive performance in multi-frame blind deconvolution, both with respect to quality and runtime.

Patrick Wieschollek, Bernhard Schölkopf, Hendrik P. A. Lensch, Michael Hirsch
Spectral Reflectance Recovery with Interreflection Using a Hyperspectral Image

The capture of scene spectral reflectance (SR) provides a wealth of information about the material properties of objects, and has proven useful for applications including classification, synthetic relighting, medical imaging, and more. Thus many methods for SR capture have been proposed. While effective, past methods do not consider the effects of indirectly bounced light from within the scene, and the estimated SR from traditional techniques is largely affected by interreflection. For example, different lighting directions can cause different SR estimates. On the other hand, past work has shown that accurate interreflection separation in hyperspectral images is possible but the SR of all surface points needs to be known a priori. Thus we see that the estimation of SR and interreflection in its current form constitutes a chicken and egg dilemma. In this work, we propose the challenging and novel problem of simultaneously performing SR recovery and interreflection removal from a single hyperspectral image, and develop the first strategy to address it. Specifically, we model this problem using a compact sparsity regularized nonnegative matrix factorization (NMF) formulation, and introduce a scalable optimization algorithm on the basis of the alternating direction method of multipliers (ADMM). Our experiments have demonstrated its effectiveness on scenes with a single or two reflectance colors, containing possibly concave surfaces that lead to interreflection.

Hiroki Okawa, Yinqiang Zheng, Antony Lam, Imari Sato
Learning Contextual Dependencies for Optical Flow with Recurrent Neural Networks

Pixel-level prediction tasks, such as optical flow estimation, play an important role in computer vision. Recent approaches have attempted to use the feature learning capability of Convolutional Neural Networks (CNNs) to tackle dense per-pixel predictions. However, CNNs have not been as successful in optical flow estimation as they are in many other vision tasks, such as image classification and object detection. It is challenging to adapt CNNs designated for high-level vision tasks to handle pixel-level predictions. First, CNNs do not have a mechanism to explicitly model contextual dependencies among image units. Second, the convolutional filters and pooling operations result in reduced feature maps and hence produce coarse outputs when upsampled to the original resolution. These two aspects render CNNs limited ability to delineate object details, which often result in inconsistent predictions. In this paper, we propose a recurrent neural network to alleviate this issue. Specifically, a row convolutional long short-term memory (RC-LSTM) network is introduced to model contextual dependencies of local image features. This recurrent network can be integrated with CNNs, giving rise to an end-to-end trainable network. The experimental results demonstrate that our model can learn context-aware features for optical flow estimation and achieve competitive accuracy with the state-of-the-art algorithms at a frame rate of 5 to 10 fps.

Minlong Lu, Zhiwei Deng, Ze-Nian Li

Language and Video

Frontmatter
Auto-Illustrating Poems and Songs with Style

We develop an optimization based framework to automatically illustrate poems and songs. Our method is able to produce both semantically relevant and visually coherent illustrations, all while matching a particular user selected visual style. We demonstrate our method on a selection of 200 popular poems and songs collected from the internet and operate on around 14M Flickr images. A user study evaluates variations on our optimization procedure. Finally, we present two applications, identifying textual style, and automatic music video generation.

Katharina Schwarz, Tamara L. Berg, Hendrik P. A. Lensch
Spatio-Temporal Attention Models for Grounded Video Captioning

Automatic video captioning is challenging due to the complex interactions in dynamic real scenes. A comprehensive system would ultimately localize and track the objects, actions and interactions present in a video and generate a description that relies on temporal localization in order to ground the visual concepts. However, most existing automatic video captioning systems map from raw video data to high level textual description, bypassing localization and recognition, thus discarding potentially valuable information for content localization and generalization. In this work we present an automatic video captioning model that combines spatio-temporal attention and image classification by means of deep neural network structures based on long short-term memory. The resulting system is demonstrated to produce state-of-the-art results in the standard YouTube captioning benchmark while also offering the advantage of localizing the visual concepts (subjects, verbs, objects), with no grounding supervision, over space and time.

Mihai Zanfir, Elisabeta Marinoiu, Cristian Sminchisescu
Variational Convolutional Networks for Human-Centric Annotations

To model how a human would annotate an image is an important and interesting task relevant to image captioning. Its main challenge is that a same visual concept may be important in some images but becomes less salient in other situations. Further, the subjective viewpoints of a human annotator also play a crucial role in finalizing the annotations. To deal with such high variability, we introduce a new deep net model that integrates a CNN with a variational auto-encoder (VAE). With the latent features embedded in a VAE, it becomes more flexible to tackle the uncertainly of human-centric annotations. On the other hand, the supervised generalization further enables the discriminative power of the generative VAE model. The resulting model can be end-to-end fine-tuned to further improve the performance on predicting visual concepts. The provided experimental results show that our method is state-of-the-art over two benchmark datasets: MS COCO and Flickr30K, producing mAP of 36.6 and 23.49, and PHR (Precision at Human Recall) of 49.9 and 32.04, respectively.

Tsung-Wei Ke, Che-Wei Lin, Tyng-Luh Liu, Davi Geiger
Anticipating Accidents in Dashcam Videos

We propose a Dynamic-Spatial-Attention (DSA) Recurrent Neural Network (RNN) for anticipating accidents in dashcam videos (Fig. 1). Our DSA-RNN learns to (1) distribute soft-attention to candidate objects dynamically to gather subtle cues and (2) model the temporal dependencies of all cues to robustly anticipate an accident. Anticipating accidents is much less addressed than anticipating events such as changing a lane, making a turn, etc., since accidents are rare to be observed and can happen in many different ways mostly in a sudden. To overcome these challenges, we (1) utilize state-of-the-art object detector [3] to detect candidate objects, and (2) incorporate full-frame and object-based appearance and motion features in our model. We also harvest a diverse dataset of 678 dashcam accident videos on the web (Fig. 3). The dataset is unique, since various accidents (e.g., a motorbike hits a car, a car hits another car, etc.) occur in all videos. We manually mark the time-location of accidents and use them as supervision to train and evaluate our method. We show that our method anticipates accidents about 2 s before they occur with 80% recall and 56.14% precision. Most importantly, it achieves the highest mean average precision (74.35%) outperforming other baselines without attention or RNN.

Fu-Hsiang Chan, Yu-Ting Chen, Yu Xiang, Min Sun
Pano2Vid: Automatic Cinematography for Watching 360 Videos

We introduce the novel task of Pano2Vid — automatic cinematography in panoramic 360$$^{\circ }$$ videos. Given a 360$$^{\circ }$$ video, the goal is to direct an imaginary camera to virtually capture natural-looking normal field-of-view (NFOV) video. By selecting “where to look” within the panorama at each time step, Pano2Vid aims to free both the videographer and the end viewer from the task of determining what to watch. Towards this goal, we first compile a dataset of 360$$^{\circ }$$ videos downloaded from the web, together with human-edited NFOV camera trajectories to facilitate evaluation. Next, we propose AutoCam, a data-driven approach to solve the Pano2Vid task. AutoCam leverages NFOV web video to discriminatively identify space-time “glimpses” of interest at each time instant, and then uses dynamic programming to select optimal human-like camera trajectories. Through experimental evaluation on multiple newly defined Pano2Vid performance measures against several baselines, we show that our method successfully produces informative videos that could conceivably have been captured by human videographers.

Yu-Chuan Su, Dinesh Jayaraman, Kristen Grauman
PicMarker: Data-Driven Image Categorization Based on Iterative Clustering

Facing the explosive growth of personal photos, an effective classification tool is becoming an urgent need for users to categorize images efficiently with personal preferences. As previous researches mainly focus on the accuracy of automatic classification within the pre-defined label space, they cannot be used directly for the personalized categorization. In this paper, we propose a data-driven classification method for personalized image classification tasks which can categorize images group by group. Firstly, we describe images from both the view of appearance and the view of semantic. Then, an iterative framework which incorporates spectral clustering with user intervention is utilized to categorize images group by group. To improve the quality of clustering, we propose an online multi-view metric learning algorithm to learn the similarity metrics in accordance with user’s criterion, and constraint propagation is integrated to adjust the similarity matrix. In addition, we build a system named PicMarker based on the proposed method. Experimental results demonstrate the effectiveness of the proposed method.

Jiagao Hu, Zhengxing Sun, Bo Li, Shuang Wang

Image Alignment

Frontmatter
Adaptive Direct RGB-D Registration and Mapping for Large Motions

Dense direct RGB-D registration methods are widely used in tasks ranging from localization and tracking to 3D scene reconstruction. This work addresses a peculiar aspect which drastically limits the applicability of direct registration, namely the weakness of the convergence domain. First, we propose an activation function based on the conditioning of the RGB and ICP point-to-plane error terms. This function strengthens the geometric error influence in the first coarse iterations, while the intensity data term dominates in the finer increments. The information gathered from the geometric and photometric cost functions is not only considered for improving the system observability, but for exploiting the different convergence properties and convexity of each data term. Next, we develop a set of strategies as a flexible regularization and a pixel saliency selection to further improve the quality and robustness of this approach.The methodology is formulated for a generic warping model and results are given using perspective and spherical sensor models. Finally, our method is validated in different RGB-D spherical datasets, including both indoor and outdoor real sequences and using the KITTI VO/SLAM benchmark dataset. We show that the different proposed techniques (weighted activation function, regularization, saliency pixel selection), lead to faster convergence and larger convergence domains, which are the main limitations to the use of direct methods.

Renato Martins, Eduardo Fernandez-Moral, Patrick Rives
Deep Discrete Flow

Motivated by the success of deep learning techniques in matching problems, we present a method for learning context-aware features for solving optical flow using discrete optimization. Towards this goal, we present an efficient way of training a context network with a large receptive field size on top of a local network using dilated convolutions on patches. We perform feature matching by comparing each pixel in the reference image to every pixel in the target image, utilizing fast GPU matrix multiplication. The matching cost volume from the network’s output forms the data term for discrete MAP inference in a pairwise Markov random field. We provide an extensive empirical investigation of network architectures and model parameters. At the time of submission, our method ranks second on the challenging MPI Sintel test set.

Fatma Güney, Andreas Geiger
Dense Motion Estimation for Smoke

Motion estimation for highly dynamic phenomena such as smoke is an open challenge for Computer Vision. Traditional dense motion estimation algorithms have difficulties with non-rigid and large motions, both of which are frequently observed in smoke motion. We propose an algorithm for dense motion estimation of smoke. Our algorithm is robust, fast, and has better performance over different types of smoke compared to other dense motion estimation algorithms, including state of the art and neural network approaches. The key to our contribution is to use skeletal flow, without explicit point matching, to provide a sparse flow. This sparse flow is upgraded to a dense flow. In this paper we describe our algorithm in greater detail, and provide experimental evidence to support our claims.

Da Chen, Wenbin Li, Peter Hall
Data Association Based Multi-target Tracking Using a Joint Formulation

We revisit the classical conditional random filed based tracking-by-detection framework for multi-target tracking, in which function factors associating pairs of short tracklets in a long term are modeled to produce final tracks. Unlike most previous approaches which only focus on modeling feature difference for distinguishing pairs of targets, we propose to directly model the joint formulation of pairs of tracklets for association in the CRF framework. To this end, we use a Hough Forest (HF) based learning framework to effectively learn a discriminative codebook of features among tracklets by utilizing appearance and motion cues stored in the leaf nodes. Given the learned codebook, the joint formulation of tracklet pairs can be directly modeled in a nonparametric manner by defining a sharing and excluding matrix. Then all of the statistics required in CRF inference can be directly estimated. Extensive experiments have been conducted on several public datasets, and the performance is comparable to the state of the art.

Jun Xiang, Jianhua Hou, Changxin Gao, Nong Sang
Combining Texture and Shape Cues for Object Recognition with Minimal Supervision

We present a novel approach to object classification and detection which requires minimal supervision and which combines visual texture cues and shape information learned from freely available unlabeled web search results. The explosion of visual data on the web can potentially make visual examples of almost any object easily accessible via web search. Previous unsupervised methods have utilized either large scale sources of texture cues from the web, or shape information from data such as crowdsourced CAD models. We propose a two-stream deep learning framework that combines these cues, with one stream learning visual texture cues from image search data, and the other stream learning rich shape information from 3D CAD models. To perform classification or detection for a novel image, the predictions of the two streams are combined using a late fusion scheme. We present experiments and visualizations for both tasks on the standard benchmark PASCAL VOC 2007 to demonstrate that texture and shape provide complementary information in our model. Our method outperforms previous web image based models, 3D CAD model based approaches, and weakly supervised models.

Xingchao Peng, Kate Saenko
Video Temporal Alignment for Object Viewpoint

We address the problem of temporally aligning semantically similar videos, for example two videos of cars on different tracks. We present an alignment method that establishes frame-to-frame correspondences such that the two cars are seen from a similar viewpoint (e.g. facing right), while also being temporally smooth and visually pleasing. Unlike previous works, we do not assume that the videos show the same scripted sequence of events. We compare against three alternative methods, including the popular DTW algorithm, on a new dataset of realistic videos collected from the internet. We perform a comprehensive evaluation using a novel protocol that includes both quantitative measures and a user study on visual pleasingness.

Anestis Papazoglou, Luca Del Pero, Vittorio Ferrari

3D Computer Vision

Frontmatter
Recovering Pose and 3D Deformable Shape from Multi-instance Image Ensembles

In recent years, there has been a growing interest on tackling the Non-Rigid Structure from Motion problem (NRSfM), where the shape of a deformable object and the pose of a moving camera are simultaneously estimated from a monocular video sequence. Existing solutions are limited to single objects and continuous, smoothly changing sequences. In this paper we extend NRSfM to a multi-instance domain, in which the images do not need to have temporal consistency, allowing for instance, to jointly reconstruct the face of multiple persons from an unordered list of images. For this purpose, we present a new formulation of the problem based on a dual low-rank shape representation, that simultaneously captures the between- and within-individual deformations. The parameters of this model are learned using a variant of the probabilistic linear discriminant analysis that requires consecutive batches of expectation and maximization steps. The resulting approach estimates 3D deformable shape and pose of multiple instances from only 2D point observations on a collection images, without requiring pre-trained 3D data, and is shown to be robust to noisy measurements and missing points. We provide quantitative and qualitative evaluation on both synthetic and real data, and show consistent benefits compared to current state of the art.

Antonio Agudo, Francesc Moreno-Noguer
Robust Multi-Model Fitting Using Density and Preference Analysis

Robust multi-model fitting problems are often solved using consensus based or preference based methods, each of which captures largely independent information from the data. However, most existing techniques still adhere to either of these approaches. In this paper, we bring these two paradigms together and present a novel robust method for discovering multiple structures from noisy, outlier corrupted data. Our method adopts a random sampling based hypothesis generation and works on the premise that inliers are densely packed around the structure, while the outliers are sparsely spread out. We leverage consensus maximization by defining the residual density, which is a simple and efficient measure of density in the 1-D residual space. We locate the inlier-outlier boundary by using preference based point correlations together with the disparity in residual density of inliers and outliers. Finally, we employ a simple strategy that uses preference based hypothesis correlation and residual density to identify one hypothesis representing each structure and their corresponding inliers. The strength of the proposed approach is evaluated empirically by comparing with state-of-the-art techniques over synthetic data and the AdelaideRMF dataset.

Lokender Tiwari, Saket Anand, Sushil Mittal
Photometric Bundle Adjustment for Vision-Based SLAM

We propose a novel algorithm for the joint refinement of structure and motion parameters from image data directly without relying on fixed and known correspondences. In contrast to traditional bundle adjustment (BA) where the optimal parameters are determined by minimizing the reprojection error using tracked features, the proposed algorithm relies on maximizing the photometric consistency and estimates the correspondences implicitly. Since the proposed algorithm does not require correspondences, its application is not limited to corner-like structure; any pixel with nonvanishing gradient could be used in the estimation process. Furthermore, we demonstrate the feasibility of refining the motion and structure parameters simultaneously using the photometric error in unconstrained scenes and without requiring restrictive assumptions such as planarity. The proposed algorithm is evaluated on range of challenging outdoor datasets, and it is shown to improve upon the accuracy of the state-of-the-art VSLAM methods obtained using the minimization of the reprojection error using traditional BA as well as loop closure.

Hatem Alismail, Brett Browning, Simon Lucey
Can Computer Vision Techniques be Applied to Automated Forensic Examinations? A Study on Sex Identification from Human Skulls Using Head CT Scans

Sex determination from human skeletal remains is a challenging problem in forensic anthropology. The human skull has been regarded as the second best predictor of sex because it contains several sexually dimorphic traits. Previous studies have shown that morphological assessment and morphometric analysis can be used to assess sex variation from dried skulls. With the availability of CT scanners, the field has seen increasing computer aided techniques in assisting these traditional forensic examinations. However, they largely remain at the level of providing a digital interface for landmarking for morphometric analysis. A recent research has applied shape analysis techniques for morphological analysis on a specific part of the skull. In this paper, we endeavor to explore the application of computer vision techniques that have prominently been used in the field of 3D object recognition and retrieval, for providing alternative method to achieve sex identification from human skulls automatically. We suggest a possible framework for the whole process including multi-region representation of the skull with 3D shape descriptors, and particularly examined the role of 3D descriptors on sex identification accuracy. The experimental results on 100 head post mortem CT scans indicate the potential of 3D descriptors for skull sex classification. To the best of our knowledge, this is the first work to have approached skull sex prediction from this novel perspective.

Olasimbo Ayodeji Arigbabu, Iman Yi Liao, Nurliza Abdullah, Mohamad Helmee Mohamad Noor
Deep Depth Super-Resolution: Learning Depth Super-Resolution Using Deep Convolutional Neural Network

Depth image super-resolution is an extremely challenging task due to the information loss in sub-sampling. Deep convolutional neural network has been widely applied to color image super-resolution. Quite surprisingly, this success has not been matched to depth super-resolution. This is mainly due to the inherent difference between color and depth images. In this paper, we bridge up the gap and extend the success of deep convolutional neural network to depth super-resolution. The proposed deep depth super-resolution method learns the mapping from a low-resolution depth image to a high-resolution one in an end-to-end style. Furthermore, to better regularize the learned depth map, we propose to exploit the depth field statistics and the local correlation between depth image and color image. These priors are integrated in an energy minimization formulation, where the deep neural network learns the unary term, the depth field statistics works as global model constraint and the color-depth correlation is utilized to enforce the local structure in depth image. Extensive experiments on various depth super-resolution benchmark datasets show that our method outperforms the state-of-the-art depth image super-resolution methods with a margin.

Xibin Song, Yuchao Dai, Xueying Qin
3D Watertight Mesh Generation with Uncertainties from Ubiquitous Data

In this paper, we propose a generic framework for watertight mesh generation with uncertainties that provides a confidence measure on each reconstructed mesh triangle. Its input is a set of vision-based or Lidar-based 3D measurements which are converted to a set of mass functions that characterize the level of confidence on the occupancy of the scene as occupied, empty or unknown based on Dempster-Shafer Theory. The output is a multi-label segmentation of the ambient 3D space expressing the confidence for each resulting volume element to be occupied or empty. While existing methods either sacrifice watertightness (local methods) or need to introduce a smoothness prior (global methods), we derive a per-triangle confidence measure that is able to gradually characterize when the resulting surface patches are certain due to dense and coherent measurements and when these patches are more uncertain and are mainly present to ensure smoothness and/or watertightness. The surface mesh reconstruction is formulated as a global energy minimization problem efficiently optimized with the $$\alpha $$-expansion algorithm. We claim that the resulting confidence measure is a good estimate of the local lack of sufficiently dense and coherent input measurements, which would be a valuable input for the next-best-view scheduling of a complementary acquisition.Beside the new formulation, the proposed approach achieves state-of-the-art results on surface reconstruction benchmark. It is robust to noise, manages high scale disparity and produces a watertight surface with a small Hausdorff distance in uncertainty area thanks to the multi-label formulation. By simply thresholding the result, the method shows a good reconstruction quality compared to local algorithms on high density data. This is demonstrated on a large scale reconstruction combining real-world datasets from airborne and terrestrial Lidar and on an indoor scene reconstructed from images.

Laurent Caraffa, Mathieu Brédif, Bruno Vallet
Color Correction for Image-Based Modeling in the Large

Current texture creation methods for image-based modeling suffer from color discontinuity issues due to drastically varying conditions of illumination, exposure and time during the image capturing process. This paper proposes a novel system that generates consistent textures for triangular meshes. The key to our system is a color correction framework for large-scale unordered image collections. We model the problem as a graph-structured optimization over the overlapping regions of image pairs. After reconstructing the mesh of the scene, we accurately calculate matched image regions by re-projecting images onto the mesh. Then the image collection is robustly adjusted using a non-linear least square solver over color histograms in an unsupervised fashion. Finally, a connectivity-preserving edge pruning method is introduced to accelerate the color correction process. This system is evaluated with crowdsourcing image collections containing medium-sized scenes and city-scale urban datasets. To the best of our knowledge, this system is the first consistent texturing system for image-based modeling that is capable of handling thousands of input images.

Tianwei Shen, Jinglu Wang, Tian Fang, Siyu Zhu, Long Quan
Bringing 3D Models Together: Mining Video Liaisons in Crowdsourced Reconstructions

The recent advances in large-scale scene modeling have enabled the automatic 3D reconstruction of landmark sites from crowdsourced photo collections. Here, we address the challenge of leveraging crowdsourced video collections to identify connecting visual observations that enable the alignment and subsequent aggregation, of disjoint 3D models. We denote these connecting image sequences as video liaisons and develop a data-driven framework for fully unsupervised extraction and exploitation. Towards this end, we represent video contents in terms of a histogram representation of iconic imagery contained within existing 3D models attained from a photo collection. We then use this representation to efficiently identify and prioritize the analysis of individual videos within a large-scale video collection, in an effort to determine camera motion trajectories connecting different landmarks. Results on crowdsourced data illustrate the efficiency and effectiveness of our proposed approach.

Ke Wang, Enrique Dunn, Mikel Rodriguez, Jan-Michael Frahm
Planar Markerless Augmented Reality Using Online Orientation Estimation

This paper presents a fast and accurate online orientation estimation method that estimates the normal direction of an arbitrary planar target from small baseline images using efficient bundle adjustment. The estimated normal direction is used for planar metric rectification, and the rectified target images are registered as the recognition targets for markerless tracking on the fly. The conventional planar metric rectification methods estimate the normal direction in a very efficient way, and recently proposed depth map estimation methods estimate accurate normal direction. However, they suffer from either poor estimation accuracy for small baseline images or high computational cost, which degrades the usability for untrained end-users in terms of shooting procedure or waiting time for new target registration. In contrast, the proposed method achieves both high estimation accuracy and fast processing speed in order to improve the usability, by reducing the number of degrees of freedom in bundle adjustment for the conventional depth map estimation methods. We compare the proposed method with these conventional methods using both artificially generated keypoints and real camera sequences with small baseline motion. The experimental results show the high estimation accuracy of the proposed method relative to the conventional planar metric rectification methods and significantly greater speed compared to the conventional depth map estimation methods, without sacrificing estimation accuracy.

Tatsuya Kobayashi, Haruhisa Kato, Masaru Sugano
Simultaneous Independent Image Display Technique on Multiple 3D Objects

We propose a new system to visualize depth-dependent patterns and images on solid objects with complex geometry using multiple projectors. The system, despite consisting of conventional passive LCD projectors, is able to project different images and patterns depending on the spatial location of the object. The technique is based on the simple principle that multiple patterns projected from multiple projectors interfere constructively with each other when their patterns are projected on the same object. Previous techniques based on the same principle can only achieve (1) low resolution volume colorization or (2) high resolution images but only on a limited number of flat planes. In this paper, we discretize a 3D object into a number of 3D points so that high resolution images can be projected onto the complex shapes. We also propose a dynamic ranges expansion technique as well as an efficient optimization procedure based on epipolar constraints. Such technique can be used to the extend projection mapping to have spatial dependency, which is desirable for practical applications. We also demonstrate the system potential as a visual instructor for object placement and assembling. Experiments prove the effectiveness of our method.

Takuto Hirukawa, Marco Visentini-Scarzanella, Hiroshi Kawasaki, Ryo Furukawa, Shinsaku Hiura
ZigzagNet: Efficient Deep Learning for Real Object Recognition Based on 3D Models

Effective utilization on texture-less 3D models for deep learning is significant to recognition on real photos. We eliminate the reliance on massive real training data by modifying convolutional neural network in 3 aspects: synthetic data rendering for training data generation in large quantities, multi-triplet cost function modification for multi-task learning and compact micro architecture design for producing tiny parametric model while overcoming over-fit problem in texture-less models. Network is initiated with multi-triplet cost function establishing sphere-like distribution of descriptors in each category which is helpful for recognition on regular photos according to pose, lighting condition, background and category information of rendered images. Fine-tuning with additional data further meets the aim of classification on special real photos based on initial model. We propose a 6.2 MB compact parametric model called ZigzagNet based on SqueezeNet to improve the performance for recognition by applying moving normalization inside micro architecture and adding channel wise convolutional bypass through macro architecture. Moving batch normalization is used to get a good performance on both convergence speed and recognition accuracy. Accuracy of our compact parametric model in experiment on ImageNet and PASCAL samples provided by PASCAL3D+ based on simple Nearest Neighbor classifier is close to the result of 240 MB AlexNet trained with real images. Model trained on texture-less models which consumes less time for rendering and collecting outperforms the result of training with more textured models from ShapeNet.

Yida Wang, Can Cui, Xiuzhuang Zhou, Weihong Deng
Precise Measurement of Cargo Boxes for Gantry Robot Palletization in Large Scale Workspaces Using Low-Cost RGB-D Sensors

This paper presents a novel algorithm for extracting the pose and dimensions of cargo boxes in a large measurement space of a robotic gantry, with sub-centimetre accuracy using multiple low cost RGB-D Kinect sensors. This information is used by a bin-packing and path-planning software to build up a pallet. The robotic gantry workspaces can be up to 10 m in all dimensions, and the cameras cannot be placed top-down since the components of the gantry actuate within this space. This presents a challenge as occlusion and sensor noise is more likely.This paper presents the system integration components on how point cloud information is extracted from multiple cameras and fused in real-time, how primitives and contours are extracted and corrected using RGB image features, and how cargo parameters from the cluttered cloud are extracted and optimized using graph based segmentation and particle filter based techniques. This is done with sub-centimetre accuracy irrespective of occlusion or noise from cameras at such camera placements and range to cargo.

Yaadhav Raaj, Suraj Nair, Alois Knoll
Visual Place Recognition Using Landmark Distribution Descriptors

Recent work by Sünderhauf et al. [1] demonstrated improved visual place recognition using proposal regions coupled with features from convolutional neural networks (CNN) to match landmarks between views. In this work we extend the approach by introducing descriptors built from landmark features which also encode the spatial distribution of the landmarks within a view. Matching descriptors then enforces consistency of the relative positions of landmarks between views. This has a significant impact on performance. For example, in experiments on 10 image-pair datasets, each consisting of 200 urban locations with significant differences in viewing positions and conditions, we recorded average precision of around 70% (at 100% recall), compared with 58% obtained using whole image CNN features and 50% for the method in [1].

Pilailuck Panphattarasap, Andrew Calway
Real Time Direct Visual Odometry for Flexible Multi-camera Rigs

We present a Direct Visual Odometry (VO) algorithm for multi-camera rigs, that allows for flexible connections between cameras and runs in real-time at high frame rate on GPU for stereo setups. In contrast to feature-based VO methods, Direct VO aligns images directly to depth-enhanced previous images based on the photoconsistency of all high-contrast pixels. By using a multi-camera setup we can introduce an absolute scale into our reconstruction. Multiple views also allow us to obtain depth from multiple disparity sources: static disparity between the different cameras of the rig and temporal disparity by exploiting rig motion. We propose a joint optimization of the rig poses and the camera poses within the rig which enables working with flexible rigs. We show that sub-pixel rigidity is difficult to manufacture for 720p or higher resolution cameras which makes this feature important, particularly in current and future (semi-)autonomous cars or drones. Consequently, we evaluate our approach on own, real-world and synthetic datasets that exhibit flexibility in the rig beside sequences from established KITTI dataset.

Benjamin Resch, Jian Wei, Hendrik P. A. Lensch
Analysis and Practical Minimization of Registration Error in a Spherical Fish Tank Virtual Reality System

We describe the design, implementation and detailed visual error analysis of a 3D perspective-corrected spherical display that uses calibrated, multiple rear projected pico-projectors. The display system is calibrated via 3D reconstruction using a single inexpensive camera, which enables both view-independent and view-dependent applications, also known as, Fish Tank Virtual Reality (FTVR). We perform error analysis of the system in terms of display calibration error and head-tracking error using a mathematical model. We found: head tracking error causes significantly more eye angular error than display calibration error; angular error becomes more sensitive to tracking error when the viewer moves closer to the sphere; and angular error is sensitive to the distance between the virtual object and its corresponding pixel on the surface. Taken together, these results provide practical guidelines for building a spherical FTVR display and can be applied to other configurations of geometric displays.

Qian Zhou, Gregor Miller, Kai Wu, Ian Stavness, Sidney Fels
Enhancing Direct Camera Tracking with Dense Feature Descriptors

Direct camera tracking is a popular tool for motion estimation. It promises more precise estimates, enhanced robustness as well as denser reconstruction efficiently. However, most direct tracking algorithms rely on the brightness constancy assumption, which is seldom satisfied in the real world. This means that direct tracking is unsuitable when dealing with sudden and arbitrary illumination changes. In this work, we propose a non-parametric approach to address illumination variations in direct tracking. Instead of modeling illumination, or relying on difficult to optimize robust similarity metrics, we propose to directly minimize the squared distance between densely evaluated local feature descriptors. Our approach is shown to perform well in terms of robustness and runtime. The algorithm is evaluated on two direct tracking problems: template tracking and direct visual odometry and using a variety of feature descriptors proposed in the literature.

Hatem Alismail, Brett Browning, Simon Lucey
Backmatter
Metadaten
Titel
Computer Vision – ACCV 2016
herausgegeben von
Shang-Hong Lai
Vincent Lepetit
Ko Nishino
Yoichi Sato
Copyright-Jahr
2017
Electronic ISBN
978-3-319-54190-7
Print ISBN
978-3-319-54189-1
DOI
https://doi.org/10.1007/978-3-319-54190-7