main-content

## Über dieses Buch

The eight-volume set comprising LNCS volumes 9905-9912 constitutes the refereed proceedings of the 14th European Conference on Computer Vision, ECCV 2016, held in Amsterdam, The Netherlands, in October 2016.
The 415 revised papers presented were carefully reviewed and selected from 1480 submissions. The papers cover all aspects of computer vision and pattern recognition such as 3D computer vision; computational photography, sensing and display; face and gesture; low-level vision and image processing; motion and tracking; optimization methods; physics-based vision, photometry and shape-from-X; recognition: detection, categorization, indexing, matching; segmentation, grouping and shape representation; statistical methods and learning; video: events, activities and surveillance; applications. They are organized in topical sections on detection, recognition and retrieval; scene understanding; optimization; image and video processing; learning; action activity and tracking; 3D; and 9 poster sessions.

## Inhaltsverzeichnis

### Reflection Symmetry Detection via Appearance of Structure Descriptor

Symmetry in visual data represents repeated patterns or shapes that is easily found in natural and human-made objects. Symmetry pattern on an object works as a salient visual feature attracting human attention and letting the object to be easily recognized. Most existing symmetry detection methods are based on sparsely detected local features describing the appearance of their neighborhood, which have difficulty in capturing object structure mostly supported by edges and contours. In this work, we propose a new reflection symmetry detection method extracting robust 4-dimensional Appearance of Structure descriptors based on a set of outstanding neighbourhood edge segments in multiple scales. Our experimental evaluations on multiple public symmetry detection datasets show promising reflection symmetry detection results on challenging real world and synthetic images.

### Faceless Person Recognition: Privacy Implications in Social Media

As we shift more of our lives into the virtual domain, the volume of data shared on the web keeps increasing and presents a threat to our privacy. This works contributes to the understanding of privacy implications of such data sharing by analysing how well people are recognisable in social media data. To facilitate a systematic study we define a number of scenarios considering factors such as how many heads of a person are tagged and if those heads are obfuscated or not. We propose a robust person recognition system that can handle large variations in pose and clothing, and can be trained with few training samples. Our results indicate that a handful of images is enough to threaten users’ privacy, even in the presence of obfuscation. We show detailed experimental results, and discuss their implications.

Seong Joon Oh, Rodrigo Benenson, Mario Fritz, Bernt Schiele

### Segmental Spatiotemporal CNNs for Fine-Grained Action Segmentation

Joint segmentation and classification of fine-grained actions is important for applications of human-robot interaction, video surveillance, and human skill evaluation. However, despite substantial recent progress in large-scale action classification, the performance of state-of-the-art fine-grained action recognition approaches remains low. We propose a model for action segmentation which combines low-level spatiotemporal features with a high-level segmental classifier. Our spatiotemporal CNN is comprised of a spatial component that represents relationships between objects and a temporal component that uses large 1D convolutional filters to capture how object relationships change across time. These features are used in tandem with a semi-Markov model that captures transitions from one action to another. We introduce an efficient constrained segmental inference algorithm for this model that is orders of magnitude faster than the current approach. We highlight the effectiveness of our Segmental Spatiotemporal CNN on cooking and surgical action datasets for which we observe substantially improved performance relative to recent baseline methods.

Colin Lea, Austin Reiter, René Vidal, Gregory D. Hager

### Structure from Motion on a Sphere

We describe a special case of structure from motion where the camera rotates on a sphere. The camera’s optical axis lies perpendicular to the sphere’s surface. In this case, the camera’s pose is minimally represented by three rotation parameters. From analysis of the epipolar geometry we derive a novel and efficient solution for the essential matrix relating two images, requiring only three point correspondences in the minimal case. We apply this solver in a structure-from-motion pipeline that aggregates pairwise relations by rotation averaging followed by bundle adjustment with an inverse depth parameterization. Our methods enable scene modeling with an outward-facing camera and object scanning with an inward-facing camera.

Jonathan Ventura

### Evaluation of LBP and Deep Texture Descriptors with a New Robustness Benchmark

In recent years, a wide variety of different texture descriptors has been proposed, including many LBP variants. New types of descriptors based on multistage convolutional networks and deep learning have also emerged. In different papers the performance comparison of the proposed methods to earlier approaches is mainly done with some well-known texture datasets, with differing classifiers and testing protocols, and often not using the best sets of parameter values and multiple scales for the comparative methods. Very important aspects such as computational complexity and effects of poor image quality are often neglected.In this paper, we propose a new extensive benchmark (RoTeB) for measuring the robustness of texture operators against different classification challenges, including changes in rotation, scale, illumination, viewpoint, number of classes, different types of image degradation, and computational complexity. Fourteen datasets from the eight most commonly used texture sources are used in the benchmark. An extensive evaluation of the recent most promising LBP variants and some non-LBP descriptors based on deep convolutional networks is carried out. The best overall performance is obtained for the Median Robust Extended Local Binary Pattern (MRELBP) feature. For textures with very large appearance variations, Fisher vector pooling of deep Convolutional Neural Networks is clearly the best, but at the cost of very high computational complexity. The sensitivity to image degradations and computational complexity are among the key problems for most of the methods considered.

Li Liu, Paul Fieguth, Xiaogang Wang, Matti Pietikäinen, Dewen Hu

### MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition

In this paper, we design a benchmark task and provide the associated datasets for recognizing face images and link them to corresponding entity keys in a knowledge base. More specifically, we propose a benchmark task to recognize one million celebrities from their face images, by using all the possibly collected face images of this individual on the web as training data. The rich information provided by the knowledge base helps to conduct disambiguation and improve the recognition accuracy, and contributes to various real-world applications, such as image captioning and news video analysis. Associated with this task, we design and provide concrete measurement set, evaluation protocol, as well as training data. We also present in details our experiment setup and report promising baseline results. Our benchmark task could lead to one of the largest classification problems in computer vision. To the best of our knowledge, our training dataset, which contains 10M images in version 1, is the largest publicly available one in the world.

Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, Jianfeng Gao

### Hierarchical Beta Process with Gaussian Process Prior for Hyperspectral Image Super Resolution

Hyperspectral cameras acquire precise spectral information, however, their resolution is very low due to hardware constraints. We propose an image fusion based hyperspectral super resolution approach that employes a Bayesian representation model. The proposed model accounts for spectral smoothness and spatial consistency of the representation by using Gaussian Processes and a spatial kernel in a hierarchical formulation of the Beta Process. The model is employed by our approach to first infer Gaussian Processes for the spectra present in the hyperspectral image. Then, it is used to estimate the activity level of the inferred processes in a sparse representation of a high resolution image of the same scene. Finally, we use the model to compute multiple sparse codes of the high resolution image, that are merged with the samples of the Gaussian Processes for an accurate estimate of the high resolution hyperspectral image. We perform experiments with remotely sensed and ground-based hyperspectral images to establish the effectiveness of our approach.

Naveed Akhtar, Faisal Shafait, Ajmal Mian

### A 4D Light-Field Dataset and CNN Architectures for Material Recognition

We introduce a new light-field dataset of materials, and take advantage of the recent success of deep learning to perform material recognition on the 4D light-field. Our dataset contains 12 material categories, each with 100 images taken with a Lytro Illum, from which we extract about 30,000 patches in total. To the best of our knowledge, this is the first mid-size dataset for light-field images. Our main goal is to investigate whether the additional information in a light-field (such as multiple sub-aperture views and view-dependent reflectance effects) can aid material recognition. Since recognition networks have not been trained on 4D images before, we propose and compare several novel CNN architectures to train on light-field images. In our experiments, the best performing CNN architecture achieves a 7 % boost compared with 2D image classification ($$70\,\%\rightarrow 77\,\%$$70%→77%). These results constitute important baselines that can spur further research in the use of CNNs for light-field applications. Upon publication, our dataset also enables other novel applications of light-fields, including object detection, image segmentation and view interpolation.

Ting-Chun Wang, Jun-Yan Zhu, Ebi Hiroaki, Manmohan Chandraker, Alexei A. Efros, Ravi Ramamoorthi

### Graph-Based Consistent Matching for Structure-from-Motion

Pairwise image matching of unordered image collections greatly affects the efficiency and accuracy of Structure-from-Motion (SfM). Insufficient match pairs may result in disconnected structures or incomplete components, while costly redundant pairs containing erroneous ones may lead to folded and superimposed structures. This paper presents a graph-based image matching method that tackles the issues of completeness, efficiency and consistency in a unified framework. Our approach starts by chaining all but singleton images using a visual-similarity-based minimum spanning tree. Then the minimum spanning tree is incrementally expanded to form locally consistent strong triplets. Finally, a global community-based graph algorithm is introduced to strengthen the global consistency by reinforcing potentially large connected components. We demonstrate the superior performance of our method in terms of accuracy and efficiency on both benchmark and Internet datasets. Our method also performs remarkably well on the challenging datasets of highly ambiguous and duplicated scenes.

Tianwei Shen, Siyu Zhu, Tian Fang, Runze Zhang, Long Quan

### All-Around Depth from Small Motion with a Spherical Panoramic Camera

With the growing use of head-mounted displays for virtual reality (VR), generating 3D contents for these devices becomes an important topic in computer vision. For capturing full 360 degree panoramas in a single shot, the Spherical Panoramic Camera (SPC) are gaining in popularity. However, estimating depth from a SPC remains a challenging problem. In this paper, we propose a practical method that generates all-around dense depth map using a narrow-baseline video clip captured by a SPC. While existing methods for depth from small motion rely on perspective cameras, we introduce a new bundle adjustment approach tailored for SPC that minimizes the re-projection error directly on the unit sphere. It enables to estimate approximate metric camera poses and 3D points. Additionally, we present a novel dense matching method called sphere sweeping algorithm. This allows us to take advantage of the overlapping regions between the cameras. To validate the effectiveness of the proposed method, we evaluate our approach on both synthetic and real-world data. As an example of the applications, we also present stereoscopic panorama images generated from our depth results.

Sunghoon Im, Hyowon Ha, François Rameau, Hae-Gon Jeon, Gyeongmin Choe, In So Kweon

### On Volumetric Shape Reconstruction from Implicit Forms

In this paper we report on the evaluation of volumetric shape reconstruction methods that consider as input implicit forms in 3D. Many visual applications build implicit representations of shapes that are converted into explicit shape representations using geometric tools such as the Marching Cubes algorithm. This is the case with image based reconstructions that produce point clouds from which implicit functions are computed, with for instance a Poisson reconstruction approach. While the Marching Cubes method is a versatile solution with proven efficiency, alternative solutions exist with different and complementary properties that are of interest for shape modeling. In this paper, we propose a novel strategy that builds on Centroidal Voronoi Tessellations (CVTs). These tessellations provide volumetric and surface representations with strong regularities in addition to provably more accurate approximations of the implicit forms considered. In order to compare the existing strategies, we present an extensive evaluation that analyzes various properties of the main strategies for implicit to explicit volumetric conversions: Marching cubes, Delaunay refinement and CVTs, including accuracy and shape quality of the resulting shape mesh.

Li Wang, Franck Hétroy-Wheeler, Edmond Boyer

### Multi-attributed Graph Matching with Multi-layer Random Walks

This paper addresses the multi-attributed graph matching problem considering multiple attributes jointly while preserving the characteristics of each attribute. Since most of conventional graph matching algorithms integrate multiple attributes to construct a single attribute in an oversimplified way, the information from multiple attributes are not often fully exploited. In order to solve this problem, we propose a novel multi-layer graph structure that can preserve the particularities of each attribute in separated layers. Then, we also propose a multi-attributed graph matching algorithm based on the random walk centrality for the proposed multi-layer graph structure. We compare the proposed algorithm with other state-of-the-art graph matching algorithms based on the single-layer structure using synthetic and real datasets, and prove the superior performance of the proposed multi-layer graph structure and matching algorithm.

Han-Mu Park, Kuk-Jin Yoon

### Deep Learning of Local RGB-D Patches for 3D Object Detection and 6D Pose Estimation

We present a 3D object detection method that uses regressed descriptors of locally-sampled RGB-D patches for 6D vote casting. For regression, we employ a convolutional auto-encoder that has been trained on a large collection of random local patches. During testing, scene patch descriptors are matched against a database of synthetic model view patches and cast 6D object votes which are subsequently filtered to refined hypotheses. We evaluate on three datasets to show that our method generalizes well to previously unseen input data, delivers robust detection results that compete with and surpass the state-of-the-art while being scalable in the number of objects.

Wadim Kehl, Fausto Milletari, Federico Tombari, Slobodan Ilic, Nassir Navab

### A Neural Approach to Blind Motion Deblurring

We present a new method for blind motion deblurring that uses a neural network trained to compute estimates of sharp image patches from observations that are blurred by an unknown motion kernel. Instead of regressing directly to patch intensities, this network learns to predict the complex Fourier coefficients of a deconvolution filter to be applied to the input patch for restoration. For inference, we apply the network independently to all overlapping patches in the observed image, and average its outputs to form an initial estimate of the sharp image. We then explicitly estimate a single global blur kernel by relating this estimate to the observed image, and finally perform non-blind deconvolution with this kernel. Our method exhibits accuracy and robustness close to state-of-the-art iterative methods, while being much faster when parallelized on GPU hardware.

Ayan Chakrabarti

### Joint Face Representation Adaptation and Clustering in Videos

Clustering faces in movies or videos is extremely challenging since characters’ appearance can vary drastically under different scenes. In addition, the various cinematic styles make it difficult to learn a universal face representation for all videos. Unlike previous methods that assume fixed handcrafted features for face clustering, in this work, we formulate a joint face representation adaptation and clustering approach in a deep learning framework. The proposed method allows face representation to gradually adapt from an external source domain to a target video domain. The adaptation of deep representation is achieved without any strong supervision but through iteratively discovered weak pairwise identity constraints derived from potentially noisy face clustering result. Experiments on three benchmark video datasets demonstrate that our approach generates character clusters with high purity compared to existing video face clustering methods, which are either based on deep face representation (without adaptation) or carefully engineered features.

Zhanpeng Zhang, Ping Luo, Chen Change Loy, Xiaoou Tang

### Uncovering Symmetries in Polynomial Systems

In this paper we study symmetries in polynomial equation systems and how they can be integrated into the action matrix method. The main contribution is a generalization of the partial p-fold symmetry and we provide new theoretical insights as to why these methods work. We show several examples of how to use this symmetry to construct more compact polynomial solvers. As a second contribution we present a simple and automatic method for finding these symmetries for a given problem. Finally we show two examples where these symmetries occur in real applications.

### ATGV-Net: Accurate Depth Super-Resolution

In this work we present a novel approach for single depth map super-resolution. Modern consumer depth sensors, especially Time-of-Flight sensors, produce dense depth measurements, but are affected by noise and have a low lateral resolution. We propose a method that combines the benefits of recent advances in machine learning based single image super-resolution, i.e. deep convolutional networks, with a variational method to recover accurate high-resolution depth maps. In particular, we integrate a variational method that models the piecewise affine structures apparent in depth data via an anisotropic total generalized variation regularization term on top of a deep network. We call our method ATGV-Net and train it end-to-end by unrolling the optimization procedure of the variational method. To train deep networks, a large corpus of training data with accurate ground-truth is required. We demonstrate that it is feasible to train our method solely on synthetic data that we generate in large quantities for this task. Our evaluations show that we achieve state-of-the-art results on three different benchmarks, as well as on a challenging Time-of-Flight dataset, all without utilizing an additional intensity image as guidance.

Gernot Riegler, Matthias Rüther, Horst Bischof

### Indoor-Outdoor 3D Reconstruction Alignment

Structure-from-Motion can achieve accurate reconstructions of urban scenes. However, reconstructing the inside and the outside of a building into a single model is very challenging due to the lack of visual overlap and the change of lighting conditions between the two scenes. We propose a solution to align disconnected indoor and outdoor models of the same building into a single 3D model. Our approach leverages semantic information, specifically window detections, in multiple scenes to obtain candidate matches from which an alignment hypothesis can be computed. To determine the best alignment, we propose a novel cost function that takes both the number of window matches and the intersection of the aligned models into account. We evaluate our solution on multiple challenging datasets.

Andrea Cohen, Johannes L. Schönberger, Pablo Speciale, Torsten Sattler, Jan-Michael Frahm, Marc Pollefeys

### The Unreasonable Effectiveness of Noisy Data for Fine-Grained Recognition

Current approaches for fine-grained recognition do the following: First, recruit experts to annotate a dataset of images, optionally also collecting more structured data in the form of part annotations and bounding boxes. Second, train a model utilizing this data. Toward the goal of solving fine-grained recognition, we introduce an alternative approach, leveraging free, noisy data from the web and simple, generic methods of recognition. This approach has benefits in both performance and scalability. We demonstrate its efficacy on four fine-grained datasets, greatly exceeding existing state of the art without the manual collection of even a single label, and furthermore show first results at scaling to more than 10,000 fine-grained categories. Quantitatively, we achieve top-1 accuracies of $$92.3\,\%$$92.3% on CUB-200-2011, $$85.4\,\%$$85.4% on Birdsnap, $$93.4\,\%$$93.4% on FGVC-Aircraft, and $$80.8\,\%$$80.8% on Stanford Dogs without using their annotated training sets. We compare our approach to an active learning approach for expanding fine-grained datasets.

Jonathan Krause, Benjamin Sapp, Andrew Howard, Howard Zhou, Alexander Toshev, Tom Duerig, James Philbin, Li Fei-Fei

### A Simple Hierarchical Pooling Data Structure for Loop Closure

We propose a data structure obtained by hierarchically pooling Bag-of-Words (BoW) descriptors during a sequence of views that achieves average speedups in large-scale loop closure applications ranging from 2 to 20 times on benchmark datasets. Although simple, the method works as well as sophisticated agglomerative schemes at a fraction of the cost with minimal loss of performance.

Xiaohan Fei, Konstantine Tsotsos, Stefano Soatto

### A Versatile Approach for Solving PnP, PnPf, and PnPfr Problems

This paper proposes a versatile approach for solving three kinds of absolute camera pose estimation problem: PnP problem for calibrated cameras, PnPf problem for cameras with unknown focal length, and PnPfr problem for cameras with unknown focal length and unknown radial distortion. This is not only the first least squares solution to PnPfr problem, but also the first approach formulating three problems in the same theoretical manner. We show that all problems have a common subproblem represented as multivariate polynomial equations. Solving these equations by Gröbner basis method, we derive a linear form for the remaining parameters of each problem. Finally, we apply root polishing to strictly satisfy the original KKT condition. The proposed PnP and PnPf solvers have comparable performance to the state-of-the-art methods on synthetic distortion-free data. Moreover, the novel PnPfr solver gives the best result on distorted point data and demonstrates real image rectification against significant distortion.

Gaku Nakano

### Depth Map Super-Resolution by Deep Multi-Scale Guidance

Depth boundaries often lose sharpness when upsampling from low-resolution (LR) depth maps especially at large upscaling factors. We present a new method to address the problem of depth map super resolution in which a high-resolution (HR) depth map is inferred from a LR depth map and an additional HR intensity image of the same scene. We propose a Multi-Scale Guided convolutional network (MSG-Net) for depth map super resolution. MSG-Net complements LR depth features with HR intensity features using a multi-scale fusion strategy. Such a multi-scale guidance allows the network to better adapt for upsampling of both fine- and large-scale structures. Specifically, the rich hierarchical HR intensity features at different levels progressively resolve ambiguity in depth map upsampling. Moreover, we employ a high-frequency domain training method to not only reduce training time but also facilitate the fusion of depth and intensity features. With the multi-scale guidance, MSG-Net achieves state-of-art performance for depth map upsampling.

Tak-Wai Hui, Chen Change Loy, Xiaoou Tang

### SEAGULL: Seam-Guided Local Alignment for Parallax-Tolerant Image Stitching

Image stitching with large parallax is a challenging problem. Global alignment usually introduces noticeable artifacts. A common strategy is to perform partial alignment to facilitate the search for a good seam for stitching. Different from existing approaches where the seam estimation process is performed sequentially after alignment, we explicitly use the estimated seam to guide the process of optimizing local alignment so that the seam quality gets improved over each iteration. Furthermore, a novel structure-preserving warping method is introduced to preserve salient curve and line structures during the warping. These measures substantially improve the effectiveness of our method in dealing with a wide range of challenging images with large parallax.

Kaimo Lin, Nianjuan Jiang, Loong-Fah Cheong, Minh Do, Jiangbo Lu

### Grid Loss: Detecting Occluded Faces

Detection of partially occluded objects is a challenging computer vision problem. Standard Convolutional Neural Network (CNN) detectors fail if parts of the detection window are occluded, since not every sub-part of the window is discriminative on its own. To address this issue, we propose a novel loss layer for CNNs, named grid loss, which minimizes the error rate on sub-blocks of a convolution layer independently rather than over the whole feature map. This results in parts being more discriminative on their own, enabling the detector to recover if the detection window is partially occluded. By mapping our loss layer back to a regular fully connected layer, no additional computational cost is incurred at runtime compared to standard CNNs. We demonstrate our method for face detection on several public face detection benchmarks and show that our method outperforms regular CNNs, is suitable for realtime applications and achieves state-of-the-art performance.

Michael Opitz, Georg Waltner, Georg Poier, Horst Possegger, Horst Bischof

### Large-Scale R-CNN with Classifier Adaptive Quantization

This paper extends R-CNN, a state-of-the-art object detection method, to larger scales. To apply R-CNN to a large database storing thousands to millions of images, the SVM classification of millions to billions of DCNN features extracted from object proposals is indispensable, which imposes unrealistic computational and memory costs. Our method dramatically narrows down the number of object proposals by using an inverted index and efficiently searches by using residual vector quantization (RVQ). Instead of k-means that has been used in inverted indices, we present a novel quantization method designed for linear classification wherein the quantization error is re-defined for linear classification. It approximates the error as the empirical error with pre-defined multiple exemplar classifiers and captures the variance and common attributes of object category classifiers effectively. Experimental results show that our method achieves comparable performance to that of applying R-CNN to all images while achieving a 250 times speed-up and 180 times memory reduction. Moreover, our approach significantly outperforms the state-of-the-art large-scale category detection method, with about a 40$$\sim$$∼58 % increase in top-K precision. Scalability is also validated, and we demonstrate that our method can process 100 K images in 0.13 s while retaining precision.

Ryota Hinami, Shin’ichi Satoh

### Face Detection with End-to-End Integration of a ConvNet and a 3D Model

This paper presents a method for face detection in the wild, which integrates a ConvNet and a 3D mean face model in an end-to-end multi-task discriminative learning framework. The 3D mean face model is predefined and fixed (e.g., we used the one provided in the AFLW dataset). The ConvNet consists of two components: (i) The face proposal component computes face bounding box proposals via estimating facial key-points and the 3D transformation (rotation and translation) parameters for each predicted key-point w.r.t. the 3D mean face model. (ii) The face verification component computes detection results by pruning and refining proposals based on facial key-points based configuration pooling. The proposed method addresses two issues in adapting state-of-the-art generic object detection ConvNets (e.g., faster R-CNN) for face detection: (i) One is to eliminate the heuristic design of predefined anchor boxes in the region proposals network (RPN) by exploiting a 3D mean face model. (ii) The other is to replace the generic RoI (Region-of-Interest) pooling layer with a configuration pooling layer to respect underlying object structures. The multi-task loss consists of three terms: the classification Softmax loss and the location smooth $$l_1$$l1-losses of both the facial key-points and the face bounding boxes. In experiments, our ConvNet is trained on the AFLW dataset only and tested on the FDDB benchmark with fine-tuning and on the AFW benchmark without fine-tuning. The proposed method obtains very competitive state-of-the-art performance in the two benchmarks.

Yunzhu Li, Benyuan Sun, Tianfu Wu, Yizhou Wang

### Large Scale Asset Extraction for Urban Images

Object proposals are currently used for increasing the computational efficiency of object detection. We propose a novel adaptive pipeline for interleaving object proposals with object classification and use it as a formulation for asset detection. We first preprocess the images using a novel and efficient rectification technique. We then employ a particle filter approach to keep track of three priors, which guide proposed samples and get updated using classifier output. Tests performed on over 1000 urban images demonstrate that our rectification method is faster than existing methods without loss in quality, and that our interleaved proposal method outperforms current state-of-the-art. We further demonstrate that other methods can be improved by incorporating our interleaved proposals.

Lama Affara, Liangliang Nan, Bernard Ghanem, Peter Wonka

### Multi-label Active Learning Based on Maximum Correntropy Criterion: Towards Robust and Discriminative Labeling

Multi-label learning is a challenging problem in computer vision field. In this paper, we propose a novel active learning approach to reduce the annotation costs greatly for multi-label classification. State-of-the-art active learning methods either annotate all the relevant samples without diagnosing discriminative information in the labels or annotate only limited discriminative samples manually, that has weak immunity for the outlier labels. To overcome these problems, we propose a multi-label active learning method based on Maximum Correntropy Criterion (MCC) by merging uncertainty and representativeness. We use the the labels of labeled data and the prediction labels of unknown data to enhance the uncertainty and representativeness measurement by merging strategy, and use the MCC to alleviate the influence of outlier labels for discriminative labeling. Experiments on several challenging benchmark multi-label datasets show the superior performance of our proposed method to the state-of-the-art methods.

Zengmao Wang, Bo Du, Lefei Zhang, Liangpei Zhang, Meng Fang, Dacheng Tao

We present a novel multi-view reconstruction approach that effectively combines stereo and shape-from-shading energies into a single optimization scheme. Our method uses image gradients to transition between stereo-matching (which is more accurate at large gradients) and Lambertian shape-from-shading (which is more robust in flat regions). In addition, we show that our formulation is invariant to spatially varying albedo without explicitly modeling it. We show that the resulting energy function can be optimized efficiently using a smooth surface representation based on bicubic patches, and demonstrate that this algorithm outperforms both previous multi-view stereo algorithms and shading based refinement approaches on a number of datasets.

Fabian Langguth, Kalyan Sunkavalli, Sunil Hadap, Michael Goesele

### Fine-Scale Surface Normal Estimation Using a Single NIR Image

We present surface normal estimation using a single near infrared (NIR) image. We are focusing on reconstructing fine-scale surface geometry using an image captured with an uncalibrated light source. To tackle this ill-posed problem, we adopt a generative adversarial network, which is effective in recovering sharp outputs essential for fine-scale surface normal estimation. We incorporate the angular error and an integrability constraint into the objective function of the network to make the estimated normals incorporate physical characteristics. We train and validate our network on a recent NIR dataset, and also evaluate the generality of our trained model by using new external datasets that are captured with a different camera under different environments.

Youngjin Yoon, Gyeongmin Choe, Namil Kim, Joon-Young Lee, In So Kweon

### Pixelwise View Selection for Unstructured Multi-View Stereo

This work presents a Multi-View Stereo system for robust and efficient dense modeling from unstructured image collections. Our core contributions are the joint estimation of depth and normal information, pixelwise view selection using photometric and geometric priors, and a multi-view geometric consistency term for the simultaneous refinement and image-based depth and normal fusion. Experiments on benchmarks and large-scale Internet photo collections demonstrate state-of-the-art performance in terms of accuracy, completeness, and efficiency.

Johannes L. Schönberger, Enliang Zheng, Jan-Michael Frahm, Marc Pollefeys

### Laplacian Pyramid Reconstruction and Refinement for Semantic Segmentation

CNN architectures have terrific recognition performance but rely on spatial pooling which makes it difficult to adapt them to tasks that require dense, pixel-accurate labeling. This paper makes two contributions: (1) We demonstrate that while the apparent spatial resolution of convolutional feature maps is low, the high-dimensional feature representation contains significant sub-pixel localization information. (2) We describe a multi-resolution reconstruction architecture based on a Laplacian pyramid that uses skip connections from higher resolution feature maps and multiplicative gating to successively refine segment boundaries reconstructed from lower-resolution maps. This approach yields state-of-the-art semantic segmentation results on the PASCAL VOC and Cityscapes segmentation benchmarks without resorting to more complex random-field inference or instance detection driven architectures.

Golnaz Ghiasi, Charless C. Fowlkes

### Generic 3D Representation via Pose Estimation and Matching

Though a large body of computer vision research has investigated developing generic semantic representations, efforts towards developing a similar representation for 3D has been limited. In this paper, we learn a generic 3D representation through solving a set of foundational proxy 3D tasks: object-centric camera pose estimation and wide baseline feature matching. Our method is based upon the premise that by providing supervision over a set of carefully selected foundational tasks, generalization to novel tasks and abstraction capabilities can be achieved. We empirically show that the internal representation of a multi-task ConvNet trained to solve the above core problems generalizes to novel 3D tasks (e.g., scene layout estimation, object pose estimation, surface normal estimation) without the need for fine-tuning and shows traits of abstraction abilities (e.g., cross modality pose estimation).In the context of the core supervised tasks, we demonstrate our representation achieves state-of-the-art wide baseline feature matching results without requiring apriori rectification (unlike SIFT and the majority of learnt features). We also show 6DOF camera pose estimation given a pair local image patches. The accuracy of both supervised tasks come comparable to humans. Finally, we contribute a large-scale dataset composed of object-centric street view scenes along with point correspondences and camera pose information, and conclude with a discussion on the learned representation and open research questions.

Amir R. Zamir, Tilman Wekel, Pulkit Agrawal, Colin Wei, Jitendra Malik, Silvio Savarese

### Hand Pose Estimation from Local Surface Normals

We present a hierarchical regression framework for estimating hand joint positions from single depth images based on local surface normals. The hierarchical regression follows the tree structured topology of hand from wrist to finger tips. We propose a conditional regression forest, i.e. the Frame Conditioned Regression Forest (FCRF) which uses a new normal difference feature. At each stage of the regression, the frame of reference is established from either the local surface normal or previously estimated hand joints. By making the regression with respect to the local frame, the pose estimation is more robust to rigid transformations. We also introduce a new efficient approximation to estimate surface normals. We verify the effectiveness of our method by conducting experiments on two challenging real-world datasets and show consistent improvements over previous discriminative pose estimation methods.

Chengde Wan, Angela Yao, Luc Van Gool

### Abundant Inverse Regression Using Sufficient Reduction and Its Applications

Statistical models such as linear regression drive numerous applications in computer vision and machine learning. The landscape of practical deployments of these formulations is dominated by forward regression models that estimate the parameters of a function mapping a set of p covariates, $$\varvec{x}$$x, to a response variable, y. The less known alternative, Inverse Regression, offers various benefits that are much less explored in vision problems. The goal of this paper is to show how Inverse Regression in the “abundant” feature setting (i.e., many subsets of features are associated with the target label or response, as is the case for images), together with a statistical construction called Sufficient Reduction, yields highly flexible models that are a natural fit for model estimation tasks in vision. Specifically, we obtain formulations that provide relevance of individual covariates used in prediction, at the level of specific examples/samples — in a sense, explaining why a particular prediction was made. With no compromise in performance relative to other methods, an ability to interpret why a learning algorithm is behaving in a specific way for each prediction, adds significant value in numerous applications. We illustrate these properties and the benefits of Abundant Inverse Regression on three distinct applications.

Hyunwoo J. Kim, Brandon M. Smith, Nagesh Adluru, Charles R. Dyer, Sterling C. Johnson, Vikas Singh

### Learning Diverse Models: The Coulomb Structured Support Vector Machine

In structured prediction, it is standard procedure to discriminatively train a single model that is then used to make a single prediction for each input. This practice is simple but risky in many ways. For instance, models are often designed with tractability rather than faithfulness in mind. To hedge against such model misspecification, it may be useful to train multiple models that all are a reasonable fit to the training data, but at least one of which may hopefully make more valid predictions than the single model in standard procedure. We propose the Coulomb Structured SVM (CSSVM) as a means to obtain at training time a full ensemble of different models. At test time, these models can run in parallel and independently to make diverse predictions. We demonstrate on challenging tasks from computer vision that some of these diverse predictions have significantly lower task loss than that of a single model, and improve over state-of-the-art diversity encouraging approaches.

Martin Schiegg, Ferran Diego, Fred A. Hamprecht

### Pose Hashing with Microlens Arrays

We design and demonstrate a passive physical object whose appearance changes to give a discrete encoding of its pose. This object is created with a microlens array that is placed on top of a black and white pattern; when viewed from a particular viewpoint, the lenses appear black or white depending on the part of the pattern that each microlens projects towards that viewpoint. We analyze different design considerations that impact the information gained from the appearance of microlens array. In addition, we introduce the process through which the discrete microlens pattern can be turned into a viewpoint and a pose estimate. We empirically evaluate factors that impact viewpoint and pose estimation accuracy. Finally, we compare the pose estimation accuracy of the microlens array to other related fiducial markers.

Ian Schillebeeckx, Robert Pless

### The Fast Bilateral Solver

We present the bilateral solver, a novel algorithm for edge-aware smoothing that combines the flexibility and speed of simple filtering approaches with the accuracy of domain-specific optimization algorithms. Our technique is capable of matching or improving upon state-of-the-art results on several different computer vision tasks (stereo, depth superresolution, colorization, and semantic segmentation) while being 10–1000$$\times$$× faster than baseline techniques with comparable accuracy, and producing lower-error output than techniques with comparable runtimes. The bilateral solver is fast, robust, straightforward to generalize to new domains, and simple to integrate into deep learning pipelines.

Jonathan T. Barron, Ben Poole

### Phase-Based Modification Transfer for Video

We present a novel phase-based method for propagating modifications of one video frame to an entire sequence. Instead of computing accurate pixel correspondences between frames, e.g. extracting sparse features or optical flow, we use the assumption that small motion can be represented as the phase shift of individual pixels. In order to successfully apply this idea to transferring image edits, we propose a correction algorithm, which adapts the phase shift as well as the amplitude of the modified images. As our algorithm avoids expensive global optimization and all computational steps are performed per-pixel, it allows for a simple and efficient implementation. We evaluate the flexibility of the approach by applying it to various types of image modifications, ranging from compositing and colorization to image filters.

Simone Meyer, Alexander Sorkine-Hornung, Markus Gross

### Colorful Image Colorization

Given a grayscale photograph as input, this paper attacks the problem of hallucinating a plausible color version of the photograph. This problem is clearly underconstrained, so previous approaches have either relied on significant user interaction or resulted in desaturated colorizations. We propose a fully automatic approach that produces vibrant and realistic colorizations. We embrace the underlying uncertainty of the problem by posing it as a classification task and use class-rebalancing at training time to increase the diversity of colors in the result. The system is implemented as a feed-forward pass in a CNN at test time and is trained on over a million color images. We evaluate our algorithm using a “colorization Turing test,” asking human participants to choose between a generated and ground truth color image. Our method successfully fools humans on 32 % of the trials, significantly higher than previous methods. Moreover, we show that colorization can be a powerful pretext task for self-supervised feature learning, acting as a cross-channel encoder. This approach results in state-of-the-art performance on several feature learning benchmarks.

Richard Zhang, Phillip Isola, Alexei A. Efros

### Focal Flow: Measuring Distance and Velocity with Defocus and Differential Motion

We present the focal flow sensor. It is an unactuated, monocular camera that simultaneously exploits defocus and differential motion to measure a depth map and a 3D scene velocity field. It does so using an optical-flow-like, per-pixel linear constraint that relates image derivatives to depth and velocity. We derive this constraint, prove its invariance to scene texture, and prove that it is exactly satisfied only when the sensor’s blur kernels are Gaussian. We analyze the inherent sensitivity of the ideal focal flow sensor, and we build and test a prototype. Experiments produce useful depth and velocity information for a broader set of aperture configurations, including a simple lens with a pillbox aperture.

Emma Alexander, Qi Guo, Sanjeev Koppal, Steven Gortler, Todd Zickler

### An Evaluation of Computational Imaging Techniques for Heterogeneous Inverse Scattering

Inferring internal scattering parameters for general, heterogeneous materials, remains a challenging inverse problem. Its difficulty arises from the complex way in which scattering materials interact with light, as well as the very high dimensionality of the material space implied by heterogeneity. The recent emergence of diverse computational imaging techniques, together with the widespread availability of computing power, present a renewed opportunity for tackling this problem. We take first steps in this direction, by deriving theoretical results, developing an algorithmic framework, and performing quantitative evaluations for the problem of heterogeneous inverse scattering from simulated measurements of different computational imaging configurations.

Ioannis Gkioulekas, Anat Levin, Todd Zickler

### Precomputed Real-Time Texture Synthesis with Markovian Generative Adversarial Networks

This paper proposes Markovian Generative Adversarial Networks (MGANs), a method for training generative networks for efficient texture synthesis. While deep neural network approaches have recently demonstrated remarkable results in terms of synthesis quality, they still come at considerable computational costs (minutes of run-time for low-res images). Our paper addresses this efficiency issue. Instead of a numerical deconvolution in previous work, we precompute a feed-forward, strided convolutional network that captures the feature statistics of Markovian patches and is able to directly generate outputs of arbitrary dimensions. Such network can directly decode brown noise to realistic texture, or photos to artistic paintings. With adversarial training, we obtain quality comparable to recent neural texture synthesis methods. As no optimization is required at generation time, our run-time performance (0.25 M pixel images at 25 Hz) surpasses previous neural texture synthesizers by a significant margin (at least 500 times faster). We apply this idea to texture synthesis, style transfer, and video stylization.

Chuan Li, Michael Wand

### Fast Guided Global Interpolation for Depth and Motion

We study the problems of upsampling a low-resolution depth map and interpolating an initial set of sparse motion matches, with the guidance from a corresponding high-resolution color image. The common objective for both tasks is to densify a set of sparse data points, either regularly distributed or scattered, to a full image grid through a 2D guided interpolation process. We propose a unified approach that casts the fundamental guided interpolation problem into a hierarchical, global optimization framework. Built on a weighted least squares (WLS) formulation with its recent fast solver – fast global smoothing (FGS) technique, our method progressively densifies the input data set by efficiently performing the cascaded, global interpolation (or smoothing) with alternating guidances. Our cascaded scheme effectively addresses the potential structure inconsistency between the sparse input data and the guidance image, while preserving depth or motion boundaries. To prevent new data points of low confidence from contaminating the next interpolation process, we also prudently evaluate the consensus of the interpolated intermediate data. Experiments show that our general interpolation approach successfully tackles several notorious challenges. Our method achieves quantitatively competitive results on various benchmark evaluations, while running much faster than other competing methods designed specifically for either depth upsampling or motion interpolation.

Yu Li, Dongbo Min, Minh N. Do, Jiangbo Lu

### Learning High-Order Filters for Efficient Blind Deconvolution of Document Photographs

Photographs of text documents taken by hand-held cameras can be easily degraded by camera motion during exposure. In this paper, we propose a new method for blind deconvolution of document images. Observing that document images are usually dominated by small-scale high-order structures, we propose to learn a multi-scale, interleaved cascade of shrinkage fields model, which contains a series of high-order filters to facilitate joint recovery of blur kernel and latent image. With extensive experiments, we show that our method produces high quality results and is highly efficient at the same time, making it a practical choice for deblurring high resolution text images captured by modern mobile devices.

Lei Xiao, Jue Wang, Wolfgang Heidrich, Michael Hirsch

### Multi-view Inverse Rendering Under Arbitrary Illumination and Albedo

3D shape reconstruction with multi-view stereo (MVS) relies on a robust evaluation of photo consistencies across images. The robustness is ensured by isolating surface albedo and scene illumination from the shape recovery, i.e. shading and colour variation are regarded as a nuisance in MVS. This yields a gap in the qualities between the recovered shape and the images used. We present a method to address it by jointly estimating detailed shape, illumination and albedo using the initial shape robustly recovered by MVS. This is achieved by solving the multi-view inverse rendering problem using the geometric and photometric smoothness terms and the normalized spherical harmonics illumination model. Our method allows spatially-varying albedo and per image illumination without any prerequisites such as training data or image segmentation. We demonstrate that our method can clearly improve the 3D shape and recover illumination and albedo on real world scenes.

Kichang Kim, Akihiko Torii, Masatoshi Okutomi

### DAPs: Deep Action Proposals for Action Understanding

Object proposals have contributed significantly to recent advances in object understanding in images. Inspired by the success of this approach, we introduce Deep Action Proposals (DAPs), an effective and efficient algorithm for generating temporal action proposals from long videos. We show how to take advantage of the vast capacity of deep learning models and memory cells to retrieve from untrimmed videos temporal segments, which are likely to contain actions. A comprehensive evaluation indicates that our approach outperforms previous work on a large scale action benchmark, runs at 134 FPS making it practical for large-scale scenarios, and exhibits an appealing ability to generalize, i.e. to retrieve good quality temporal proposals of actions unseen in training.

Victor Escorcia, Fabian Caba Heilbron, Juan Carlos Niebles, Bernard Ghanem

### A Large Contextual Dataset for Classification, Detection and Counting of Cars with Deep Learning

We have created a large diverse set of cars from overhead images (Data sets, annotations, networks and scripts are available from http://gdo-datasci.ucllnl.org/cowc/), which are useful for training a deep learner to binary classify, detect and count them. The dataset and all related material will be made publically available. The set contains contextual matter to aid in identification of difficult targets. We demonstrate classification and detection on this dataset using a neural network we call ResCeption. This network combines residual learning with Inception-style layers and is used to count cars in one look. This is a new way to count objects rather than by localization or density estimation. It is fairly accurate, fast and easy to implement. Additionally, the counting method is not car or scene specific. It would be easy to train this method to count other kinds of objects and counting over new scenes requires no extra set up or assumptions about object locations.

T. Nathan Mundhenk, Goran Konjevod, Wesam A. Sakla, Kofi Boakye

### Reliable Attribute-Based Object Recognition Using High Predictive Value Classifiers

We consider the problem of object recognition in 3D using an ensemble of attribute-based classifiers. We propose two new concepts to improve classification in practical situations, and show their implementation in an approach implemented for recognition from point-cloud data. First, the viewing conditions can have a strong influence on classification performance. We study the impact of the distance between the camera and the object and propose an approach to fusing multiple attribute classifiers, which incorporates distance into the decision making. Second, lack of representative training samples often makes it difficult to learn the optimal threshold value for best positive and negative detection rate. We address this issue, by setting in our attribute classifiers instead of just one threshold value, two threshold values to distinguish a positive, a negative and an uncertainty class, and we prove the theoretical correctness of this approach. Empirical studies demonstrate the effectiveness and feasibility of the proposed concepts.

Wentao Luan, Yezhou Yang, Cornelia Fermüller, John S. Baras

### Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition

3D action recognition – analysis of human actions based on 3D skeleton data – becomes popular recently due to its succinctness, robustness, and view-invariant representation. Recent attempts on this problem suggested to develop RNN-based learning methods to model the contextual dependency in the temporal domain. In this paper, we extend this idea to spatio-temporal domains to analyze the hidden sources of action-related information within the input data over both domains concurrently. Inspired by the graphical structure of the human skeleton, we further propose a more powerful tree-structure based traversal method. To handle the noise and occlusion in 3D skeleton data, we introduce new gating mechanism within LSTM to learn the reliability of the sequential input data and accordingly adjust its effect on updating the long-term context information stored in the memory cell. Our method achieves state-of-the-art performance on 4 challenging benchmark datasets for 3D human action analysis.

Jun Liu, Amir Shahroudy, Dong Xu, Gang Wang

### Going Further with Point Pair Features

Point Pair Features is a widely used method to detect 3D objects in point clouds, however they are prone to fail in presence of sensor noise and background clutter. We introduce novel sampling and voting schemes that significantly reduces the influence of clutter and sensor noise. Our experiments show that with our improvements, PPFs become competitive against state-of-the-art methods as it outperforms them on several objects from challenging benchmarks, at a low computational cost.

Stefan Hinterstoisser, Vincent Lepetit, Naresh Rajkumar, Kurt Konolige

### Webly-Supervised Video Recognition by Mutually Voting for Relevant Web Images and Web Video Frames

Video recognition usually requires a large amount of training samples, which are expensive to be collected. An alternative and cheap solution is to draw from the large-scale images and videos from the Web. With modern search engines, the top ranked images or videos are usually highly correlated to the query, implying the potential to harvest the labeling-free Web images and videos for video recognition. However, there are two key difficulties that prevent us from using the Web data directly. First, they are typically noisy and may be from a completely different domain from that of users’ interest (e.g. cartoons). Second, Web videos are usually untrimmed and very lengthy, where some query-relevant frames are often hidden in between the irrelevant ones. A question thus naturally arises: to what extent can such noisy Web images and videos be utilized for labeling-free video recognition? In this paper, we propose a novel approach to mutually voting for relevant Web images and video frames, where two forces are balanced, i.e. aggressive matching and passive video frame selection. We validate our approach on three large-scale video recognition datasets.

Chuang Gan, Chen Sun, Lixin Duan, Boqing Gong

### HFS: Hierarchical Feature Selection for Efficient Image Segmentation

In this paper, we propose a real-time system, Hierarchical Feature Selection (HFS), that performs image segmentation at a speed of 50 frames-per-second. We make an attempt to improve the performance of previous image segmentation systems by focusing on two aspects: (1) a careful system implementation on modern GPUs for efficient feature computation; and (2) an effective hierarchical feature selection and fusion strategy with learning. Compared with classic segmentation algorithms, our system demonstrates its particular advantage in speed, with comparable results in segmentation quality. Adopting HFS in applications like salient object detection and object proposal generation results in a significant performance boost. Our proposed HFS system (will be open-sourced) can be used in a variety computer vision tasks that are built on top of image segmentation and superpixel extraction.

Ming-Ming Cheng, Yun Liu, Qibin Hou, Jiawang Bian, Philip Torr, Shi-Min Hu, Zhuowen Tu

### Backmatter

Weitere Informationen