Skip to main content

2016 | Buch

Computer Vision – ECCV 2016

14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II

insite
SUCHEN

Über dieses Buch

The eight-volume set comprising LNCS volumes 9905-9912 constitutes the refereed proceedings of the 14th European Conference on Computer Vision, ECCV 2016, held in Amsterdam, The Netherlands, in October 2016.
The 415 revised papers presented were carefully reviewed and selected from 1480 submissions. The papers cover all aspects of computer vision and pattern recognition such as 3D computer vision; computational photography, sensing and display; face and gesture; low-level vision and image processing; motion and tracking; optimization methods; physics-based vision, photometry and shape-from-X; recognition: detection, categorization, indexing, matching; segmentation, grouping and shape representation; statistical methods and learning; video: events, activities and surveillance; applications. They are organized in topical sections on detection, recognition and retrieval; scene understanding; optimization; image and video processing; learning; action activity and tracking; 3D; and 9 poster sessions.

Inhaltsverzeichnis

Frontmatter

Poster Session 2

Frontmatter
The Curious Robot: Learning Visual Representations via Physical Interactions

What is the right supervisory signal to train visual representations? Current approaches in computer vision use category labels from datasets such as ImageNet to train ConvNets. However, in case of biological agents, visual representation learning does not require millions of semantic labels. We argue that biological agents use physical interactions with the world to learn visual representations unlike current vision systems which just use passive observations (images and videos downloaded from web). For example, babies push objects, poke them, put them in their mouth and throw them to learn representations. Towards this goal, we build one of the first systems on a Baxter platform that pushes, pokes, grasps and observes objects in a tabletop environment. It uses four different types of physical interactions to collect more than 130K datapoints, with each datapoint providing supervision to a shared ConvNet architecture allowing us to learn visual representations. We show the quality of learned representations by observing neuron activations and performing nearest neighbor retrieval on this learned representation. Quantitatively, we evaluate our learned ConvNet on image classification tasks and show improvements compared to learning without external data. Finally, on the task of instance retrieval, our network outperforms the ImageNet network on recall@1 by 3 %.

Lerrel Pinto, Dhiraj Gandhi, Yuanfeng Han, Yong-Lae Park, Abhinav Gupta
Image Co-localization by Mimicking a Good Detector’s Confidence Score Distribution

Given a set of images containing objects from the same category, the task of image co-localization is to identify and localize each instance. This paper shows that this problem can be solved by a simple but intriguing idea, that is, a common object detector can be learnt by making its detection confidence scores distributed like those of a strongly supervised detector. More specifically, we observe that given a set of object proposals extracted from an image that contains the object of interest, an accurate strongly supervised object detector should give high scores to only a small minority of proposals, and low scores to most of them. Thus, we devise an entropy-based objective function to enforce the above property when learning the common object detector. Once the detector is learnt, we resort to a segmentation approach to refine the localization. We show that despite its simplicity, our approach outperforms state-of-the-arts.

Yao Li, Lingqiao Liu, Chunhua Shen, Anton van den Hengel
Facilitating and Exploring Planar Homogeneous Texture for Indoor Scene Understanding

Indoor scenes tend to be abundant with planar homogeneous texture, manifesting as regularly repeating scene elements along a plane. In this work, we propose to exploit such structure to facilitate high-level scene understanding. By robustly fitting a texture projection model to optimal dominant frequency estimates in image patches, we arrive at a projective-invariant method to localize such semantically meaningful regions in multi-planar scenes. The recovered projective parameters also allow an affine-ambiguous rectification in real-world images marred with outliers, room clutter, and photometric severities. Qualitative and quantitative results show our method outperforms existing representative work for both rectification and detection. We then explore the potential of homogeneous texture for two indoor scene understanding tasks. In scenes where vanishing points cannot be reliably detected, or the Manhattan assumption is not satisfied, homogeneous texture detected by the proposed approach provides alternative cues to obtain an indoor scene geometric layout. Second, low-level feature descriptors extracted upon affine rectification of detected texture are found to be not only class-discriminative but also complementary to features without rectification, improving recognition performance on the MIT Indoor67 benchmark.

Shahzor Ahmad, Loong-Fah Cheong
An Empirical Study and Analysis of Generalized Zero-Shot Learning for Object Recognition in the Wild

We investigate the problem of generalized zero-shot learning (GZSL). GZSL relaxes the unrealistic assumption in conventional zero-shot learning (ZSL) that test data belong only to unseen novel classes. In GZSL, test data might also come from seen classes and the labeling space is the union of both types of classes. We show empirically that a straightforward application of classifiers provided by existing ZSL approaches does not perform well in the setting of GZSL. Motivated by this, we propose a surprisingly simple but effective method to adapt ZSL approaches for GZSL. The main idea is to introduce a calibration factor to calibrate the classifiers for both seen and unseen classes so as to balance two conflicting forces: recognizing data from seen classes and those from unseen ones. We develop a new performance metric called the Area Under Seen-Unseen accuracy Curve to characterize this trade-off. We demonstrate the utility of this metric by analyzing existing ZSL approaches applied to the generalized setting. Extensive empirical studies reveal strengths and weaknesses of those approaches on three well-studied benchmark datasets, including the large-scale ImageNet with more than 20,000 unseen categories. We complement our comparative studies in learning methods by further establishing an upper bound on the performance limit of GZSL. In particular, our idea is to use class-representative visual features as the idealized semantic embeddings. We show that there is a large gap between the performance of existing approaches and the performance limit, suggesting that improving the quality of class semantic embeddings is vital to improving ZSL.

Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, Fei Sha
Modeling Context in Referring Expressions

Humans refer to objects in their environments all the time, especially in dialogue with other people. We explore generating and comprehending natural language referring expressions for objects in images. In particular, we focus on incorporating better measures of visual context into referring expression models and find that visual comparison to other objects within an image helps improve performance significantly. We also develop methods to tie the language generation process together, so that we generate expressions for all objects of a particular category jointly. Evaluation on three recent datasets - RefCOCO, RefCOCO+, and RefCOCOg (Datasets and toolbox can be downloaded from https://github.com/lichengunc/refer), shows the advantages of our methods for both referring expression generation and comprehension.

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, Tamara L. Berg
Taxonomy-Regularized Semantic Deep Convolutional Neural Networks

We propose a novel convolutional network architecture that abstracts and differentiates the categories based on a given class hierarchy. We exploit grouped and discriminative information provided by the taxonomy, by focusing on the general and specific components that comprise each category, through the min- and difference-pooling operations. Without using any additional parameters or substantial increase in time complexity, our model is able to learn the features that are discriminative for classifying often confused sub-classes belonging to the same superclass, and thus improve the overall classification performance. We validate our method on CIFAR-100, Places-205, and ImageNet Animal datasets, on which our model obtains significant improvements over the base convolutional networks.

Wonjoon Goo, Juyong Kim, Gunhee Kim, Sung Ju Hwang
Playing for Data: Ground Truth from Computer Games

Recent progress in computer vision has been driven by high-capacity models trained on large datasets. Unfortunately, creating large datasets with pixel-level labels has been extremely costly due to the amount of human effort required. In this paper, we present an approach to rapidly creating pixel-accurate semantic label maps for images extracted from modern computer games. Although the source code and the internal operation of commercial games are inaccessible, we show that associations between image patches can be reconstructed from the communication between the game and the graphics hardware. This enables rapid propagation of semantic labels within and across images synthesized by the game, with no access to the source code or the content. We validate the presented approach by producing dense pixel-level semantic annotations for 25 thousand images synthesized by a photorealistic open-world computer game. Experiments on semantic segmentation datasets show that using the acquired data to supplement real-world images significantly increases accuracy and that the acquired data enables reducing the amount of hand-labeled real-world data: models trained with game data and just $$\tfrac{1}{3}$$13 of the CamVid training set outperform models trained on the complete CamVid training set.

Stephan R. Richter, Vibhav Vineet, Stefan Roth, Vladlen Koltun
Human Re-identification in Crowd Videos Using Personal, Social and Environmental Constraints

This paper addresses the problem of human re-identification in videos of dense crowds. Re-identification in crowded scenes is a challenging problem due to large number of people and frequent occlusions, coupled with changes in their appearance due to different properties and exposure of cameras. To solve this problem, we model multiple Personal, Social and Environmental (PSE) constraints on human motion across cameras in crowded scenes. The personal constraints include appearance and preferred speed of each individual, while the social influences are modeled by grouping and collision avoidance. Finally, the environmental constraints model the transition probabilities between gates (entrances/exits). We incorporate these constraints into an energy minimization for solving human re-identification. Assigning 1–1 correspondence while modeling PSE constraints is NP-hard. We optimize using a greedy local neighborhood search algorithm to restrict the search space of hypotheses. We evaluated the proposed approach on several thousand frames of PRID and Grand Central datasets, and obtained significantly better results compared to existing methods.

Shayan Modiri Assari, Haroon Idrees, Mubarak Shah
Revisiting Additive Quantization

We revisit Additive Quantization (AQ), an approach to vector quantization that uses multiple, full-dimensional, and non-orthogonal codebooks. Despite its elegant and simple formulation, AQ has failed to achieve state-of-the-art performance on standard retrieval benchmarks, because the encoding problem, which amounts to MAP inference in multiple fully-connected Markov Random Fields (MRFs), has proven to be hard to solve. We demonstrate that the performance of AQ can be improved to surpass the state of the art by leveraging iterated local search, a stochastic local search approach known to work well for a range of NP-hard combinatorial problems. We further show a direct application of our approach to a recent formulation of vector quantization that enforces sparsity of the codebooks. Unlike previous work, which required specialized optimization techniques, our formulation can be plugged directly into state-of-the-art lasso optimizers. This results in a conceptually simple, easily implemented method that outperforms the previous state of the art in solving sparse vector quantization. Our implementation is publicly available (https://github.com/jltmtz/local-search-quantization).

Julieta Martinez, Joris Clement, Holger H. Hoos, James J. Little
Single Image Dehazing via Multi-scale Convolutional Neural Networks

The performance of existing image dehazing methods is limited by hand-designed features, such as the dark channel, color disparity and maximum contrast, with complex fusion schemes. In this paper, we propose a multi-scale deep neural network for single-image dehazing by learning the mapping between hazy images and their corresponding transmission maps. The proposed algorithm consists of a coarse-scale net which predicts a holistic transmission map based on the entire image, and a fine-scale net which refines results locally. To train the multi-scale deep network, we synthesize a dataset comprised of hazy images and corresponding transmission maps based on the NYU Depth dataset. Extensive experiments demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods on both synthetic and real-world images in terms of quality and speed.

Wenqi Ren, Si Liu, Hua Zhang, Jinshan Pan, Xiaochun Cao, Ming-Hsuan Yang
Photometric Stereo Under Non-uniform Light Intensities and Exposures

This paper studies the effects of non-uniform light intensities and sensor exposures across observed images in photometric stereo. While conventional photometric stereo methods typically assume that light intensities are identical and sensor exposure is constant across observed images taken under varying lightings, these assumptions easily break down in practical settings due to individual light bulb’s characteristics and limited control over sensors. Our method explicitly models these non-uniformities and develops a method for accurately determining surface normal without affected by these factors. In addition, we show that our method is advantageous for general photometric stereo settings, where auto-exposure control is desirable. We compare our method with conventional least-squares and robust photometric stereo methods, and the experimental result shows superior accuracy of our method in this practical circumstance.

Donghyeon Cho, Yasuyuki Matsushita, Yu-Wing Tai, Inso Kweon
Visual Motif Discovery via First-Person Vision

Visual motifs are images of visual experiences that are significant and shared across many people, such as an image of an informative sign viewed by many people and that of a familiar social situation such as when interacting with a clerk at a store. The goal of this study is to discover visual motifs from a collection of first-person videos recorded by a wearable camera. To achieve this goal, we develop a commonality clustering method that leverages three important aspects: inter-video similarity, intra-video sparseness, and people’s visual attention. The problem is posed as normalized spectral clustering, and is solved efficiently using a weighted covariance matrix. Experimental results suggest the effectiveness of our method over several state-of-the-art methods in terms of both accuracy and efficiency of visual motif discovery.

Ryo Yonetani, Kris M. Kitani, Yoichi Sato
A Cluster Sampling Method for Image Matting via Sparse Coding

In this paper, we present a new image matting algorithm which solves two major problems encountered by previous sampling-based algorithms. The first is that existing sampling-based approaches typically rely on certain spatial assumptions in collecting samples from known regions, and thus their performance deteriorates if the underlying assumptions are not satisfied. Here, we propose a method that a more representative set of samples is collected so as not to miss out true samples. This is accomplished by clustering the foreground and background pixels and collecting samples from each of the clusters. The second problem is that the quality of matting result is determined by the goodness of a single sample pair which causes errors when sampling-based methods fail to select the best pairs. In this paper, we derive a new objective function for directly obtaining the estimation of the alpha matte from a bunch of samples. Comparison on a standard benchmark dataset demonstrates that the proposed approach generates more robust and accurate alpha matte than state-of-the-art methods.

Xiaoxue Feng, Xiaohui Liang, Zili Zhang
Fundamental Matrices from Moving Objects Using Line Motion Barcodes

Computing the epipolar geometry between cameras with very different viewpoints is often very difficult. The appearance of objects can vary greatly, and it is difficult to find corresponding feature points. Prior methods searched for corresponding epipolar lines using points on the convex hull of the silhouette of a single moving object. These methods fail when the scene includes multiple moving objects. This paper extends previous work to scenes having multiple moving objects by using the “Motion Barcodes”, a temporal signature of lines. Corresponding epipolar lines have similar motion barcodes, and candidate pairs of corresponding epipoar lines are found by the similarity of their motion barcodes. As in previous methods we assume that cameras are relatively stationary and that moving objects have already been extracted using background subtraction.

Yoni Kasten, Gil Ben-Artzi, Shmuel Peleg, Michael Werman
Fashion Landmark Detection in the Wild

Visual fashion analysis has attracted many attentions in the recent years. Previous work represented clothing regions by either bounding boxes or human joints. This work presents fashion landmark detection or fashion alignment, which is to predict the positions of functional key points defined on the fashion items, such as the corners of neckline, hemline, and cuff. To encourage future studies, we introduce a fashion landmark dataset (The dataset is available at http://mmlab.ie.cuhk.edu.hk/projects/DeepFashion/LandmarkDetection.html.) with over 120K images, where each image is labeled with eight landmarks. With this dataset, we study fashion alignment by cascading multiple convolutional neural networks in three stages. These stages gradually improve the accuracies of landmark predictions. Extensive experiments demonstrate the effectiveness of the proposed method, as well as its generalization ability to pose estimation. Fashion landmark is also compared to clothing bounding boxes and human joints in two applications, fashion attribute prediction and clothes retrieval, showing that fashion landmark is a more discriminative representation to understand fashion images.

Ziwei Liu, Sijie Yan, Ping Luo, Xiaogang Wang, Xiaoou Tang
Human Pose Estimation Using Deep Consensus Voting

In this paper we consider the problem of human pose estimation from a single still image. We propose a novel approach where each location in the image votes for the position of each keypoint using a convolutional neural net. The voting scheme allows us to utilize information from the whole image, rather than rely on a sparse set of keypoint locations. Using dense, multi-target votes, not only produces good keypoint predictions, but also enables us to compute image-dependent joint keypoint probabilities by looking at consensus voting. This differs from most previous methods where joint probabilities are learned from relative keypoint locations and are independent of the image. We finally combine the keypoints votes and joint probabilities in order to identify the optimal pose configuration. We show our competitive performance on the MPII Human Pose and Leeds Sports Pose datasets.

Ita Lifshitz, Ethan Fetaya, Shimon Ullman
Leveraging Visual Question Answering for Image-Caption Ranking

Visual Question Answering (VQA) is the task of taking as input an image and a free-form natural language question about the image, and producing an accurate answer. In this work we view VQA as a “feature extraction” module to extract image and caption representations. We employ these representations for the task of image-caption ranking. Each feature dimension captures (imagines) whether a fact (question-answer pair) could plausibly be true for the image and caption. This allows the model to interpret images and captions from a wide variety of perspectives. We propose score-level and representation-level fusion models to incorporate VQA knowledge in an existing state-of-the-art VQA-agnostic image-caption ranking model. We find that incorporating and reasoning about consistency between images and captions significantly improves performance. Concretely, our model improves state-of-the-art on caption retrieval by 7.1 % and on image retrieval by 4.4 % on the MSCOCO dataset.

Xiao Lin, Devi Parikh
DAVE: A Unified Framework for Fast Vehicle Detection and Annotation

Vehicle detection and annotation for streaming video data with complex scenes is an interesting but challenging task for urban traffic surveillance. In this paper, we present a fast framework of Detection and Annotation for Vehicles (DAVE), which effectively combines vehicle detection and attributes annotation. DAVE consists of two convolutional neural networks (CNNs): a fast vehicle proposal network (FVPN) for vehicle-like objects extraction and an attributes learning network (ALN) aiming to verify each proposal and infer each vehicle’s pose, color and type simultaneously. These two nets are jointly optimized so that abundant latent knowledge learned from the ALN can be exploited to guide FVPN training. Once the system is trained, it can achieve efficient vehicle detection and annotation for real-world traffic surveillance data. We evaluate DAVE on a new self-collected UTS dataset and the public PASCAL VOC2007 car and LISA 2010 datasets, with consistent improvements over existing algorithms.

Yi Zhou, Li Liu, Ling Shao, Matt Mellor
Real-Time Joint Tracking of a Hand Manipulating an Object from RGB-D Input

Real-time simultaneous tracking of hands manipulating and interacting with external objects has many potential applications in augmented reality, tangible computing, and wearable computing. However, due to difficult occlusions, fast motions, and uniform hand appearance, jointly tracking hand and object pose is more challenging than tracking either of the two separately. Many previous approaches resort to complex multi-camera setups to remedy the occlusion problem and often employ expensive segmentation and optimization steps which makes real-time tracking impossible. In this paper, we propose a real-time solution that uses a single commodity RGB-D camera. The core of our approach is a 3D articulated Gaussian mixture alignment strategy tailored to hand-object tracking that allows fast pose optimization. The alignment energy uses novel regularizers to address occlusions and hand-object contacts. For added robustness, we guide the optimization with discriminative part classification of the hand and segmentation of the object. We conducted extensive experiments on several existing datasets and introduce a new annotated hand-object dataset. Quantitative and qualitative results show the key advantages of our method: speed, accuracy, and robustness.

Srinath Sridhar, Franziska Mueller, Michael Zollhöfer, Dan Casas, Antti Oulasvirta, Christian Theobalt
DeepWarp: Photorealistic Image Resynthesis for Gaze Manipulation

In this work, we consider the task of generating highly-realistic images of a given face with a redirected gaze. We treat this problem as a specific instance of conditional image generation and suggest a new deep architecture that can handle this task very well as revealed by numerical comparison with prior art and a user study. Our deep architecture performs coarse-to-fine warping with an additional intensity correction of individual pixels. All these operations are performed in a feed-forward manner, and the parameters associated with different operations are learned jointly in the end-to-end fashion. After learning, the resulting neural network can synthesize images with manipulated gaze, while the redirection angle can be selected arbitrarily from a certain range and provided as an input to the network.

Yaroslav Ganin, Daniil Kononenko, Diana Sungatullina, Victor Lempitsky
Non-rigid 3D Shape Retrieval via Large Margin Nearest Neighbor Embedding

In this paper, we propose a highly efficient metric learning approach to non-rigid 3D shape analysis. From a training set of 3D shapes from different classes, we learn a transformation of the shapes which optimally enforces a clustering of shapes from the same class. In contrast to existing approaches, we do not perform a transformation of individual local point descriptors, but a linear embedding of the entire distribution of shape descriptors. It turns out that this embedding of the input shapes is sufficiently powerful to enable state of the art retrieval performance using a simple nearest neighbor classifier. We demonstrate experimentally that our approach substantially outperforms the state of the art non-rigid 3D shape retrieval methods on the recent benchmark data set SHREC’14 Non-Rigid 3D Human Models, both in classification accuracy and runtime.

Ioannis Chiotellis, Rudolph Triebel, Thomas Windheuser, Daniel Cremers
Multi-Task Zero-Shot Action Recognition with Prioritised Data Augmentation

Zero-Shot Learning (ZSL) promises to scale visual recognition by bypassing the conventional model training requirement of annotated examples for every category. This is achieved by establishing a mapping connecting low-level features and a semantic description of the label space, referred as visual-semantic mapping, on auxiliary data. Re-using the learned mapping to project target videos into an embedding space thus allows novel-classes to be recognised by nearest neighbour inference. However, existing ZSL methods suffer from auxiliary-target domain shift intrinsically induced by assuming the same mapping for the disjoint auxiliary and target classes. This compromises the generalisation accuracy of ZSL recognition on the target data. In this work, we improve the ability of ZSL to generalise across this domain shift in both model- and data-centric ways by formulating a visual-semantic mapping with better generalisation properties and a dynamic data re-weighting method to prioritise auxiliary data that are relevant to the target classes. Specifically: (1) We introduce a multi-task visual-semantic mapping to improve generalisation by constraining the semantic mapping parameters to lie on a low-dimensional manifold, (2) We explore prioritised data augmentation by expanding the pool of auxiliary data with additional instances weighted by relevance to the target domain. The proposed new model is applied to the challenging zero-shot action recognition problem to demonstrate its advantages over existing ZSL models.

Xun Xu, Timothy M. Hospedales, Shaogang Gong
Head Reconstruction from Internet Photos

3D face reconstruction from Internet photos has recently produced exciting results. A person’s face, e.g., Tom Hanks, can be modeled and animated in 3D from a completely uncalibrated photo collection. Most methods, however, focus solely on face area and mask out the rest of the head. This paper proposes that head modeling from the Internet is a problem we can solve. We target reconstruction of the rough shape of the head. Our method is to gradually “grow” the head mesh starting from the frontal face and extending to the rest of views using photometric stereo constraints. We call our method boundary-value growing algorithm. Results on photos of celebrities downloaded from the Internet are presented.

Shu Liang, Linda G. Shapiro, Ira Kemelmacher-Shlizerman
Support Discrimination Dictionary Learning for Image Classification

Dictionary learning has been successfully applied in image classification. However, many dictionary learning methods that encode only a single image at a time while training, ignore correlation and other useful information contained within the entire training set. In this paper, we propose a new principle that uses the support of the coefficients to measure the similarity between the pairs of coefficients, instead of using Euclidian distance directly. More specifically, we proposed a support discrimination dictionary learning method, which finds a dictionary under which the coefficients of images from the same class have a common sparse structure while the size of the overlapped signal support of different classes is minimised. In addition, adopting a shared dictionary in a multi-task learning setting, this method can find the number and position of associated dictionary atoms for each class automatically by using structured sparsity on a group of images. The proposed model is extensively evaluated using various image datasets, and it shows superior performance to many state-of-the-art dictionary learning methods.

Yang Liu, Wei Chen, Qingchao Chen, Ian Wassell
Accelerating the Super-Resolution Convolutional Neural Network

As a successful deep model applied in image super-resolution (SR), the Super-Resolution Convolutional Neural Network (SRCNN) [1, 2] has demonstrated superior performance to the previous hand-crafted models either in speed and restoration quality. However, the high computational cost still hinders it from practical usage that demands real-time performance (24 fps). In this paper, we aim at accelerating the current SRCNN, and propose a compact hourglass-shape CNN structure for faster and better SR. We re-design the SRCNN structure mainly in three aspects. First, we introduce a deconvolution layer at the end of the network, then the mapping is learned directly from the original low-resolution image (without interpolation) to the high-resolution one. Second, we reformulate the mapping layer by shrinking the input feature dimension before mapping and expanding back afterwards. Third, we adopt smaller filter sizes but more mapping layers. The proposed model achieves a speed up of more than 40 times with even superior restoration quality. Further, we present the parameter settings that can achieve real-time performance on a generic CPU while still maintaining good performance. A corresponding transfer strategy is also proposed for fast training and testing across different upscaling factors.

Chao Dong, Chen Change Loy, Xiaoou Tang
Symmetric Non-rigid Structure from Motion for Category-Specific Object Structure Estimation

Many objects, especially these made by humans, are symmetric, e.g. cars and aeroplanes. This paper addresses the estimation of 3D structures of symmetric objects from multiple images of the same object category, e.g. different cars, seen from various viewpoints. We assume that the deformation between different instances from the same object category is non-rigid and symmetric. In this paper, we extend two leading non-rigid structure from motion (SfM) algorithms to exploit symmetry constraints. We model the both methods as energy minimization, in which we also recover the missing observations caused by occlusions. In particularly, we show that by rotating the coordinate system, the energy can be decoupled into two independent terms, which still exploit symmetry, to apply matrix factorization separately on each of them for initialization. The results on the Pascal3D+ dataset show that our methods significantly improve performance over baseline methods.

Yuan Gao, Alan L. Yuille
Peak-Piloted Deep Network for Facial Expression Recognition

Objective functions for training of deep networks for face-related recognition tasks, such as facial expression recognition (FER), usually consider each sample independently. In this work, we present a novel peak-piloted deep network (PPDN) that uses a sample with peak expression (easy sample) to supervise the intermediate feature responses for a sample of non-peak expression (hard sample) of the same type and from the same subject. The expression evolving process from non-peak expression to peak expression can thus be implicitly embedded in the network to achieve the invariance to expression intensities. A special-purpose back-propagation procedure, peak gradient suppression (PGS), is proposed for network training. It drives the intermediate-layer feature responses of non-peak expression samples towards those of the corresponding peak expression samples, while avoiding the inverse. This avoids degrading the recognition capability for samples of peak expression due to interference from their non-peak expression counterparts. Extensive comparisons on two popular FER datasets, Oulu-CASIA and CK+, demonstrate the superiority of the PPDN over state-of-the-art FER methods, as well as the advantages of both the network structure and the optimization strategy. Moreover, it is shown that PPDN is a general architecture, extensible to other tasks by proper definition of peak and non-peak samples. This is validated by experiments that show state-of-the-art performance on pose-invariant face recognition, using the Multi-PIE dataset.

Xiangyun Zhao, Xiaodan Liang, Luoqi Liu, Teng Li, Yugang Han, Nuno Vasconcelos, Shuicheng Yan
Is Faster R-CNN Doing Well for Pedestrian Detection?

Detecting pedestrian has been arguably addressed as a special topic beyond general object detection. Although recent deep learning object detectors such as Fast/Faster R-CNN have shown excellent performance for general object detection, they have limited success for detecting pedestrian, and previous leading pedestrian detectors were in general hybrid methods combining hand-crafted and deep convolutional features. In this paper, we investigate issues involving Faster R-CNN for pedestrian detection. We discover that the Region Proposal Network (RPN) in Faster R-CNN indeed performs well as a stand-alone pedestrian detector, but surprisingly, the downstream classifier degrades the results. We argue that two reasons account for the unsatisfactory accuracy: (i) insufficient resolution of feature maps for handling small instances, and (ii) lack of any bootstrapping strategy for mining hard negative examples. Driven by these observations, we propose a very simple but effective baseline for pedestrian detection, using an RPN followed by boosted forests on shared, high-resolution convolutional feature maps. We comprehensively evaluate this method on several benchmarks (Caltech, INRIA, ETH, and KITTI), presenting competitive accuracy and good speed. Code will be made publicly available.

Liliang Zhang, Liang Lin, Xiaodan Liang, Kaiming He
Coarse-to-fine Planar Regularization for Dense Monocular Depth Estimation

Simultaneous localization and mapping (SLAM) using the whole image data is an appealing framework to address shortcoming of sparse feature-based methods – in particular frequent failures in textureless environments. Hence, direct methods bypassing the need of feature extraction and matching became recently popular. Many of these methods operate by alternating between pose estimation and computing (semi-)dense depth maps, and are therefore not fully exploiting the advantages of joint optimization with respect to depth and pose. In this work, we propose a framework for monocular SLAM, and its local model in particular, which optimizes simultaneously over depth and pose. In addition to a planarity enforcing smoothness regularizer for the depth we also constrain the complexity of depth map updates, which provides a natural way to avoid poor local minima and reduces unknowns in the optimization. Starting from a holistic objective we develop a method suitable for online and real-time monocular SLAM. We evaluate our method quantitatively in pose and depth on the TUM dataset, and qualitatively on our own video sequences.

Stephan Liwicki, Christopher Zach, Ondrej Miksik, Philip H. S. Torr
Deep Attributes Driven Multi-camera Person Re-identification

The visual appearance of a person is easily affected by many factors like pose variations, viewpoint changes and camera parameter differences. This makes person Re-Identification (ReID) among multiple cameras a very challenging task. This work is motivated to learn mid-level human attributes which are robust to such visual appearance variations. And we propose a semi-supervised attribute learning framework which progressively boosts the accuracy of attributes only using a limited number of labeled data. Specifically, this framework involves a three-stage training. A deep Convolutional Neural Network (dCNN) is first trained on an independent dataset labeled with attributes. Then it is fine-tuned on another dataset only labeled with person IDs using our defined triplet loss. Finally, the updated dCNN predicts attribute labels for the target dataset, which is combined with the independent dataset for the final round of fine-tuning. The predicted attributes, namely deep attributes exhibit superior generalization ability across different datasets. By directly using the deep attributes with simple Cosine distance, we have obtained surprisingly good accuracy on four person ReID datasets. Experiments also show that a simple distance metric learning modular further boosts our method, making it significantly outperform many recent works.

Chi Su, Shiliang Zhang, Junliang Xing, Wen Gao, Qi Tian
An Occlusion-Resistant Ellipse Detection Method by Joining Coelliptic Arcs

In this study, we propose an ellipse detection method which gives prospering results on occlusive cases. The method starts with detection of edge segments. Then we extract elliptical arcs by computing corners and fitting ellipse to the pixels between two consecutive corners. Once the elliptical arcs are extracted, we aim to test all possible arc subsets. However, this requires exponential complexity and runtime diverges as the number of arcs increases. To accelerate the process, arc pairing strategy is deployed by using conic properties of arcs. If any pair found to be non-coelliptic, then arc combinations including that pair are eliminated. Therefore the number of possible arcs subsets is reduced and computation time is improved. In the end, ellipse fitting is applied to remaining arc combinations to decide on final ellipses. Performance of the proposed algorithm is tested on real datasets, and better results have been obtained compare to state-of-the-art algorithms.

Halil Ibrahim Cakir, Cihan Topal, Cuneyt Akinlar
Branching Path Following for Graph Matching

Recently, graph matching algorithms utilizing the path following strategy have exhibited state-of-the-art performances. However, the paths computed in these algorithms often contain singular points, which usually hurt the matching performance. To deal with this issue, in this paper we propose a novel path following strategy, named branching path following (BPF), which consequently improves graph matching performance. In particular, we first propose a singular point detector by solving an KKT system, and then design a branch switching method to seek for better paths at singular points. Using BPF, a new graph matching algorithm named BPF-G is developed by applying BPF to a recently proposed path following algorithm named GNCCP (Liu&Qiao 2014). For evaluation, we compare BPF-G with several recently proposed graph matching algorithms on a synthetic dataset and four public benchmark datasets. Experimental results show that our approach achieves remarkable improvement in matching accuracy and outperforms other algorithms.

Tao Wang, Haibin Ling, Congyan Lang, Jun Wu
Higher Order Conditional Random Fields in Deep Neural Networks

We address the problem of semantic segmentation using deep learning. Most segmentation systems include a Conditional Random Field (CRF) to produce a structured output that is consistent with the image’s visual features. Recent deep learning approaches have incorporated CRFs into Convolutional Neural Networks (CNNs), with some even training the CRF end-to-end with the rest of the network. However, these approaches have not employed higher order potentials, which have previously been shown to significantly improve segmentation performance. In this paper, we demonstrate that two types of higher order potential, based on object detections and superpixels, can be included in a CRF embedded within a deep network. We design these higher order potentials to allow inference with the differentiable mean field algorithm. As a result, all the parameters of our richer CRF model can be learned end-to-end with our pixelwise CNN classifier. We achieve state-of-the-art segmentation performance on the PASCAL VOC benchmark with these trainable higher order potentials.

Anurag Arnab, Sadeep Jayasumana, Shuai Zheng, Philip H. S. Torr
LSTM-CF: Unifying Context Modeling and Fusion with LSTMs for RGB-D Scene Labeling

Semantic labeling of RGB-D scenes is crucial to many intelligent applications including perceptual robotics. It generates pixelwise and fine-grained label maps from simultaneously sensed photometric (RGB) and depth channels. This paper addresses this problem by (i) developing a novel Long Short-Term Memorized Context Fusion (LSTM-CF) Model that captures and fuses contextual information from multiple channels of photometric and depth data, and (ii) incorporating this model into deep convolutional neural networks (CNNs) for end-to-end training. Specifically, contexts in photometric and depth channels are, respectively, captured by stacking several convolutional layers and a long short-term memory layer; the memory layer encodes both short-range and long-range spatial dependencies in an image along the vertical direction. Another long short-term memorized fusion layer is set up to integrate the contexts along the vertical direction from different channels, and perform bi-directional propagation of the fused vertical contexts along the horizontal direction to obtain true 2D global contexts. At last, the fused contextual representation is concatenated with the convolutional features extracted from the photometric channels in order to improve the accuracy of fine-scale semantic labeling. Our proposed model has set a new state of the art, i.e., $${\mathbf{48.1}}\%$$48.1% and $${\mathbf{49.4}}\%$$49.4% average class accuracy over 37 categories ($${\mathbf{2.2}}\%$$2.2% and $${\mathbf{5.4}}\%$$5.4% improvement) on the large-scale SUNRGBD dataset and the NYUDv2 dataset, respectively.

Zhen Li, Yukang Gan, Xiaodan Liang, Yizhou Yu, Hui Cheng, Liang Lin
Stereo Video Deblurring

Videos acquired in low-light conditions often exhibit motion blur, which depends on the motion of the objects relative to the camera. This is not only visually unpleasing, but can hamper further processing. With this paper we are the first to show how the availability of stereo video can aid the challenging video deblurring task. We leverage 3D scene flow, which can be estimated robustly even under adverse conditions. We go beyond simply determining the object motion in two ways: First, we show how a piecewise rigid 3D scene flow representation allows to induce accurate blur kernels via local homographies. Second, we exploit the estimated motion boundaries of the 3D scene flow to mitigate ringing artifacts using an iterative weighting scheme. Being aware of 3D object motion, our approach can deal robustly with an arbitrary number of independently moving objects. We demonstrate its benefit over state-of-the-art video deblurring using quantitative and qualitative experiments on rendered scenes and real videos.

Anita Sellent, Carsten Rother, Stefan Roth
Robust Image and Video Dehazing with Visual Artifact Suppression via Gradient Residual Minimization

Most existing image dehazing methods tend to boost local image contrast for regions with heavy haze. Without special treatment, these methods may significantly amplify existing image artifacts such as noise, color aliasing and blocking, which are mostly invisible in the input images but are visually intruding in the results. This is especially the case for low quality cellphone shots or compressed video frames. The recent work of Li et al. (2014) addresses blocking artifacts for dehazing, but is insufficient to handle other artifacts. In this paper, we propose a new method for reliable suppression of different types of visual artifacts in image and video dehazing. Our method makes contributions in both the haze estimation step and the image recovery step. Firstly, an image-guided, depth-edge-aware smoothing algorithm is proposed to refine the initial atmosphere transmission map generated by local priors. In the image recovery process, we propose Gradient Residual Minimization (GRM) for jointly recovering the haze-free image while explicitly minimizing possible visual artifacts in it. Our evaluation suggests that the proposed method can generate results with much less visual artifacts than previous approaches for lower quality inputs such as compressed video clips.

Chen Chen, Minh N. Do, Jue Wang
Smooth Neighborhood Structure Mining on Multiple Affinity Graphs with Applications to Context-Sensitive Similarity

Due to the ability of capturing geometry structures of the data manifold, diffusion process has demonstrated impressive performances in retrieval task by spreading the similarities on the affinity graph. In view of robustness to noise edges, diffusion process is usually localized, i.e., only propagating similarities via neighbors. However, selecting neighbors smoothly on graph-based manifolds is more or less ignored by previous works. In this paper, we propose a new algorithm called Smooth Neighborhood (SN) that mines the neighborhood structure to satisfy the manifold assumption. By doing so, nearby points on the underlying manifold are guaranteed to yield similar neighbors as much as possible. Moreover, SN is adjusted to tackle multiple affinity graphs by imposing a weight learning paradigm, and this is the primary difference compared with related works which are only applicable with one affinity graph. Exhausted experimental results and comparisons against other algorithms manifest the effectiveness of the proposed algorithm.

Song Bai, Shaoyan Sun, Xiang Bai, Zhaoxiang Zhang, Qi Tian
Title Generation for User Generated Videos

A great video title describes the most salient event compactly and captures the viewer’s attention. In contrast, video captioning tends to generate sentences that describe the video as a whole. Although generating a video title automatically is a very useful task, it is much less addressed than video captioning. We address video title generation for the first time by proposing two methods that extend state-of-the-art video captioners to this new task. First, we make video captioners highlight sensitive by priming them with a highlight detector. Our framework allows for jointly training a model for title generation and video highlight localization. Second, we induce high sentence diversity in video captioners, so that the generated titles are also diverse and catchy. This means that a large number of sentences might be required to learn the sentence structure of titles. Hence, we propose a novel sentence augmentation method to train a captioner with additional sentence-only examples that come without corresponding videos. We collected a large-scale Video Titles in the Wild (VTW) dataset of 18100 automatically crawled user-generated videos and titles. On VTW, our methods consistently improve title prediction accuracy, and achieve the best performance in both automatic and human evaluation. Finally, our sentence augmentation method also outperforms the baselines on the M-VAD dataset.

Kuo-Hao Zeng, Tseng-Hung Chen, Juan Carlos Niebles, Min Sun
Natural Image Matting Using Deep Convolutional Neural Networks

We propose a deep Convolutional Neural Networks (CNN) method for natural image matting. Our method takes results of the closed form matting, results of the KNN matting and normalized RGB color images as inputs, and directly learns an end-to-end mapping between the inputs, and reconstructed alpha mattes. We analyze pros and cons of the closed form matting, and the KNN matting in terms of local and nonlocal principle, and show that they are complementary to each other. A major benefit of our method is that it can “recognize” different local image structures, and then combine results of local (closed form matting), and nonlocal (KNN matting) matting effectively to achieve higher quality alpha mattes than both of its inputs. Extensive experiments demonstrate that our proposed deep CNN matting produces visually and quantitatively high-quality alpha mattes. In addition, our method has achieved the highest ranking in the public alpha matting evaluation dataset in terms of the sum of absolute differences, mean squared errors, and gradient errors.

Donghyeon Cho, Yu-Wing Tai, Inso Kweon
Double-Opponent Vectorial Total Variation

We present a new vectorial total variation (VTV) method that addresses the problem of color consistent image filtering. Our approach combines insights based on the double-opponent cell representation in the visual cortex with state-of-the-art variational modelling using VTV regularization. Existing methods of vectorial total variation regularizers have insufficient (even no) coupling between the color channels and thus may introduce color artifacts. We address this problem by introducing a novel color channel coupling inspired from a pullback-metric from an opponent space to the observation space. We show existence and uniqueness of a solution in the space of vectorial functions of bounded variation. In experiments, we demonstrate that our novel approach compares favorably to state-of-the-art methods w.r.t. to structure coherence and color consistency.

Freddie Åström, Christoph Schnörr
Learning to Count with CNN Boosting

In this paper, we address the task of object counting in images. We follow modern learning approaches in which a density map is estimated directly from the input image. We employ CNNs and incorporate two significant improvements to the state of the art methods: layered boosting and selective sampling. As a result, we manage both to increase the counting accuracy and to reduce processing time. Moreover, we show that the proposed method is effective, even in the presence of labeling errors. Extensive experiments on five different datasets demonstrate the efficacy and robustness of our approach. Mean Absolute error was reduced by 20 % to 35 %. At the same time, the training time of each CNN has been reduced by 50 %.

Elad Walach, Lior Wolf
Amodal Instance Segmentation

We consider the problem of amodal instance segmentation, the objective of which is to predict the region encompassing both visible and occluded parts of each object. Thus far, the lack of publicly available amodal segmentation annotations has stymied the development of amodal segmentation methods. In this paper, we sidestep this issue by relying solely on standard modal instance segmentation annotations to train our model. The result is a new method for amodal instance segmentation, which represents the first such method to the best of our knowledge. We demonstrate the proposed method’s effectiveness both qualitatively and quantitatively.

Ke Li, Jitendra Malik
Perceptual Losses for Real-Time Style Transfer and Super-Resolution

We consider image transformation problems, where an input image is transformed into an output image. Recent methods for such problems typically train feed-forward convolutional neural networks using a per-pixel loss between the output and ground-truth images. Parallel work has shown that high-quality images can be generated by defining and optimizing perceptual loss functions based on high-level features extracted from pretrained networks. We combine the benefits of both approaches, and propose the use of perceptual loss functions for training feed-forward networks for image transformation tasks. We show results on image style transfer, where a feed-forward network is trained to solve the optimization problem proposed by Gatys et al. in real-time. Compared to the optimization-based method, our network gives similar qualitative results but is three orders of magnitude faster. We also experiment with single-image super-resolution, where replacing a per-pixel loss with a perceptual loss gives visually pleasing results.

Justin Johnson, Alexandre Alahi, Li Fei-Fei

Optimization

Frontmatter
An Efficient Fusion Move Algorithm for the Minimum Cost Lifted Multicut Problem

Many computer vision problems can be cast as an optimization problem whose feasible solutions are decompositions of a graph. The minimum cost lifted multicut problem is such an optimization problem. Its objective function can penalize or reward all decompositions for which any given pair of nodes are in distinct components. While this property has many potential applications, such applications are hampered by the fact that the problem is NP-hard. We propose a fusion move algorithm for computing feasible solutions, better and more efficiently than existing algorithms. We demonstrate this and applications to image segmentation, obtaining a new state of the art for a problem in biological image analysis.

Thorsten Beier, Björn Andres, Ullrich Köthe, Fred A. Hamprecht
-Sparse Subspace Clustering

Subspace clustering methods with sparsity prior, such as Sparse Subspace Clustering (SSC) [1], are effective in partitioning the data that lie in a union of subspaces. Most of those methods require certain assumptions, e.g. independence or disjointness, on the subspaces. These assumptions are not guaranteed to hold in practice and they limit the application of existing sparse subspace clustering methods. In this paper, we propose $$\ell ^{0}$$ℓ0-induced sparse subspace clustering ($$\ell ^{0}$$ℓ0-SSC). In contrast to the required assumptions, such as independence or disjointness, on subspaces for most existing sparse subspace clustering methods, we prove that subspace-sparse representation, a key element in subspace clustering, can be obtained by $$\ell ^{0}$$ℓ0-SSC for arbitrary distinct underlying subspaces almost surely under the mild i.i.d. assumption on the data generation. We also present the “no free lunch” theorem that obtaining the subspace representation under our general assumptions can not be much computationally cheaper than solving the corresponding $$\ell ^{0}$$ℓ0 problem of $$\ell ^{0}$$ℓ0-SSC. We develop a novel approximate algorithm named Approximate $$\ell ^{0}$$ℓ0-SSC ($$\hbox {A}\ell ^{0}$$Aℓ0-SSC) that employs proximal gradient descent to obtain a sub-optimal solution to the optimization problem of $$\ell ^{0}$$ℓ0-SSC with theoretical guarantee, and the sub-optimal solution is used to build a sparse similarity matrix for clustering. Extensive experimental results on various data sets demonstrate the superiority of $$\hbox {A}\ell ^{0}$$Aℓ0-SSC compared to other competing clustering methods.

Yingzhen Yang, Jiashi Feng, Nebojsa Jojic, Jianchao Yang, Thomas S. Huang
Normalized Cut Meets MRF

We propose a new segmentation or clustering model that combines Markov Random Field (MRF) and Normalized Cut (NC) objectives. Both NC and MRF models are widely used in machine learning and computer vision, but they were not combined before due to significant differences in the corresponding optimization, e.g. spectral relaxation and combinatorial max-flow techniques. On the one hand, we show that many common applications for multi-label MRF segmentation energies can benefit from a high-order NC term, e.g. enforcing balanced clustering of arbitrary high-dimensional image features combining color, texture, location, depth, motion, etc. On the other hand, standard NC applications benefit from an inclusion of common pairwise or higher-order MRF constraints, e.g. edge alignment, bin-consistency, label cost, etc. To address NC+MRF energy, we propose two efficient multi-label combinatorial optimization techniques, spectral cut and kernel cut, using new unary bounds for different NC formulations.

Meng Tang, Dmitrii Marin, Ismail Ben Ayed, Yuri Boykov
Fast Global Registration

We present an algorithm for fast global registration of partially overlapping 3D surfaces. The algorithm operates on candidate matches that cover the surfaces. A single objective is optimized to align the surfaces and disable false matches. The objective is defined densely over the surfaces and the optimization achieves tight alignment with no initialization. No correspondence updates or closest-point queries are performed in the inner loop. An extension of the algorithm can perform joint global registration of many partially overlapping surfaces. Extensive experiments demonstrate that the presented approach matches or exceeds the accuracy of state-of-the-art global registration pipelines, while being at least an order of magnitude faster. Remarkably, the presented approach is also faster than local refinement algorithms such as ICP. It provides the accuracy achieved by well-initialized local refinement algorithms, without requiring an initialization and at lower computational cost.

Qian-Yi Zhou, Jaesik Park, Vladlen Koltun

Poster Session 3

Frontmatter
Polysemous Codes

This paper considers the problem of approximate nearest neighbor search in the compressed domain. We introduce polysemous codes, which offer both the distance estimation quality of product quantization and the efficient comparison of binary codes with Hamming distance. Their design is inspired by algorithms introduced in the 90’s to construct channel-optimized vector quantizers. At search time, this dual interpretation accelerates the search. Most of the indexed vectors are filtered out with Hamming distance, letting only a fraction of the vectors to be ranked with an asymmetric distance estimator. The method is complementary with a coarse partitioning of the feature space such as the inverted multi-index. This is shown by our experiments performed on several public benchmarks such as the BIGANN dataset comprising one billion vectors, for which we report state-of-the-art results for query times below 0.3 millisecond per core. Last but not least, our approach allows the approximate computation of the k-NN graph associated with the Yahoo Flickr Creative Commons 100M, described by CNN image descriptors, in less than 8 h on a single machine.

Matthijs Douze, Hervé Jégou, Florent Perronnin
Binary Hashing with Semidefinite Relaxation and Augmented Lagrangian

This paper proposes two approaches for inferencing binary codes in two-step (supervised, unsupervised) hashing. We first introduce an unified formulation for both supervised and unsupervised hashing. Then, we cast the learning of one bit as a Binary Quadratic Problem (BQP). We propose two approaches to solve BQP. In the first approach, we relax BQP as a semidefinite programming problem which its global optimum can be achieved. We theoretically prove that the objective value of the binary solution achieved by this approach is well bounded. In the second approach, we propose an augmented Lagrangian based approach to solve BQP directly without relaxing the binary constraint. Experimental results on three benchmark datasets show that our proposed methods compare favorably with the state of the art.

Thanh-Toan Do, Anh-Dzung Doan, Duc-Thanh Nguyen, Ngai-Man Cheung
Efficient Continuous Relaxations for Dense CRF

Dense conditional random fields (CRF) with Gaussian pairwise potentials have emerged as a popular framework for several computer vision applications such as stereo correspondence and semantic segmentation. By modeling long-range interactions, dense CRFs provide a more detailed labelling compared to their sparse counterparts. Variational inference in these dense models is performed using a filtering-based mean-field algorithm in order to obtain a fully-factorized distribution minimising the Kullback-Leibler divergence to the true distribution. In contrast to the continuous relaxation-based energy minimisation algorithms used for sparse CRFs, the mean-field algorithm fails to provide strong theoretical guarantees on the quality of its solutions. To address this deficiency, we show that it is possible to use the same filtering approach to speed-up the optimisation of several continuous relaxations. Specifically, we solve a convex quadratic programming (QP) relaxation using the efficient Frank-Wolfe algorithm. This also allows us to solve difference-of-convex relaxations via the iterative concave-convex procedure where each iteration requires solving a convex QP. Finally, we develop a novel divide-and-conquer method to compute the subgradients of a linear programming relaxation that provides the best theoretical bounds for energy minimisation. We demonstrate the advantage of continuous relaxations over the widely used mean-field algorithm on publicly available datasets.

Alban Desmaison, Rudy Bunel, Pushmeet Kohli, Philip H. S. Torr, M. Pawan Kumar
Complexity of Discrete Energy Minimization Problems

Discrete energy minimization is widely-used in computer vision and machine learning for problems such as MAP inference in graphical models. The problem, in general, is notoriously intractable, and finding the global optimal solution is known to be NP-hard. However, is it possible to approximate this problem with a reasonable ratio bound on the solution quality in polynomial time? We show in this paper that the answer is no. Specifically, we show that general energy minimization, even in the 2-label pairwise case, and planar energy minimization with three or more labels are exp-APX-complete. This finding rules out the existence of any approximation algorithm with a sub-exponential approximation ratio in the input size for these two problems, including constant factor approximations. Moreover, we collect and review the computational complexity of several subclass problems and arrange them on a complexity scale consisting of three major complexity classes – PO, APX, and exp-APX, corresponding to problems that are solvable, approximable, and inapproximable in polynomial time. Problems in the first two complexity classes can serve as alternative tractable formulations to the inapproximable ones. This paper can help vision researchers to select an appropriate model for an application or guide them in designing new algorithms.

Mengtian Li, Alexander Shekhovtsov, Daniel Huber
A Convex Solution to Spatially-Regularized Correspondence Problems

We propose a convex formulation of the correspondence problem between two images with respect to an energy function measuring data consistency and spatial regularity. To this end, we formulate the general correspondence problem as the search for a minimal two-dimensional surface in $$\mathbb {R}^4$$R4. We then use tools from geometric measure theory and introduce 2-vector fields as a representation of two-dimensional surfaces in $$\mathbb {R}^4$$R4. We propose a discretization of this surface formulation that gives rise to a convex minimization problem and compute a globally optimal solution using an efficient primal-dual algorithm.

Thomas Windheuser, Daniel Cremers
A Deep Learning-Based Approach to Progressive Vehicle Re-identification for Urban Surveillance

While re-identification (Re-Id) of persons has attracted intensive attention, vehicle, which is a significant object class in urban video surveillance, is often overlooked by vision community. Most existing methods for vehicle Re-Id only achieve limited performance, as they predominantly focus on the generic appearance of vehicle while neglecting some unique identities of vehicle (e.g., license plate). In this paper, we propose a novel deep learning-based approach to PROgressive Vehicle re-ID, called “PROVID”. Our approach treats vehicle Re-Id as two specific progressive search processes: coarse-to-fine search in the feature space, and near-to-distant search in the real world surveillance environment. The first search process employs the appearance attributes of vehicle for a coarse filtering, and then exploits the Siamese Neural Network for license plate verification to accurately identify vehicles. The near-to-distant search process retrieves vehicles in a manner like human beings, by searching from near to faraway cameras and from close to distant time. Moreover, to facilitate progressive vehicle Re-Id research, we collect to-date the largest dataset named VeRi-776 from large-scale urban surveillance videos, which contains not only massive vehicles with diverse attributes and high recurrence rate, but also sufficient license plates and spatiotemporal labels. A comprehensive evaluation on the VeRi-776 shows that our approach outperforms the state-of-the-art methods by 9.28 % improvements in term of mAP.

Xinchen Liu, Wu Liu, Tao Mei, Huadong Ma
Backmatter
Metadaten
Titel
Computer Vision – ECCV 2016
herausgegeben von
Bastian Leibe
Jiri Matas
Nicu Sebe
Max Welling
Copyright-Jahr
2016
Electronic ISBN
978-3-319-46475-6
Print ISBN
978-3-319-46474-9
DOI
https://doi.org/10.1007/978-3-319-46475-6