nach oben

2014 | Buch

Kapitel lesen Erstes Kapitel lesen

Computer Vision – ECCV 2014

13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI

herausgegeben von: David Fleet, Tomas Pajdla, Bernt Schiele, Tinne Tuytelaars

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

The seven-volume set comprising LNCS volumes 8689-8695 constitutes the refereed proceedings of the 13th European Conference on Computer Vision, ECCV 2014, held in Zurich, Switzerland, in September 2014. The 363 revised papers presented were carefully reviewed and selected from 1444 submissions. The papers are organized in topical sections on tracking and activity recognition; recognition; learning and inference; structure from motion and feature matching; computational photography and low-level vision; vision; segmentation and saliency; context and 3D scenes; motion and 3D scene analysis; and poster sessions.

Inhaltsverzeichnis

Frontmatter

Poster Session 6 (continued)

All-In-Focus Synthetic Aperture Imaging

Heavy occlusions in cluttered scenes impose significant challenges to many computer vision applications. Recent light field imaging systems provide new see-through capabilities through synthetic aperture imaging (SAI) to overcome the occlusion problem. Existing synthetic aperture imaging methods, however, emulate focusing at a specific depth layer but is incapable of producing an all-in-focus see-through image. Alternative in-painting algorithms can generate visually plausible results but can not guarantee the correctness of the result. In this paper, we present a novel depth free all-in-focus SAI technique based on light-field visibility analysis. Specifically, we partition the scene into multiple visibility layers to directly deal with layer-wise occlusion and apply an optimization framework to propagate the visibility information between multiple layers. On each layer, visibility and optimal focus depth estimation is formulated as a multiple label energy minimization problem. The energy integrates the visibility mask from previous layers, multi-view intensity consistency, and depth smoothness constraint. We compare our method with the state-of-the-art solutions. Extensive experimental results with qualitative and quantitative analysis demonstrate the effectiveness and superiority of our approach.

Tao Yang, Yanning Zhang, Jingyi Yu, Jing Li, Wenguang Ma, Xiaomin Tong, Rui Yu, Lingyan Ran

Photo Uncrop

We address the problem of extending the field of view of a photo—an operation we call

uncrop

. Given a reference photograph to be uncropped, our approach selects, reprojects, and composites a subset of Internet imagery taken near the reference into a larger image around the reference using the underlying scene geometry. The proposed Markov Random Field based approach is capable of handling large Internet photo collections with arbitrary viewpoints, dramatic appearance variation, and complicated scene layout. We show results that are visually compelling on a wide range of real-world landmarks.

Qi Shan, Brian Curless, Yasutaka Furukawa, Carlos Hernandez, Steven M. Seitz

Solving Square Jigsaw Puzzles with Loop Constraints

We present a novel algorithm based on “loop constraints” for assembling non-overlapping square-piece jigsaw puzzles where the rotation and the position of each piece are unknown. Our algorithm finds small loops of puzzle pieces which form consistent cycles. These small loops are in turn aggregated into higher order “loops of loops” in a bottom-up fashion. In contrast to previous puzzle solvers which avoid or ignore puzzle cycles, we specifically seek out and exploit these loops as a form of outlier rejection. Our algorithm significantly outperforms state-of-the-art algorithms in puzzle reconstruction accuracy. For the most challenging type of image puzzles with unknown piece rotation we reduce the reconstruction error by up to 70%. We determine an upper bound on reconstruction accuracy for various data sets and show that, in some cases, our algorithm nearly matches the upper bound.

Kilho Son, James Hays, David B. Cooper

Geometric Calibration of Micro-Lens-Based Light-Field Cameras Using Line Features

We present a novel method of geometric calibration of micro-lens-based light-field cameras. Accurate geometric calibration is a basis of various applications. Instead of using sub-aperture images, we utilize raw images directly for calibration. We select proper regions in raw images and extract line features from micro-lens images in those regions. For the whole process, we formulate a new projection model of micro-lens-based light-field cameras. It is transformed into a linear form using line features. We compute an initial solution of both intrinsic and extrinsic parameters by a linear computation, and refine it via a non-linear optimization. Experimental results show the accuracy of the correspondences between rays and pixels in raw images, estimated by the proposed method.

Yunsu Bok, Hae-Gon Jeon, In So Kweon

Spatio-temporal Matching for Human Detection in Video

Detection and tracking humans in videos have been long-standing problems in computer vision. Most successful approaches (

e.g.

, deformable parts models) heavily rely on discriminative models to build appearance detectors for body joints and generative models to constrain possible body configurations (

e.g.

, trees). While these 2D models have been successfully applied to images (and with less success to videos), a major challenge is to generalize these models to cope with camera views. In order to achieve view-invariance, these 2D models typically require a large amount of training data across views that is difficult to gather and time-consuming to label. Unlike existing 2D models, this paper formulates the problem of human detection in videos as spatio-temporal matching (STM) between a 3D motion capture model and trajectories in videos. Our algorithm estimates the camera view and selects a subset of tracked trajectories that matches the motion of the 3D model. The STM is efficiently solved with linear programming, and it is robust to tracking mismatches, occlusions and outliers. To the best of our knowledge this is the first paper that solves the correspondence between video and 3D motion capture data for human pose detection. Experiments on the Human3.6M and Berkeley MHAD databases illustrate the benefits of our method over state-of-the-art approaches.

Feng Zhou, Fernando De la Torre

Collaborative Facial Landmark Localization for Transferring Annotations Across Datasets

In this paper we make the first effort, to the best of our knowledge, to combine multiple face landmark datasets with different landmark definitions into a super dataset, with a union of all landmark types computed in each image as output. Our approach is flexible, and our system can optionally use known landmarks in the target dataset to constrain the localization. Our novel pipeline is built upon variants of state-of-the-art facial landmark localization methods. Specifically, we propose to label images in the target dataset jointly rather than independently and exploit exemplars from both the source datasets and the target dataset. This approach integrates nonparametric appearance and shape modeling and graph matching together to achieve our goal.

Brandon M. Smith, Li Zhang

Facial Landmark Detection by Deep Multi-task Learning

Facial landmark detection has long been impeded by the problems of occlusion and pose variation. Instead of treating the detection task as a single and independent problem, we investigate the possibility of improving detection robustness through multi-task learning. Specifically, we wish to optimize facial landmark detection together with heterogeneous but subtly correlated tasks, e.g. head pose estimation and facial attribute inference. This is non-trivial since different tasks have different learning difficulties and convergence rates. To address this problem, we formulate a novel tasks-constrained deep model, with task-wise early stopping to facilitate learning convergence. Extensive evaluations show that the proposed task-constrained learning (i) outperforms existing methods, especially in dealing with faces with severe occlusion and pose variation, and (ii) reduces model complexity drastically compared to the state-of-the-art method based on cascaded deep model [21].

Zhanpeng Zhang, Ping Luo, Chen Change Loy, Xiaoou Tang

Joint Cascade Face Detection and Alignment

We present a new state-of-the-art approach for face detection. The key idea is to combine face alignment with detection, observing that aligned face shapes provide better features for face classification. To make this combination more effective, our approach learns the two tasks jointly in the same cascade framework, by exploiting recent advances in face alignment. Such joint learning greatly enhances the capability of cascade detection and still retains its realtime performance. Extensive experiments show that our approach achieves the best accuracy on challenging datasets, where all existing solutions are either inaccurate or too slow.

Dong Chen, Shaoqing Ren, Yichen Wei, Xudong Cao, Jian Sun

Weighted Block-Sparse Low Rank Representation for Face Clustering in Videos

In this paper, we study the problem of face clustering in videos. Specifically, given automatically extracted faces from videos and two kinds of prior knowledge (the face track that each face belongs to, and the pairs of faces that appear in the same frame), the task is to partition the faces into a given number of disjoint groups, such that each group is associated with one subject. To deal with this problem, we propose a new method called weighted block-sparse low rank representation (WBSLRR) which considers the available prior knowledge while learning a low rank data representation, and also develop a simple but effective approach to obtain the clustering result of faces. Moreover, after using several acceleration techniques, our proposed method is suitable for solving large-scale problems. The experimental results on two benchmark datasets demonstrate the effectiveness of our approach.

Shijie Xiao, Mingkui Tan, Dong Xu

Crowd Tracking with Dynamic Evolution of Group Structures

Crowd tracking generates trajectories of a set of particles for further analysis of crowd motion patterns. In this paper, we try to answer the following questions: what are the particles appropriate for crowd tracking and how to track them robustly through crowd. Different than existing approaches of computing optical flows, tracking keypoints or pedestrians, we propose to discover distinctive and stable mid-level patches and track them jointly with dynamic evolution of group structures. This is achieved through the integration of low-level keypoint tracking, mid-level patch tracking, and high-level group evolution. Keypoint tracking guides the generation of patches with stable internal motions, and also organizes patches into hierarchical groups with collective motions. Patches are tracked together through occlusions with spatial constraints imposed by hierarchical tree structures within groups. Coherent groups are dynamically updated through merge and split events guided by keypoint tracking. The dynamically structured patches not only substantially improve the tracking for themselves, but also can assist the tracking of any other target in the crowd. The effectiveness of the proposed approach is shown through experiments and comparison with state-of-the-art trackers.

Feng Zhu, Xiaogang Wang, Nenghai Yu

Tracking Using Multilevel Quantizations

Most object tracking methods only exploit a single quantization of an image space: pixels, superpixels, or bounding boxes, each of which has advantages and disadvantages. It is highly unlikely that a common optimal quantization level, suitable for tracking all objects in all environments, exists. We therefore propose a hierarchical appearance representation model for tracking, based on a graphical model that exploits shared information across multiple quantization levels. The tracker aims to find the most possible position of the target by jointly classifying the pixels and superpixels and obtaining the best configuration across all levels. The motion of the bounding box is taken into consideration, while Online Random Forests are used to provide pixel- and superpixel-level quantizations and progressively updated on-the-fly. By appropriately considering the multilevel quantizations, our tracker exhibits not only excellent performance in non-rigid object deformation handling, but also its robustness to occlusions. A quantitative evaluation is conducted on two benchmark datasets: a non-rigid object tracking dataset (11 sequences) and the CVPR2013 tracking benchmark (50 sequences). Experimental results show that our tracker overcomes various tracking challenges and is superior to a number of other popular tracking methods.

Zhibin Hong, Chaohui Wang, Xue Mei, Danil Prokhorov, Dacheng Tao

Occlusion and Motion Reasoning for Long-Term Tracking

Object tracking is a reoccurring problem in computer vision. Tracking-by-detection approaches, in particular Struck [20], have shown to be competitive in recent evaluations. However, such approaches fail in the presence of long-term occlusions as well as severe viewpoint changes of the object. In this paper we propose a principled way to combine occlusion and motion reasoning with a tracking-by-detection approach. Occlusion and motion reasoning is based on state-of-the-art long-term trajectories which are labeled as object or background tracks with an energy-based formulation. The overlap between labeled tracks and detected regions allows to identify occlusions. The motion changes of the object between consecutive frames can be estimated robustly from the geometric relation between object trajectories. If this geometric change is significant, an additional detector is trained. Experimental results show that our tracker obtains state-of-the-art results and handles occlusion and viewpoints changes better than competing tracking methods.

Yang Hua, Karteek Alahari, Cordelia Schmid

MEEM: Robust Tracking via Multiple Experts Using Entropy Minimization

We propose a multi-expert restoration scheme to address the model drift problem in online tracking. In the proposed scheme, a tracker and its historical snapshots constitute an expert ensemble, where the best expert is selected to restore the current tracker when needed based on a minimum entropy criterion, so as to correct undesirable model updates. The base tracker in our formulation exploits an online SVM on a budget algorithm and an explicit feature mapping method for efficient model update and inference. In experiments, our tracking method achieves substantially better overall performance than 32 trackers on a benchmark dataset of 50 video sequences under various evaluation settings. In addition, in experiments with a newly collected dataset of challenging sequences, we show that the proposed multi-expert restoration scheme significantly improves the robustness of our base tracker, especially in scenarios with frequent occlusions and repetitive appearance variations.

Jianming Zhang, Shugao Ma, Stan Sclaroff

Robust Motion Segmentation with Unknown Correspondences

Motion segmentation can be addressed as a subspace clustering problem, assuming that the trajectories of interest points are known. However, establishing point correspondences is in itself a challenging task. Most existing approaches tackle the correspondence estimation and motion segmentation problems separately. In this paper, we introduce an approach to performing motion segmentation without any prior knowledge of point correspondences. We formulate this problem in terms of Partial Permutation Matrices (PPMs) and aim to match feature descriptors while simultaneously encouraging point trajectories to satisfy subspace constraints. This lets us handle outliers in both point locations and feature appearance. The resulting optimization problem can be solved via the Alternating Direction Method of Multipliers (ADMM), where each subproblem has an efficient solution. Our experimental evaluation on synthetic and real sequences clearly evidences the benefits of our formulation over the traditional sequential approach that first estimates correspondences and then performs motion segmentation.

Pan Ji, Hongdong Li, Mathieu Salzmann, Yuchao Dai

Monocular Multiview Object Tracking with 3D Aspect Parts

In this work, we focus on the problem of tracking objects under significant viewpoint variations, which poses a big challenge to traditional object tracking methods. We propose a novel method to track an object and estimate its continuous pose and part locations under severe viewpoint change. In order to handle the change in topological appearance introduced by viewpoint transformations, we represent objects with 3D aspect parts and model the relationship between viewpoint and 3D aspect parts in a part-based particle filtering framework. Moreover, we show that instance-level online-learned part appearance can be incorporated into our model, which makes it more robust in difficult scenarios with occlusions. Experiments are conducted on a new dataset of challenging YouTube videos and a subset of the KITTI dataset [14] that include significant viewpoint variations, as well as a standard sequence for car tracking. We demonstrate that our method is able to track the 3D aspect parts and the viewpoint of objects accurately despite significant changes in viewpoint.

Yu Xiang, Changkyu Song, Roozbeh Mottaghi, Silvio Savarese

Modeling Blurred Video with Layers

Videos contain complex spatially-varying motion blur due to the combination of object motion, camera motion, and depth variation with finite shutter speeds. Existing methods to estimate optical flow, deblur the images, and segment the scene fail in such cases. In particular, boundaries between differently moving objects cause problems, because here the blurred images are a combination of the blurred appearances of multiple surfaces. We address this with a novel layered model of scenes in motion. From a motion-blurred video sequence, we jointly estimate the layer segmentation and each layer’s appearance and motion. Since the blur is a function of the layer motion and segmentation, it is completely determined by our generative model. Given a video, we formulate the optimization problem as minimizing the pixel error between the blurred frames and images synthesized from the model, and solve it using gradient descent. We demonstrate our approach on synthetic and real sequences.

Jonas Wulff, Michael Julian Black

Efficient Image and Video Co-localization with Frank-Wolfe Algorithm

In this paper, we tackle the problem of performing efficient co-localization in images and videos. Co-localization is the problem of simultaneously localizing (with bounding boxes) objects of the same class across a set of distinct images or videos. Building upon recent state-of-the-art methods, we show how we are able to naturally incorporate temporal terms and constraints for video co-localization into a quadratic programming framework. Furthermore, by leveraging the Frank-Wolfe algorithm (or conditional gradient), we show how our optimization formulations for both images and videos can be reduced to solving a succession of simple integer programs, leading to increased efficiency in both memory and speed. To validate our method, we present experimental results on the PASCAL VOC 2007 dataset for images and the YouTube-Objects dataset for videos, as well as a joint combination of the two.

Armand Joulin, Kevin Tang, Li Fei-Fei

Non-parametric Higher-Order Random Fields for Image Segmentation

Models defined using higher-order potentials are becoming increasingly popular in computer vision. However, the exact representation of a general higher-order potential defined over many variables is computationally unfeasible. This has led prior works to adopt parametric potentials that can be compactly represented. This paper proposes a non-parametric higher-order model for image labeling problems that uses a patch-based representation of its potentials. We use the transformation scheme of [11, 25] to convert the higher-order potentials to a pair-wise form that can be handled using traditional inference algorithms. This representation is able to capture structure, geometrical and topological information of labels from training data and to provide more precise segmentations. Other tasks such as image denoising and reconstruction are also possible. We evaluate our method on denoising and segmentation problems with synthetic and real images.

Pablo Márquez-Neila, Pushmeet Kohli, Carsten Rother, Luis Baumela

Co-Sparse Textural Similarity for Interactive Segmentation

We propose an algorithm for segmenting natural images based on texture and color information, which leverages the co-sparse analysis model for image segmentation. As a key ingredient of this method, we introduce a novel textural similarity measure, which builds upon the co-sparse representation of image patches. We propose a statistical MAP inference approach to merge textural similarity with information about color and location. Combined with recently developed convex multilabel optimization methods this leads to an efficient algorithm for interactive segmentation, which is easily parallelized on graphics hardware. The provided approach outperforms state-of-the-art interactive segmentation methods on the Graz Benchmark.

Claudia Nieuwenhuis, Simon Hawe, Martin Kleinsteuber, Daniel Cremers

A Convergent Incoherent Dictionary Learning Algorithm for Sparse Coding

Recently, sparse coding has been widely used in many applications ranging from image recovery to pattern recognition. The low mutual coherence of a dictionary is an important property that ensures the optimality of the sparse code generated from this dictionary. Indeed, most existing dictionary learning methods for sparse coding either implicitly or explicitly tried to learn an incoherent dictionary, which requires solving a very challenging non-convex optimization problem. In this paper, we proposed a hybrid alternating proximal algorithm for incoherent dictionary learning, and established its global convergence property. Such a convergent incoherent dictionary learning method is not only of theoretical interest, but also might benefit many sparse coding based applications.

Chenglong Bao, Yuhui Quan, Hui Ji

Free-Shape Polygonal Object Localization

Polygonal objects are prevalent in man-made scenes. Early approaches to detecting them relied mainly on geometry while subsequent ones also incorporated appearance-based cues. It has recently been shown that this could be done fast by searching for cycles in graphs of line-fragments, provided that the cycle scoring function can be expressed as additive terms attached to individual fragments. In this paper, we propose an approach that eliminates this restriction. Given a weighted line-fragment graph, we use its

cyclomatic number

to partition the graph into managebly-sized sub-graphs that preserve nodes and edges with a high weight and are most likely to contain object contours. Object contours are then detected as maximally scoring elementary circuits enumerated in each sub-graph. Our approach can be used with

any

cycle scoring function and multiple candidates that share line fragments can be found. This is unlike in other approaches that rely on a greedy approach to finding candidates. We demonstrate that our approach significantly outperforms the state-of-the-art for the detection of building rooftops in aerial images and polygonal object categories from ImageNet.

Xiaolu Sun, C. Mario Christoudias, Pascal Fua

Interactively Guiding Semi-Supervised Clustering via Attribute-Based Explanations

Unsupervised image clustering is a challenging and often ill-posed problem. Existing image descriptors fail to capture the clustering criterion well, and more importantly, the criterion itself may depend on (unknown) user preferences. Semi-supervised approaches such as distance metric learning and constrained clustering thus leverage user-provided annotations indicating which pairs of images belong to the same cluster (must-link) and which ones do not (cannot-link). These approaches require many such constraints before achieving good clustering performance because each constraint only provides weak cues about the desired clustering. In this paper, we propose to use image attributes as a modality for the user to provide more informative cues. In particular, the clustering algorithm iteratively and actively queries a user with an image pair. Instead of the user simply providing a must-link/cannot-link constraint for the pair, the user also provides an attribute-based reasoning e.g. “these two images are similar because both are natural and have still water” or “these two people are dissimilar because one is way older than the other”. Under the guidance of this explanation, and equipped with attribute predictors, many additional constraints are automatically generated. We demonstrate the effectiveness of our approach by incorporating the proposed attribute-based explanations in three standard semi-supervised clustering algorithms: Constrained K-Means, MPCK-Means, and Spectral Clustering, on three domains: scenes, shoes, and faces, using both binary and relative attributes.

Shrenik Lad, Devi Parikh

Attributes Make Sense on Segmented Objects

In this paper we aim for object classification and segmentation by attributes. Where existing work considers attributes either for the global image or for the parts of the object, we propose, as our first novelty, to learn and extract attributes on segments containing the entire object. Object-level attributes suffer less from accidental content around the object and accidental image conditions such as partial occlusions, scale changes and viewpoint changes. As our second novelty, we propose joint learning for simultaneous object classification and segment proposal ranking, solely on the basis of attributes. This naturally brings us to our third novelty: object-level attributes for zero-shot, where we use attribute descriptions of unseen classes for localizing their instances in new images and classifying them accordingly. Results on the Caltech UCSD Birds, Leeds Butterflies, and an a-Pascal subset demonstrate that

) extracting attributes on oracle object-level brings substantial benefits

) our joint learning model leads to accurate attribute-based classification and segmentation, approaching the oracle results and

iii

) object-level attributes also allow for zero-shot classification and segmentation.We conclude that attributes make sense on segmented objects.

Zhenyang Li, Efstratios Gavves, Thomas Mensink, Cees G. M. Snoek

Towards Transparent Systems: Semantic Characterization of Failure Modes

Today’s computer vision systems are not perfect. They fail frequently. Even worse, they fail abruptly and seemingly inexplicably. We argue that making our systems more transparent via an explicit human understandable characterization of their failure modes is desirable. We propose characterizing the failure modes of a vision system using semantic attributes. For example, a face recognition system may say “If the test image is blurry, or the face is not frontal, or the person to be recognized is a young white woman with heavy make up, I am likely to fail.” This information can be used at training time by researchers to design better features, models or collect more focused training data. It can also be used by a downstream machine or human user at test time to know when to ignore the output of the system, in turn making it more reliable. To generate such a “specification sheet”, we discriminatively cluster incorrectly classified images in the semantic attribute space using L1-regularized weighted logistic regression. We show that our specification sheets can predict oncoming failures for face and animal species recognition better than several strong baselines. We also show that lay people can easily follow our specification sheets.

Aayush Bansal, Ali Farhadi, Devi Parikh

Orientation Covariant Aggregation of Local Descriptors with Embeddings

Image search systems based on local descriptors typically achieve orientation invariance by aligning the patches on their dominant orientations. Albeit successful, this choice introduces too much invariance because it does not guarantee that the patches are rotated consistently.

This paper introduces an aggregation strategy of local descriptors that achieves this covariance property by jointly encoding the angle in the aggregation stage in a continuous manner. It is combined with an efficient monomial embedding to provide a codebook-free method to aggregate local descriptors into a single vector representation.

Our strategy is also compatible and employed with several popular encoding methods, in particular bag-of-words, VLAD and the Fisher vector. Our geometric-aware aggregation strategy is effective for image search, as shown by experiments performed on standard benchmarks for image and particular object retrieval, namely Holidays and Oxford buildings.

Giorgos Tolias, Teddy Furon, Hervé Jégou

Similarity-Invariant Sketch-Based Image Retrieval in Large Databases

Proliferation of touch-based devices has made the idea of sketch-based image retrieval practical. While many methods exist for sketch-based image retrieval on small datasets, little work has been done on large (web)-scale image retrieval. In this paper, we present an efficient approach for image retrieval from millions of images based on user-drawn sketches. Unlike existing methods which are sensitive to even translation or scale variations, our method handles translation, scale, rotation (similarity) and small deformations. To make online retrieval fast, each database image is preprocessed to extract sequences of contour segments (chains) that capture sufficient shape information which are represented by succinct variable length descriptors. Chain similarities are computed by a fast Dynamic Programming-based

approximate substring

matching algorithm, which enables partial matching of chains. Finally, hierarchical k-medoids based indexing is used for very fast retrieval in a few seconds on databases with millions of images. Qualitative and quantitative results clearly demonstrate superiority of the approach over existing methods.

Sarthak Parui, Anurag Mittal

Discovering Object Classes from Activities

In order to avoid an expensive manual labelling process or to learn object classes autonomously without human intervention, object discovery techniques have been proposed that extract visually similar objects from weakly labelled videos. However, the problem of discovering small or medium sized objects is largely unexplored. We observe that videos with activities involving human-object interactions can serve as weakly labelled data for such cases. Since neither object appearance nor motion is distinct enough to discover objects in such videos, we propose a framework that samples from a space of algorithms and their parameters to extract sequences of object proposals. Furthermore, we model similarity of objects based on appearance and functionality, which is derived from human and object motion. We show that functionality is an important cue for discovering objects from activities and demonstrate the generality of the model on three challenging RGB-D and RGB datasets.

Abhilash Srikantha, Juergen Gall

Weakly Supervised Object Localization with Latent Category Learning

Localizing objects in cluttered backgrounds is a challenging task in weakly supervised localization. Due to large object variations in cluttered images, objects have large ambiguity with backgrounds. However, backgrounds contain useful latent information,

e.g.,

the sky for aeroplanes. If we can learn this latent information, object-background ambiguity can be reduced to suppress the background. In this paper, we propose the latent category learning (LCL), which is an unsupervised learning problem given only image-level class labels. Firstly, inspired by the latent semantic discovery, we use the typical probabilistic Latent Semantic Analysis (pLSA) to learn the latent categories, which can represent objects, object parts or backgrounds. Secondly, to determine which category contains the target object, we propose a category selection method evaluating each category’s discrimination. We evaluate the method on the PASCAL VOC 2007 database and ILSVRC 2013 detection challenge. On VOC 2007, the proposed method yields the annotation accuracy of 48%, which outperforms previous results by 10%. More importantly, we achieve the detection average precision of 30.9%, which improves previous results by 8% and can be competitive with the supervised deformable part model (DPM) 5.0 baseline 33.7%. On ILSVRC 2013 detection, the method yields the precision of 6.0%, which is also competitive with the DPM 5.0.

Chong Wang, Weiqiang Ren, Kaiqi Huang, Tieniu Tan

Food-101 – Mining Discriminative Components with Random Forests

In this paper we address the problem of automatically recognizing pictured dishes. To this end, we introduce a novel method to mine discriminative parts using Random Forests (rf), which allows us to mine for parts simultaneously for all classes and to share knowledge among them. To improve efficiency of mining and classification, we only consider patches that are aligned with image superpixels, which we call components. To measure the performance of our rf component mining for food recognition, we introduce a novel and challenging dataset of 101 food categories, with 101’000 images. With an average accuracy of 50.76%, our model outperforms alternative classification methods except for cnn, including svm classification on Improved Fisher Vectors and existing discriminative part-mining algorithms by 11.88% and 8.13%, respectively. On the challenging mit-Indoor dataset, our method compares nicely to other s-o-a component-based classification methods.

Lukas Bossard, Matthieu Guillaumin, Luc Van Gool

Latent-Class Hough Forests for 3D Object Detection and Pose Estimation

In this paper we propose a novel framework,

Latent-Class Hough

Forests, for 3D object detection and pose estimation in heavily cluttered and occluded scenes. Firstly, we adapt the state-of-the-art template matching feature, LINEMOD [14], into a scale-invariant patch descriptor and integrate it into a regression forest using a novel template-based split function. In training, rather than explicitly collecting representative negative samples, our method is trained on positive samples only and we treat the class distributions at the leaf nodes as latent variables. During the inference process we iteratively update these distributions, providing accurate estimation of background clutter and foreground occlusions and thus a better detection rate. Furthermore, as a by-product, the latent class distributions can provide accurate occlusion aware segmentation masks, even in the multi-instance scenario. In addition to an existing public dataset, which contains only single-instance sequences with large amounts of clutter, we have collected a new, more challenging, dataset for multiple-instance detection containing heavy 2D and 3D clutter as well as foreground occlusions. We evaluate the Latent-Class Hough Forest on both of these datasets where we outperform state-of-the art methods.

Alykhan Tejani, Danhang Tang, Rigas Kouskouridas, Tae-Kyun Kim

FPM: Fine Pose Parts-Based Model with 3D CAD Models

We introduce a novel approach to the problem of localizing objects in an image and estimating their fine-pose. Given exact CAD models, and a few real training images with aligned models, we propose to leverage the geometric information from CAD models and appearance information from real images to learn a model that can accurately estimate fine pose in real images. Specifically, we propose FPM, a fine pose parts-based model, that combines geometric information in the form of shared 3D parts in deformable part based models, and appearance information in the form of objectness to achieve both fast and accurate fine pose estimation. Our method significantly outperforms current state-of-the-art algorithms in both accuracy and speed.

Joseph J. Lim, Aditya Khosla, Antonio Torralba

Learning High-Level Judgments of Urban Perception

Human observers make a variety of perceptual inferences about pictures of places based on prior knowledge and experience. In this paper we apply computational vision techniques to the task of predicting the perceptual characteristics of places by leveraging recent work on visual features along with a geo-tagged dataset of images associated with crowd-sourced urban perception judgments for wealth, uniqueness, and safety. We perform extensive evaluations of our models, training and testing on images of the same city as well as training and testing on images of different cities to demonstrate generalizability. In addition, we collect a new densely sampled dataset of streetview images for 4 cities and explore joint models to collectively predict perceptual judgments at city scale. Finally, we show that our predictions correlate well with ground truth statistics of wealth and crime.

Vicente Ordonez, Tamara L. Berg

CollageParsing: Nonparametric Scene Parsing by Adaptive Overlapping Windows

Scene parsing is the problem of assigning a semantic label to every pixel in an image. Though an ambitious task, impressive advances have been made in recent years, in particular in scalable nonparametric techniques suitable for open-universe databases. This paper presents the CollageParsing algorithm for scalable nonparametric scene parsing. In contrast to common practice in recent nonparametric approaches, CollageParsing reasons about mid-level windows that are designed to capture entire objects, instead of low-level superpixels that tend to fragment objects. On a standard benchmark consisting of outdoor scenes from the LabelMe database, CollageParsing achieves state-of-the-art nonparametric scene parsing results with 7 to 11% higher average per-class accuracy than recent nonparametric approaches.

Frederick Tung, James J. Little

Discovering Video Clusters from Visual Features and Noisy Tags

We present an algorithm for automatically clustering tagged videos. Collections of tagged videos are commonplace, however, it is not trivial to discover video clusters therein. Direct methods that operate on visual features ignore the regularly available, valuable source of tag information. Solely clustering videos on these tags is error-prone since the tags are typically noisy. To address these problems, we develop a structured model that considers the interaction between visual features, video tags and video clusters. We model tags from visual features, and correct noisy tags by checking visual appearance consistency. In the end, videos are clustered from the refined tags as well as the visual features. We learn the clustering through a max-margin framework, and demonstrate empirically that this algorithm can produce more accurate clustering results than baseline methods based on tags or visual features, or both. Further, qualitative results verify that the clustering results can discover sub-categories and more specific instances of a given video category.

Arash Vahdat, Guang-Tong Zhou, Greg Mori

Category-Specific Video Summarization

In large video collections with clusters of typical categories, such as “birthday party” or “flash-mob”, category-specific video summarization can produce higher quality video summaries than unsupervised approaches that are blind to the video category.

Given a video from a known category, our approach first efficiently performs a temporal segmentation into semantically-consistent segments, delimited not only by shot boundaries but also general change points. Then, equipped with an SVM classifier, our approach assigns importance scores to each segment. The resulting video assembles the sequence of segments with the highest scores. The obtained video summary is therefore both short and highly informative. Experimental results on videos from the multimedia event detection (MED) dataset of TRECVID’11 show that our approach produces video summaries with higher relevance than the state of the art.

Danila Potapov, Matthijs Douze, Zaid Harchaoui, Cordelia Schmid

Assessing the Quality of Actions

While recent advances in computer vision have provided reliable methods to recognize actions in both images and videos, the problem of assessing how well people perform actions has been largely unexplored in computer vision. Since methods for assessing action quality have many real-world applications in healthcare, sports, and video retrieval, we believe the computer vision community should begin to tackle this challenging problem. To spur progress, we introduce a learning-based framework that takes steps towards assessing how well people perform actions in videos. Our approach works by training a regression model from spatiotemporal pose features to scores obtained from expert judges. Moreover, our approach can provide interpretable feedback on how people can improve their action. We evaluate our method on a new Olympic sports dataset, and our experiments suggest our framework is able to rank the athletes more accurately than a non-expert human. While promising, our method is still a long way to rivaling the performance of expert judges, indicating that there is significant opportunity in computer vision research to improve on this difficult yet important task.

Hamed Pirsiavash, Carl Vondrick, Antonio Torralba

HiRF: Hierarchical Random Field for Collective Activity Recognition in Videos

This paper addresses the problem of recognizing and localizing coherent activities of a group of people, called collective activities, in video. Related work has argued the benefits of capturing long-range and higher-order dependencies among video features for robust recognition. To this end, we formulate a new deep model, called Hierarchical Random Field (HiRF). HiRF models only hierarchical dependencies between model variables. This effectively amounts to modeling higher-order temporal dependencies of video features. We specify an efficient inference of HiRF that iterates in each step linear programming for estimating latent variables. Learning of HiRF parameters is specified within the max-margin framework. Our evaluation on the benchmark New Collective Activity and Collective Activity datasets, demonstrates that HiRF yields superior recognition and localization as compared to the state of the art.

Mohamed Rabie Amer, Peng Lei, Sinisa Todorovic

Part Bricolage: Flow-Assisted Part-Based Graphs for Detecting Activities in Videos

Space-time detection of human activities in videos can significantly enhance visual search. To handle such tasks, while solely using low-level features has been found somewhat insufficient for complex datasets; mid-level features (like body parts) that are normally considered, are not robustly accounted for their inaccuracy. Moreover, the activity detection mechanisms do not constructively utilize the importance and trustworthiness of the features.

This paper addresses these problems and introduces a unified formulation for robustly detecting activities in videos. Our

first contribution

is the formulation of the detection task as an undirected node- and edge-weighted graphical structure called

Part Bricolage (PB)

, where the node weights represent the type of features along with their importance, and edge weights incorporate the probability of the features belonging to a known activity class, while also accounting for the trustworthiness of the features connecting the edge. Prize-Collecting-Steiner-Tree (PCST) problem [19] is solved for such a graph that gives the best connected subgraph comprising the activity of interest. Our

second contribution

is a novel technique for robust body part estimation, which uses two types of state-of-the-art pose detectors, and resolves the plausible detection ambiguities with pre-trained classifiers that predict the trustworthiness of the pose detectors. Our

third contribution

is the proposal of fusing the low-level descriptors with the mid-level ones, while maintaining the spatial structure between the features.

For a quantitative evaluation of the detection power of

, we run

on Hollywood and MSR-Actions datasets and outperform the state-of-the-art by a significant margin for various detection paradigms.

Sukrit Shankar, Vijay Badrinarayanan, Roberto Cipolla

GIS-Assisted Object Detection and Geospatial Localization

Geographical Information System (GIS) databases contain information about many objects, such as traffic signals, road signs, fire hydrants, etc. in urban areas. This wealth of information can be utilized for assisting various computer vision tasks. In this paper, we propose a method for improving object detection using a set of priors acquired from GIS databases. Given a database of object locations from GIS and a query image with metadata, we compute the expected spatial location of the visible objects in the image. We also perform object detection in the query image (e.g., using DPM) and obtain a set of candidate bounding boxes for the objects. Then, we fuse the GIS priors with the potential detections to find the final object bounding boxes. To cope with various inaccuracies and practical complications, such as noisy metadata, occlusion, inaccuracies in GIS, and poor candidate detections, we formulate our fusion as a higher-order graph matching problem which we robustly solve using RANSAC. We demonstrate that this approach outperforms well established object detectors, such as DPM, with a large margin.

Furthermore, we propose that the GIS objects can be used as cues for discovering the location where an image was taken. Our hypothesis is based on the idea that the objects visible in one image, along with their relative spatial location, provide distinctive cues for the geo-location. In order to estimate the geo-location based on the generic objects, we perform a search on a dense grid of locations over the covered area. We assign a score to each location based on the similarity of its GIS objects and the imperfect object detections in the image. We demonstrate that over a broad urban area of >10 square kilometers, this semantic approach can significantly narrow down the localization search space, and occasionally, even find the correct location.

Shervin Ardeshir, Amir Roshan Zamir, Alejandro Torroella, Mubarak Shah

Context-Based Pedestrian Path Prediction

We present a novel Dynamic Bayesian Network for pedestrian path prediction in the intelligent vehicle domain. The model incorporates the pedestrian situational awareness, situation criticality and spatial layout of the environment as latent states on top of a Switching Linear Dynamical System (SLDS) to anticipate changes in the pedestrian dynamics. Using computer vision, situational awareness is assessed by the pedestrian head orientation, situation criticality by the distance between vehicle and pedestrian at the expected point of closest approach, and spatial layout by the distance of the pedestrian to the curbside. Our particular scenario is that of a crossing pedestrian, who might stop or continue walking at the curb. In experiments using stereo vision data obtained from a vehicle, we demonstrate that the proposed approach results in more accurate path prediction than only SLDS, at the relevant short time horizon (1

), and slightly outperforms a computationally more demanding state-of-the-art method.

Julian Francisco Pieter Kooij, Nicolas Schneider, Fabian Flohr, Dariu M. Gavrila

Context and 3D Scenes

Sliding Shapes for 3D Object Detection in Depth Images

The depth information of RGB-D sensors has greatly simplified some common challenges in computer vision and enabled breakthroughs for several tasks. In this paper, we propose to use depth maps for object detection and design a 3D detector to overcome the major difficulties for recognition, namely the variations of texture, illumination, shape, viewpoint, clutter, occlusion, self-occlusion and sensor noises. We take a collection of 3D CAD models and render each CAD model from hundreds of viewpoints to obtain synthetic depth maps. For each depth rendering, we extract features from the 3D point cloud and train an Exemplar-SVM classifier. During testing and hard-negative mining, we slide a 3D detection window in 3D space. Experiment results show that our 3D detector significantly outperforms the state-of-the-art algorithms for both RGB and RGB-D images, and achieves about ×1.7 improvement on average precision compared to DPM and R-CNN. All source code and data are available online.

Shuran Song, Jianxiong Xiao

Integrating Context and Occlusion for Car Detection by Hierarchical And-Or Model

This paper presents a method of learning reconfigurable hierarchical And-Or models to integrate context and occlusion for car detection. The And-Or model represents the regularities of car-to-car context and occlusion patterns at three levels: (i) layouts of spatially-coupled

cars, (ii) single cars with different viewpoint-occlusion configurations, and (iii) a small number of parts. The learning process consists of two stages. We first learn the structure of the And-Or model with three components: (a) mining

-car contextual patterns based on layouts of annotated single car bounding boxes, (b) mining the occlusion configurations based on the overlapping statistics between single cars, and (c) learning visible parts based on car 3D CAD simulation or heuristically mining latent car parts. The And-Or model is organized into a directed and acyclic graph which leads to the Dynamic Programming algorithm in inference. In the second stage, we jointly train the model parameters (for appearance, deformation and bias) using Weak-Label Structural SVM. In experiments, we test our model on four car datasets: the KITTI dataset [11], the street parking dataset [19], the PASCAL VOC2007 car dataset [7], and a self-collected parking lot dataset. We compare with state-of-the-art variants of deformable part-based models and other methods. Our model obtains significant improvement consistently on the four datasets.

Bo Li, Tianfu Wu, Song-Chun Zhu

PanoContext: A Whole-Room 3D Context Model for Panoramic Scene Understanding

The field-of-view of standard cameras is very small, which is one of the main reasons that contextual information is not as useful as it should be for object detection. To overcome this limitation, we advocate the use of 360° full-view panoramas in scene understanding, and propose a whole-room context model in 3D. For an input panorama, our method outputs 3D bounding boxes of the room and all major objects inside, together with their semantic categories. Our method generates 3D hypotheses based on contextual constraints and ranks the hypotheses holistically, combining both bottom-up and top-down context information. To train our model, we construct an annotated panorama dataset and reconstruct the 3D model from single-view using manual annotation. Experiments show that solely based on 3D context without any image region category classifier, we can achieve a comparable performance with the state-of-the-art object detector. This demonstrates that when the FOV is large, context is as powerful as object appearance. All data and source code are available online.

Yinda Zhang, Shuran Song, Ping Tan, Jianxiong Xiao

Unfolding an Indoor Origami World

In this work, we present a method for single-view reasoning about 3D surfaces and their relationships. We propose the use of mid-level constraints for 3D scene understanding in the form of convex and concave edges and introduce a generic framework capable of incorporating these and other constraints. Our method takes a variety of cues and uses them to infer a consistent interpretation of the scene. We demonstrate improvements over the state-of-the art and produce interpretations of the scene that link large planar surfaces.

David Ford Fouhey, Abhinav Gupta, Martial Hebert

Poster Session 7

Joint Semantic Segmentation and 3D Reconstruction from Monocular Video

We present an approach for joint inference of 3D scene structure and semantic labeling for monocular video. Starting with monocular image stream, our framework produces a 3D volumetric semantic + occupancy map, which is much more useful than a series of 2D semantic label images or a sparse point cloud produced by traditional semantic segmentation and Structure from Motion(SfM) pipelines respectively. We derive a Conditional Random Field (CRF) model defined in the 3D space, that jointly infers the semantic category and occupancy for each voxel. Such a joint inference in the 3D CRF paves the way for more informed priors and constraints, which is otherwise not possible if solved separately in their traditional frameworks. We make use of class specific semantic cues that constrain the 3D structure in areas, where multiview constraints are weak. Our model comprises of higher order factors, which helps when the depth is unobservable.We also make use of class specific semantic cues to reduce either the degree of such higher order factors, or to approximately model them with unaries if possible. We demonstrate improved 3D structure and temporally consistent semantic segmentation for difficult, large scale, forward moving monocular image sequences.

Abhijit Kundu, Yin Li, Frank Dellaert, Fuxin Li, James M. Rehg

A New Variational Framework for Multiview Surface Reconstruction

The creation of surfaces from overlapping images taken from different vantages is a hard and important problem in computer vision. Recent developments fall primarily into two categories: the use of dense matching to produce point clouds from which surfaces are built, and the construction of surfaces from images directly. This paper presents a new method for surface reconstruction falling in the second category. First, a strongly motivated variational framework is built from the ground up based on a limiting case of photo-consistency. The framework includes a powerful new edge preserving smoothness term and exploits the input images exhaustively, directly yielding high quality surfaces instead of dealing with issues (such as noise or misalignment) after the fact. Numeric solution is accomplished with a combination of Gauss-Newton descent and the finite element method, yielding deep convergence in few iterates. The method is fast, robust, very insensitive to view/scene configurations, and produces state-of-the-art results in the Middlebury evaluation.

Ben Semerjian

Multi-body Depth-Map Fusion with Non-intersection Constraints

Depthmap fusion is the problem of computing dense 3D reconstructions from a set of depthmaps. Whereas this problem has received a lot of attention for purely rigid scenes, there is remarkably little prior work for dense reconstructions of scenes consisting of several moving rigid bodies or parts. This paper therefore explores this multi-body depthmap fusion problem. A first observation in the multi-body setting is that when treated naively, ghosting artifacts will emerge, ie. the same part will be reconstructed multiple times at different positions. We therefore introduce non-intersection constraints which resolve these issues: at any point in time, a point in space can only be occupied by at most one part. Interestingly enough, these constraints can be expressed as linear inequalities and as such define a convex set. We therefore propose to phrase the multi-body depthmap fusion problem in a convex voxel labeling framework. Experimental evaluation shows that our approach succeeds in computing artifact-free dense reconstructions of the individual parts with a minimal overhead due to the non-intersection constraints.

Bastien Jacquet, Christian Häne, Roland Angst, Marc Pollefeys

Shape from Light Field Meets Robust PCA

In this paper we propose a new type of matching term for multi-view stereo reconstruction. Our model is based on the assumption, that if one warps the images of the various views to a common warping center and considers each warped image as one row in a matrix, then this matrix will have low rank. This also implies, that we assume a certain amount of overlap between the views after the warping has been performed. Such an assumption is obviously met in the case of light field data, which motivated us to demonstrate the proposed model for this type of data. Our final model is a large scale convex optimization problem, where the low rank minimization is relaxed via the nuclear norm. We present qualitative and quantitative experiments, where the proposed model achieves excellent results.

Stefan Heber, Thomas Pock

Cross-Age Reference Coding for Age-Invariant Face Recognition and Retrieval

Recently, promising results have been shown on face recognition researches. However, face recognition and retrieval across age is still challenging. Unlike prior methods using complex models with strong parametric assumptions to model the aging process, we use a data-driven method to address this problem. We propose a novel coding framework called Cross-Age Reference Coding (CARC). By leveraging a large-scale image dataset freely available on the Internet as a reference set, CARC is able to encode the low-level feature of a face image with an age-invariant reference space. In the testing phase, the proposed method only requires a linear projection to encode the feature and therefore it is highly scalable. To thoroughly evaluate our work, we introduce a new large-scale dataset for face recognition and retrieval across age called Cross-Age Celebrity Dataset (CACD). The dataset contains more than 160,000 images of 2,000 celebrities with age ranging from 16 to 62. To the best of our knowledge, it is by far the largest publicly available cross-age face dataset. Experimental results show that the proposed method can achieve state-of-the-art performance on both our dataset as well as the other widely used dataset for face recognition across age, MORPH dataset.

Bor-Chun Chen, Chu-Song Chen, Winston H. Hsu

Reverse Training: An Efficient Approach for Image Set Classification

This paper introduces a new approach, called

reverse training

, to efficiently extend binary classifiers for the task of multi-class image set classification. Unlike existing binary to multi-class extension strategies, which require multiple binary classifiers, the proposed approach is very efficient since it trains a single binary classifier to optimally discriminate the class of the query image set from all others. For this purpose, the classifier is trained with the images of the query set (labelled positive) and a randomly sampled subset of the training data (labelled negative). The trained classifier is then evaluated on rest of the training images. The class of these images with their largest percentage classified as positive is predicted as the class of the query image set. The confidence level of the prediction is also computed and integrated into the proposed approach to further enhance its robustness and accuracy. Extensive experiments and comparisons with existing methods show that the proposed approach achieves state of the art performance for face and object recognition on a number of datasets.

Munawar Hayat, Mohammed Bennamoun, Senjian An

Real-Time Exemplar-Based Face Sketch Synthesis

This paper proposes a simple yet effective face sketch synthesis method. Similar to existing exemplar-based methods, a training dataset containing photo-sketch pairs is required, and a

-NN photo patch search is performed between a test photo and every training exemplar for sketch patch selection. Instead of using the Markov Random Field to optimize global sketch patch selection, this paper formulates face sketch synthesis as an image denoising problem which can be solved efficiently using the proposed method. Real-time performance can be obtained on a state-of-the-art GPU. Meanwhile quantitative evaluations on face sketch recognition and user study demonstrate the effectiveness of the proposed method. In addition, the proposed method can be directly extended to the temporal domain for consistent video sketch synthesis, which is of great importance in digital entertainment.

Yibing Song, Linchao Bao, Qingxiong Yang, Ming-Hsuan Yang

Domain-Adaptive Discriminative One-Shot Learning of Gestures

The objective of this paper is to recognize gestures in videos – both localizing the gesture and classifying it into one of multiple classes.

We show that the performance of a gesture classifier learnt from a single (strongly supervised) training example can be boosted significantly using a ‘reservoir’ of weakly supervised gesture examples (and that the performance exceeds learning from the one-shot example or reservoir alone). The one-shot example and weakly supervised reservoir are from different ‘domains’ (different people, different videos, continuous or non-continuous gesturing,

etc

), and we propose a domain adaptation method for human pose and hand shape that enables gesture learning methods to generalise between them. We also show the benefits of using the recently introduced Global Alignment Kernel [12], instead of the standard Dynamic Time Warping that is generally used for time alignment.

The domain adaptation and learning methods are evaluated on two large scale challenging gesture datasets: one for sign language, and the other for Italian hand gestures. In both cases performance exceeds the previous published results, including the best skeleton-classification-only entry in the 2013 ChaLearn challenge.

Tomas Pfister, James Charles, Andrew Zisserman

Backmatter

Titel: Computer Vision – ECCV 2014
herausgegeben von: David Fleet
Tomas Pajdla
Bernt Schiele
Tinne Tuytelaars
Verlag: Springer International Publishing
Electronic ISBN: 978-3-319-10599-4
Print ISBN: 978-3-319-10598-7
DOI: https://doi.org/10.1007/978-3-319-10599-4

Springer Professional