Skip to main content

Über dieses Buch

This book describes the fundamental building-block of many new computer vision systems: dense and robust correspondence estimation. Dense correspondence estimation techniques are now successfully being used to solve a wide range of computer vision problems, very different from the traditional applications such techniques were originally developed to solve. This book introduces the techniques used for establishing correspondences between challenging image pairs, the novel features used to make these techniques robust, and the many problems dense correspondences are now being used to solve. The book provides information to anyone attempting to utilize dense correspondences in order to solve new or existing computer vision problems. The editors describe how to solve many computer vision problems by using dense correspondence estimation. Finally, it surveys resources, code and data, necessary for expediting the development of effective correspondence-based computer vision systems.



Establishing Dense Correspondences


Introduction to Dense Optical Flow

Before the notion of motion is generalized to arbitrary images, we first give a brief introduction to motion analysis for videos. We will review how motion is estimated when the underlying motion is slow and smooth, especially the Horn–Schunck (Artif Intell 17:185–203, 1981) formulation with robust functions. We show step-by-step how to optimize the optical flow objective function using iteratively reweighted least squares (IRLS), which is equivalent to conventional Euler–Lagrange variational approach but more succinct to derive. Then we will briefly discuss how motion is estimated when the slow and smooth assumption becomes invalid, especially how large displacement motion is estimated.
Ce Liu

SIFT Flow: Dense Correspondence Across Scenes and Its Applications

While image alignment has been studied in different areas of computer vision for decades, aligning images depicting different scenes remains a challenging problem. Analogous to optical flow where an image is aligned to its temporally adjacent frame, we propose scale-invariant feature transform (SIFT) flow, a method to align an image to its nearest neighbors in a large image corpus containing a variety of scenes. The SIFT flow algorithm consists of matching densely sampled, pixel-wise SIFT features between two images while preserving spatial discontinuities. The SIFT features allow robust matching across different scene/object appearances, whereas the discontinuity-preserving spatial model allows matching of objects located at different parts of the scene. Experiments show that the proposed approach robustly aligns complex scene pairs containing significant spatial differences. Based on SIFT flow, we propose an alignment-based large database framework for image analysis and synthesis, where image information is transferred from the nearest neighbors to a query image according to the dense scene correspondence. This framework is demonstrated through concrete applications, such as motion field prediction from a single image, motion synthesis via object transfer, satellite image registration, and face recognition.
Ce Liu, Jenny Yuen, Antonio Torralba

Dense, Scale-Less Descriptors

Establishing correspondences between two images requires matching similar image regions. To do this effectively, local representations must be designed to allow for meaningful comparisons. As we discuss in previous chapters, one such representation is the SIFT descriptor used by SIFT flow. The scale selection required to make SIFT scale invariant, however, is only known to be possible at sparse interest points, where local image information varies sufficiently. SIFT flow and similar methods consequently settle for descriptors extracted using manually determined scales, kept constant for all image pixels. In this chapter we discuss alternative representations designed to capture multiscale appearance, even in image regions where existing methods for scale selection are not effective. We show that (1) SIFTs extracted from different scales, even in low contrast areas, vary in their values and so single scale selection often results in poor matches when images show content at different scales. (2) We propose representing pixel appearances with sets of SIFTs extracted at multiple scales. Finally, (3) low-dimensional, linear subspaces are shown to accurately represent such SIFT sets. By mapping these subspaces to points we obtain a novel representation, the Scale-Less SIFT (SLS), which can be used in a dense manner, throughout the image, to represent multiscale image appearances. We demonstrate how the use of the SLS descriptor can act as an alternative to existing, single scale representations, allowing for accurate dense correspondences between images with scale-varying content.
Tal Hassner, Viki Mayzels, Lihi Zelnik-Manor

Scale-Space SIFT Flow

The SIFT flow algorithm has been widely used for the image matching/ registration task and it is particularly effective in handling image pairs from similar scenes but with different object configurations. The way in which the dense SIFT features are computed at a fixed scale in the SIFT flow method might however limit its capability of dealing with scenes having great scale changes. In this work, we propose a simple, intuitive, and effective approach, Scale-Space SIFT flow, to deal with the large object scale differences. We introduce a scale field to the SIFT flow function to automatically explore the scale changes. Our approach achieves a similar performance as the SIFT flow method for natural scenes but obtains significant improvement for the images with large scale differences. Compared with a recent method that addresses a similar problem, our approach shows its advantage being more effective and efficient.
Weichao Qiu, Xinggang Wang, Xiang Bai, Alan Yuille, Zhuowen Tu

Dense Segmentation-Aware Descriptors

Dense descriptors are becoming increasingly popular in a host of tasks, such as dense image correspondence, bag-of-words image classification, and label transfer. However, the extraction of descriptors on generic image points, rather than selecting geometric features, requires rethinking how to achieve invariance to nuisance parameters. In this work we pursue invariance to occlusions and background changes by introducing segmentation information within dense feature construction. The core idea is to use the segmentation cues to downplay the features coming from image areas that are unlikely to belong to the same region as the feature point. We show how to integrate this idea with dense SIFT, as well as with the dense scale- and rotation-invariant descriptor (SID). We thereby deliver dense descriptors that are invariant to background changes, rotation, and/or scaling. We explore the merit of our technique in conjunction with large displacement motion estimation and wide-baseline stereo, and demonstrate that exploiting segmentation information yields clear improvements.
Eduard Trulls, Iasonas Kokkinos, Alberto Sanfeliu, Francesc Moreno-Noguer

SIFTpack: A Compact Representation for Efficient SIFT Matching

Computing distances between large sets of SIFT descriptors is a basic step in numerous algorithms in computer vision. When the number of descriptors is large, as is often the case, computing these distances can be extremely time consuming. We propose the SIFTpack: a compact way of storing SIFT descriptors, which enables significantly faster calculations between sets of SIFTs than the current solutions. SIFTpack can be used to represent SIFTs densely extracted from a single image or sparsely from multiple different images. We show that the SIFTpack representation saves both storage space and run time, for both finding nearest neighbors and computing all distances between all descriptors. The usefulness of SIFTpack is demonstrated as an alternative implementation for K-means dictionaries of visual words and for image retrieval.
Alexandra Gilinsky, Lihi Zelnik-Manor

In Defense of Gradient-Based Alignment on Densely Sampled Sparse Features

In this chapter, we explore the surprising result that gradient-based continuous optimization methods perform well for the alignment of image/object models when using densely sampled sparse features (HOG, dense SIFT, etc.). Gradient-based approaches for image/object alignment have many desirable properties—inference is typically fast and exact, and diverse constraints can be imposed on the motion of points. However, the presumption that gradients predicted on sparse features would be poor estimators of the true descent direction has meant that gradient-based optimization is often overlooked in favor of graph-based optimization. We show that this intuition is only partly true: sparse features are indeed poor predictors of the error surface, but this has no impact on the actual alignment performance. In fact, for general object categories that exhibit large geometric and appearance variation, sparse features are integral to achieving any convergence whatsoever. How the descent directions are predicted becomes an important consideration for these descriptors. We explore a number of strategies for estimating gradients, and show that estimating gradients via regression in a manner that explicitly handles outliers improves alignment performance substantially. To illustrate the general applicability of gradient-based methods to the alignment of challenging object categories, we perform unsupervised ensemble alignment on a series of nonrigid animal classes from ImageNet.
Hilton Bristow, Simon Lucey

Dense Correspondences and Their Applications


From Images to Depths and Back

This chapter describes what is possibly the earliest use of dense correspondence estimation for transferring semantic information between images of different scenes. The method described in this chapter was designed for non-parametric, “example-based” depth estimation of objects appearing in single photos. It consults a database of example 3D geometries and associated appearances, searching for those which look similar to the object in the photo. This is performed at the pixel level, in similar spirit to the more recent methods described in the following chapters. Those newer methods, however, use robust, generic dense correspondence estimation engines. By contrast, the method described here uses a hard-EM optimization to optimize a well-defined target function over the similarity of appearance/depth pairs in the database to appearance/estimated-depth pairs of a query photo. Results are presented demonstrating how depths associated with diverse reference objects may be assigned to different objects appearing in query photos. Going beyond visible shape, we show that the method can be employed for the surprising task of estimating shapes of occluded objects’ backsides. This, so long as the reference database contains examples of mappings from appearances to backside shapes. Finally, we show how the duality of appearance and shape may be exploited in order to “paint colors” on query shapes (“colorize” them) by simply reversing the matching from appearances to depths.
Tal Hassner, Ronen Basri

Depth Transfer: Depth Extraction from Videos Using Nonparametric Sampling

In this chapter, a technique that automatically generates plausible depth maps from videos using nonparametric depth sampling is discussed. We demonstrate this method in cases where existing methods fail (nontranslating cameras and dynamic scenes). This technique is applicable to single images as well as videos. For videos, local motion cues are used to improve the inferred depth maps, while optical flow is used to ensure temporal depth consistency. For training and evaluation, a Microsoft Kinect-based system is developed to collect a large dataset containing stereoscopic videos with known depths, and this depth estimation technique outperforms the state-of-the-art on benchmark databases. This method can be used to automatically convert a monoscopic video into stereo for 3D visualization demonstrated through a variety of visually pleasing results for indoor and outdoor scenes, including results from the feature film Charade.
Kevin Karsch, Ce Liu, Sing Bing Kang

Nonparametric Scene Parsing via Label Transfer

While there has been a lot of recent work on object recognition and image understanding, the focus has been on carefully establishing mathematical models for images, scenes, and objects. In this chapter, we propose a novel, nonparametric approach for object recognition and scene parsing using a new technology we name label transfer. For an input image, our system first retrieves its nearest neighbors from a large database containing fully annotated images. Then, the system establishes dense correspondences between the input image and each of the nearest neighbors using the dense SIFT flow algorithm (Liu et al., 33(5):978–994, 2011 Chap. 2), which aligns two images based on local image structures. Finally, based on the dense scene correspondences obtained from the SIFT flow, our system warps the existing annotations, and integrates multiple cues in a Markov random field framework to segment and recognize the query image. Promising experimental results have been achieved by our nonparametric scene parsing system on challenging databases. Compared to existing object recognition approaches that require training classifiers or appearance models for each object category, our system is easy to implement, has few parameters, and embeds contextual information naturally in the retrieval/alignment procedure.
Ce Liu, Jenny Yuen, Antonio Torralba

Joint Inference in Weakly-Annotated Image Datasets via Dense Correspondence

We present a principled framework for inferring pixel labels in weakly annotated image datasets. Most previous, example-based approaches to computer vision rely on a large corpus of densely labeled images. However, for large, modern image datasets, such labels are expensive to obtain and are often unavailable. We establish a large-scale graphical model spanning all labeled and unlabeled images, then solve it to infer pixel labels jointly for all images in the dataset while enforcing consistent annotations over similar visual patterns. This model requires significantly less labeled data and assists in resolving ambiguities by propagating inferred annotations from images with stronger local visual evidences to images with weaker local evidences. We apply our proposed framework to two computer vision problems: image annotation with semantic segmentation, and object discovery and co-segmentation (segmenting multiple images containing a common object). Extensive numerical evaluations and comparisons show that our method consistently outperforms the state of the art in automatic annotation and semantic labeling, while requiring significantly less labeled data. In contrast to previous co-segmentation techniques, our method manages to discover and segment objects well even in the presence of substantial amounts of noise images (images not containing the common object), as typical for datasets collected from Internet search.
Michael Rubinstein, Ce Liu, William T. Freeman

Dense Correspondences and Ancient Texts

This chapter concerns applications of dense correspondences to images of a very different nature than those considered in previous chapters. Rather than images of natural or man-made scenes and objects, here, we deal with images of texts. We present a novel, dense correspondence-based approach to text image analysis instead of the more traditional approach of analysis at the character level (e.g., existing optical character recognition methods) or word level (the so called word spotting approach). We focus on the challenging domain of historical text image analysis. Such texts are handwritten and are often severely corrupted by noise and degradation, making them difficult to handle with existing methods. Our system is designed for the particular task of aligning such manuscript images to their transcripts. Our proposed alternative to performing this task manually is a system which directly matches the historical text image with a synthetic image rendered from the transcript. These matches are performed at the pixel level, by using SIFT flow applied to a novel per pixel representation. Our pipeline is robust to document degradation, variations between script styles and nonlinear image transformations. More importantly, this per pixel matching approach does not require prior learning of the particular script used in the documents being processed, and so can easily be applied to manuscripts of widely varying origins, languages, and characteristics.
Tal Hassner, Lior Wolf, Nachum Dershowitz, Gil Sadeh, Daniel Stökl Ben-Ezra
Weitere Informationen