Top

2018 | Book

Read chapter Read first chapter

Computer Vision – ECCV 2018

15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part III

Editors: Vittorio Ferrari, Prof. Martial Hebert, Cristian Sminchisescu, Yair Weiss

Publisher: Springer International Publishing

Book Series : Lecture Notes in Computer Science

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

The sixteen-volume set comprising the LNCS volumes 11205-11220 constitutes the refereed proceedings of the 15th European Conference on Computer Vision, ECCV 2018, held in Munich, Germany, in September 2018.The 776 revised papers presented were carefully reviewed and selected from 2439 submissions. The papers are organized in topical sections on learning for vision; computational photography; human analysis; human sensing; stereo and reconstruction; optimization; matching and recognition; video attention; and poster sessions.

Frontmatter

Computational Photography

Frontmatter

Light Structure from Pin Motion: Simple and Accurate Point Light Calibration for Physics-Based Modeling

We present a practical method for geometric point light source calibration. Unlike in prior works that use Lambertian spheres, mirror spheres, or mirror planes, our calibration target consists of a Lambertian plane and small shadow casters at unknown positions above the plane. Due to their small size, the casters’ shadows can be localized more precisely than highlights on mirrors. We show that, given shadow observations from a moving calibration target and a fixed camera, the shadow caster positions and the light position or direction can be simultaneously recovered in a structure from motion framework. Our evaluation on simulated and real scenes shows that our method yields light estimates that are stable and more accurate than existing techniques while having a considerably simpler setup and requiring less manual labor.This project’s source code can be downloaded from: https://github.com/hiroaki-santo/light-structure-from-pin-motion .

Hiroaki Santo, Michael Waechter, Masaki Samejima, Yusuke Sugano, Yasuyuki Matsushita

Programmable Triangulation Light Curtains

A vehicle on a road or a robot in the field does not need a full-featured 3D depth sensor to detect potential collisions or monitor its blind spot. Instead, it needs to only monitor if any object comes within its near proximity which is an easier task than full depth scanning. We introduce a novel device that monitors the presence of objects on a virtual shell near the device, which we refer to as a light curtain. Light curtains offer a light-weight, resource-efficient and programmable approach to proximity awareness for obstacle avoidance and navigation. They also have additional benefits in terms of improving visibility in fog as well as flexibility in handling light fall-off. Our prototype for generating light curtains works by rapidly rotating a line sensor and a line laser, in synchrony. The device is capable of generating light curtains of various shapes with a range of 20–30 m in sunlight (40 m under cloudy skies and 50 m indoors) and adapts dynamically to the demands of the task. We analyze properties of light curtains and various approaches to optimize their thickness as well as power requirements. We showcase the potential of light curtains using a range of real-world scenarios.

Jian Wang, Joseph Bartels, William Whittaker, Aswin C. Sankaranarayanan, Srinivasa G. Narasimhan

Learning to Separate Object Sounds by Watching Unlabeled Video

Perceiving a scene most fully requires all the senses. Yet modeling how objects look and sound is challenging: most natural scenes and events contain multiple objects, and the audio track mixes all the sound sources together. We propose to learn audio-visual object models from unlabeled video, then exploit the visual context to perform audio source separation in novel videos. Our approach relies on a deep multi-instance multi-label learning framework to disentangle the audio frequency bases that map to individual visual objects, even without observing/hearing those objects in isolation. We show how the recovered disentangled bases can be used to guide audio source separation to obtain better-separated, object-level sounds. Our work is the first to learn audio source separation from large-scale “in the wild” videos containing multiple audio sources per video. We obtain state-of-the-art results on visually-aided audio source separation and audio denoising. Our video results: http://vision.cs.utexas.edu/projects/separating_object_sounds/ .

Ruohan Gao, Rogerio Feris, Kristen Grauman

Coded Two-Bucket Cameras for Computer Vision

We introduce coded two-bucket (C2B) imaging, a new operating principle for computational sensors with applications in active 3D shape estimation and coded-exposure imaging. A C2B sensor modulates the light arriving at each pixel by controlling which of the pixel’s two “buckets” should integrate it. C2B sensors output two images per video frame—one per bucket—and allow rapid, fully-programmable, per-pixel control of the active bucket. Using these properties as a starting point, we (1) develop an image formation model for these sensors, (2) couple them with programmable light sources to acquire illumination mosaics, i.e., images of a scene under many different illumination conditions whose pixels have been multiplexed and acquired in one shot, and (3) show how to process illumination mosaics to acquire live disparity or normal maps of dynamic scenes at the sensor’s native resolution. We present the first experimental demonstration of these capabilities, using a fully-functional C2B camera prototype. Key to this unique prototype is a novel programmable CMOS sensor that we designed from the ground up, fabricated and turned into a working system.

Mian Wei, Navid Sarhangnejad, Zhengfan Xia, Nikita Gusev, Nikola Katic, Roman Genov, Kiriakos N. Kutulakos

Materials for Masses: SVBRDF Acquisition with a Single Mobile Phone Image

We propose a material acquisition approach to recover the spatially-varying BRDF and normal map of a near-planar surface from a single image captured by a handheld mobile phone camera. Our method images the surface under arbitrary environment lighting with the flash turned on, thereby avoiding shadows while simultaneously capturing high-frequency specular highlights. We train a CNN to regress an SVBRDF and surface normals from this image. Our network is trained using a large-scale SVBRDF dataset and designed to incorporate physical insights for material estimation, including an in-network rendering layer to model appearance and a material classifier to provide additional supervision during training. We refine the results from the network using a dense CRF module whose terms are designed specifically for our task. The framework is trained end-to-end and produces high quality results for a variety of materials. We provide extensive ablation studies to evaluate our network on both synthetic and real data, while demonstrating significant improvements in comparisons with prior works.

Zhengqin Li, Kalyan Sunkavalli, Manmohan Chandraker

Poster Session

Frontmatter

Video Object Segmentation with Joint Re-identification and Attention-Aware Mask Propagation

The problem of video object segmentation can become extremely challenging when multiple instances co-exist. While each instance may exhibit large scale and pose variations, the problem is compounded when instances occlude each other causing failures in tracking. In this study, we formulate a deep recurrent network that is capable of segmenting and tracking objects in video simultaneously by their temporal continuity, yet able to re-identify them when they re-appear after a prolonged occlusion. We combine temporal propagation and re-identification functionalities into a single framework that can be trained end-to-end. In particular, we present a re-identification module with template expansion to retrieve missing objects despite their large appearance changes. In addition, we contribute an attention-based recurrent mask propagation approach that is robust to distractors not belonging to the target segment. Our approach achieves a new state-of-the-art $$\mathcal {G}$$ -mean of 68.2 on the challenging DAVIS 2017 benchmark (test-dev set), outperforming the winning solution. Project Page: http://mmlab.ie.cuhk.edu.hk/projects/DyeNet/ .

Xiaoxiao Li, Chen Change Loy

Spatio-Temporal Transformer Network for Video Restoration

State-of-the-art video restoration methods integrate optical flow estimation networks to utilize temporal information. However, these networks typically consider only a pair of consecutive frames and hence are not capable of capturing long-range temporal dependencies and fall short of establishing correspondences across several timesteps. To alleviate these problems, we propose a novel Spatio-temporal Transformer Network (STTN) which handles multiple frames at once and thereby manages to mitigate the common nuisance of occlusions in optical flow estimation. Our proposed STTN comprises a module that estimates optical flow in both space and time and a resampling layer that selectively warps target frames using the estimated flow. In our experiments, we demonstrate the efficiency of the proposed network and show state-of-the-art restoration results in video super-resolution and video deblurring.

Tae Hyun Kim, Mehdi S. M. Sajjadi, Michael Hirsch, Bernhard Schölkopf

Dense Pose Transfer

In this work we integrate ideas from surface-based modeling with neural synthesis: we propose a combination of surface-based pose estimation and deep generative models that allows us to perform accurate pose transfer, i.e. synthesize a new image of a person based on a single image of that person and the image of a pose donor. We use a dense pose estimation system that maps pixels from both images to a common surface-based coordinate system, allowing the two images to be brought in correspondence with each other. We inpaint and refine the source image intensities in the surface coordinate system, prior to warping them onto the target pose. These predictions are fused with those of a convolutional predictive module through a neural synthesis module allowing for training the whole pipeline jointly end-to-end, optimizing a combination of adversarial and perceptual losses. We show that dense pose estimation is a substantially more powerful conditioning input than landmark-, or mask-based alternatives, and report systematic improvements over state of the art generators on DeepFashion and MVC datasets.

Natalia Neverova, Rıza Alp Güler, Iasonas Kokkinos

Memory Aware Synapses: Learning What (not) to Forget

Humans can learn in a continuous manner. Old rarely utilized knowledge can be overwritten by new incoming information while important, frequently used knowledge is prevented from being erased. In artificial learning systems, lifelong learning so far has focused mainly on accumulating knowledge over tasks and overcoming catastrophic forgetting. In this paper, we argue that, given the limited model capacity and the unlimited new information to be learned, knowledge has to be preserved or erased selectively. Inspired by neuroplasticity, we propose a novel approach for lifelong learning, coined Memory Aware Synapses (MAS). It computes the importance of the parameters of a neural network in an unsupervised and online manner. Given a new sample which is fed to the network, MAS accumulates an importance measure for each parameter of the network, based on how sensitive the predicted output function is to a change in this parameter. When learning a new task, changes to important parameters can then be penalized, effectively preventing important knowledge related to previous tasks from being overwritten. Further, we show an interesting connection between a local version of our method and Hebb’s rule, which is a model for the learning process in the brain. We test our method on a sequence of object recognition tasks and on the challenging problem of learning an embedding for predicting <subject, predicate, object> triplets. We show state-of-the-art performance and, for the first time, the ability to adapt the importance of the parameters based on unlabeled data towards what the network needs (not) to forget, which may vary depending on test conditions.

Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, Tinne Tuytelaars

Multi-view to Novel View: Synthesizing Novel Views With Self-learned Confidence

In this paper, we address the task of multi-view novel view synthesis, where we are interested in synthesizing a target image with an arbitrary camera pose from given source images. We propose an end-to-end trainable framework that learns to exploit multiple viewpoints to synthesize a novel view without any 3D supervision. Specifically, our model consists of a flow prediction module and a pixel generation module to directly leverage information presented in source views as well as hallucinate missing pixels from statistical priors. To merge the predictions produced by the two modules given multi-view source images, we introduce a self-learned confidence aggregation mechanism. We evaluate our model on images rendered from 3D object models as well as real and synthesized scenes. We demonstrate that our model is able to achieve state-of-the-art results as well as progressively improve its predictions when more source images are available.

Shao-Hua Sun, Minyoung Huh, Yuan-Hong Liao, Ning Zhang, Joseph J. Lim

Multimodal Unsupervised Image-to-Image Translation

Unsupervised image-to-image translation is an important and challenging problem in computer vision. Given an image in the source domain, the goal is to learn the conditional distribution of corresponding images in the target domain, without seeing any examples of corresponding image pairs. While this conditional distribution is inherently multimodal, existing approaches make an overly simplified assumption, modeling it as a deterministic one-to-one mapping. As a result, they fail to generate diverse outputs from a given source domain image. To address this limitation, we propose a Multimodal Unsupervised Image-to-image $$\text{ Translation } \text{(MUNIT) }$$ framework. We assume that the image representation can be decomposed into a content code that is domain-invariant, and a style code that captures domain-specific properties. To translate an image to another domain, we recombine its content code with a random style code sampled from the style space of the target domain. We analyze the proposed framework and establish several theoretical results. Extensive experiments with comparisons to state-of-the-art approaches further demonstrate the advantage of the proposed framework. Moreover, our framework allows users to control the style of translation outputs by providing an example style image. Code and pretrained models are available at https://github.com/nvlabs/MUNIT .

Xun Huang, Ming-Yu Liu, Serge Belongie, Jan Kautz

Deeply Learned Compositional Models for Human Pose Estimation

Compositional models represent patterns with hierarchies of meaningful parts and subparts. Their ability to characterize high-order relationships among body parts helps resolve low-level ambiguities in human pose estimation (HPE). However, prior compositional models make unrealistic assumptions on subpart-part relationships, making them incapable to characterize complex compositional patterns. Moreover, state spaces of their higher-level parts can be exponentially large, complicating both inference and learning. To address these issues, this paper introduces a novel framework, termed as Deeply Learned Compositional Model (DLCM), for HPE. It exploits deep neural networks to learn the compositionality of human bodies. This results in a novel network with a hierarchical compositional architecture and bottom-up/top-down inference stages. In addition, we propose a novel bone-based part representation. It not only compactly encodes orientations, scales and shapes of parts, but also avoids their potentially large state spaces. With significantly lower complexities, our approach outperforms state-of-the-art methods on three benchmark datasets.

Wei Tang, Pei Yu, Ying Wu

Unsupervised Video Object Segmentation with Motion-Based Bilateral Networks

In this work, we study the unsupervised video object segmentation problem where moving objects are segmented without prior knowledge of these objects. First, we propose a motion-based bilateral network to estimate the background based on the motion pattern of non-object regions. The bilateral network reduces false positive regions by accurately identifying background objects. Then, we integrate the background estimate from the bilateral network with instance embeddings into a graph, which allows multiple frame reasoning with graph edges linking pixels from different frames. We classify graph nodes by defining and minimizing a cost function, and segment the video frames based on the node labels. The proposed method outperforms previous state-of-the-art unsupervised video object segmentation methods against the DAVIS 2016 and the FBMS-59 datasets.

Siyang Li, Bryan Seybold, Alexey Vorobyov, Xuejing Lei, C.-C. Jay Kuo

Monocular Depth Estimation with Affinity, Vertical Pooling, and Label Enhancement

Significant progress has been made in monocular depth estimation with Convolutional Neural Networks (CNNs). While absolute features, such as edges and textures, could be effectively extracted, the depth constraint of neighboring pixels, namely relative features, has been mostly ignored by recent CNN-based methods. To overcome this limitation, we explicitly model the relationships of different image locations with an affinity layer and combine absolute and relative features in an end-to-end network. In addition, we consider prior knowledge that major depth changes lie in the vertical direction, and thus, it is beneficial to capture long-range vertical features for refined depth estimation. In the proposed algorithm we introduce vertical pooling to aggregate image features vertically to improve the depth accuracy. Furthermore, since the Lidar depth ground truth is quite sparse, we enhance the depth labels by generating high-quality dense depth maps with off-the-shelf stereo matching method taking left-right image pairs as input. We also integrate multi-scale structure in our network to obtain global understanding of the image depth and exploit residual learning to help depth refinement. We demonstrate that the proposed algorithm performs favorably against state-of-the-art methods both qualitatively and quantitatively on the KITTI driving dataset.

Yukang Gan, Xiangyu Xu, Wenxiu Sun, Liang Lin

ML-LocNet: Improving Object Localization with Multi-view Learning Network

This paper addresses Weakly Supervised Object Localization (WSOL) with only image-level supervision. We propose a Multi-view Learning Localization Network (ML-LocNet) by incorporating multi-view learning into a two-phase WSOL model. The multi-view learning would benefit localization due to the complementary relationships among the learned features from different views and the consensus property among the mined instances from each view. In the first phase, the representation is augmented by integrating features learned from multiple views, and in the second phase, the model performs multi-view co-training to enhance localization performance of one view with the help of instances mined from other views, which thus effectively avoids early fitting. ML-LocNet can be easily combined with existing WSOL models to further improve the localization accuracy. Its effectiveness has been proved experimentally. Notably, it achieves $$68.6\%$$ CorLoc and $$49.7\%$$ mAP on PASCAL VOC 2007, surpassing the state-of-the-arts by a large margin.

Xiaopeng Zhang, Yang Yang, Jiashi Feng

Diagnosing Error in Temporal Action Detectors

Despite the recent progress in video understanding and the continuous rate of improvement in temporal action localization throughout the years, it is still unclear how far (or close?) we are to solving the problem. To this end, we introduce a new diagnostic tool to analyze the performance of temporal action detectors in videos and compare different methods beyond a single scalar metric. We exemplify the use of our tool by analyzing the performance of the top rewarded entries in the latest ActivityNet action localization challenge. Our analysis shows that the most impactful areas to work on are: strategies to better handle temporal context around the instances, improving the robustness w.r.t. the instance absolute and relative size, and strategies to reduce the localization errors. Moreover, our experimental analysis finds the lack of agreement among annotator is not a major roadblock to attain progress in the field. Our diagnostic tool is publicly available to keep fueling the minds of other researchers with additional insights about their algorithms.

Humam Alwassel, Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem

Improved Structure from Motion Using Fiducial Marker Matching

In this paper, we present an incremental structure from motion (SfM) algorithm that significantly outperforms existing algorithms when fiducial markers are present in the scene, and that matches the performance of existing algorithms when no markers are present. Our algorithm uses markers to limit potential incorrect image matches, change the order in which images are added to the reconstruction, and enforce new bundle adjustment constraints. To validate our algorithm, we introduce a new dataset with 16 image collections of large indoor scenes with challenging characteristics (e.g., blank hallways, glass facades, brick walls) and with markers placed throughout. We show that our algorithm produces complete, accurate reconstructions on all 16 image collections, most of which cause other algorithms to fail. Further, by selectively masking fiducial markers, we show that the presence of even a small number of markers can improve the results of our algorithm.

Joseph DeGol, Timothy Bretl, Derek Hoiem

Unsupervised Domain Adaptation for Semantic Segmentation via Class-Balanced Self-training

Recent deep networks achieved state of the art performance on a variety of semantic segmentation tasks. Despite such progress, these models often face challenges in real world “wild tasks” where large difference between labeled training/source data and unseen test/target data exists. In particular, such difference is often referred to as “domain gap”, and could cause significantly decreased performance which cannot be easily remedied by further increasing the representation power. Unsupervised domain adaptation (UDA) seeks to overcome such problem without target domain labels. In this paper, we propose a novel UDA framework based on an iterative self-training (ST) procedure, where the problem is formulated as latent variable loss minimization, and can be solved by alternatively generating pseudo labels on target data and re-training the model with these labels. On top of ST, we also propose a novel class-balanced self-training (CBST) framework to avoid the gradual dominance of large classes on pseudo-label generation, and introduce spatial priors to refine generated labels. Comprehensive experiments show that the proposed methods achieve state of the art semantic segmentation performance under multiple major UDA settings.

Yang Zou, Zhiding Yu, B. V. K. Vijaya Kumar, Jinsong Wang

Towards Human-Level License Plate Recognition

License plate recognition (LPR) is a fundamental component of various intelligent transport systems, which is always expected to be accurate and efficient enough. In this paper, we propose a novel LPR framework consisting of semantic segmentation and character counting, towards achieving human-level performance. Benefiting from innovative structure, our method can recognize a whole license plate once rather than conducting character detection or sliding window followed by per-character recognition. Moreover, our method can achieve higher recognition accuracy due to more effectively exploiting global information and avoiding sensitive character detection, and is time-saving due to eliminating one-by-one character recognition. Finally, we experimentally verify the effectiveness of the proposed method on two public datasets (AOLP and Media Lab) and our License Plate Dataset. The results demonstrate our method significantly outperforms the previous state-of-the-art methods, and achieves the accuracies of more than 99% for almost all settings.

Jiafan Zhuang, Saihui Hou, Zilei Wang, Zheng-Jun Zha

Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition

Recognizing visual relationships $$\langle $$ subject-predicate-object $$\rangle $$ among any pair of localized objects is pivotal for image understanding. Previous studies have shown remarkable progress in exploiting linguistic priors or external textual information to improve the performance. In this work, we investigate an orthogonal perspective based on feature interactions. We show that by encouraging deep message propagation and interactions between local object features and global predicate features, one can achieve compelling performance in recognizing complex relationships without using any linguistic priors. To this end, we present two new pooling cells to encourage feature interactions: (i) Contrastive ROI Pooling Cell, which has a unique deROI pooling that inversely pools local object features to the corresponding area of global predicate features. (ii) Pyramid ROI Pooling Cell, which broadcasts global predicate features to reinforce local object features. The two cells constitute a Spatiality-Context-Appearance Module (SCA-M), which can be further stacked consecutively to form our final Zoom-Net. We further shed light on how one could resolve ambiguous and noisy object and predicate annotations by Intra-Hierarchical trees (IH-tree). Extensive experiments conducted on Visual Genome dataset demonstrate the effectiveness of our feature-oriented approach compared to state-of-the-art methods (Acc@1 $$11.42\%$$ from $$8.16\%$$ ) that depend on explicit modeling of linguistic interactions. We further show that SCA-M can be incorporated seamlessly into existing approaches to improve the performance by a large margin.

Guojun Yin, Lu Sheng, Bin Liu, Nenghai Yu, Xiaogang Wang, Jing Shao, Chen Change Loy

Quantized Densely Connected U-Nets for Efficient Landmark Localization

In this paper, we propose quantized densely connected U-Nets for efficient visual landmark localization. The idea is that features of the same semantic meanings are globally reused across the stacked U-Nets. This dense connectivity largely improves the information flow, yielding improved localization accuracy. However, a vanilla dense design would suffer from critical efficiency issue in both training and testing. To solve this problem, we first propose order-K dense connectivity to trim off long-distance shortcuts; then, we use a memory-efficient implementation to significantly boost the training efficiency and investigate an iterative refinement that may slice the model size in half. Finally, to reduce the memory consumption and high precision operations both in training and testing, we further quantize weights, inputs, and gradients of our localization network to low bit-width numbers. We validate our approach in two tasks: human pose estimation and face alignment. The results show that our approach achieves state-of-the-art localization accuracy, but using $$\sim $$ 70% fewer parameters, $$\sim $$ 98% less model size and saving $$\sim $$ 32 $$\times $$ training memory compared with other benchmark localizers.

Zhiqiang Tang, Xi Peng, Shijie Geng, Lingfei Wu, Shaoting Zhang, Dimitris Metaxas

Grassmann Pooling as Compact Homogeneous Bilinear Pooling for Fine-Grained Visual Classification

Designing discriminative and invariant features is the key to visual recognition. Recently, the bilinear pooled feature matrix of Convolutional Neural Network (CNN) has shown to achieve state-of-the-art performance on a range of fine-grained visual recognition tasks. The bilinear feature matrix collects second-order statistics and is closely related to the covariance matrix descriptor. However, the bilinear feature could suffer from the visual burstiness phenomenon similar to other visual representations such as VLAD and Fisher Vector. The reason is that the bilinear feature matrix is sensitive to the magnitudes and correlations of local CNN feature elements which can be measured by its singular values. On the other hand, the singular vectors are more invariant and reasonable to be adopted as the feature representation. Motivated by this point, we advocate an alternative pooling method which transforms the CNN feature matrix to an orthonormal matrix consists of its principal singular vectors. Geometrically, such orthonormal matrix lies on the Grassmann manifold, a Riemannian manifold whose points represent subspaces of the Euclidean space. Similarity measurement of images reduces to comparing the principal angles between these “homogeneous” subspaces and thus is independent of the magnitudes and correlations of local CNN activations. In particular, we demonstrate that the projection distance on the Grassmann manifold deduces a bilinear feature mapping without explicitly computing the bilinear feature matrix, which enables a very compact feature and classifier representation. Experimental results show that our method achieves an excellent balance of model complexity and accuracy on a variety of fine-grained image classification datasets.

Xing Wei, Yue Zhang, Yihong Gong, Jiawei Zhang, Nanning Zheng

CGIntrinsics: Better Intrinsic Image Decomposition Through Physically-Based Rendering

Intrinsic image decomposition is a challenging, long-standing computer vision problem for which ground truth data is very difficult to acquire. We explore the use of synthetic data for training CNN-based intrinsic image decomposition models, then applying these learned models to real-world images. To that end, we present CGIntrinsics, a new, large-scale dataset of physically-based rendered images of scenes with full ground truth decompositions. The rendering process we use is carefully designed to yield high-quality, realistic images, which we find to be crucial for this problem domain. We also propose a new end-to-end training method that learns better decompositions by leveraging CGIntrinsics, and optionally IIW and SAW, two recent datasets of sparse annotations on real-world images. Surprisingly, we find that a decomposition network trained solely on our synthetic data outperforms the state-of-the-art on both IIW and SAW, and performance improves even further when IIW and SAW data is added during training. Our work demonstrates the suprising effectiveness of carefully-rendered synthetic data for the intrinsic images task.

Zhengqi Li, Noah Snavely

Simultaneous Edge Alignment and Learning

Edge detection is among the most fundamental vision problems for its role in perceptual grouping and its wide applications. Recent advances in representation learning have led to considerable improvements in this area. Many state of the art edge detection models are learned with fully convolutional networks (FCNs). However, FCN-based edge learning tends to be vulnerable to misaligned labels due to the delicate structure of edges. While such problem was considered in evaluation benchmarks, similar issue has not been explicitly addressed in general edge learning. In this paper, we show that label misalignment can cause considerably degraded edge learning quality, and address this issue by proposing a simultaneous edge alignment and learning framework. To this end, we formulate a probabilistic model where edge alignment is treated as latent variable optimization, and is learned end-to-end during network training. Experiments show several applications of this work, including improved edge detection with state of the art performance, and automatic refinement of noisy annotations.

Zhiding Yu, Weiyang Liu, Yang Zou, Chen Feng, Srikumar Ramalingam, B. V. K. Vijaya Kumar, Jan Kautz

ICNet for Real-Time Semantic Segmentation on High-Resolution Images

We focus on the challenging task of real-time semantic segmentation in this paper. It finds many practical applications and yet is with fundamental difficulty of reducing a large portion of computation for pixel-wise label inference. We propose an image cascade network (ICNet) that incorporates multi-resolution branches under proper label guidance to address this challenge. We provide in-depth analysis of our framework and introduce the cascade feature fusion unit to quickly achieve high-quality segmentation. Our system yields real-time inference on a single GPU card with decent quality results evaluated on challenging datasets like Cityscapes, CamVid and COCO-Stuff.

Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jianping Shi, Jiaya Jia

Part-Activated Deep Reinforcement Learning for Action Prediction

In this paper, we propose a part-activated deep reinforcement learning (PA-DRL) method for action prediction. Most existing methods for action prediction utilize the evolution of whole frames to model actions, which cannot avoid the noise of the current action, especially in the early prediction. Moreover, the loss of structural information of human body diminishes the capacity of features to describe actions. To address this, we design the PA-DRL to exploit the structure of the human body by extracting skeleton proposals under a deep reinforcement learning framework. Specifically, we extract features from different parts of the human body individually and activate the action-related parts in features to enhance the representation. Our method not only exploits the structure information of the human body, but also considers the saliency part for expressing actions. We evaluate our method on three popular action prediction datasets: UT-Interaction, BIT-Interaction and UCF101. Our experimental results demonstrate that our method achieves the performance with state-of-the-arts.

Lei Chen, Jiwen Lu, Zhanjie Song, Jie Zhou

Lifelong Learning via Progressive Distillation and Retrospection

Lifelong learning aims at adapting a learned model to new tasks while retaining the knowledge gained earlier. A key challenge for lifelong learning is how to strike a balance between the preservation on old tasks and the adaptation to a new one within a given model. Approaches that combine both objectives in training have been explored in previous works. Yet the performance still suffers from considerable degradation in a long sequence of tasks. In this work, we propose a novel approach to lifelong learning, which tries to seek a better balance between preservation and adaptation via two techniques: Distillation and Retrospection. Specifically, the target model adapts to the new task by knowledge distillation from an intermediate expert, while the previous knowledge is more effectively preserved by caching a small subset of data for old tasks. The combination of Distillation and Retrospection leads to a more gentle learning curve for the target model, and extensive experiments demonstrate that our approach can bring consistent improvements on both old and new tasks (Project page: http://mmlab.ie.cuhk.edu.hk/projects/lifelong/ ).

Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, Dahua Lin

A Closed-Form Solution to Photorealistic Image Stylization

Photorealistic image stylization concerns transferring style of a reference photo to a content photo with the constraint that the stylized photo should remain photorealistic. While several photorealistic image stylization methods exist, they tend to generate spatially inconsistent stylizations with noticeable artifacts. In this paper, we propose a method to address these issues. The proposed method consists of a stylization step and a smoothing step. While the stylization step transfers the style of the reference photo to the content photo, the smoothing step ensures spatially consistent stylizations. Each of the steps has a closed-form solution and can be computed efficiently. We conduct extensive experimental validations. The results show that the proposed method generates photorealistic stylization outputs that are more preferred by human subjects as compared to those by the competing methods while running much faster. Source code and additional results are available at https://github.com/NVIDIA/FastPhotoStyle .

Yijun Li, Ming-Yu Liu, Xueting Li, Ming-Hsuan Yang, Jan Kautz

Visual Tracking via Spatially Aligned Correlation Filters Network

Correlation filters based trackers rely on a periodic assumption of the search sample to efficiently distinguish the target from the background. This assumption however yields undesired boundary effects and restricts aspect ratios of search samples. To handle these issues, an end-to-end deep architecture is proposed to incorporate geometric transformations into a correlation filters based network. This architecture introduces a novel spatial alignment module, which provides continuous feedback for transforming the target from the border to the center with a normalized aspect ratio. It enables correlation filters to work on well-aligned samples for better tracking. The whole architecture not only learns a generic relationship between object geometric transformations and object appearances, but also learns robust representations coupled to correlation filters in case of various geometric transformations. This lightweight architecture permits real-time speed. Experiments show our tracker effectively handles boundary effects and aspect ratio variations, achieving state-of-the-art tracking results on recent benchmarks.

Mengdan Zhang, Qiang Wang, Junliang Xing, Jin Gao, Peixi Peng, Weiming Hu, Steve Maybank

Online Dictionary Learning for Approximate Archetypal Analysis

Archetypal analysis is an unsupervised learning approach which represents data by convex combinations of a set of archetypes. The archetypes generally correspond to the extremal points in the dataset and are learned by requiring them to be convex combinations of the training data. In spite of its nice property of interpretability, the method is slow. We propose a variant of archetypal analysis which scales gracefully to large datasets. The core idea is to decouple the binding between data and archetypes and require them to be unit normalized. Geometrically, the method learns a convex hull inside the unit sphere and represents the data by their projections on the closest surfaces of the convex hull. By minimizing the representation error, the method pushes the convex hull surfaces close to the regions of the sphere where the data reside. The vertices of the convex hull are the learned archetypes. We apply the method to human faces and poses to validate its effectiveness in the context of reconstructions and classifications.

Jieru Mei, Chunyu Wang, Wenjun Zeng

Compositing-Aware Image Search

We present a new image search technique that, given a background image, returns compatible foreground objects for image compositing tasks. The compatibility of a foreground object and a background scene depends on various aspects such as semantics, surrounding context, geometry, style and color. However, existing image search techniques measure the similarities on only a few aspects, and may return many results that are not suitable for compositing. Moreover, the importance of each factor may vary for different object categories and image content, making it difficult to manually define the matching criteria. In this paper, we propose to learn feature representations for foreground objects and background scenes respectively, where image content and object category information are jointly encoded during training. As a result, the learned features can adaptively encode the most important compatibility factors. We project the features to a common embedding space, so that the compatibility scores can be easily measured using the cosine similarity, enabling very efficient search. We collect an evaluation set consisting of eight object categories commonly used in compositing tasks, on which we demonstrate that our approach significantly outperforms other search techniques.

Hengshuang Zhao, Xiaohui Shen, Zhe Lin, Kalyan Sunkavalli, Brian Price, Jiaya Jia

Improving Sequential Determinantal Point Processes for Supervised Video Summarization

It is now much easier than ever before to produce videos. While the ubiquitous video data is a great source for information discovery and extraction, the computational challenges are unparalleled. Automatically summarizing the videos has become a substantial need for browsing, searching, and indexing visual content. This paper is in the vein of supervised video summarization using sequential determinantal point processes (SeqDPPs), which models diversity by a probabilistic distribution. We improve this model in two folds. In terms of learning, we propose a large-margin algorithm to address the exposure bias problem in SeqDPP. In terms of modeling, we design a new probabilistic distribution such that, when it is integrated into SeqDPP, the resulting model accepts user input about the expected length of the summary. Moreover, we also significantly extend a popular video summarization dataset by (1) more egocentric videos, (2) dense user annotations, and (3) a refined evaluation scheme. We conduct extensive experiments on this dataset (about 60 h of videos in total) and compare our approach to several competitive baselines.

Aidean Sharghi, Ali Borji, Chengtao Li, Tianbao Yang, Boqing Gong

Online Detection of Action Start in Untrimmed, Streaming Videos

We aim to tackle a novel task in action detection - Online Detection of Action Start (ODAS) in untrimmed, streaming videos. The goal of ODAS is to detect the start of an action instance, with high categorization accuracy and low detection latency. ODAS is important in many applications such as early alert generation to allow timely security or emergency response. We propose three novel methods to specifically address the challenges in training ODAS models: (1) hard negative samples generation based on Generative Adversarial Network (GAN) to distinguish ambiguous background, (2) explicitly modeling the temporal consistency between data around action start and data succeeding action start, and (3) adaptive sampling strategy to handle the scarcity of training data. We conduct extensive experiments using THUMOS’14 and ActivityNet. We show that our proposed methods lead to significant performance gains and improve the state-of-the-art methods. An ablation study confirms the effectiveness of each proposed method.

Zheng Shou, Junting Pan, Jonathan Chan, Kazuyuki Miyazawa, Hassan Mansour, Anthony Vetro, Xavier Giro-i-Nieto, Shih-Fu Chang

Temporal Modular Networks for Retrieving Complex Compositional Activities in Videos

A major challenge in computer vision is scaling activity understanding to the long tail of complex activities without requiring collecting large quantities of data for new actions. The task of video retrieval using natural language descriptions seeks to address this through rich, unconstrained supervision about complex activities. However, while this formulation offers hope of leveraging underlying compositional structure in activity descriptions, existing approaches typically do not explicitly model compositional reasoning. In this work, we introduce an approach for explicitly and dynamically reasoning about compositional natural language descriptions of activity in videos. We take a modular neural network approach that, given a natural language query, extracts the semantic structure to assemble a compositional neural network layout and corresponding network modules. We show that this approach is able to achieve state-of-the-art results on the DiDeMo video retrieval dataset.

Bingbin Liu, Serena Yeung, Edward Chou, De-An Huang, Li Fei-Fei, Juan Carlos Niebles

Meta-tracker: Fast and Robust Online Adaptation for Visual Object Trackers

This paper improves state-of-the-art visual object trackers that use online adaptation. Our core contribution is an offline meta-learning-based method to adjust the initial deep networks used in online adaptation-based tracking. The meta learning is driven by the goal of deep networks that can quickly be adapted to robustly model a particular target in future frames. Ideally the resulting models focus on features that are useful for future frames, and avoid overfitting to background clutter, small parts of the target, or noise. By enforcing a small number of update iterations during meta-learning, the resulting networks train significantly faster. We demonstrate this approach on top of the high performance tracking approaches: tracking-by-detection based MDNet [1] and the correlation based CREST [2]. Experimental results on standard benchmarks, OTB2015 [3] and VOT2016 [4], show that our meta-learned versions of both trackers improve speed, accuracy, and robustness.

Eunbyung Park, Alexander C. Berg

Collaborative Deep Reinforcement Learning for Multi-object Tracking

In this paper, we propose a collaborative deep reinforcement learning (C-DRL) method for multi-object tracking. Most existing multi-object tracking methods employ the tracking-by-detection strategy which first detects objects in each frame and then associates them across different frames. However, the performance of these methods rely heavily on the detection results, which are usually unsatisfied in many real applications, especially in crowded scenes. To address this, we develop a deep prediction-decision network in our C-DRL, which simultaneously detects and predicts objects under a unified network via deep reinforcement learning. Specifically, we consider each object as an agent and track it via the prediction network, and seek the optimal tracked results by exploiting the collaborative interactions of different agents and environments via the decision network. Experimental results on the challenging MOT15 and MOT16 benchmarks are presented to show the effectiveness of our approach.

Liangliang Ren, Jiwen Lu, Zifeng Wang, Qi Tian, Jie Zhou

Multi-scale Context Intertwining for Semantic Segmentation

Accurate semantic image segmentation requires the joint consideration of local appearance, semantic information, and global scene context. In today’s age of pre-trained deep networks and their powerful convolutional features, state-of-the-art semantic segmentation approaches differ mostly in how they choose to combine together these different kinds of information. In this work, we propose a novel scheme for aggregating features from different scales, which we refer to as Multi-Scale Context Intertwining (MSCI). In contrast to previous approaches, which typically propagate information between scales in a one-directional manner, we merge pairs of feature maps in a bidirectional and recurrent fashion, via connections between two LSTM chains. By training the parameters of the LSTM units on the segmentation task, the above approach learns how to extract powerful and effective features for pixel-level semantic segmentation, which are then combined hierarchically. Furthermore, rather than using fixed information propagation routes, we subdivide images into super-pixels, and use the spatial relationship between them in order to perform image-adapted context aggregation. Our extensive evaluation on public benchmarks indicates that all of the aforementioned components of our approach increase the effectiveness of information propagation throughout the network, and significantly improve its eventual segmentation accuracy.

Di Lin, Yuanfeng Ji, Dani Lischinski, Daniel Cohen-Or, Hui Huang

Second-Order Democratic Aggregation

Aggregated second-order features extracted from deep convolutional networks have been shown to be effective for texture generation, fine-grained recognition, material classification, and scene understanding. In this paper, we study a class of orderless aggregation functions designed to minimize interference or equalize contributions in the context of second-order features and we show that they can be computed just as efficiently as their first-order counterparts and they have favorable properties over aggregation by summation. Another line of work has shown that matrix power normalization after aggregation can significantly improve the generalization of second-order representations. We show that matrix power normalization implicitly equalizes contributions during aggregation thus establishing a connection between matrix normalization techniques and prior work on minimizing interference. Based on the analysis we present $$\gamma $$ -democratic aggregators that interpolate between sum ( $$\gamma $$ = 1) and democratic pooling ( $$\gamma $$ = 0) outperforming both on several classification tasks. Moreover, unlike power normalization, the $$\gamma $$ -democratic aggregations can be computed in a low dimensional space by sketching that allows the use of very high-dimensional second-order features. This results in a state-of-the-art performance on several datasets.

Tsung-Yu Lin, Subhransu Maji, Piotr Koniusz

Occlusion-Aware R-CNN: Detecting Pedestrians in a Crowd

Pedestrian detection in crowded scenes is a challenging problem since the pedestrians often gather together and occlude each other. In this paper, we propose a new occlusion-aware R-CNN (OR-CNN) to improve the detection accuracy in the crowd. Specifically, we design a new aggregation loss to enforce proposals to be close and locate compactly to the corresponding objects. Meanwhile, we use a new part occlusion-aware region of interest (PORoI) pooling unit to replace the RoI pooling layer in order to integrate the prior structure information of human body with visibility prediction into the network to handle occlusion. Our detector is trained in an end-to-end fashion, which achieves state-of-the-art results on three pedestrian detection datasets, i.e., CityPersons, ETH, and INRIA, and performs on-pair with the state-of-the-arts on Caltech.

Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, Stan Z. Li

Seeing Deeply and Bidirectionally: A Deep Learning Approach for Single Image Reflection Removal

Reflections often obstruct the desired scene when taking photos through glass panels. Removing unwanted reflection automatically from the photos is highly desirable. Traditional methods often impose certain priors or assumptions to target particular type(s) of reflection such as shifted double reflection, thus have difficulty to generalize to other types. Very recently a deep learning approach has been proposed. It learns a deep neural network that directly maps a reflection contaminated image to a background (target) image (i.e.reflection free image) in an end to end fashion, and outperforms the previous methods. We argue that, to remove reflection truly well, we should estimate the reflection and utilize it to estimate the background image. We propose a cascade deep neural network, which estimates both the background image and the reflection. This significantly improves reflection removal. In the cascade deep network, we use the estimated background image to estimate the reflection, and then use the estimated reflection to estimate the background image, facilitating our idea of seeing deeply and bidirectionally.

Jie Yang, Dong Gong, Lingqiao Liu, Qinfeng Shi

Long-Term Tracking in the Wild: A Benchmark

We introduce the OxUvA dataset and benchmark for evaluating single-object tracking algorithms. Benchmarks have enabled great strides in the field of object tracking by defining standardized evaluations on large sets of diverse videos. However, these works have focused exclusively on sequences that are just tens of seconds in length and in which the target is always visible. Consequently, most researchers have designed methods tailored to this “short-term” scenario, which is poorly representative of practitioners’ needs. Aiming to address this disparity, we compile a long-term, large-scale tracking dataset of sequences with average length greater than two minutes and with frequent target object disappearance. The OxUvA dataset is much larger than the object tracking datasets of recent years: it comprises 366 sequences spanning 14 h of video. We assess the performance of several algorithms, considering both the ability to locate the target and to determine whether it is present or absent. Our goal is to offer the community a large and diverse benchmark to enable the design and evaluation of tracking methods ready to be used “in the wild”. The project website is oxuva.net .

Jack Valmadre, Luca Bertinetto, João F. Henriques, Ran Tao, Andrea Vedaldi, Arnold W. M. Smeulders, Philip H. S. Torr, Efstratios Gavves

Affinity Derivation and Graph Merge for Instance Segmentation

We present an instance segmentation scheme based on pixel affinity information, which is the relationship of two pixels belonging to the same instance. In our scheme, we use two neural networks with similar structures. One predicts the pixel level semantic score and the other is designed to derive pixel affinities. Regarding pixels as the vertexes and affinities as edges, we then propose a simple yet effective graph merge algorithm to cluster pixels into instances. Experiments show that our scheme generates fine grained instance masks. With Cityscape training data, the proposed scheme achieves 27.3 AP on test set.

Yiding Liu, Siyu Yang, Bin Li, Wengang Zhou, Jizheng Xu, Houqiang Li, Yan Lu

Generating 3D Faces Using Convolutional Mesh Autoencoders

Learned 3D representations of human faces are useful for computer vision problems such as 3D face tracking and reconstruction from images, as well as graphics applications such as character generation and animation. Traditional models learn a latent representation of a face using linear subspaces or higher-order tensor generalizations. Due to this linearity, they can not capture extreme deformations and non-linear expressions. To address this, we introduce a versatile model that learns a non-linear representation of a face using spectral convolutions on a mesh surface. We introduce mesh sampling operations that enable a hierarchical mesh representation that captures non-linear variations in shape and expression at multiple scales within the model. In a variational setting, our model samples diverse realistic 3D faces from a multivariate Gaussian distribution. Our training data consists of 20,466 meshes of extreme expressions captured over 12 different subjects. Despite limited training data, our trained model outperforms state-of-the-art face models with 50% lower reconstruction error, while using 75% fewer parameters. We show that, replacing the expression space of an existing state-of-the-art face model with our model, achieves a lower reconstruction error. Our data, model and code are available at http://coma.is.tue.mpg.de/ .

Anurag Ranjan, Timo Bolkart, Soubhik Sanyal, Michael J. Black

Hierarchical Relational Networks for Group Activity Recognition and Retrieval

Modeling structured relationships between people in a scene is an important step toward visual understanding. We present a Hierarchical Relational Network that computes relational representations of people, given graph structures describing potential interactions. Each relational layer is fed individual person representations and a potential relationship graph. Relational representations of each person are created based on their connections in this particular graph. We demonstrate the efficacy of this model by applying it in both supervised and unsupervised learning paradigms. First, given a video sequence of people doing a collective activity, the relational scene representation is utilized for multi-person activity recognition. Second, we propose a Relational Autoencoder model for unsupervised learning of features for action and scene retrieval. Finally, a Denoising Autoencoder variant is presented to infer missing people in the scene from their context. Empirical results demonstrate that this approach learns relational feature representations that can effectively discriminate person and group activity classes.

Mostafa S. Ibrahim, Greg Mori

Neural Procedural Reconstruction for Residential Buildings

This paper proposes a novel 3D reconstruction approach, dubbed Neural Procedural Reconstruction (NPR). NPR infers a sequence of shape grammar rule applications and reconstructs CAD-quality models with procedural structure from 3D points. While most existing methods rely on low-level geometry analysis to extract primitive structures, our approach conducts global analysis of entire building structures by deep neural networks (DNNs), enabling the reconstruction even from incomplete and sparse input data. We demonstrate the proposed system for residential buildings with aerial LiDAR as the input. Our 3D models boast compact geometry and semantically segmented architectural components. Qualitative and quantitative evaluations on hundreds of houses demonstrate that the proposed approach makes significant improvements over the existing state-of-the-art.

Huayi Zeng, Jiaye Wu, Yasutaka Furukawa

Simultaneous 3D Reconstruction for Water Surface and Underwater Scene

This paper presents the first approach for simultaneously recovering the 3D shape of both the wavy water surface and the moving underwater scene. A portable camera array system is constructed, which captures the scene from multiple viewpoints above the water. The correspondences across these cameras are estimated using an optical flow method and are used to infer the shape of the water surface and the underwater scene. We assume that there is only one refraction occurring at the water interface. Under this assumption, two estimates of the water surface normals should agree: one from Snell’s law of light refraction and another from local surface structure. The experimental results using both synthetic and real data demonstrate the effectiveness of the presented approach.

Yiming Qian, Yinqiang Zheng, Minglun Gong, Yee-Hong Yang

Women Also Snowboard: Overcoming Bias in Captioning Models

Most machine learning methods are known to capture and exploit biases of the training data. While some biases are beneficial for learning, others are harmful. Specifically, image captioning models tend to exaggerate biases present in training data (e.g., if a word is present in 60% of training sentences, it might be predicted in 70% of sentences at test time). This can lead to incorrect captions in domains where unbiased captions are desired, or required, due to over-reliance on the learned prior and image context. In this work we investigate generation of gender-specific caption words (e.g. man, woman) based on the person’s appearance or the image context. We introduce a new Equalizer model that encourages equal gender probability when gender evidence is occluded in a scene and confident predictions when gender evidence is present. The resulting model is forced to look at a person rather than use contextual cues to make a gender-specific prediction. The losses that comprise our model, the Appearance Confusion Loss and the Confident Loss, are general, and can be added to any description model in order to mitigate impacts of unwanted bias in a description dataset. Our proposed model has lower error than prior work when describing images with people and mentioning their gender and more closely matches the ground truth ratio of sentences including women to sentences including men. Finally, we show that our model more often looks at people when predicting their gender ( https://people.eecs.berkeley.edu/~lisa anne/snowboard.html ).

Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, Trevor Darrell, Anna Rohrbach

Joint Camera Spectral Sensitivity Selection and Hyperspectral Image Recovery

Hyperspectral image (HSI) recovery from a single RGB image has attracted much attention, whose performance has recently been shown to be sensitive to the camera spectral sensitivity (CSS). In this paper, we present an efficient convolutional neural network (CNN) based method, which can jointly select the optimal CSS from a candidate dataset and learn a mapping to recover HSI from a single RGB image captured with this algorithmically selected camera. Given a specific CSS, we first present a HSI recovery network, which accounts for the underlying characteristics of the HSI, including spectral nonlinear mapping and spatial similarity. Later, we append a CSS selection layer onto the recovery network, and the optimal CSS can thus be automatically determined from the network weights under the nonnegative sparse constraint. Experimental results show that our HSI recovery network outperforms state-of-the-art methods in terms of both quantitative metrics and perceptive quality, and the selection layer always returns a CSS consistent to the best one determined by exhaustive search.

Ying Fu, Tao Zhang, Yinqiang Zheng, Debing Zhang, Hua Huang

Disentangling Factors of Variation with Cycle-Consistent Variational Auto-encoders

Generative models that learn disentangled representations for different factors of variation in an image can be very useful for targeted data augmentation. By sampling from the disentangled latent subspace of interest, we can efficiently generate new data necessary for a particular task. Learning disentangled representations is a challenging problem, especially when certain factors of variation are difficult to label. In this paper, we introduce a novel architecture that disentangles the latent space into two complementary subspaces by using only weak supervision in form of pairwise similarity labels. Inspired by the recent success of cycle-consistent adversarial architectures, we use cycle-consistency in a variational auto-encoder framework. Our non-adversarial approach is in contrast with the recent works that combine adversarial training with auto-encoders to disentangle representations. We show compelling results of disentangled latent subspaces on three datasets and compare with recent works that leverage adversarial training.

Ananya Harsh Jha, Saket Anand, Maneesh Singh, VSR Veeravasarapu

Object-Centered Image Stitching

Image stitching is typically decomposed into three phases: registration, which aligns the source images with a common target image; seam finding, which determines for each target pixel the source image it should come from; and blending, which smooths transitions over the seams. As described in [1], the seam finding phase attempts to place seams between pixels where the transition between source images is not noticeable. Here, we observe that the most problematic failures of this approach occur when objects are cropped, omitted, or duplicated. We therefore take an object-centered approach to the problem, leveraging recent advances in object detection [2–4]. We penalize candidate solutions with this class of error by modifying the energy function used in the seam finding stage. This produces substantially more realistic stitching results on challenging imagery. In addition, these methods can be used to determine when there is non-recoverable occlusion in the input data, and also suggest a simple evaluation metric that can be used to evaluate the output of stitching algorithms.

Charles Herrmann, Chen Wang, Richard Strong Bowen, Emil Keyder, Ramin Zabih

Backmatter

Title: Computer Vision – ECCV 2018
Editors: Vittorio Ferrari
Prof. Martial Hebert
Cristian Sminchisescu
Yair Weiss
Publisher: Springer International Publishing
Electronic ISBN: 978-3-030-01219-9
Print ISBN: 978-3-030-01218-2
DOI: https://doi.org/10.1007/978-3-030-01219-9

Springer Professional

About this book

Table of Contents

Frontmatter

Computational Photography

Frontmatter

Light Structure from Pin Motion: Simple and Accurate Point Light Calibration for Physics-Based Modeling

Programmable Triangulation Light Curtains

Learning to Separate Object Sounds by Watching Unlabeled Video

Coded Two-Bucket Cameras for Computer Vision

Materials for Masses: SVBRDF Acquisition with a Single Mobile Phone Image

Poster Session

Frontmatter

Video Object Segmentation with Joint Re-identification and Attention-Aware Mask Propagation

Spatio-Temporal Transformer Network for Video Restoration

Dense Pose Transfer

Memory Aware Synapses: Learning What (not) to Forget

Multi-view to Novel View: Synthesizing Novel Views With Self-learned Confidence

Multimodal Unsupervised Image-to-Image Translation

Deeply Learned Compositional Models for Human Pose Estimation

Unsupervised Video Object Segmentation with Motion-Based Bilateral Networks

Monocular Depth Estimation with Affinity, Vertical Pooling, and Label Enhancement

ML-LocNet: Improving Object Localization with Multi-view Learning Network

Diagnosing Error in Temporal Action Detectors

Improved Structure from Motion Using Fiducial Marker Matching

Unsupervised Domain Adaptation for Semantic Segmentation via Class-Balanced Self-training

Towards Human-Level License Plate Recognition

Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition

Quantized Densely Connected U-Nets for Efficient Landmark Localization

Grassmann Pooling as Compact Homogeneous Bilinear Pooling for Fine-Grained Visual Classification

CGIntrinsics: Better Intrinsic Image Decomposition Through Physically-Based Rendering

Simultaneous Edge Alignment and Learning

ICNet for Real-Time Semantic Segmentation on High-Resolution Images

Part-Activated Deep Reinforcement Learning for Action Prediction

Lifelong Learning via Progressive Distillation and Retrospection

A Closed-Form Solution to Photorealistic Image Stylization

Visual Tracking via Spatially Aligned Correlation Filters Network

Online Dictionary Learning for Approximate Archetypal Analysis

Compositing-Aware Image Search

Improving Sequential Determinantal Point Processes for Supervised Video Summarization

Online Detection of Action Start in Untrimmed, Streaming Videos

Temporal Modular Networks for Retrieving Complex Compositional Activities in Videos

Meta-tracker: Fast and Robust Online Adaptation for Visual Object Trackers

Collaborative Deep Reinforcement Learning for Multi-object Tracking

Multi-scale Context Intertwining for Semantic Segmentation

Second-Order Democratic Aggregation

Occlusion-Aware R-CNN: Detecting Pedestrians in a Crowd

Seeing Deeply and Bidirectionally: A Deep Learning Approach for Single Image Reflection Removal

Long-Term Tracking in the Wild: A Benchmark

Affinity Derivation and Graph Merge for Instance Segmentation

Generating 3D Faces Using Convolutional Mesh Autoencoders

Hierarchical Relational Networks for Group Activity Recognition and Retrieval

Neural Procedural Reconstruction for Residential Buildings

Simultaneous 3D Reconstruction for Water Surface and Underwater Scene

Women Also Snowboard: Overcoming Bias in Captioning Models

Joint Camera Spectral Sensitivity Selection and Hyperspectral Image Recovery

Disentangling Factors of Variation with Cycle-Consistent Variational Auto-encoders

Object-Centered Image Stitching

Backmatter

Premium Partner