Skip to main content

2021 | Buch

Computer Vision – ACCV 2020

15th Asian Conference on Computer Vision, Kyoto, Japan, November 30 – December 4, 2020, Revised Selected Papers, Part II

herausgegeben von: Prof. Hiroshi Ishikawa, Prof. Cheng-Lin Liu, Prof. Tomas Pajdla, Prof. Jianbo Shi

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

The six volume set of LNCS 12622-12627 constitutes the proceedings of the 15th Asian Conference on Computer Vision, ACCV 2020, held in Kyoto, Japan, in November/ December 2020.*

The total of 254 contributions was carefully reviewed and selected from 768 submissions during two rounds of reviewing and improvement. The papers focus on the following topics:

Part I: 3D computer vision; segmentation and grouping

Part II: low-level vision, image processing; motion and tracking

Part III: recognition and detection; optimization, statistical methods, and learning; robot vision

Part IV: deep learning for computer vision, generative models for computer vision

Part V: face, pose, action, and gesture; video analysis and event recognition; biomedical image analysis

Part VI: applications of computer vision; vision for X; datasets and performance analysis

*The conference was held virtually.

Inhaltsverzeichnis

Frontmatter

Low-Level Vision, Image Processing

Frontmatter
Image Inpainting with Onion Convolutions

Recently deep learning methods have achieved a great success in image inpainting problem. However, reconstructing continuities of complex structures with non-stationary textures remains a challenging task for computer vision. In this paper, a novel approach to image inpainting problem is presented, which adapts exemplar-based methods for deep convolutional neural networks. The concept of onion convolution is introduced with the purpose of preserving feature continuities and semantic coherence. Similar to recent approaches, our onion convolution is able to capture long-range spatial correlations. In general, the implementation of modules with such ability in low-level features leads to impractically high latency and complexity. To address this limitations, the onion convolution suggests an efficient implementation. As qualitative and quantitative comparisons show, our method with onion convolutions outperforms state-of-the-art methods by producing more realistic, visually plausible and semantically coherent results.

Shant Navasardyan, Marianna Ohanyan
Accurate and Efficient Single Image Super-Resolution with Matrix Channel Attention Network

In recent years, deep learning methods have achieved impressive results with higher peak signal-to-noise ratio in Single Image Super-Resolution (SISR) tasks. However, these methods are usually computationally expensive, which constrains their application in mobile scenarios. In addition, most of the existing methods rarely take full advantage of the intermediate features which are helpful for restoration. To address these issues, we propose a moderate-size SISR network named matrix channel attention network (MCAN) by constructing a matrix ensemble of multi-connected channel attention blocks (MCAB). Several models of different sizes are released to meet various practical requirements. Extensive benchmark experiments show that the proposed models achieve better performance with much fewer multiply-adds and parameters (Source code is at https://github.com/macn3388/MCAN ).

Hailong Ma, Xiangxiang Chu, Bo Zhang
Second-Order Camera-Aware Color Transformation for Cross-Domain Person Re-identification

In recent years, supervised person re-identification (person ReID) has achieved great performance on public datasets, however, cross-domain person ReID remains a challenging task. The performance of ReID model trained on the labeled dataset (source) is often inferior on the new unlabeled dataset (target), due to large variation in color, resolution, scenes of different datasets. Therefore, unsupervised person ReID has gained a lot of attention due to its potential to solve the domain adaptation problem. Many methods focus on minimizing the distribution discrepancy in the feature domain but neglecting the differences among input distributions. This motivates us to handle the variation between input distributions of source and target datasets directly. We propose a Second-order Camera-aware Color Transformation (SCCT) that can operate on image level and align the second-order statistics of all the views of both source and target domain data with original ImageNet data statistics. This new input normalization method, as shown in our experiments, is much more efficient than simply using ImageNet statistics. We test our method on Market1501, DukeMTMC, and MSMT17 and achieve leading performance in unsupervised person ReID.

Wangmeng Xiang, Hongwei Yong, Jianqiang Huang, Xian-Sheng Hua, Lei Zhang
CS-MCNet: A Video Compressive Sensing Reconstruction Network with Interpretable Motion Compensation

In this paper, a deep neural network with interpretable motion compensation called CS-MCNet is proposed to realize high-quality and real-time decoding of video compressive sensing. Firstly, explicit multi-hypothesis motion compensation is applied in our network to extract correlation information of adjacent frames (as shown in Fig. 1), which improves the recover performance. And then, a residual module further narrows down the gap between reconstruction result and original signal. The overall architecture is interpretable by using algorithm unrolling, which brings the benefits of being able to transfer prior knowledge about the conventional algorithms. As a result, a PSNR of 22 dB can be achieved at 64x compression ratio, which is about $$4\%$$ 4 % to $$9\%$$ 9 % better than state-of-the-art methods. In addition, due to the feed-forward architecture, the reconstruction can be processed by our network in real time and up to three orders of magnitude faster than traditional iterative methods.

Bowen Huang, Jinjia Zhou, Xiao Yan, Ming’e Jing, Rentao Wan, Yibo Fan
MCGKT-Net: Multi-level Context Gating Knowledge Transfer Network for Single Image Deraining

Rain streak removal in a single image is a very challenging task due to its ill-posed nature in essence. Recently, the end-to-end learning techniques with deep convolutional neural networks (DCNN) have made great progress in this task. However, the conventional DCNN-based deraining methods have struggled to exploit deeper and more complex network architectures for pursuing better performance. This study proposes a novel MCGKT-Net for boosting deraining performance, which is a naturally multi-scale learning framework being capable of exploring multi-scale attributes of rain streaks and different semantic structures of the clear images. In order to obtain high representative features inside MCGKT-Net, we explore internal knowledge transfer module using ConvLSTM unit for conducting interaction learning between different layers and investigate external knowledge transfer module for leveraging the knowledge already learned in other task domains. Furthermore, to dynamically select useful features in learning procedure, we propose a multi-scale context gating module in the MCGKT-Net using squeeze-and-excitation block. Experiments on three benchmark datasets: Rain100H, Rain100L, and Rain800, manifest impressive performance compared with state-of-the-art methods.

Kohei Yamamichi, Xian-Hua Han
Degradation Model Learning for Real-World Single Image Super-Resolution

It is well-known that the single image super-resolution (SISR) models trained on those synthetic datasets, where a low-resolution (LR) image is generated by applying a simple degradation operator (e.g., bicubic downsampling) to its high-resolution (HR) counterpart, have limited generalization capability on real-world LR images, whose degradation process is much more complex. Several real-world SISR datasets have been constructed to reduce this gap; however, their scale is relatively small due to laborious and costly data collection process. To remedy this issue, we propose to learn a realistic degradation model from the existing real-world datasets, and use the learned degradation model to synthesize realistic HR-LR image pairs. Specifically, we learn a group of basis degradation kernels, and simultaneously learn a weight prediction network to predict the pixel-wise spatially variant degradation kernel as the weighted combination of the basis kernels. With the learned degradation model, a large number of realistic HR-LR pairs can be easily generated to train a more robust SISR model. Extensive experiments are performed to quantitatively and qualitatively validate the proposed degradation learning method and its effectiveness in improving the generalization performance of SISR models in practical scenarios.

Jin Xiao, Hongwei Yong, Lei Zhang
Chromatic Aberration Correction Using Cross-Channel Prior in Shearlet Domain

Instead of more expensive and complex optics, recent years, many researches are focused on high-quality photography using light-weight cameras, such as single-ball lens, with computational image processing. Traditional methods for image enhancement do not comprehensively address the blurring artifacts caused by strong chromatic aberrations in images produced by a simple optical system. In this paper, we propose a new method to correct both lateral and axial chromatic aberrations based on their different characteristics. To eliminate lateral chromatic aberration, cross-channel prior in shearlet domain is proposed to align texture information of red and blue channels to green channel. We also propose a new PSF estimation method to better correct axial chromatic aberration using wave propagation model, where F-number of the optical system is needed. Simulation results demonstrate our method can provide aberration-free images while there are still some artifacts in the results of the state-of-art methods. PSNRs of simulation results increase at least 2 dB and SSIM is on average 6.29% to 41.26% better than other methods. Real-captured image results prove that the proposed prior can effectively remove lateral chromatic aberration while the proposed PSF model can further correct the axial chromatic aberration.

Kunyi Li, Xin Jin
Raw-Guided Enhancing Reprocess of Low-Light Image via Deep Exposure Adjustment

Enhancement of images captured in low-light conditions remains to be a challenging problem even with the advanced machine learning techniques. The challenges include the ambiguity of the ground truth for a low-light image and the loss of information during the RAW image processing. To tackle the problems, in this paper, we take a novel view to regard low-light image enhancement as an exposure time adjustment problem and propose a corresponding explicit and mathematical definition. Based on that, we construct a RAW-Guiding exposure time adjustment Network (RGNET), which overcomes RGB images’ nonlinearity and RAW images’ inaccessibility. That is, RGNET is only trained with RGB images and corresponding RAW images, which helps project nonlinear RGB images into a linear domain, simultaneously without using RAW images in the testing phase. Furthermore, our network consists of three individual sub-modules for unprocessing, reconstruction and processing, respectively. To the best of our knowledge, the proposed sub-net for unprocessing is the first learning-based unprocessing method. After the joint training of three parts, each pre-trained seperately with the RAW image guidance, experimental results demonstrate that RGNET outperforms state-of-the-art low-light image enhancement methods.

Haofeng Huang, Wenhan Yang, Yueyu Hu, Jiaying Liu
Robust High Dynamic Range (HDR) Imaging with Complex Motion and Parallax

High dynamic range (HDR) imaging is widely used in consumer photography, computer game rendering, autonomous driving, and surveillance systems. Reconstructing ghosting-free HDR images of dynamic scenes from a set of multi-exposure images is a challenging task, especially with large object motion, disparity, and occlusions, leading to visible artifacts using existing methods. In this paper, we propose a Pyramidal Alignment and Masked merging network (PAMnet) that learns to synthesize HDR images from input low dynamic range (LDR) images in an end-to-end manner. Instead of aligning under/overexposed images to the reference view directly in pixel-domain, we apply deformable convolutions across multiscale features for pyramidal alignment. Aligned features offer more flexibility to refine the inevitable misalignment for subsequent merging network without reconstructing the aligned image explicitly. To make full use of aligned features, we use dilated dense residual blocks with squeeze-and-excitation (SE) attention. Such attention mechanism effectively helps to remove redundant information and suppress misaligned features. Additional mask-based weighting is further employed to refine the HDR reconstruction, which offers better image quality and sharp local details. Experiments demonstrate that PAMnet can produce ghosting-free HDR results in the presence of large disparity and motion. We present extensive comparative studies using several popular datasets to demonstrate superior quality compared to the state-of-the-art algorithms.

Zhiyuan Pu, Peiyao Guo, M. Salman Asif, Zhan Ma
Low-Light Color Imaging via Dual Camera Acquisition

As existing low-light color imaging suffers from the unrealistic color representation or blurry texture with a single camera setup, we are motivated to devise a dual camera system using a high spatial resolution (HSR) monochrome camera and another low spatial resolution (LSR) color camera for synthesizing the high-quality color image under low-light illumination conditions. The key problem is how to efficiently learn and fuse cross-camera information for improved presentation in such heterogeneous setup with domain gaps (e.g., color vs. monochrome, HSR vs. LSR). We have divided the end-to-end pipeline into three consecutive modularized sub-tasks, including the reference-based exposure compensation (RefEC), reference-based colorization (RefColor) and reference-based super-resolution (RefSR), to alleviate domain gaps and capture inter-camera dynamics between hybrid inputs. In each step, we leverage the powerful deep neural network (DNN) to respectively transfer and enhance the illuminative, spectral and spatial granularity in a data-driven way. Each module is first trained separately, and then jointly fine-tuned for robust and reliable performance. Experimental results have shown that our work provides the leading performance in synthetic content from popular test datasets when compared to existing algorithms, and offers appealing color reconstruction using real captured scenes from an industrial monochrome and a smartphone RGB cameras, in low-light color imaging application.

Peiyao Guo, Zhan Ma
Frequency Attention Network: Blind Noise Removal for Real Images

With outstanding feature extraction capabilities, deep convolutional neural networks (CNNs) have achieved extraordinary improvements in image denoising tasks. However, because of the difference of statistical characteristics of signal-dependent noise and signal-independent noise, it is hard to model real noise for training and blind real image denoising is still an important challenge problem. In this work we propose a method for blind image denoising that combines frequency domain analysis and attention mechanism, named frequency attention network (FAN). We adopt wavelet transform to convert images from spatial domain to frequency domain with more sparse features to utilize spectral information and structure information. For the denoising task, the objective of the neural network is to estimate the optimal solution of the wavelet coefficients of the clean image by nonlinear characteristics, which makes FAN possess good interpretability. Meanwhile, spatial and channel mechanisms are employed to enhance feature maps at different scales for capturing contextual information. Extensive experiments on the synthetic noise dataset and two real-world noise benchmarks indicate the superiority of our method over other competing methods at different noise type cases in blind image denoising.

Hongcheng Mo, Jianfei Jiang, Qin Wang, Dong Yin, Pengyu Dong, Jingjun Tian
Restoring Spatially-Heterogeneous Distortions Using Mixture of Experts Network

In recent years, deep learning-based methods have been successfully applied to the image distortion restoration tasks. However, scenarios that assume a single distortion only may not be suitable for many real-world applications. To deal with such cases, some studies have proposed sequentially combined distortions datasets. Viewing in a different point of combining, we introduce a spatially-heterogeneous distortion dataset in which multiple corruptions are applied to the different locations of each image. In addition, we also propose a mixture of experts network to effectively restore a multi-distortion image. Motivated by the multi-task learning, we design our network to have multiple paths that learn both common and distortion-specific representations. Our model is effective for restoring real-world distortions and we experimentally verify that our method outperforms other models designed to manage both single distortion and multiple distortions.

Sijin Kim, Namhyuk Ahn, Kyung-Ah Sohn
Color Enhancement Using Global Parameters and Local Features Learning

This paper proposes a neural network to learn global parameters and extract local features for color enhancement. Firstly, the global parameters extractor subnetwork with dilated convolution is used to estimate a global color transformation matrix. The introduction of the dilated convolution enhances the ability to aggregate spatial information. Secondly, the local features extractor subnetwork with a light dense block structure is designed to learn the matrix of local details. Finally, an enhancement map is obtained by multiplying these two matrices. A novel combination of loss functions is formulated to make the color of the generated image more consistent with that of the target. The enhanced image is formed by adding the original image with an enhancement map. Thus, we make it possible to adjust the enhancement intensity by multiplying the enhancement map with a weighting coefficient. We conduct experiments on the MIT-Adobe FiveK benchmark, and our algorithm generates superior performance compared with the state-of-the-art methods on images and videos, both qualitatively and quantitatively.

Enyu Liu, Songnan Li, Shan Liu
An Efficient Group Feature Fusion Residual Network for Image Super-Resolution

Convolutional neural networks (CNNs) have made great breakthrough in the field of image super-resolution (SR). However, most current methods are usually to improve their performance by simply increasing the depth of their network. Although this strategy can get promising results, it is inefficient in many real-world scenarios because of the high computational cost. In this paper, we propose an efficient group feature fusion residual network (GFFRN) for image super-resolution. In detail, we design a novel group feature fusion residual block (GFFRB) to group and fuse the features of the intermediate layers. In this way, GFFRB can enjoy the merits of the lightweight of the group convolution and the high-efficiency of the skip connections, thus achieving better performance compared with most current residual blocks. Experiments on the benchmark test sets show that our models are more efficient than most of the state-of-the-art methods.

Pengcheng Lei, Cong Liu
Adversarial Image Composition with Auxiliary Illumination

Dealing with the inconsistency between a foreground object and a background image is a challenging task in high-fidelity image composition. State-of-the-art methods strive to harmonize the composed image by adapting the style of foreground objects to be compatible with the background image, whereas the potential shadow of foreground objects within the composed image which is critical to the composition realism is largely neglected. In this paper, we propose an Adversarial Image Composition Net (AIC-Net) that achieves realistic image composition by considering potential shadows that the foreground object projects in the composed image. A novel branched generation mechanism is proposed, which disentangles the generation of shadows and the transfer of foreground styles for optimal accomplishment of the two tasks simultaneously. A differentiable spatial transformation module is designed which bridges the local harmonization and the global harmonization to achieve their joint optimization effectively. Extensive experiments on pedestrian and car composition tasks show that the proposed AIC-Net achieves superior composition performance qualitatively and quantitatively.

Fangneng Zhan, Shijian Lu, Changgong Zhang, Feiying Ma, Xuansong Xie
Overwater Image Dehazing via Cycle-Consistent Generative Adversarial Network

In contrast to images taken on land scenes, images taken over water are more prone to degradation due to the influence of the haze. However, existing image dehazing methods are mainly developed for land scenes and perform poorly when applied to overwater images. To address this problem, we collect the first overwater image dehazing dataset and propose an OverWater Image Dehazing GAN (OWI-DehazeGAN). Due to the difficulties of collecting paired hazy and clean images, the dataset is composed of unpaired hazy and clean images taken over water. The proposed OWI-DehazeGAN learns the underlying style mapping between hazy and clean images in an encoder-decoder framework, which is supervised by a forward-backward translation consistency loss for self-supervision and a perceptual loss for content preservation. In addition to qualitative evaluation, we design an image quality assessment network to rank the dehazed images. Experimental results on both real and synthetic test data demonstrate that the proposed method performs superiorly against several state-of-the-art land dehazing methods.

Shunyuan Zheng, Jiamin Sun, Qinglin Liu, Yuankai Qi, Shengping Zhang
Lightweight Single-Image Super-Resolution Network with Attentive Auxiliary Feature Learning

Despite convolutional network-based methods have boosted the performance of single image super-resolution (SISR), the huge computation costs restrict their practical applicability. In this paper, we develop a computation efficient yet accurate network based on the proposed attentive auxiliary features (A $$^2$$ 2 F) for SISR. Firstly, to explore the features from the bottom layers, the auxiliary feature from all the previous layers are projected into a common space. Then, to better utilize these projected auxiliary features and filter the redundant information, the channel attention is employed to select the most important common feature based on current layer feature. We incorporate these two modules into a block and implement it with a lightweight network. Experimental results on large-scale dataset demonstrate the effectiveness of the proposed model against the state-of-the-art (SOTA) SR methods. Notably, when parameters are less than 320k, A $$^2$$ 2 F outperforms SOTA methods for all scales, which proves its ability to better utilize the auxiliary features. Codes are available at https://github.com/wxxxxxxh/A2F-SR .

Xuehui Wang, Qing Wang, Yuzhi Zhao, Junchi Yan, Lei Fan, Long Chen
Multi-scale Attentive Residual Dense Network for Single Image Rain Removal

Single image deraining is an urgent yet challenging task since rain streaks severely degrade the image quality and hamper the practical application. The investigation on rain removal has thus been attracting, while the performances of existing deraining have limitations owing to over smoothing effect, poor generalization capability and rain intensity varies both in spatial locations and color channels. To address these issues, we proposed a Multi-scale Attentive Residual Dense Network called MARD-Net in end-to-end manner, to exactly extract the negative rain streaks from rainy images while precisely preserving the image details. The architecture of modified dense network can be used to exploit the rain streaks details representation through feature reuse and propagation. Further, the Multi-scale Attentive Residual Block (MARB) is involved in the dense network to guide the rain streaks feature extraction and representation capability. Since contextual information is very critical for deraining, MARB first uses different convolution kernels along with fusion to extract multi-scale rain features and employs feature attention module to identify rain streaks regions and color channels, as well as has the skip connections to aggregate features at multiple levels and accelerate convergence. The proposed method is extensively evaluated on several frequent-use synthetic and real-world datasets. The quantitative and qualitative results show that the designed framework performs better than the recent state-of-the-art deraining approaches on promoting the rain removal performance and preserving image details under various rain streaks cases.

Xiang Chen, Yufeng Huang, Lei Xu
FAN: Feature Adaptation Network for Surveillance Face Recognition and Normalization

This paper studies face recognition (FR) and normalization in surveillance imagery. Surveillance FR is a challenging problem that has great values in law enforcement. Despite recent progress in conventional FR, less effort has been devoted to surveillance FR. To bridge this gap, we propose a Feature Adaptation Network (FAN) to jointly perform surveillance FR and normalization. Our face normalization mainly acts on the aspect of image resolution, closely related to face super-resolution. However, previous face super-resolution methods require paired training data with pixel-to-pixel correspondence, which is typically unavailable between real-world low-resolution and high-resolution faces. FAN can leverage both paired and unpaired data as we disentangle the features into identity and non-identity components and adapt the distribution of the identity features, which breaks the limit of current face super-resolution methods. We further propose a random scale augmentation scheme to learn resolution robust identity features, with advantages over previous fixed scale augmentation. Extensive experiments on LFW, WIDER FACE, QUML-SurvFace and SCface datasets have shown the effectiveness of our method on surveillance FR and normalization.

Xi Yin, Ying Tai, Yuge Huang, Xiaoming Liu
Human Motion Deblurring Using Localized Body Prior

In recent decades, the skinned multi-person linear model (SMPL) is widely exploited in the image-based 3D body reconstruction. This model, however, depends fully on the quality of the input image. Degraded image case, such as the motion-blurred issue, downgrades the quality of the reconstructed 3D body. This issue becomes severe as recent motion deblurring methods mainly focused on solving the camera motion case while ignoring the blur caused by human-articulated motion. In this work, we construct a localized adversarial framework that solves both human-articulated and camera motion blurs. To achieve this, we utilize the result of the restored image in a 3D body reconstruction module and produces a localized map. The map is employed to guide the adversarial modules on learning both the human body and scene regions. Nevertheless, training these modules straight-away is impractical since the recent blurry dataset is not supported by the 3D body predictor module. To settle this issue, we generate a novel dataset that simulates realistic blurry human motion while maintaining the presence of camera motion. By engaging this dataset and the proposed framework, we show that our deblurring results are superior among the state-of-the-art algorithms in both quantitative and qualitative performances.

Jonathan Samuel Lumentut, Joshua Santoso, In Kyu Park
Synergistic Saliency and Depth Prediction for RGB-D Saliency Detection

Depth information available from an RGB-D camera can be useful in segmenting salient objects when figure/ground cues from RGB channels are weak. This has motivated the development of several RGB-D saliency datasets and algorithms that use all four channels of the RGB-D data for both training and inference. Unfortunately, existing RGB-D saliency datasets are small, which may lead to overfitting and limited generalization for diverse scenarios. Here we propose a semi-supervised system for RGB-D saliency detection that can be trained on smaller RGB-D saliency datasets without saliency ground truth, while also make effective joint use of a large RGB saliency dataset with saliency ground truth together. To generalize our method on RGB-D saliency datasets, a novel prediction-guided cross-refinement module which jointly estimates both saliency and depth by mutual refinement between two respective tasks, and an adversarial learning approach are employed. Critically, our system does not require saliency ground-truth for the RGB-D datasets, which saves the massive human labor for hand labeling, and does not require the depth data for inference, allowing the method to be used for the much broader range of applications where only RGB data are available. Evaluation on seven RGB-D datasets demonstrates that even without saliency ground truth for RGB-D datasets and using only the RGB data of RGB-D datasets at inference, our semi-supervised system performs favorable against state-of-the-art fully-supervised RGB-D saliency detection methods that use saliency ground truth for RGB-D datasets at training and depth data at inference on two largest testing datasets. Our approach also achieves comparable results on other popular RGB-D saliency benchmarks.

Yue Wang, Yuke Li, James H. Elder, Runmin Wu, Huchuan Lu, Lu Zhang
Deep Snapshot HDR Imaging Using Multi-exposure Color Filter Array

In this paper, we propose a deep snapshot high dynamic range (HDR) imaging framework that can effectively reconstruct an HDR image from the RAW data captured using a multi-exposure color filter array (ME-CFA), which consists of a mosaic pattern of RGB filters with different exposure levels. To effectively learn the HDR image reconstruction network, we introduce the idea of luminance normalization that simultaneously enables effective loss computation and input data normalization by considering relative local contrasts in the “normalized-by-luminance” HDR domain. This idea makes it possible to equally handle the errors in both bright and dark areas regardless of absolute luminance levels, which significantly improves the visual image quality in a tone-mapped domain. Experimental results using two public HDR image datasets demonstrate that our framework outperforms other snapshot methods and produces high-quality HDR images with fewer visual artifacts.

Takeru Suda, Masayuki Tanaka, Yusuke Monno, Masatoshi Okutomi
Deep Priors Inside an Unrolled and Adaptive Deconvolution Model

Image deconvolution is an essential but ill-posed problem even if the degradation kernel is known. Recently, learning based methods have demonstrated superior image restoration quality in comparison to traditional methods which are typically based on empirical statistics and parameter adjustment. Though coming up with outstanding performance, most of the plug-and-play priors are trained in a specific degradation model, leading to inferior performance on restoring high-frequency components. To address this problem, a deblurring architecture that adopts (1) adaptive deconvolution modules and (2) learning based image prior solvers is proposed. The adaptive deconvolution module adjusts the regularization weight locally to well process both smooth and non-smooth regions. Moreover, a cascade made of image priors is learned from the mapping between intermediates thus robust to arbitrary noise, aliasing, and artifact. According to our analysis, the proposed architecture can achieve a significant improvement on the convergence rate and result in an even better restoration performance.

Hung-Chih Ko, Je-Yuan Chang, Jian-Jiun Ding

Motion and Tracking

Frontmatter
Adaptive Spatio-Temporal Regularized Correlation Filters for UAV-Based Tracking

Visual tracking on unmanned aerial vehicles (UAVs) has enabled many new practical applications in computer vision. Meanwhile, discriminative correlation filter (DCF)-based trackers have drawn great attention and undergone remarkable progress due to their promising performance and efficiency. However, the boundary effect and filter degradation remain two challenging problems. In this work, a novel Adaptive Spatio-Temporal Regularized Correlation Filter (ASTR-CF) model is proposed to address these two problems. The ASTR-CF can optimize the spatial regularization weight and the temporal regularization weight simultaneously. Meanwhile, the proposed model can be effectively optimized based on the alternating direction method of multipliers (ADMM). Experimental results on DTB70 and UAV123@10fps benchmarks have proven the superiority of the ASTR-CF tracker compared to the state-of-the-art trackers in terms of both accuracy and computational speed.

Libin Xu, Qilei Li, Jun Jiang, Guofeng Zou, Zheng Liu, Mingliang Gao
Goal-GAN: Multimodal Trajectory Prediction Based on Goal Position Estimation

In this paper, we present Goal-GAN, an interpretable and end-to-end trainable model for human trajectory prediction. Inspired by human navigation, we model the task of trajectory prediction as an intuitive two-stage process: (i) goal estimation, which predicts the most likely target positions of the agent, followed by a (ii) routing module which estimates a set of plausible trajectories that route towards the estimated goal. We leverage information about the past trajectory and visual context of the scene to estimate a multi-modal probability distribution over the possible goal positions, which is used to sample a potential goal during the inference. The routing is governed by a recurrent neural network that reacts to physical constraints in the nearby surroundings and generates feasible paths that route towards the sampled goal. Our extensive experimental evaluation shows that our method establishes a new state-of-the-art on several benchmarks while being able to generate a realistic and diverse set of trajectories that conform to physical constraints.

Patrick Dendorfer, Aljoša Ošep, Laura Leal-Taixé
Self-supervised Sparse to Dense Motion Segmentation

Observable motion in videos can give rise to the definition of objects moving with respect to the scene. The task of segmenting such moving objects is referred to as motion segmentation and is usually tackled either by aggregating motion information in long, sparse point trajectories, or by directly producing per frame dense segmentations relying on large amounts of training data. In this paper, we propose a self supervised method to learn the densification of sparse motion segmentations from single video frames. While previous approaches towards motion segmentation build upon pre-training on large surrogate datasets and use dense motion information as an essential cue for the pixelwise segmentation, our model does not require pre-training and operates at test time on single frames. It can be trained in a sequence specific way to produce high quality dense segmentations from sparse and noisy input. We evaluate our method on the well-known motion segmentation datasets $$\mathrm {FBMS59}$$ FBMS 59 and $$\mathrm {DAVIS}_{16}$$ DAVIS 16 .

Amirhossein Kardoost, Kalun Ho, Peter Ochs, Margret Keuper
Recursive Bayesian Filtering for Multiple Human Pose Tracking from Multiple Cameras

Markerless motion capture allows the extraction of multiple 3D human poses from natural scenes, without the need for a controlled but artificial studio environment or expensive hardware. In this work we present a novel tracking algorithm which utilizes recent advancements in 2D human pose estimation as well as 3D human motion anticipation. During the prediction step we utilize an RNN to forecast a set of plausible future poses while we utilize a 2D multiple human pose estimation model during the update step to incorporate observations. Casting the problem of estimating multiple persons from multiple cameras as a tracking problem rather than an association problem results in a linear relationship between runtime and the number of tracked persons. Furthermore, tracking enables our method to overcome temporary occlusions by relying on the prediction model. Our approach achieves state-of-the-art results on popular benchmarks for 3D human pose estimation and tracking.

Oh-Hun Kwon, Julian Tanke, Juergen Gall
Adversarial Refinement Network for Human Motion Prediction

Human motion prediction aims to predict future 3D skeletal sequences by giving a limited human motion as inputs. Two popular methods, recurrent neural networks and feed-forward deep networks, are able to predict rough motion trend, but motion details such as limb movement may be lost. To predict more accurate future human motion, we propose an Adversarial Refinement Network (ARNet) following a simple yet effective coarse-to-fine mechanism with novel adversarial error augmentation. Specifically, we take both the historical motion sequences and coarse prediction as input of our cascaded refinement network to predict refined human motion and strengthen the refinement network with adversarial error augmentation. During training, we deliberately introduce the error distribution by learning through the adversarial mechanism among different subjects. In testing, our cascaded refinement network alleviates the prediction error from the coarse predictor resulting in a finer prediction robustly. This adversarial error augmentation provides rich error cases as input to our refinement network, leading to better generalization performance on the testing dataset. We conduct extensive experiments on three standard benchmark datasets and show that our proposed ARNet outperforms other state-of-the-art methods, especially on challenging aperiodic actions in both short-term and long-term predictions.

Xianjin Chao, Yanrui Bin, Wenqing Chu, Xuan Cao, Yanhao Ge, Chengjie Wang, Jilin Li, Feiyue Huang, Howard Leung
Semantic Synthesis of Pedestrian Locomotion

We present a model for generating 3d articulated pedestrian locomotion in urban scenarios, with synthesis capabilities informed by the 3d scene semantics and geometry. We reformulate pedestrian trajectory forecasting as a structured reinforcement learning (RL) problem. This allows us to naturally combine prior knowledge on collision avoidance, 3d human motion capture and the motion of pedestrians as observed e.g. in Cityscapes, Waymo or simulation environments like Carla. Our proposed RL-based model allows pedestrians to accelerate and slow down to avoid imminent danger (e.g. cars), while obeying human dynamics learnt from in-lab motion capture datasets. Specifically, we propose a hierarchical model consisting of a semantic trajectory policy network that provides a distribution over possible movements, and a human locomotion network that generates 3d human poses in each step. The RL-formulation allows the model to learn even from states that are seldom exhibited in the dataset, utilizing all of the available prior and scene information. Extensive evaluations using both real and simulated data illustrate that the proposed model is on par with recent models such as S-GAN, ST-GAT and S-STGCNN in pedestrian forecasting, while outperforming these in collision avoidance. We also show that our model can be used to plan goal reaching trajectories in urban scenes with dynamic actors.

Maria Priisalu, Ciprian Paduraru, Aleksis Pirinen, Cristian Sminchisescu
Betrayed by Motion: Camouflaged Object Discovery via Motion Segmentation

The objective of this paper is to design a computational architecture that discovers camouflaged objects in videos, specifically by exploiting motion information to perform object segmentation. We make the following three contributions: (i) We propose a novel architecture that consists of two essential components for breaking camouflage, namely, a differentiable registration module to align consecutive frames based on the background, which effectively emphasises the object boundary in the difference image, and a motion segmentation module with memory that discovers the moving objects, while maintaining the object permanence even when motion is absent at some point. (ii) We collect the first large-scale Moving Camouflaged Animals (MoCA) video dataset, which consists of over 140 clips across a diverse range of animals (67 categories). (iii) We demonstrate the effectiveness of the proposed model on MoCA, and achieve competitive performance on the unsupervised segmentation protocol on DAVIS2016 by only relying on motion.

Hala Lamdouar, Charig Yang, Weidi Xie, Andrew Zisserman
Visual Tracking by TridentAlign and Context Embedding

Recent advances in Siamese network-based visual tracking methods have enabled high performance on numerous tracking benchmarks. However, extensive scale variations of the target object and distractor objects with similar categories have consistently posed challenges in visual tracking. To address these persisting issues, we propose novel TridentAlign and context embedding modules for Siamese network-based visual tracking methods. The TridentAlign module facilitates adaptability to extensive scale variations and large deformations of the target, where it pools the feature representation of the target object into multiple spatial dimensions to form a feature pyramid, which is then utilized in the region proposal stage. Meanwhile, context embedding module aims to discriminate the target from distractor objects by accounting for the global context information among objects. The context embedding module extracts and embeds the global context information of a given frame into a local feature representation such that the information can be utilized in the final classification stage. Experimental results obtained on multiple benchmark datasets show that the performance of the proposed tracker is comparable to that of state-of-the-art trackers, while the proposed tracker runs at real-time speed. (Code available on https://github.com/JanghoonChoi/TACT ).

Janghoon Choi, Junseok Kwon, Kyoung Mu Lee
Leveraging Tacit Information Embedded in CNN Layers for Visual Tracking

Different layers in CNNs provide not only different levels of abstraction for describing the objects in the input but also encode various implicit information about them. The activation patterns of different features contain valuable information about the stream of incoming images: spatial relations, temporal patterns, and co-occurrence of spatial and spatiotemporal (ST) features. The studies in visual tracking literature, so far, utilized only one of the CNN layers, a pre-fixed combination of them, or an ensemble of trackers built upon individual layers. In this study, we employ an adaptive combination of several CNN layers in a single DCF tracker to address variations of the target appearances and propose the use of style statistics on both spatial and temporal properties of the target, directly extracted from CNN layers for visual tracking. Experiments demonstrate that using the additional implicit data of CNNs significantly improves the performance of the tracker. Results demonstrate the effectiveness of using style similarity and activation consistency regularization in improving its localization and scale accuracy.

Kourosh Meshgi, Maryam Sadat Mirzaei, Shigeyuki Oba
A Two-Stage Minimum Cost Multicut Approach to Self-supervised Multiple Person Tracking

Multiple Object Tracking (MOT) is a long-standing task in computer vision. Current approaches based on the tracking by detection paradigm either require some sort of domain knowledge or supervision to associate data correctly into tracks. In this work, we present a self-supervised multiple object tracking approach based on visual features and minimum cost lifted multicuts. Our method is based on straight-forward spatio-temporal cues that can be extracted from neighboring frames in an image sequences without supervision. Clustering based on these cues enables us to learn the required appearance invariances for the tracking task at hand and train an AutoEncoder to generate suitable latent representations. Thus, the resulting latent representations can serve as robust appearance cues for tracking even over large temporal distances where no reliable spatio-temporal features can be extracted. We show that, despite being trained without using the provided annotations, our model provides competitive results on the challenging MOT Benchmark for pedestrian tracking.

Kalun Ho, Amirhossein Kardoost, Franz-Josef Pfreundt, Janis Keuper, Margret Keuper
Learning Local Feature Descriptors for Multiple Object Tracking

The present study aims at learning class-agnostic embedding, which is suitable for Multiple Object Tracking (MOT). We demonstrate that the learning of local feature descriptors could provide a sufficient level of generalization. Proposed embedding function exhibits on-par performance with its dedicated person re-identification counterparts in their target domain and outperforms them in others. Through its utilization, our solutions achieve state-of-the-art performance in a number of MOT benchmarks, which includes CVPR’19 Tracking Challenge.

Dmytro Mykheievskyi, Dmytro Borysenko, Viktor Porokhonskyy
VAN: Versatile Affinity Network for End-to-End Online Multi-object Tracking

In recent years, tracking-by-detection has become the most popular multi-object tracking (MOT) method, and deep convolutional neural networks (CNNs)-based appearance features have been successfully applied to enhance the performance of candidate association. Several MOT methods adopt single-object tracking (SOT) and handcrafted rules to deal with incomplete detection, resulting in numerous false positives (FPs) and false negatives (FNs). However, a separately trained SOT network is not directly adaptable because domains can differ, and handcrafted rules contain a considerable number of hyperparameters, thus making it difficult to optimize the MOT method. To address this issue, we propose a versatile affinity network (VAN) that can perform the entire MOT process in a single network including target specific SOT to handle incomplete detection issues, affinity computation between target and candidates, and decision of tracking termination. We train the VAN in an end-to-end manner by using event-aware learning that is designed to reduce the potential error caused by FNs, FPs, and identity switching. The proposed VAN significantly reduces the number of hyperparameters and handcrafted rules required for the MOT framework and successfully improves the MOT performance. We implement the VAN using two baselines with different candidate refinement methods to demonstrate the effects of the proposed VAN. We also conduct extensive experiments including ablation studies on three public benchmark datasets: 2D MOT2015, MOT2016, and MOT2017. The results indicate that the proposed method successfully improves the object tracking performance compared with that of baseline methods, and outperforms recent state-of-the-art MOT methods in terms of several tracking metrics including MOT accuracy (MOTA), identity F1 score (IDF1), percentage of mostly tracked targets (MT), and FP.

Hyemin Lee, Inhan Kim, Daijin Kim
COMET: Context-Aware IoU-Guided Network for Small Object Tracking

We consider the problem of tracking an unknown small target from aerial videos of medium to high altitudes. This is a challenging problem, which is even more pronounced in unavoidable scenarios of drastic camera motion and high density. To address this problem, we introduce a context-aware IoU-guided tracker (COMET) that exploits a multitask two-stream network and an offline reference proposal generation strategy. The proposed network fully exploits target-related information by multi-scale feature learning and attention modules. The proposed strategy introduces an efficient sampling strategy to generalize the network on the target and its parts without imposing extra computational complexity during online tracking. These strategies contribute considerably in handling significant occlusions and viewpoint changes. Empirically, COMET outperforms the state-of-the-arts in a range of aerial view datasets that focusing on tracking small objects. Specifically, COMET outperforms the celebrated ATOM tracker by an average margin of $$6.2\%$$ 6.2 % (and $$7\%$$ 7 % ) in precision (and success) score on challenging benchmarks of UAVDT, VisDrone-2019, and Small-90.

Seyed Mojtaba Marvasti-Zadeh, Javad Khaghani, Hossein Ghanei-Yakhdan, Shohreh Kasaei, Li Cheng
Adversarial Semi-supervised Multi-domain Tracking

Neural networks for multi-domain learning empowers an effective combination of information from different domains by sharing and co-learning the parameters. In visual tracking, the emerging features in shared layers of a multi-domain tracker, trained on various sequences, are crucial for tracking in unseen videos. Yet, in a fully shared architecture, some of the emerging features are useful only in a specific domain, reducing the generalization of the learned feature representation. We propose a semi-supervised learning scheme to separate domain-invariant and domain-specific features using adversarial learning, to encourage mutual exclusion between them, and to leverage self-supervised learning for enhancing the shared features using the unlabeled reservoir. By employing these features and training dedicated layers for each sequence, we build a tracker that performs exceptionally on different types of videos.

Kourosh Meshgi, Maryam Sadat Mirzaei
Tracking-by-Trackers with a Distilled and Reinforced Model

Visual object tracking was generally tackled by reasoning independently on fast processing algorithms, accurate online adaptation methods, and fusion of trackers. In this paper, we unify such goals by proposing a novel tracking methodology that takes advantage of other visual trackers, offline and online. A compact student model is trained via the marriage of knowledge distillation and reinforcement learning. The first allows to transfer and compress tracking knowledge of other trackers. The second enables the learning of evaluation measures which are then exploited online. After learning, the student can be ultimately used to build (i) a very fast single-shot tracker, (ii) a tracker with a simple and effective online adaptation mechanism, (iii) a tracker that performs fusion of other trackers. Extensive validation shows that the proposed algorithms compete with real-time state-of-the-art trackers.

Matteo Dunnhofer, Niki Martinel, Christian Micheloni
Motion Prediction Using Temporal Inception Module

Human motion prediction is a necessary component for many applications in robotics and autonomous driving. Recent methods propose using sequence-to-sequence deep learning models to tackle this problem. However, they do not focus on exploiting different temporal scales for different length inputs. We argue that the diverse temporal scales are important as they allow us to look at the past frames with different receptive fields, which can lead to better predictions. In this paper, we propose a Temporal Inception Module (TIM) to encode human motion. Making use of TIM, our framework produces input embeddings using convolutional layers, by using different kernel sizes for different input lengths. The experimental results on standard motion prediction benchmark datasets Human3.6M and CMU motion capture dataset show that our approach consistently outperforms the state of the art methods.

Tim Lebailly, Sena Kiciroglu, Mathieu Salzmann, Pascal Fua, Wei Wang
A Sparse Gaussian Approach to Region-Based 6DoF Object Tracking

We propose a novel, highly efficient sparse approach to region-based 6DoF object tracking that requires only a monocular RGB camera and the 3D object model. The key contribution of our work is a probabilistic model that considers image information sparsely along correspondence lines. For the implementation, we provide a highly efficient discrete scale-space formulation. In addition, we derive a novel mathematical proof that shows that our proposed likelihood function follows a Gaussian distribution. Based on this information, we develop robust approximations for the derivatives of the log-likelihood that are used in a regularized Newton optimization. In multiple experiments, we show that our approach outperforms state-of-the-art region-based methods in terms of tracking success while being about one order of magnitude faster. The source code of our tracker is publicly available ( https://github.com/DLR-RM/RBGT ).

Manuel Stoiber, Martin Pfanne, Klaus H. Strobl, Rudolph Triebel, Alin Albu-Schäffer
Modeling Cross-Modal Interaction in a Multi-detector, Multi-modal Tracking Framework

Different modalities have their own advantages and disadvantages. In a tracking-by-detection framework, fusing data from multiple modalities would ideally improve tracking performance than using a single modality, but this has been a challenge. This study builds upon previous research in this area. We propose a deep-learning based tracking-by-detection pipeline that uses multiple detectors and multiple sensors. For the input, we associate object proposals from 2D and 3D detectors. Through a cross-modal attention module, we optimize interaction between the 2D RGB and 3D point clouds features of each proposal. This helps to generate 2D features with suppressed irrelevant information for boosting performance. Through experiments on a published benchmark, we prove the value and ability of our design in introducing a multi-modal tracking solution to the current research on Multi-Object Tracking (MOT).

Yiqi Zhong, Suya You, Ulrich Neumann
Dense Pixel-Wise Micro-motion Estimation of Object Surface by Using Low Dimensional Embedding of Laser Speckle Pattern

This paper proposes a method of estimating micro-motion of an object at each pixel that is too small to detect under a common setup of camera and illumination. The method introduces an active-lighting approach to make the motion visually detectable. The approach is based on speckle pattern, which is produced by the mutual interference of laser light on object’s surface and continuously changes its appearance according to the out-of-plane motion of the surface. In addition, speckle pattern becomes uncorrelated with large motion. To compensate such micro- and large motion, the method estimates the motion parameters up to scale at each pixel by nonlinear embedding of the speckle pattern into low-dimensional space. The out-of-plane motion is calculated by making the motion parameters spatially consistent across the image. In the experiments, the proposed method is compared with other measuring devices to prove the effectiveness of the method.

Ryusuke Sagawa, Yusuke Higuchi, Hiroshi Kawasaki, Ryo Furukawa, Takahiro Ito
Backmatter
Metadaten
Titel
Computer Vision – ACCV 2020
herausgegeben von
Prof. Hiroshi Ishikawa
Prof. Cheng-Lin Liu
Prof. Tomas Pajdla
Prof. Jianbo Shi
Copyright-Jahr
2021
Electronic ISBN
978-3-030-69532-3
Print ISBN
978-3-030-69531-6
DOI
https://doi.org/10.1007/978-3-030-69532-3