Skip to main content

2020 | Buch

Computer Vision – ECCV 2020 Workshops

Glasgow, UK, August 23–28, 2020, Proceedings, Part II

insite
SUCHEN

Über dieses Buch

The 5-volume set, comprising the LNCS books 12535 until 12540, constitutes the refereed proceedings of 28 out of the 45 workshops held at the 16th European Conference on Computer Vision, ECCV 2020. The conference was planned to take place in Glasgow, UK, during August 23-28, 2020, but changed to a virtual format due to the COVID-19 pandemic.

The 249 full papers, 18 short papers, and 21 further contributions included in the workshop proceedings were carefully reviewed and selected from a total of 467 submissions. The papers deal with diverse computer vision topics.

Part II focusses on commands for autonomous vehicles; computer vision for ART analysis; sign language recognition, translation and production; visual inductive priors for data-efficient deep learning; 3D poses in the wild challenge; map-based localization for autonomous driving; recovering 6D object pose; and shape recovery from partial textured 3D scans.

Inhaltsverzeichnis

Frontmatter

W12 - Commands 4 Autonomous Vehicles

Frontmatter
Commands 4 Autonomous Vehicles (C4AV) Workshop Summary

The task of visual grounding requires locating the most relevant region or object in an image, given a natural language query. So far, progress on this task was mostly measured on curated datasets, which are not always representative of human spoken language. In this work, we deviate from recent, popular task settings and consider the problem under an autonomous vehicle scenario. In particular, we consider a situation where passengers can give free-form natural language commands to a vehicle which can be associated with an object in the street scene. To stimulate research on this topic, we have organized the Commands for Autonomous Vehicles (C4AV) challenge based on the recent Talk2Car dataset. This paper presents the results of the challenge. First, we compare the used benchmark against existing datasets for visual grounding. Second, we identify the aspects that render top-performing models successful, and relate them to existing state-of-the-art models for visual grounding, in addition to detecting potential failure cases by evaluating on carefully selected subsets. Finally, we discuss several possibilities for future work.

Thierry Deruyttere, Simon Vandenhende, Dusan Grujicic, Yu Liu, Luc Van Gool, Matthew Blaschko, Tinne Tuytelaars, Marie-Francine Moens
Commands for Autonomous Vehicles by Progressively Stacking Visual-Linguistic Representations

In this work, we focus on the object referral problem in the autonomous driving setting. We use a stacked visual-linguistic BERT model to learn a generic visual-linguistic representation. Each element of the input is either a word or a region of interest from the input image. To train the deep model efficiently, we use a stacking algorithm to transfer knowledge from a shallow BERT model to a deep BERT model.

Hang Dai, Shujie Luo, Yong Ding, Ling Shao
C4AV: Learning Cross-Modal Representations from Transformers

In this paper, we focus on the object referral problem in the autonomous driving setting. We propose a novel framework to learn cross-modal representations from transformers. In order to extract the linguistic feature, we feed the input command to the transformer encoder. Meanwhile, we use a resnet as the backbone for the image feature learning. The image features are flattened and used as the query inputs to the transformer decoder. The image feature and the linguistic feature are aggregated in the transformer decoder. A region-of-interest (RoI) alignment is applied to the feature map output from the transformer decoder to crop the RoI features for region proposals. Finally, a multi-layer classifier is used for object referral from the features of proposal regions.

Shujie Luo, Hang Dai, Ling Shao, Yong Ding
Cosine Meets Softmax: A Tough-to-beat Baseline for Visual Grounding

In this paper, we present a simple baseline for visual grounding for autonomous driving which outperforms the state of the art methods, while retaining minimal design choices. Our framework minimizes the cross-entropy loss over the cosine distance between multiple image ROI features with a text embedding (representing the given sentence/phrase). We use pre-trained networks for obtaining the initial embeddings and learn a transformation layer on top of the text embedding. We perform experiments on the Talk2Car dataset and achieve 68.7% AP50 accuracy, improving upon the previous state of the art by 8.6%. Our investigation suggests reconsideration towards more approaches employing sophisticated attention mechanisms or multi-stage reasoning or complex metric learning loss functions by showing promise in simpler alternatives.

Nivedita Rufus, Unni Krishnan R Nair, K. Madhava Krishna, Vineet Gandhi
Attention Enhanced Single Stage Multimodal Reasoner

In this paper, we propose an Attention Enhanced Single Stage Multimodal Reasoner (ASSMR) to tackle the object referral task in the self-driving car scenario. We extract features from each modality and establish attention mechanisms to jointly process them. The Key Words Extractor (KWE) is used to extract the attribute and position/scale information of the target in the command, which are used to score the corresponding features through the Position/Scale Attention Module (P/SAM) and the Object Attention Module (OAM). Based on the attention mechanism, the effective part of the position/scale feature, the object attribute feature and the semantic feature of the command is enhanced. Finally, we map different features to a common embedding space to predict the final result. Our method is based on the simplified version of the Talk2Car dataset, and scored on 66.4 AP50 on the test set, while using the official region proposals.

Jie Ou, Xinying Zhang
AttnGrounder: Talking to Cars with Attention

We propose the Attention Grounder (AttnGrounder), a single-stage end-to-end trainable model for the task of visual grounding. Visual grounding aims to localize a specific object in an image based on a given natural language text query. Unlike previous methods that use the same text representation for every image region, we use a visual-text attention module that relates each word in the given query with every region in the corresponding image, constructing a region dependent text representation. Furthermore, to improve the localization ability of our model, we use a visual-text attention module that generates an attention mask around the referred object. The attention mask is trained as an auxiliary task using a rectangular mask generated with the provided ground-truth coordinates. We evaluate the AttnGrounder on the Talk2Car dataset and show an improvement of 3.26% over the existing methods. The code is available at https://github.com/i-m-vivek/AttnGrounder .

Vivek Mittal

W13 - Computer VISion for ART Analysis

Frontmatter
Detecting Faces, Visual Medium Types, and Gender in Historical Advertisements, 1950–1995

Libraries, museums, and other heritage institutions are digitizing large parts of their archives. Computer vision techniques enable scholars to query, analyze, and enrich the visual sources in these archives. However, it remains unclear how well algorithms trained on modern photographs perform on historical material. This study evaluates and adapts existing algorithms. We show that we can detect faces, visual media types, and gender with high accuracy in historical advertisements. It remains difficult to detect gender when faces are either of low quality or relatively small or large. Further optimization of scaling might solve the latter issue, while the former might be ameliorated using upscaling. We show how computer vision can produce meta-data information, which can enrich historical collections. This information can be used for further analysis of the historical representation of gender.

Melvin Wevers, Thomas Smits
A Dataset and Baselines for Visual Question Answering on Art

Answering questions related to art pieces (paintings) is a difficult task, as it implies the understanding of not only the visual information that is shown in the picture, but also the contextual knowledge that is acquired through the study of the history of art. In this work, we introduce our first attempt towards building a new dataset, coined AQUA (Art QUestion Answering). The question-answer (QA) pairs are automatically generated using state-of-the-art question generation methods based on paintings and comments provided in an existing art understanding dataset. The QA pairs are cleansed by crowdsourcing workers with respect to their grammatical correctness, answerability, and answers’ correctness. Our dataset inherently consists of visual (painting-based) and knowledge (comment-based) questions. We also present a two-branch model as baseline, where the visual and knowledge questions are handled independently. We extensively compare our baseline model against the state-of-the-art models for question answering, and we provide a comprehensive study about the challenges and potential future directions for visual question answering on art.

Noa Garcia, Chentao Ye, Zihua Liu, Qingtao Hu, Mayu Otani, Chenhui Chu, Yuta Nakashima, Teruko Mitamura
Understanding Compositional Structures in Art Historical Images Using Pose and Gaze Priors
Towards Scene Understanding in Digital Art History

Image compositions as a tool for analysis of artworks is of extreme significance for art historians. These compositions are useful in analyzing the interactions in an image to study artists and their artworks. Max Imdahl in his work called Ikonik, along with other prominent art historians of the 20 $$^\mathrm{th}$$ th century, underlined the aesthetic and semantic importance of the structural composition of an image. Understanding underlying compositional structures within images is challenging and a time consuming task. Generating these structures automatically using computer vision techniques (1) can help art historians towards their sophisticated analysis by saving lot of time; providing an overview and access to huge image repositories and (2) also provide an important step towards an understanding of man made imagery by machines. In this work, we attempt to automate this process using the existing state of the art machine learning techniques, without involving any form of training. Our approach, inspired by Max Imdahl’s pioneering work, focuses on two central themes of image composition: (a) detection of action regions and action lines of the artwork; and (b) pose-based segmentation of foreground and background. Currently, our approach works for artworks comprising of protagonists (persons) in an image. In order to validate our approach qualitatively and quantitatively, we conduct a user study involving experts and non-experts. The outcome of the study highly correlates with our approach and also demonstrates its domain-agnostic capability. We have open-sourced the code: https://github.com/image-compostion-canvas-group/image-compostion- canvas

Prathmesh Madhu, Tilman Marquart, Ronak Kosti, Peter Bell, Andreas Maier, Vincent Christlein
Demographic Influences on Contemporary Art with Unsupervised Style Embeddings

Computational art analysis has, through its reliance on classification tasks, prioritised historical datasets in which the artworks are already well sorted with the necessary annotations. Art produced today, on the other hand, is numerous and easily accessible, through the internet and social networks that are used by professional and amateur artists alike to display their work. Although this art—yet unsorted in terms of style and genre—is less suited for supervised analysis, the data sources come with novel information that may help frame the visual content in equally novel ways. As a first step in this direction, we present contempArt, a multi-modal dataset of exclusively contemporary artworks. contempArt is a collection of paintings and drawings, a detailed graph network based on social connections on Instagram and additional socio-demographic information; all attached to 442 artists at the beginning of their career. We evaluate three methods suited for generating unsupervised style embeddings of images and correlate them with the remaining data. We find no connections between visual style on the one hand and social proximity, gender, and nationality on the other.

Nikolai Huckle, Noa Garcia, Yuta Nakashima
Geolocating Time: Digitisation and Reverse Engineering of a Roman Sundial

The sundial of Euporus was discovered in 1878 within the ancient Roman city of Aquileia (Italy), in a quite unusual location at the centre of the city’s horse race track. Studies have tried to demonstrate that the sundial had been made for a more southern location than the one it was found at, although no specific alternative positions have been suggested. This paper showcases both the workflow designed to fully digitise it in 3D and analyses on the use of the artefact undertaken from it. The final 3D reconstruction achieves accuracies of a few millimetres, thus offering the opportunity to analyse small details of its surface and to perform non-trivial measurements. We also propose a mathematical approach to compute the object’s optimal working latitude as well as the gnomon position and orientation. The algorithm is designed as an optimization problem where the sundial’s inscriptions and the Sun positions during daytime are considered to obtain the optimal configuration. The complete 3D model of the object is used to get all the geometrical information needed to validate the results of computations.

Mara Pistellato, Arianna Traviglia, Filippo Bergamasco
Object Retrieval and Localization in Large Art Collections Using Deep Multi-style Feature Fusion and Iterative Voting

The search for specific objects or motifs is essential to art history as both assist in decoding the meaning of artworks. Digitization has produced large art collections, but manual methods prove to be insufficient to analyze them. In the following, we introduce an algorithm that allows users to search for image regions containing specific motifs or objects and find similar regions in an extensive dataset, helping art historians to analyze large digitized art collections. Computer vision has presented efficient methods for visual instance retrieval across photographs. However, applied to art collections, they reveal severe deficiencies because of diverse motifs and massive domain shifts induced by differences in techniques, materials, and styles. In this paper, we present a multi-style feature fusion approach that successfully reduces the domain gap and improves retrieval results without labelled data or curated image collections. Our region-based voting with GPU-accelerated approximate nearest-neighbour search [29] allows us to find and localize even small motifs within an extensive dataset in a few seconds. We obtain state-of-the-art results on the Brueghel dataset [2, 52] and demonstrate its generalization to inhomogeneous collections with a large number of distractors.

Nikolai Ufer, Sabine Lang, Björn Ommer

W15 - Sign Language Recognition, Translation and Production

Frontmatter
SLRTP 2020: The Sign Language Recognition, Translation & Production Workshop

The objective of the “Sign Language Recognition, Translation & Production” (SLRTP 2020) Workshop was to bring together researchers who focus on the various aspects of sign language understanding using tools from computer vision and linguistics. The workshop sought to promote a greater linguistic and historical understanding of sign languages within the computer vision community, to foster new collaborations and to identify the most pressing challenges for the field going forwards. The workshop was held in conjunction with the European Conference on Computer Vision (ECCV), 2020.

Necati Cihan Camgöz, Gül Varol, Samuel Albanie, Neil Fox, Richard Bowden, Andrew Zisserman, Kearsy Cormier
Automatic Segmentation of Sign Language into Subtitle-Units

We present baseline results for a new task of automatic segmentation of Sign Language video into sentence-like units. We use a corpus of natural Sign Language video with accurately aligned subtitles to train a spatio-temporal graph convolutional network with a BiLSTM on 2D skeleton data to automatically detect the temporal boundaries of subtitles. In doing so, we segment Sign Language video into subtitle-units that can be translated into phrases in a written language. We achieve a ROC-AUC statistic of 0.87 at the frame level and 92% label accuracy within a time margin of 0.6s of the true labels.

Hannah Bull, Michèle Gouiffès, Annelies Braffort
Phonologically-Meaningful Subunits for Deep Learning-Based Sign Language Recognition

The large majority of sign language recognition systems based on deep learning adopt a word model approach. Here we present a system that works with subunits, rather than word models. We propose a pipelined approach to deep learning that uses a factorisation algorithm to derive hand motion features, embedded within a low-rank trajectory space. Recurrent neural networks are then trained on these embedded features for subunit recognition, followed by a second-stage neural network for sign recognition. Our evaluation shows that our proposed solution compares well in accuracy against the state of the art, providing added benefits of better interpretability and phonologically-meaningful subunits that can operate across different signers and sign languages.

Mark Borg, Kenneth P. Camilleri
Recognition of Affective and Grammatical Facial Expressions: A Study for Brazilian Sign Language

Individuals with hearing impairment typically face difficulties in communicating with hearing individuals and during the acquisition of reading and writing skills. Widely adopted by the deaf, Sign Language (SL) has a grammatical structure where facial expressions assume grammatical and affective functions, differentiate lexical items, participate in syntactic construction, and contribute to intensification processes. Automatic Sign Language Recognition (ASLR) technology supports the communication between deaf and hearing individuals, translating sign language gestures into written or spoken sentences of a target language. The recognition of facial expressions can improve ASLR accuracy rates. There are cases where the absence of a facial expression can create wrong translations, making them necessary for the understanding of sign language. This paper presents an approach to facial recognition for sign language. Brazilian Sign Language (Libras) is used as a case study. In our approach, we code Libras’ facial expression using the Facial Action Coding System (FACS). In the paper, we evaluate two convolutional neural networks, a standard CNN and hybrid CNN+LSTM, for AU recognition. We evaluate the models on a challenging real-world video dataset of facial expressions in Libras. The results obtained were 0.87 f1-score average and indicated the potential of the system to recognize Libras’ facial expressions.

Emely Pujólli da Silva, Paula Dornhofer Paro Costa, Kate Mamhy Oliveira Kumada, José Mario De Martino, Gabriela Araújo Florentino
Real-Time Sign Language Detection Using Human Pose Estimation

We propose a lightweight real-time sign language detection model, as we identify the need for such a case in videoconferencing. We extract optical flow features based on human pose estimation and, using a linear classifier, show these features are meaningful with an accuracy of 80%, evaluated on the Public DGS Corpus. Using a recurrent model directly on the input, we see improvements of up to 91% accuracy, while still working under 4 ms. We describe a demo application to sign language detection in the browser in order to demonstrate its usage possibility in videoconferencing applications.

Amit Moryossef, Ioannis Tsochantaridis, Roee Aharoni, Sarah Ebling, Srini Narayanan
Exploiting 3D Hand Pose Estimation in Deep Learning-Based Sign Language Recognition from RGB Videos

In this paper, we investigate the benefit of 3D hand skeletal information to the task of sign language (SL) recognition from RGB videos, within a state-of-the-art, multiple-stream, deep-learning recognition system. As most SL datasets are available in traditional RGB-only video lacking depth information, we propose to infer 3D coordinates of the hand joints from RGB data via a powerful architecture that has been primarily introduced in the literature for the task of 3D human pose estimation. We then fuse these estimates with additional SL informative streams, namely 2D skeletal data, as well as convolutional neural network-based hand- and mouth-region representations, and employ an attention-based encoder-decoder for recognition. We evaluate our proposed approach on a corpus of isolated signs of Greek SL and a dataset of continuous finger-spelling in American SL, reporting significant gains by the inclusion of 3D hand pose information, while also outperforming the state-of-the-art on both databases. Further, we evaluate the 3D hand pose estimation technique as standalone.

Maria Parelli, Katerina Papadimitriou, Gerasimos Potamianos, Georgios Pavlakos, Petros Maragos
A Plan for Developing an Auslan Communication Technologies Pipeline

AI techniques for mainstream spoken languages have seen a great deal of progress in recent years, with technologies for transcription, translation and text processing becoming commercially available. However, no such technologies have been developed for sign languages, which, as visual-gestural languages, require multimodal processing approaches. This paper presents a plan to develop an Auslan Communication Technologies Pipeline (Auslan CTP), a prototype AI system enabling Auslan-in, Auslan-out interactions, to demonstrate the feasibility of Auslan-based machine interaction and language processing. Such a system has a range of applications, including gestural human-machine interfaces, educational tools, and translation.

Jessica Korte, Axel Bender, Guy Gallasch, Janet Wiles, Andrew Back
A Multi-modal Machine Learning Approach and Toolkit to Automate Recognition of Early Stages of Dementia Among British Sign Language Users

The ageing population trend is correlated with an increased prevalence of acquired cognitive impairments such as dementia. Although there is no cure for dementia, a timely diagnosis helps in obtaining necessary support and appropriate medication. Researchers are working urgently to develop effective technological tools that can help doctors undertake early identification of cognitive disorder. In particular, screening for dementia in ageing Deaf signers of British Sign Language (BSL) poses additional challenges as the diagnostic process is bound up with conditions such as quality and availability of interpreters, as well as appropriate questionnaires and cognitive tests. On the other hand, deep learning based approaches for image and video analysis and understanding are promising, particularly the adoption of Convolutional Neural Network (CNN), which require large amounts of training data. In this paper, however, we demonstrate novelty in the following way: a) a multi-modal machine learning based automatic recognition toolkit for early stages of dementia among BSL users in that features from several parts of the body contributing to the sign envelope, e.g., hand-arm movements and facial expressions, are combined, b) universality in that it is possible to apply our technique to users of any sign language, since it is language independent, c) given the trade-off between complexity and accuracy of machine learning (ML) prediction models as well as the limited amount of training and testing data being available, we show that our approach is not over-fitted and has the potential to scale up.

Xing Liang, Anastassia Angelopoulou, Epaminondas Kapetanios, Bencie Woll, Reda Al Batat, Tyron Woolfe
Score-Level Multi Cue Fusion for Sign Language Recognition

Sign Languages are expressed through hand and upper body gestures as well as facial expressions. Therefore, Sign Language Recognition (SLR) needs to focus on all such cues. Previous work uses hand-crafted mechanisms or network aggregation to extract the different cue features, to increase SLR performance. This is slow and involves complicated architectures. We propose a more straightforward approach that focuses on training separate cue models specializing on the dominant hand, hands, face, and upper body regions. We compare the performance of 3D Convolutional Neural Network (CNN) models specializing in these regions, combine them through score-level fusion, and use the weighted alternative. Our experimental results have shown the effectiveness of mixed convolutional models. Their fusion yields up to $$19\%$$ 19 % accuracy improvement over the baseline using the full upper body. Furthermore, we include a discussion for fusion settings, which can help future work on Sign Language Translation (SLT).

Çağrı Gökçe, Oğulcan Özdemir, Ahmet Alp Kındıroğlu, Lale Akarun
Unsupervised Discovery of Sign Terms by K-Nearest Neighbours Approach

In order to utilize the large amount of unlabeled sign language resources, unsupervised learning methods are needed. Motivated by the successful results of unsupervised term discovery (UTD) in spoken languages, here we explore how to apply similar methods for sign terms discovery. Our goal is to find the repeating terms from continuous sign videos without any supervision. Using visual features extracted from RGB videos, we show that a k-nearest neighbours based discovery algorithm designed for speech can also discover sign terms. We also run experiments using a baseline UTD algorithm and comment on their differences.

Korhan Polat, Murat Saraçlar
Improving Keyword Search Performance in Sign Language with Hand Shape Features

Handshapes and human pose estimation are among the most used pretrained features in sign language recognition. In this study, we develop a handshape based keyword search (KWS) system for sign language and compare different pose based and handshape based encoders for the task of large vocabulary sign retrieval. We improved KWS performance in sign language by 3.5% mAP score for gloss search and 1.6% for cross-lingual KWS by combining pose and handshape based KWS models in a late fusion approach.

Nazif Can Tamer, Murat Saraçlar

W16 - Visual Inductive Priors for Data-Efficient Deep Learning

Frontmatter
Lightweight Action Recognition in Compressed Videos

Most existing action recognition models are large convolutional neural networks that work only with raw RGB frames as input. However, practical applications require lightweight models that directly process compressed videos. In this work, for the first time, such a model is developed, which is lightweight enough to run in real-time on embedded AI devices without sacrifices in recognition accuracy. A new Aligned Temporal Trilinear Pooling (ATTP) module is formulated to fuse three modalities in a compressed video. To remedy the weaker motion vectors (compared to optical flow computed from raw RGB streams) for representing dynamic content, we introduce a temporal fusion method to explicitly induce the temporal context, as well as knowledge distillation from a model trained with optical flows via feature alignment. Compared to existing compressed video action recognition models, it is much more compact and faster thanks to adopting a lightweight CNN backbone.

Yuqi Huo, Xiaoli Xu, Yao Lu, Yulei Niu, Mingyu Ding, Zhiwu Lu, Tao Xiang, Ji-rong Wen
On Sparse Connectivity, Adversarial Robustness, and a Novel Model of the Artificial Neuron

In this paper, we propose two closely connected methods to improve computational efficiency and stability against adversarial perturbations on contour recognition tasks: (a) a novel model of an artificial neuron, a “strong neuron,” with inherent robustness against adversarial perturbations and (b) a novel constructive training algorithm that generates sparse networks with O(1) connections per neuron.We achieved an impressive 10x reduction (compared with other sparsification approaches; 100x when compared with dense networks) in operations count. State-of-the-art stability against adversarial perturbations was achieved without any counteradversarial measures, relying on the robustness of strong neurons alone.Our network extensively uses unsupervised feature detection, with more than 95% of operations being performed in its unsupervised parts. Less than 10.000 supervised FLOPs per class is required to recognize a contour (digit or traffic sign), which allows us to arrive to the conclusion that contour recognition is much simpler that was previously thought.

Sergey Bochkanov
Injecting Prior Knowledge into Image Caption Generation

Automatically generating natural language descriptions from an image is a challenging problem in artificial intelligence that requires a good understanding of the visual and textual signals and the correlations between them. The state-of-the-art methods in image captioning struggles to approach human level performance, especially when data is limited. In this paper, we propose to improve the performance of the state-of-the-art image captioning models by incorporating two sources of prior knowledge: (i) a conditional latent topic attention, that uses a set of latent variables (topics) as an anchor to generate highly probable words and, (ii) a regularization technique that exploits the inductive biases in syntactic and semantic structure of captions and improves the generalization of image captioning models. Our experiments validate that our method produces more human interpretable captions and also leads to significant improvements on the MSCOCO dataset in both the full and low data regimes.

Arushi Goel, Basura Fernando, Thanh-Son Nguyen, Hakan Bilen
Learning Temporally Invariant and Localizable Features via Data Augmentation for Video Recognition

Deep-Learning-based video recognition has shown promising improvements along with the development of large-scale datasets and spatiotemporal network architectures. In image recognition, learning spatially invariant features is a key factor in improving recognition performance and robustness. Data augmentation based on visual inductive priors, such as cropping, flipping, rotating, or photometric jittering, is a representative approach to achieve these features. Recent state-of-the-art recognition solutions have relied on modern data augmentation strategies that exploit a mixture of augmentation operations. In this study, we extend these strategies to the temporal dimension for videos to learn temporally invariant or temporally localizable features to cover temporal perturbations or complex actions in videos. Based on our novel temporal data augmentation algorithms, video recognition performances are improved using only a limited amount of training data compared to the spatial-only data augmentation algorithms, including the 1st Visual Inductive Priors (VIPriors) for data-efficient action recognition challenge. Furthermore, learned features are temporally localizable that cannot be achieved using spatial augmentation algorithms. Our source code is available at https://github.com/taeoh-kim/temporal_data_augmentation .

Taeoh Kim, Hyeongmin Lee, MyeongAh Cho, Ho Seong Lee, Dong Heon Cho, Sangyoun Lee
Unsupervised Learning of Video Representations via Dense Trajectory Clustering

This paper addresses the task of unsupervised learning of representations for action recognition in videos. Previous works proposed to utilize future prediction, or other domain-specific objectives to train a network, but achieved only limited success. In contrast, in the relevant field of image representation learning, simpler, discrimination-based methods have recently bridged the gap to fully-supervised performance. We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation, to the video domain. In particular, the latter approach iterates between clustering the videos in the feature space of a network and updating it to respect the cluster with a non-parametric classification loss. We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns, grouping the videos based on appearance. To mitigate this issue, we turn to the heuristic-based IDT descriptors, that were manually designed to encode motion patterns in videos. We form the clusters in the IDT space, using these descriptors as a an unsupervised prior in the iterative local aggregation algorithm. Our experiments demonstrates that this approach outperform prior work on UCF101 and HMDB51 action recognition benchmarks. We also qualitatively analyze the learned representations and show that they successfully capture video dynamics.

Pavel Tokmakov, Martial Hebert, Cordelia Schmid
Distilling Visual Priors from Self-Supervised Learning

Convolutional Neural Networks (CNNs) are prone to overfit small training datasets. We present a novel two-phase pipeline that leverages self-supervised learning and knowledge distillation to improve the generalization ability of CNN models for image classification under the data-deficient setting. The first phase is to learn a teacher model which possesses rich and generalizable visual representations via self-supervised learning, and the second phase is to distill the representations into a student model in a self-distillation manner, and meanwhile fine-tune the student model for the image classification task. We also propose a novel margin loss for the self-supervised contrastive learning proxy task to better learn the representation under the data-deficient scenario. Together with other tricks, we achieve competitive performance in the VIPriors image classification challenge.

Bingchen Zhao, Xin Wen
Unsupervised Image Classification for Deep Representation Learning

Deep clustering against self-supervised learning (SSL) is a very important and promising direction for unsupervised visual representation learning since it requires little domain knowledge to design pretext tasks. However, the key component, embedding clustering, limits its extension to the extremely large-scale dataset due to its prerequisite to save the global latent embedding of the entire dataset. In this work, we aim to make this framework more simple and elegant without performance decline. We propose an unsupervised image classification framework without using embedding clustering, which is very similar to standard supervised training manner. For detailed interpretation, we further analyze its relation with deep clustering and contrastive learning. Extensive experiments on ImageNet dataset have been conducted to prove the effectiveness of our method. Furthermore, the experiments on transfer learning benchmarks have verified its generalization to other downstream tasks, including multi-label image classification, object detection, semantic segmentation and few-shot image classification.

Weijie Chen, Shiliang Pu, Di Xie, Shicai Yang, Yilu Guo, Luojun Lin
TDMPNet: Prototype Network with Recurrent Top-Down Modulation for Robust Object Classification Under Partial Occlusion

Despite deep convolutional neural networks’ great success in object classification, recent work has shown that they suffer from a severe generalization performance drop under occlusion conditions that do not appear in the training data. Due to the large variability of occluders in terms of shape and appearance, training data can hardly cover all possible occlusion conditions. However, in practice we expect models to reliably generalize to various novel occlusion conditions, rather than being limited to the training conditions. In this work, we integrate inductive priors including prototypes, partial matching and top-down modulation into deep neural networks to realize robust object classification under novel occlusion conditions, with limited occlusion in training data. We first introduce prototype learning as its regularization encourages compact data clusters for better generalization ability. Then, a visibility map at the intermediate layer based on feature dictionary and activation scale is estimated for partial matching, whose prior sifts irrelevant information out when comparing features with prototypes. Further, inspired by the important role of feedback connection in neuroscience for object recognition under occlusion, a structural prior, i.e. top-down modulation, is introduced into convolution layers, purposefully reducing the contamination by occlusion during feature extraction. Experiment results on partially occluded MNIST, vehicles from the PASCAL3D+ dataset, and vehicles from the cropped COCO dataset demonstrate the improvement under both simulated and real-world novel occlusion conditions, as well as under the transfer of datasets.

Mingqing Xiao, Adam Kortylewski, Ruihai Wu, Siyuan Qiao, Wei Shen, Alan Yuille
What Leads to Generalization of Object Proposals?

Object proposal generation is often the first step in many detection models. It is lucrative to train a good proposal model, that generalizes to unseen classes. Motivated by this, we study how a detection model trained on a small set of source classes can provide proposals that generalize to unseen classes. We systematically study the properties of the dataset – visual diversity and label space granularity – required for good generalization. We show the trade-off between using fine-grained labels and coarse labels. We introduce the idea of prototypical classes: a set of sufficient and necessary classes required to train a detection model to obtain generalized proposals in a more data-efficient way. On the Open Images V4 dataset, we show that only $$25\%$$ 25 % of the classes can be selected to form such a prototypical set. The resulting proposals from a model trained with these classes is only $$4.3\%$$ 4.3 % worse than using all the classes, in terms of average recall (AR). We also demonstrate that Faster R-CNN model leads to better generalization of proposals compared to a single-stage network like RetinaNet.

Rui Wang, Dhruv Mahajan, Vignesh Ramanathan
A Self-supervised Framework for Human Instance Segmentation

Existing approaches for human-centered tasks such as human instance segmentation are focused on improving the architectures of models, leveraging weak supervision or transforming supervision among related tasks. Nonetheless, the structures are highly specific and the weak supervision is limited by available priors or number of related tasks. In this paper, we present a novel self-supervised framework for human instance segmentation. The framework includes one module which iteratively conducts mutual refinement between segmentation and optical flow estimation, and the other module which iteratively refines pose estimations by exploring the prior knowledge about the consistency in human graph structures from consecutive frames. The results of the proposed framework are employed for fine-tuning segmentation networks in a feedback fashion. Experimental results on the OCHuman and COCOPersons datasets demonstrate that the self-supervised framework achieves current state-of-the-art performance against existing models on the challenging datasets without requiring additional labels. Unlabeled video data is utilized together with prior knowledge to significantly improve performance and reduce the reliance on annotations. Code released at: https://github.com/AllenYLJiang/SSINS.

Yalong Jiang, Wenrui Ding, Hongguang Li, Hua Yang, Xu Wang
Multiple Interaction Learning with Question-Type Prior Knowledge for Constraining Answer Search Space in Visual Question Answering

Different approaches have been proposed to Visual Question Answering (VQA). However, few works are aware of the behaviors of varying joint modality methods over question type prior knowledge extracted from data in constraining answer search space, of which information gives a reliable cue to reason about answers for questions asked in input images. In this paper, we propose a novel VQA model that utilizes the question-type prior information to improve VQA by leveraging the multiple interactions between different joint modality methods based on their behaviors in answering questions from different types. The solid experiments on two benchmark datasets, i.e., VQA 2.0 and TDIUC, indicate that the proposed method yields the best performance with the most competitive approaches.

Tuong Do, Binh X. Nguyen, Huy Tran, Erman Tjiputra, Quang D. Tran, Thanh-Toan Do
A Visual Inductive Priors Framework for Data-Efficient Image Classification

State-of-the-art classifiers rely heavily on large-scale datasets, such as ImageNet, JFT-300M, MSCOCO, Open Images, etc. Besides, the performance may decrease significantly because of insufficient learning on a handful of samples. We present Visual Inductive Priors Framework (VIPF), a framework that can learn classifiers from scratch. VIPF can maximize the effectiveness of limited data. In this work, we propose a novel neural network architecture: DSK-net, which is very effective in training from small data sets. With more discriminative feature extracted from DSK-net, overfitting of network is alleviated. Furthermore, a loss function based on positive class as well as an induced hierarchy are also applied to further improve the VIPF’s capability of learning from scratch. Finally, we won the 1st Place in VIPriors image classification competition.

Pengfei Sun, Xuan Jin, Wei Su, Yuan He, Hui Xue, Quan Lu

W18 - 3D Poses In the Wild Challenge

Frontmatter
Predicting Camera Viewpoint Improves Cross-Dataset Generalization for 3D Human Pose Estimation

Monocular estimation of 3d human pose has attracted increased attention with the availability of large ground-truth motion capture datasets. However, the diversity of training data available is limited and it is not clear to what extent methods generalize outside the specific datasets they are trained on. In this work we carry out a systematic study of the diversity and biases present in specific datasets and its effect on cross-dataset generalization across a compendium of 5 pose datasets. We specifically focus on systematic differences in the distribution of camera viewpoints relative to a body-centered coordinate frame. Based on this observation, we propose an auxiliary task of predicting the camera viewpoint in addition to pose. We find that models trained to jointly predict viewpoint and pose systematically show significantly improved cross-dataset generalization.

Zhe Wang, Daeyun Shin, Charless C. Fowlkes
Beyond Weak Perspective for Monocular 3D Human Pose Estimation

We consider the task of 3D joints location and orientation prediction from a monocular video with the skinned multi-person linear (SMPL) model. We first infer 2D joints locations with an off-the-shelf pose estimation algorithm. We use the SPIN algorithm and estimate initial predictions of body pose, shape and camera parameters from a deep regression neural network. We then adhere to the SMPLify algorithm which receives those initial parameters, and optimizes them so that inferred 3D joints from the SMPL model would fit the 2D joints locations. This algorithm involves a projection step of 3D joints to the 2D image plane. The conventional approach is to follow weak perspective assumptions which use ad-hoc focal length. Through experimentation on the 3D poses in the wild (3DPW) dataset, we show that using full perspective projection, with the correct camera center and an approximated focal length, provides favorable results. Our algorithm has resulted in a winning entry for the 3DPW Challenge, reaching first place in joints orientation accuracy.

Imry Kissos, Lior Fritz, Matan Goldman, Omer Meir, Eduard Oks, Mark Kliger

W20 - Map-based Localization for Autonomous Driving

Frontmatter
Geographically Local Representation Learning with a Spatial Prior for Visual Localization

We revisit end-to-end representation learning for cross-view self-localization, the task of retrieving for a query camera image the closest satellite image in a database by matching them in a shared image representation space. Previous work tackles this task as a global localization problem, i.e. assuming no prior knowledge on the location, thus the learned image representation must distinguish far apart areas of the map. However, in many practical applications such as self-driving vehicles, it is already possible to discard distant locations through well-known localization techniques using temporal filters and GNSS/GPS sensors. We argue that learned features should therefore be optimized to be discriminative within the geographic local neighborhood, instead of globally. We propose a simple but effective adaptation to the common triplet loss used in previous work to consider a prior localization estimate already in the training phase. We evaluate our approach on the existing CVACT dataset, and on a novel localization benchmark based on the Oxford RobotCar dataset which tests generalization across multiple traversals and days in the same area. For the Oxford benchmarks we collected corresponding satellite images. With a localization prior, our approach improves recall@1 by 9% points on CVACT, and reduces the median localization error by 2.45 m on the Oxford benchmark, compared to a state-of-the-art baseline approach. Qualitative results underscore that with our approach the network indeed captures different aspects of the local surroundings compared to the global baseline.

Zimin Xia, Olaf Booij, Marco Manfredi, Julian F. P. Kooij

W22 - Recovering 6D Object Pose

Frontmatter
BOP Challenge 2020 on 6D Object Localization

This paper presents the evaluation methodology, datasets, and results of the BOP Challenge 2020, the third in a series of public competitions organized with the goal to capture the status quo in the field of 6D object pose estimation from an RGB-D image. In 2020, to reduce the domain gap between synthetic training and real test RGB images, the participants were 350K photorealistic training images generated by BlenderProc4BOP, a new open-source and light-weight physically-based renderer (PBR) and procedural data generator. Methods based on deep neural networks have finally caught up with methods based on point pair features, which were dominating previous editions of the challenge. Although the top-performing methods rely on RGB-D image channels, strong results were achieved when only RGB channels were used at both training and test time – out of the 26 evaluated methods, the third method was trained on RGB channels of PBR and real images, while the fifth on RGB channels of PBR images only. Strong data augmentation was identified as a key component of the top-performing CosyPose method, and the photorealism of PBR images was demonstrated effective despite the augmentation. The online evaluation system stays open and is available on the project website: bop.felk.cvut.cz .

Tomáš Hodaň, Martin Sundermeyer, Bertram Drost, Yann Labbé, Eric Brachmann, Frank Michel, Carsten Rother, Jiří Matas
StructureFromGAN: Single Image 3D Model Reconstruction and Photorealistic Texturing

We present a generative adversarial model for single photo 3D reconstruction and high resolution texturing. Our framework leverages a neural renderer and a 3D Morphable model of an object. We train our generator on the semantic labelling-to-image translation task. This allows our model to learn rich priors about object appearance and perform all-around texture and shape reconstruction from a single image. Our new generator architecture leverages a power of StyleGAN2 model for image-to-image translation with fine texture detail at the $$1024 \times 1024$$ 1024 × 1024 resolution. We evaluate our framework quantitatively and qualitatively on Florence Face and Appolo Cars datasets on the tasks of car 3D reconstruction and texturing. Extensive experiments demonstrate that our framework achieves and surpasses the state-of-the-art in single photo 3D object reconstruction and texturing using 3D morphable models. We made our code publicly available ( http://www.zefirus.org/StructureFromGAN ).

Vladimir V. Kniaz, Vladimir A. Knyaz, Vladimir Mizginov, Mark Kozyrev, Petr Moshkantsev
6 DoF Pose Estimation of Textureless Objects from Multiple RGB Frames

This paper addresses the problems of object detection and 6 DoF pose estimation from a sequence of RGB images. Our deep learning-based approach uses only synthetic non-textured 3D CAD models for training and has no access to the images from the target domain. The image sequence is used to obtain a sparse 3D reconstruction of the scene via Structure from Motion. The domain gap is closed by relying on the intuition that geometric edges are the only prominent features that can be extracted from both the 3D models and the sparse reconstructions. Based on this assumption, we have developed a domain-invariant data preparation scheme and 3DKeypointNet, which is a neural network for detecting of the 3D keypoints in sparse and noisy point clouds. The final pose is estimated with RANSAC and a scale-aware point cloud alignment method. The proposed method has been tested on the T-LESS dataset and compared to methods also trained on synthetic data. The results indicate the potential of our method despite the fact that the entire pipeline is solely trained on synthetic data.

Roman Kaskman, Ivan Shugurov, Sergey Zakharov, Slobodan Ilic
Semi-supervised Viewpoint Estimation with Geometry-Aware Conditional Generation

There is a growing interest in developing computer vision methods that can learn from limited supervision. In this paper, we consider the problem of learning to predict camera viewpoints, where obtaining ground-truth annotations are expensive and require special equipment, from a limited number of labeled images. We propose a semi-supervised viewpoint estimation method that can learn to infer viewpoint information from unlabeled image pairs, where two images differ by a viewpoint change. In particular our method learns to synthesize the second image by combining the appearance from the first one and viewpoint from the second one. We demonstrate that our method significantly improves the supervised techniques, especially in the low-label regime and outperforms the state-of-the-art semi-supervised methods.

Octave Mariotti, Hakan Bilen
Physical Plausibility of 6D Pose Estimates in Scenes of Static Rigid Objects

To enable robots to reason about manipulation of objects and AR applications to present augmented scenes to human users, accurate scene explanations based on objects and their 6D pose are required. With the pose-error functions commonly used to evaluate 6D object pose estimation approaches, the accuracy of estimates is measured by surface alignment of a target object under the estimated and true pose. However, an object floating above the ground may yield the same error as an object translated on the ground by the same magnitude. We argue that, to be intelligible for human observers, pose estimates additionally need to adhere to physical principles. To this end, we provide a definition of physical plausibility in scenes of static rigid objects, derive novel pose-error functions and compare them to existing evaluation approaches in 6D object pose estimation. Code to compute the presented pose-error functions is publicly available at github.com/dornik/plausible-poses .

Dominik Bauer, Timothy Patten, Markus Vincze
DronePose: Photorealistic UAV-Assistant Dataset Synthesis for 3D Pose Estimation via a Smooth Silhouette Loss

In this work we consider UAVs as cooperative agents supporting human users in their operations. In this context, the 3D localisation of the UAV assistant is an important task that can facilitate the exchange of spatial information between the user and the UAV. To address this in a data-driven manner, we design a data synthesis pipeline to create a realistic multimodal dataset that includes both the exocentric user view, and the egocentric UAV view. We then exploit the joint availability of photorealistic and synthesized inputs to train a single-shot monocular pose estimation model. During training we leverage differentiable rendering to supplement a state-of-the-art direct regression objective with a novel smooth silhouette loss. Our results demonstrate its qualitative and quantitative performance gains over traditional silhouette objectives. Our data and code are available at https://vcl3d.github.io/DronePose .

Georgios Albanis, Nikolaos Zioulis, Anastasios Dimou, Dimitrios Zarpalas, Petros Daras
How to Track Your Dragon: A Multi-attentional Framework for Real-Time RGB-D 6-DOF Object Pose Tracking

We present a novel multi-attentional convolutional architecture to tackle the problem of real-time RGB-D 6D object pose tracking of single, known objects. Such a problem poses multiple challenges originating both from the objects’ nature and their interaction with their environment, which previous approaches have failed to fully address. The proposed framework encapsulates methods for background clutter and occlusion handling by integrating multiple parallel soft spatial attention modules into a multitask Convolutional Neural Network (CNN) architecture. Moreover, we consider the special geometrical properties of both the object’s 3D model and the pose space, and we use a more sophisticated approach for data augmentation during training. The provided experimental results confirm the effectiveness of the proposed multi-attentional architecture, as it improves the State-of-the-Art (SoA) tracking performance by an average score of 34.03 $$\%$$ % for translation and 40.01 $$\%$$ % for rotation, when tested on the most complete dataset designed, up to date, for the problem of RGB-D object tracking. Code will be available in: https://github.com/ismarou/How_to_track_your_Dragon .

Isidoros Marougkas, Petros Koutras, Nikos Kardaris, Georgios Retsinas, Georgia Chalvatzaki, Petros Maragos
A Hybrid Approach for 6DoF Pose Estimation

We propose a method for 6DoF pose estimation of rigid objects that uses a state-of-the-art deep learning based instance detector to segment object instances in an RGB image, followed by a point-pair based voting method to recover the object’s pose. We additionally use an automatic method selection that chooses the instance detector and the training set as that with the highest performance on the validation set. This hybrid approach leverages the best of learning and classic approaches, using CNNs to filter highly unstructured data and cut through the clutter, and a local geometric approach with proven convergence for robust pose estimation. The method is evaluated on the BOP core datasets where it significantly exceeds the baseline method and is the best fast method in the BOP 2020 Challenge.

Rebecca König, Bertram Drost
Leaping from 2D Detection to Efficient 6DoF Object Pose Estimation

Estimating 6DoF object poses from single RGB images is very challenging due to severe occlusions and large search space of camera poses. Keypoint voting based methods have demonstrated its effectiveness and superiority on predicting object poses. However, those approaches are often affected by inaccurate semantic segmentation in computing the keypoint locations. To enable our model to focus on local regions without being distracted by backgrounds, we first localize object regions by a 2D object detector. In doing so, we not only reduce the search space of keypoints but also improve the robustness of the pose estimation. Moreover, since symmetric objects may suffer ambiguity along the symmetric dimension, we propose to select keypoints on the geometrically symmetric locations to resolve the ambiguity. The extensive experimental results on seven different datasets of the BOP challenge benchmark demonstrate that our method outperforms the state-of-the-art and achieves the 3-rd place in the BOP challenge.

Jinhui Liu, Zhikang Zou, Xiaoqing Ye, Xiao Tan, Errui Ding, Feng Xu, Xin Yu

W23 - SHApe Recovery from Partial Textured 3D Scans

Frontmatter
Implicit Feature Networks for Texture Completion from Partial 3D Data

Prior work to infer 3D texture use either texture atlases, which require uv-mappings and hence have discontinuities, or colored voxels, which are memory inefficient and limited in resolution. Recent work, predicts RGB color at every XYZ coordinate forming a texture field, but focus on completing texture given a single 2D image. Instead, we focus on 3D texture and geometry completion from partial and incomplete 3D scans. IF-Nets [2] have recently achieved state-of-the-art results on 3D geometry completion using a multi-scale deep feature encoding, but the outputs lack texture. In this work, we generalize IF-Nets to texture completion from partial textured scans of humans and arbitrary objects. Our key insight is that 3D texture completion benefits from incorporating local and global deep features extracted from both the 3D partial texture and completed geometry. Specifically, given the partial 3D texture and the 3D geometry completed with IF-Nets, our model successfully in-paints the missing texture parts in consistence with the completed geometry. Our model won the SHARP ECCV’20 challenge, achieving highest performance on all challenges.

Julian Chibane, Gerard Pons-Moll
3DBooSTeR: 3D Body Shape and Texture Recovery

We propose 3DBooSTeR, a novel method to recover a textured 3D body mesh from a textured partial 3D scan. With the advent of virtual and augmented reality, there is a demand for creating realistic and high-fidelity digital 3D human representations. However, 3D scanning systems can only capture the 3D human body shape up to some level of defects due to its complexity, including occlusion between body parts, varying levels of details, shape deformations and the articulated skeleton. Textured 3D mesh completion is thus important to enhance 3D acquisitions. The proposed approach decouples the shape and texture completion into two sequential tasks. The shape is recovered by an encoder-decoder network deforming a template body mesh. The texture is subsequently obtained by projecting the partial texture onto the template mesh before inpainting the corresponding texture map with a novel approach. The approach is validated on the 3DBodyTex.v2 dataset.

Alexandre Saint, Anis Kacem, Kseniya Cherenkova, Djamila Aouada
SHARP 2020: The 1st Shape Recovery from Partial Textured 3D Scans Challenge Results

The SHApe Recovery from Partial textured 3D scans challenge, SHARP 2020, is the first edition of a challenge fostering and benchmarking methods for recovering complete textured 3D scans from raw incomplete data. SHARP 2020 is organised as a workshop in conjunction with ECCV 2020. There are two complementary challenges, the first one on 3D human scans, and the second one on generic objects. Challenge 1 is further split into two tracks, focusing, first, on large body and clothing regions, and, second, on fine body details. A novel evaluation metric is proposed to quantify jointly the shape reconstruction, the texture reconstruction and the amount of completed data. Additionally, two unique datasets of 3D scans are proposed, to provide raw ground-truth data for the benchmarks. The datasets are released to the scientific community. Moreover, an accompanying custom library of software routines is also released to the scientific community. It allows for processing 3D scans, generating partial data and performing the evaluation. Results of the competition, analysed in comparison to baselines, show the validity of the proposed evaluation metrics, and highlight the challenging aspects of the task and of the datasets. Details on the SHARP 2020 challenge can be found at https://cvi2.uni.lu/sharp2020/ .

Alexandre Saint, Anis Kacem, Kseniya Cherenkova, Konstantinos Papadopoulos, Julian Chibane, Gerard Pons-Moll, Gleb Gusev, David Fofi, Djamila Aouada, Björn Ottersten
Backmatter
Metadaten
Titel
Computer Vision – ECCV 2020 Workshops
herausgegeben von
Prof. Adrien Bartoli
Andrea Fusiello
Copyright-Jahr
2020
Electronic ISBN
978-3-030-66096-3
Print ISBN
978-3-030-66095-6
DOI
https://doi.org/10.1007/978-3-030-66096-3