Skip to main content

2020 | Buch

Computer Vision – ECCV 2020 Workshops

Glasgow, UK, August 23–28, 2020, Proceedings, Part I

insite
SUCHEN

Über dieses Buch

The 5-volume set, comprising the LNCS books 12535 until 12540, constitutes the refereed proceedings of 28 out of the 45 workshops held at the 16th European Conference on Computer Vision, ECCV 2020. The conference was planned to take place in Glasgow, UK, during August 23-28, 2020, but changed to a virtual format due to the COVID-19 pandemic.

The 249 full papers, 18 short papers, and 21 further contributions included in the workshop proceedings were carefully reviewed and selected from a total of 467 submissions. The papers deal with diverse computer vision topics.

Part I focusses on adversarial robustness in the real world; bioimage computation; egocentric perception, interaction and computing; eye gaze in VR, AR, and in the wild; TASK-CV workshop and VisDA challenge; and bodily expressed emotion understanding.

Inhaltsverzeichnis

Frontmatter

W01 - Adversarial Robustness in the Real World

Frontmatter
A Deep Dive into Adversarial Robustness in Zero-Shot Learning

Machine learning (ML) systems have introduced significant advances in various fields, due to the introduction of highly complex models. Despite their success, it has been shown multiple times that machine learning models are prone to imperceptible perturbations that can severely degrade their accuracy. So far, existing studies have primarily focused on models where supervision across all classes were available. In contrast, Zero-shot Learning (ZSL) and Generalized Zero-shot Learning (GZSL) tasks inherently lack supervision across all classes. In this paper, we present a study aimed on evaluating the adversarial robustness of ZSL and GZSL models. We leverage the well-established label embedding model and subject it to a set of established adversarial attacks and defenses across multiple datasets. In addition to creating possibly the first benchmark on adversarial robustness of ZSL models, we also present analyses on important points that require attention for better interpretation of ZSL robustness results. We hope these points, along with the benchmark, will help researchers establish a better understanding what challenges lie ahead and help guide their work.

Mehmet Kerim Yucel, Ramazan Gokberk Cinbis, Pinar Duygulu
Towards Analyzing Semantic Robustness of Deep Neural Networks

Despite the impressive performance of Deep Neural Networks (DNNs) on various vision tasks, they still exhibit erroneous high sensitivity toward semantic primitives (e.g. object pose). We propose a theoretically grounded analysis for DNN robustness in the semantic space. We qualitatively analyze different DNNs’ semantic robustness by visualizing the DNN global behavior as semantic maps and observe interesting behavior of some DNNs. Since generating these semantic maps does not scale well with the dimensionality of the semantic space, we develop a bottom-up approach to detect robust regions of DNNs. To achieve this, we formalize the problem of finding robust semantic regions of the network as optimizing integral bounds and we develop expressions for update directions of the region bounds. We use our developed formulations to quantitatively evaluate the semantic robustness of different popular network architectures. We show through extensive experimentation that several networks, while trained on the same dataset and enjoying comparable accuracy, do not necessarily perform similarly in semantic robustness. For example, InceptionV3 is more accurate despite being less semantically robust than ResNet50. We hope that this tool will serve as a milestone towards understanding the semantic robustness of DNNs.

Abdullah Hamdi, Bernard Ghanem
Likelihood Landscapes: A Unifying Principle Behind Many Adversarial Defenses

Convolutional Neural Networks have been shown to be vulnerable to adversarial examples, which are known to locate in subspaces close to where normal data lies but are not naturally occurring and of low probability. In this work, we investigate the potential effect defense techniques have on the geometry of the likelihood landscape - likelihood of the input images under the trained model. We first propose a way to visualize the likelihood landscape leveraging an energy-based model interpretation of discriminative classifiers. Then we introduce a measure to quantify the flatness of the likelihood landscape. We observe that a subset of adversarial defense techniques results in a similar effect of flattening the likelihood landscape. We further explore directly regularizing towards a flat landscape for adversarial robustness.

Fu Lin, Rohit Mittapalli, Prithvijit Chattopadhyay, Daniel Bolya, Judy Hoffman
Deep k-NN Defense Against Clean-Label Data Poisoning Attacks

Targeted clean-label data poisoning is a type of adversarial attack on machine learning systems in which an adversary injects a few correctly-labeled, minimally-perturbed samples into the training data, causing a model to misclassify a particular test sample during inference. Although defenses have been proposed for general poisoning attacks, no reliable defense for clean-label attacks has been demonstrated, despite the attacks’ effectiveness and realistic applications. In this work, we propose a simple, yet highly-effective Deep k-NN defense against both feature collision and convex polytope clean-label attacks on the CIFAR-10 dataset. We demonstrate that our proposed strategy is able to detect over 99% of poisoned examples in both attacks and remove them without compromising model performance. Additionally, through ablation studies, we discover simple guidelines for selecting the value of k as well as for implementing the Deep k-NN defense on real-world datasets with class imbalance. Our proposed defense shows that current clean-label poisoning attack strategies can be annulled, and serves as a strong yet simple-to-implement baseline defense to test future clean-label poisoning attacks. Our code is available on GitHub .

Neehar Peri, Neal Gupta, W. Ronny Huang, Liam Fowl, Chen Zhu, Soheil Feizi, Tom Goldstein, John P. Dickerson
Ramifications of Approximate Posterior Inference for Bayesian Deep Learning in Adversarial and Out-of-Distribution Settings

Deep neural networks have been successful in diverse discriminative classification tasks, although, they are poorly calibrated often assigning high probability to misclassified predictions. Potential consequences could lead to trustworthiness and accountability of the models when deployed in real applications, where predictions are evaluated based on their confidence scores. Existing solutions suggest the benefits attained by combining deep neural networks and Bayesian inference to quantify uncertainty over the models’ predictions for ambiguous data points. In this work we propose to validate and test the efficacy of likelihood based models in the task of out of distribution detection (OoD). Across different datasets and metrics we show that Bayesian deep learning models indeed outperform conventional neural networks but in the event of minimal overlap between in/out distribution classes, even the best models exhibit a reduction in AUC scores in detecting OoD data. We hypothesise that the sensitivity of neural networks to unseen inputs could be a multi-factor phenomenon arising from the different architectural design choices often amplified by the curse of dimensionality. Preliminary investigations indicate the potential inherent role of bias due to choices of initialisation, architecture or activation functions. Furthermore, we perform an analysis on the effect of adversarial noise resistance methods regarding in and out-of-distribution performance when combined with Bayesian deep learners.

John Mitros, Arjun Pakrashi, Brian Mac Namee
Adversarial Shape Perturbations on 3D Point Clouds

The importance of training robust neural network grows as 3D data is increasingly utilized in deep learning for vision tasks in robotics, drone control, and autonomous driving. One commonly used 3D data type is 3D point clouds, which describe shape information. We examine the problem of creating robust models from the perspective of the attacker, which is necessary in understanding how neural networks can be exploited. We explore two categories of attacks: distributional attacks that involve imperceptible perturbations to the distribution of points, and shape attacks that involve deforming the shape represented by a point cloud. We explore three possible shape attacks for attacking 3D point cloud classification and show that some of them are able to be effective even against preprocessing steps, like the previously proposed point-removal defenses. (Source code available at https://github.com/Daniel-Liu-c0deb0t/Adversarial-point-perturbations-on-3D-objects ).

Daniel Liu, Ronald Yu, Hao Su
Jacks of All Trades, Masters of None: Addressing Distributional Shift and Obtrusiveness via Transparent Patch Attacks

We focus on the development of effective adversarial patch attacks and – for the first time – jointly address the antagonistic objectives of attack success and obtrusiveness via the design of novel semi-transparent patches. This work is motivated by our pursuit of a systematic performance analysis of patch attack robustness with regard to geometric transformations. Specifically, we first elucidate a) key factors underpinning patch attack success and b) the impact of distributional shift between training and testing/deployment when cast under the Expectation over Transformation (EoT) formalism. By focusing our analysis on three principal classes of transformations (rotation, scale, and location), our findings provide quantifiable insights into the design of effective patch attacks and demonstrate that scale, among all factors, significantly impacts patch attack success. Working from these findings, we then focus on addressing how to overcome the principal limitations of scale for the deployment of attacks in real physical settings: namely the obtrusiveness of large patches. Our strategy is to turn to the novel design of irregularly-shaped, semi-transparent partial patches which we construct via a new optimization process that jointly addresses the antagonistic goals of mitigating obtrusiveness and maximizing effectiveness. Our study – we hope – will help encourage more focus in the community on the issues of obtrusiveness, scale, and success in patch attacks.

Neil Fendley, Max Lennon, I-Jeng Wang, Philippe Burlina, Nathan Drenkow
Evaluating Input Perturbation Methods for Interpreting CNNs and Saliency Map Comparison

Input perturbation methods occlude parts of an input to a function and measure the change in the function’s output. Recently, input perturbation methods have been applied to generate and evaluate saliency maps from convolutional neural networks. In practice, neutral baseline images are used for the occlusion, such that the baseline image’s impact on the classification probability is minimal. However, in this paper we show that arguably neutral baseline images still impact the generated saliency maps and their evaluation with input perturbations. We also demonstrate that many choices of hyperparameters lead to the divergence of saliency maps generated by input perturbations. We experimentally reveal inconsistencies among a selection of input perturbation methods and find that they lack robustness for generating saliency maps and for evaluating saliency maps as saliency metrics.

Lukas Brunke, Prateek Agrawal, Nikhil George
Adversarial Robustness of Open-Set Recognition: Face Recognition and Person Re-identification

Recent studies show that DNNs are vulnerable to adversarial attacks, in which carefully chosen imperceptible modifications to the inputs lead to incorrect predictions. However most existing attacks focus on closed-set classification, and adversarial attack of open-set recognition has been less investigated. In this paper, we systematically investigate the adversarial robustness of widely used open-set recognition models, namely person re-identification (ReID) and face recognition (FR) models. Specifically, we compare two categories of black-box attacks: transfer-based extensions of standard closed-set attacks and several direct random-search based attacks proposed here. Extensive experiments demonstrate that ReID and FR models are also vulnerable to adversarial attack, and highlight a potential AI trustworthiness problem for these socially important applications.

Xiao Gong, Guosheng Hu, Timothy Hospedales, Yongxin Yang
WaveTransform: Crafting Adversarial Examples via Input Decomposition

Frequency spectrum has played a significant role in learning unique and discriminating features for object recognition. Both low and high frequency information present in images have been extracted and learnt by a host of representation learning techniques, including deep learning. Inspired by this observation, we introduce a novel class of adversarial attacks, namely ‘WaveTransform’, that creates adversarial noise corresponding to low-frequency and high-frequency subbands, separately (or in combination). The frequency subbands are analyzed using wavelet decomposition; the subbands are corrupted and then used to construct an adversarial example. Experiments are performed using multiple databases and CNN models to establish the effectiveness of the proposed WaveTransform attack and analyze the importance of a particular frequency component. The robustness of the proposed attack is also evaluated through its transferability and resiliency against a recent adversarial defense algorithm. Experiments show that the proposed attack is effective against the defense algorithm and is also transferable across CNNs.

Divyam Anshumaan, Akshay Agarwal, Mayank Vatsa, Richa Singh
Robust Super-Resolution of Real Faces Using Smooth Features

Real low-resolution (LR) face images contain degradations which are too varied and complex to be captured by known downsampling kernels and signal-independent noises. So, in order to successfully super-resolve real faces, a method needs to be robust to a wide range of noise, blur, compression artifacts etc. Some of the recent works attempt to model these degradations from a dataset of real images using a Generative Adversarial Network (GAN). They generate synthetically degraded LR images and use them with corresponding real high-resolution (HR) image to train a super-resolution (SR) network using a combination of a pixel-wise loss and an adversarial loss. In this paper, we propose a two module super-resolution network where the feature extractor module extracts robust features from the LR image, and the SR module generates an HR estimate using only these robust features. We train a degradation GAN to convert bicubically downsampled clean images to real degraded images, and interpolate between the obtained degraded LR image and its clean LR counterpart. This interpolated LR image is then used along with it’s corresponding HR counterpart to train the super-resolution network from end to end. Entropy Regularized Wasserstein Divergence is used to force the encoded features learnt from the clean and degraded images to closely resemble those extracted from the interpolated image to ensure robustness.

Saurabh Goswami, Aakanksha, A. N. Rajagopalan
Improved Robustness to Open Set Inputs via Tempered Mixup

Supervised classification methods often assume that evaluation data is drawn from the same distribution as training data and that all classes are present for training. However, real-world classifiers must handle inputs that are far from the training distribution including samples from unknown classes. Open set robustness refers to the ability to properly label samples from previously unseen categories as novel and avoid high-confidence, incorrect predictions. Existing approaches have focused on either novel inference methods, unique training architectures, or supplementing the training data with additional background samples. Here, we propose a simple regularization technique easily applied to existing convolutional neural network architectures that improves open set robustness without a background dataset. Our method achieves state-of-the-art results on open set classification baselines and easily scales to large-scale open set classification problems.

Ryne Roady, Tyler L. Hayes, Christopher Kanan
Defenses Against Multi-sticker Physical Domain Attacks on Classifiers

Recently, physical domain adversarial attacks have drawn significant attention from the machine learning community. One important attack proposed by Eykholt et al. can fool a classifier by placing black and white stickers on an object such as a road sign. While this attack may pose a significant threat to visual classifiers, there are currently no defenses designed to protect against this attack. In this paper, we propose new defenses that can protect against multi-sticker attacks. We present defensive strategies capable of operating when the defender has full, partial, and no prior information about the attack. By conducting extensive experiments, we show that our proposed defenses can outperform existing defenses against physical attacks when presented with a multi-sticker attack.

Xinwei Zhao, Matthew C. Stamm
Adversarial Attack on Deepfake Detection Using RL Based Texture Patches

The advancements in GANs have made creating deepfake videos a relatively easy task. Considering the threat that deepfake videos pose for manipulating political opinion, recent research has focused on ways to better detect deepfake videos. Even though researchers have had some success in detecting deepfake videos, it has been found that these detection systems can be attacked.The key contributions of this paper are (a) a deepfake dataset created using a commercial website, (b) validation of the efficacy of DeepExplainer and heart rate detection from the face for differentiating real faces from adversarial attacks, and (c) the proposal of an attack on the FaceForensics++ deepfake detection system using a state-of-the-art reinforcement learning-based texture patch attack. To the best of our knowledge, we are the first to successfully attack FaceForensics++ on our commercial deepfake dataset and DeepfakeTIMIT dataset.

Steven Lawrence Fernandes, Sumit Kumar Jha

W02 - BioImage Computation

Frontmatter
A Subpixel Residual U-Net and Feature Fusion Preprocessing for Retinal Vessel Segmentation

Retinal Image analysis allows medical professionals to inspect the morphology of the retinal vessels for the diagnosis of vascular diseases. Automated extraction of the vessels is vital for computer-aided diagnostic systems to provide a speedy and precise diagnosis. This paper introduces SpruNet, a Subpixel Convolution based Residual U-Net architecture which re-purposes subpixel convolutions as down-sampling and up-sampling method. The proposed subpixel convolution based down-sampling and up-sampling strategy efficiently minimizes the information loss during the encoding and decoding process which in turn increases the sensitivity of the model without hurting the specificity. A feature fusion technique of combining two types of image enhancement algorithms is also introduced. The model is trained and evaluated on three mainstream public benchmark datasets, and detailed analysis and comparison of the results are provided which shows that the model achieves state-of-the-art results with less complexity. The model can make inference on $$512\times 512$$ 512 × 512 pixel full image in 0.5 s.

Sohom Dey
Attention Deeplabv3+: Multi-level Context Attention Mechanism for Skin Lesion Segmentation

Skin lesion segmentation is a challenging task due to the large variation of anatomy across different cases. In the last few years, deep learning frameworks have shown high performance in image segmentation. In this paper, we propose Attention Deeplabv3+, an extended version of Deeplabv3+ for skin lesion segmentation by employing the idea of attention mechanism in two stages. We first capture the relationship between the channels of a set of feature maps by assigning a weight for each channel (i.e., channels attention). Channel attention allows the network to emphasize more on the informative and meaningful channels by a context gating mechanism. We also exploit the second level attention strategy to integrate different layers of the atrous convolution. It helps the network to focus on the more relevant field of view to the target. The proposed model is evaluated on three datasets ISIC 2017, ISIC 2018, and $$PH^2$$ P H 2 , achieving state-of-the-art performance.

Reza Azad, Maryam Asadi-Aghbolaghi, Mahmood Fathy, Sergio Escalera
Automated Assessment of the Curliness of Collagen Fiber in Breast Cancer

The growth and spread of breast cancer are influenced by the composition and structural properties of collagen in the extracellular matrix of tumors. Straight alignment of collagen has been attributed to tumor cell migration, which is correlated with tumor progression and metastasis in breast cancer. Thus, there is a need to characterize collagen alignment to study its value as a prognostic biomarker. We present a framework to characterize the curliness of collagen fibers in breast cancer images from DUET (DUal-mode Emission and Transmission) studies on hematoxylin and eosin (H&E) stained tissue samples. Our novel approach highlights the characteristic fiber gradients using a standard ridge detection method before feeding into the convolutional neural network. Experiments were performed on patches of breast cancer images containing straight or curly collagen. The proposed approach outperforms in terms of area under the curve against transfer learning methods trained directly on the original patches. We also explore a feature fusion strategy to combine feature representations of both the original patches and their ridge filter responses.

David Paredes, Prateek Prasanna, Christina Preece, Rajarsi Gupta, Farzad Fereidouni, Dimitris Samaras, Tahsin Kurc, Richard M. Levenson, Patricia Thompson-Carino, Joel Saltz, Chao Chen
Bionic Tracking: Using Eye Tracking to Track Biological Cells in Virtual Reality

We present Bionic Tracking, a novel method for solving biological cell tracking problems with eye tracking in virtual reality using commodity hardware. Using gaze data, and especially smooth pursuit eye movements, we are able to track cells in time series of 3D volumetric datasets. The problem of tracking cells is ubiquitous in developmental biology, where large volumetric microscopy datasets are acquired on a daily basis, often comprising hundreds or thousands of time points that span hours or days. The image data, however, is only a means to an end, and scientists are often interested in the reconstruction of cell trajectories and cell lineage trees. Reliably tracking cells in crowded three-dimensional space over many time points remains an open problem, and many current approaches rely on tedious manual annotation or curation. In the Bionic Tracking approach, we substitute the usual 2D point-and-click interface for annotation or curation with eye tracking in a virtual reality headset, where users follow cells with their eyes in 3D space in order to track them. We detail the interaction design of our approach and explain the graph-based algorithm used to connect different time points, also taking occlusion and user distraction into account. We demonstrate Bionic Tracking using examples from two different biological datasets. Finally, we report on a user study with seven cell tracking experts, highlighting the benefits and limitations of Bionic Tracking compared to point-and-click interfaces.

Ulrik Günther, Kyle I. S. Harrington, Raimund Dachselt, Ivo F. Sbalzarini
Cardiac MR Image Sequence Segmentation with Temporal Motion Encoding

The segmentation of cardiac magnetic resonance (MR) images is a critical step for the accurate assessment of cardiac function and the diagnosis of cardiovascular diseases. In this work, we propose a novel segmentation method that is able to effectively leverage the temporal information in cardiac MR image sequences. Specifically, we construct a Temporal Aggregation Module (TAM) to incorporate the temporal image-based features into a backbone spatial segmentation network (such as a 2D U-Net) with negligible extra computation cost. In addition, we also introduce a novel Motion Encoding Module (MEM) to explicitly encode the motion features of the heart. Experimental results demonstrate that each of the two modules enables clear improvements upon the base spatial network, and their combination leads to further enhanced performance. The proposed method outperforms the previous methods significantly, demonstrating the effectiveness of our design.

Pengxiang Wu, Qiaoying Huang, Jingru Yi, Hui Qu, Meng Ye, Leon Axel, Dimitris Metaxas
Classifying Nuclei Shape Heterogeneity in Breast Tumors with Skeletons

In this study, we demonstrate the efficacy of scoring statistics derived from a medial axis transform, for differentiating tumor and non-tumor nuclei, in malignant breast tumor histopathology images. Characterizing nuclei shape is a crucial part of diagnosing breast tumors for human doctors, and these scoring metrics may be integrated into machine perception algorithms which aggregate nuclei information across a region to label whole breast lesions. In particular, we present a low-dimensional representation capturing characteristics of a skeleton extracted from nuclei. We show that this representation outperforms both prior morphological features, as well as CNN features, for classification of tumors. Nuclei and region scoring algorithms such as the one presented here can aid pathologists in the diagnosis of breast tumors.

Brian Falkenstein, Adriana Kovashka, Seong Jae Hwang, S. Chakra Chennubhotla
DenoiSeg: Joint Denoising and Segmentation

Microscopy image analysis often requires the segmentation of objects, but training data for this task is typically scarce and hard to obtain. Here we propose DenoiSeg, a new method that can be trained end-to-end on only a few annotated ground truth segmentations. We achieve this by extending Noise2Void, a self-supervised denoising scheme that can be trained on noisy images alone, to also predict dense 3-class segmentations. The reason for the success of our method is that segmentation can profit from denoising, especially when performed jointly within the same network. The network becomes a denoising expert by seeing all available raw data, while co-learning to segment, even if only a few segmentation labels are available. This hypothesis is additionally fueled by our observation that the best segmentation results on high quality (very low noise) raw data are obtained when moderate amounts of synthetic noise are added. This renders the denoising-task non-trivial and unleashes the desired co-learning effect. We believe that DenoiSeg offers a viable way to circumvent the tremendous hunger for high quality training data and effectively enables learning of dense segmentations when only very limited amounts of segmentation labels are available.

Tim-Oliver Buchholz, Mangal Prakash, Deborah Schmidt, Alexander Krull, Florian Jug
DoubleU-Net: Colorectal Cancer Diagnosis and Gland Instance Segmentation with Text-Guided Feature Control

With the rapid therapeutic advancement in personalized medicine, the role of pathologists for colorectal cancer has greatly expanded from morphologists to clinical consultants. In addition to cancer diagnosis, pathologists are responsible for multiple assessments based on glandular morphology statistics, like selecting appropriate tissue sections for mutation analysis [6]. Therefore, we propose DoubleU-Net that determines the initial gland segmentation and diagnoses the histologic grades simultaneously, and then incorporates the diagnosis text data to produce more accurate final segmentation. Our DoubleU-Net shows three advantages: (1) Besides the initial segmentation, it offers histologic grade diagnosis and enhanced segmentation for full-scale assistance. (2) The textual features extracted from diagnosis data provide high-level guidance related to gland morphology, and boost the performance of challenging cases with seriously deformed glands. (3) It can be extended to segmentation tasks with text data like key clinical phrases or pathology descriptions. The model is evaluated on two public colon gland datasets and achieves state-of-the-art performance.

Pei Wang, Albert C. S. Chung
Dynamic Image for 3D MRI Image Alzheimer’s Disease Classification

We propose to apply a 2D CNN architecture to 3D MRI image Alzheimer’s disease classification. Training a 3D convolutional neural network (CNN) is time-consuming and computationally expensive. We make use of approximate rank pooling to transform the 3D MRI image volume into a 2D image to use as input to a 2D CNN. We show our proposed CNN model achieves $$9.5\%$$ 9.5 % better Alzheimer’s disease classification accuracy than the baseline 3D models. We also show that our method allows for efficient training, requiring only $$20\%$$ 20 % of the training time compared to 3D CNN models. The code is available online: https://github.com/UkyVision/alzheimer-project .

Xin Xing, Gongbo Liang, Hunter Blanton, Muhammad Usman Rafique, Chris Wang, Ai-Ling Lin, Nathan Jacobs
Feedback Attention for Cell Image Segmentation

In this paper, we address cell image segmentation task by Feedback Attention mechanism like feedback processing. Unlike conventional neural network models of feedforward processing, we focused on the feedback processing in human brain and assumed that the network learns like a human by connecting feature maps from deep layers to shallow layers. We propose some Feedback Attentions which imitate human brain and feeds back the feature maps of output layer to close layer to the input. U-Net with Feedback Attention showed better result than the conventional methods using only feedforward processing.

Hiroki Tsuda, Eisuke Shibuya, Kazuhiro Hotta
Improving Blind Spot Denoising for Microscopy

Many microscopy applications are limited by the total amount of usable light and are consequently challenged by the resulting levels of noise in the acquired images. This problem is often addressed via (supervised) deep learning based denoising. Recently, by making assumptions about the noise statistics, self-supervised methods have emerged. Such methods are trained directly on the images that are to be denoised and do not require additional paired training data. While achieving remarkable results, self-supervised methods can produce high-frequency artifacts and achieve inferior results compared to supervised approaches. Here we present a novel way to improve the quality of self-supervised denoising. Considering that light microscopy images are usually diffraction-limited, we propose to include this knowledge in the denoising process. We assume the clean image to be the result of a convolution with a point spread function (PSF) and explicitly include this operation at the end of our neural network. As a consequence, we are able to eliminate high-frequency artifacts and achieve self-supervised results that are very close to the ones achieved with traditional supervised methods.

Anna S. Goncharova, Alf Honigmann, Florian Jug, Alexander Krull
Learning to Restore ssTEM Images from Deformation and Corruption

Serial section transmission electron microscopy (ssTEM) plays an important role in biological research. Due to the imperfect sample preparation, however, ssTEM images suffer from inevitable artifacts that pose huge challenges for the subsequent analysis and visualization. In this paper, we propose a novel strategy for modeling the main type of degradation, i.e., Support Film Folds (SFF), by characterizing this degradation process as a combination of content deformation and corruption. Relying on that, we then synthesize a sufficient amount of paired samples (degraded/groundtruth), which enables the training of a tailored deep restoration network. To the best of our knowledge, this is the first learning-based framework for ssTEM image restoration. Experiments on both synthetic and real test data demonstrate the superior performance of our proposed method over existing solutions, in terms of both image restoration quality and neuron segmentation accuracy.

Wei Huang, Chang Chen, Zhiwei Xiong, Yueyi Zhang, Dong Liu, Feng Wu
Learning to Segment Microscopy Images with Lazy Labels

The need for labour intensive pixel-wise annotation is a major limitation of many fully supervised learning methods for segmenting bioimages that can contain numerous object instances with thin separations. In this paper, we introduce a deep convolutional neural network for microscopy image segmentation. Annotation issues are circumvented by letting the network being trainable on coarse labels combined with only a very small number of images with pixel-wise annotations. We call this new labelling strategy ‘lazy’ labels. Image segmentation is stratified into three connected tasks: rough inner region detection, object separation and pixel-wise segmentation. These tasks are learned in an end-to-end multi-task learning framework. The method is demonstrated on two microscopy datasets, where we show that the model gives accurate segmentation results even if exact boundary labels are missing for a majority of annotated data. It brings more flexibility and efficiency for training deep neural networks that are data hungry and is applicable to biomedical images with poor contrast at the object boundaries or with diverse textures and repeated patterns.

Rihuan Ke, Aurélie Bugeau, Nicolas Papadakis, Peter Schuetz, Carola-Bibiane Schönlieb
Multi-CryoGAN: Reconstruction of Continuous Conformations in Cryo-EM Using Generative Adversarial Networks

We propose a deep-learning-based reconstruction method for cryo-electron microscopy (Cryo-EM) that can model multiple conformations of a nonrigid biomolecule in a standalone manner. Cryo-EM produces many noisy projections from separate instances of the same but randomly oriented biomolecule. Current methods rely on pose and conformation estimation which are inefficient for the reconstruction of continuous conformations that carry valuable information. We introduce Multi-CryoGAN, which sidesteps the additional processing by casting the volume reconstruction into the distribution matching problem. By introducing a manifold mapping module, Multi-CryoGAN can learn continuous structural heterogeneity without pose estimation nor clustering. We also give a theoretical guarantee of recovery of the true conformations. Our method can successfully reconstruct 3D protein complexes on synthetic 2D Cryo-EM datasets for both continuous and discrete structural variability scenarios. Multi-CryoGAN is the first model that can reconstruct continuous conformations of a biomolecule from Cryo-EM images in a fully unsupervised and end-to-end manner.

Harshit Gupta, Thong H. Phan, Jaejun Yoo, Michael Unser
Probabilistic Deep Learning for Instance Segmentation

Probabilistic convolutional neural networks, which predict distributions of predictions instead of point estimates, led to recent advances in many areas of computer vision, from image reconstruction to semantic segmentation. Besides state of the art benchmark results, these networks made it possible to quantify local uncertainties in the predictions. These were used in active learning frameworks to target the labeling efforts of specialist annotators or to assess the quality of a prediction in a safety-critical environment. However, for instance segmentation problems these methods are not frequently used so far. We seek to close this gap by proposing a generic method to obtain model-inherent uncertainty estimates within proposal-free instance segmentation models. Furthermore, we analyze the quality of the uncertainty estimates with a metric adapted from semantic segmentation. We evaluate our method on the BBBC010 C. elegans dataset, where it yields competitive performance while also predicting uncertainty estimates that carry information about object-level inaccuracies like false splits and false merges. We perform a simulation to show the potential use of such uncertainty estimates in guided proofreading.

Josef Lorenz Rumberger, Lisa Mais, Dagmar Kainmueller
Registration of Multi-modal Volumetric Images by Establishing Cell Correspondence

Early development of an animal from an egg involves a rapid increase in cell number and several cell fate specification events accompanied by dynamic morphogenetic changes. In order to correlate the morphological changes with the genetic events, one typically needs to monitor the living system with several imaging modalities offering different spatial and temporal resolution. Live imaging allows monitoring the embryo at a high temporal resolution and observing the morphological changes. On the other hand, confocal images of specimens fixed and stained for the expression of certain genes enable observing the transcription states of an embryo at specific time points during development with high spatial resolution. The two imaging modalities cannot, by definition, be applied to the same specimen and thus, separately obtained images of different specimens need to be registered. Biologically, the most meaningful way to register the images is by identifying cellular correspondences between these two imaging modalities. In this way, one can bring the two sources of information into a single domain and combine dynamic information on morphogenesis with static gene expression data. Here we propose a new computational pipeline for identifying cell-to-cell correspondences between images from multiple modalities and for using these correspondences to register 3D images within and across imaging modalities. We demonstrate this pipeline by combining four-dimensional recording of embryogenesis of Spiralian annelid ragworm Platynereis dumerilii with three-dimensional scans of fixed Platynereis dumerilii embryos stained for the expression of a variety of important developmental genes. We compare our approach with methods for aligning point clouds and show that we match the accuracy of these state-of-the-art registration pipelines on synthetic data. We show that our approach outperforms these methods on real biological imaging datasets. Importantly, our approach uniquely provides, in addition to the registration, also the non-redundant matching of corresponding, biologically meaningful entities within the registered specimen which is the prerequisite for generating biological insights from the combined datasets. The complete pipeline is available for public use through a Fiji plugin.

Manan Lalit, Mette Handberg-Thorsager, Yu-Wen Hsieh, Florian Jug, Pavel Tomancak
W2S: Microscopy Data with Joint Denoising and Super-Resolution for Widefield to SIM Mapping

In fluorescence microscopy live-cell imaging, there is a critical trade-off between the signal-to-noise ratio and spatial resolution on one side, and the integrity of the biological sample on the other side. To obtain clean high-resolution (HR) images, one can either use microscopy techniques, such as structured-illumination microscopy (SIM), or apply denoising and super-resolution (SR) algorithms. However, the former option requires multiple shots that can damage the samples, and although efficient deep learning based algorithms exist for the latter option, no benchmark exists to evaluate these algorithms on the joint denoising and SR (JDSR) tasks.To study JDSR on microscopy data, we propose such a novel JDSR dataset, Widefield2SIM (W2S), acquired using a conventional fluorescence widefield and SIM imaging. W2S includes 144,000 real fluorescence microscopy images, resulting in a total of 360 sets of images. A set is comprised of noisy low-resolution (LR) widefield images with different noise levels, a noise-free LR image, and a corresponding high-quality HR SIM image. W2S allows us to benchmark the combinations of 6 denoising methods and 6 SR methods. We show that state-of-the-art SR networks perform very poorly on noisy inputs. Our evaluation also reveals that applying the best denoiser in terms of reconstruction error followed by the best SR method does not necessarily yield the best final result. Both quantitative and qualitative results show that SR networks are sensitive to noise and the sequential application of denoising and SR algorithms is sub-optimal. Lastly, we demonstrate that SR networks retrained end-to-end for JDSR outperform any combination of state-of-the-art deep denoising and SR networks (Code and data available at https://github.com/IVRL/w2s ).

Ruofan Zhou, Majed El Helou, Daniel Sage, Thierry Laroche, Arne Seitz, Sabine Süsstrunk

W03 - Egocentric Perception, Interaction, and Computing

Frontmatter
An Investigation of Deep Visual Architectures Based on Preprocess Using the Retinal Transform

This work investigates the utility of a biologically motivated software retina model to pre-process and compress visual information prior to training and classification by means of a deep convolutional neural networks (CNNs) in the context of object recognition in robotics and egocentric perception. We captured a dataset of video clips in a standard office environment by means of a hand-held high-resolution digital camera using uncontrolled illumination. Individual video sequences for each of 20 objects were captured over the observable view hemisphere for each object and several sequences were captured per object to serve training and validation within an object recognition task. A key objective of this project is to investigate appropriate network architectures for processing retina transformed input images and in particular to determine the utility of spatio-temporal CNNs versus simple feed-forward CNNs. A number of different CNN architectures were devised and compared in their classification performance accordingly. The project demonstrated that the image classification task could be conducted with an accuracy exceeding 98% under varying lighting conditions when the object was viewed from distances similar to that when trained.

Álvaro Mendes Samagaio, Jan Paul Siebert
Data Augmentation Techniques for the Video Question Answering Task

Video Question Answering (VideoQA) is a task that requires a model to analyze and understand both the visual content given by the input video and the textual part given by the question, and the interaction between them in order to produce a meaningful answer. In our work we focus on the Egocentric VideoQA task, which exploits first-person videos, because of the importance of such task which can have impact on many different fields, such as those pertaining the social assistance and the industrial training. Recently, an Egocentric VideoQA dataset, called EgoVQA, has been released. Given its small size, models tend to overfit quickly. To alleviate this problem, we propose several augmentation techniques which give us a +5.5% improvement on the final accuracy over the considered baseline.

Alex Falcon, Oswald Lanz, Giuseppe Serra

W05 - Eye Gaze in VR, AR, and in the Wild

Frontmatter
Efficiency in Real-Time Webcam Tracking

Efficiency and ease of use are essential for practical applications of camera based eye/gaze-tracking. Gaze tracking involves estimating where a person is looking on a screen based on face images from a computer-facing camera. In this paper we investigate two complementary forms of efficiency in gaze tracking: 1. The computational efficiency of the system which is dominated by the inference speed of a CNN predicting gaze-vectors; 2. The usability efficiency which is determined by the tediousness of the mandatory calibration of the gaze-vector to a computer screen. To do so, we evaluate the computational speed/accuracy trade-off for the CNN and the calibration effort/accuracy trade-off for screen calibration. For the CNN, we evaluate the full face, two-eyes, and single eye input. For screen calibration, we measure the number of calibration points needed and evaluate three types of calibration: 1. pure geometry, 2. pure machine learning, and 3. hybrid geometric regression. Results suggest that a single eye input and geometric regression calibration achieve the best trade-off.

Amogh Gudi, Xin Li, Jan van Gemert
Hierarchical HMM for Eye Movement Classification

In this work, we tackle the problem of ternary eye movement classification, which aims to separate fixations, saccades and smooth pursuits from the raw eye positional data. The efficient classification of these different types of eye movements helps to better analyze and utilize the eye tracking data. Different from the existing methods that detect eye movement by several pre-defined threshold values, we propose a hierarchical Hidden Markov Model (HMM) statistical algorithm for detecting fixations, saccades and smooth pursuits. The proposed algorithm leverages different features from the recorded raw eye tracking data with a hierarchical classification strategy, separating one type of eye movement each time. Experimental results demonstrate the effectiveness and robustness of the proposed method by achieving competitive or better performance compared to the state-of-the-art methods.

Ye Zhu, Yan Yan, Oleg Komogortsev
Domain Adaptation for Eye Segmentation

Domain adaptation (DA) has been widely investigated as a framework to alleviate the laborious task of data annotation for image segmentation. Most DA investigations operate under the unsupervised domain adaptation (UDA) setting, where the modeler has access to a large cohort of source domain labeled data and target domain data with no annotations. UDA techniques exhibit poor performance when the domain gap, i.e., the distribution overlap between the data in source and target domain is large. We hypothesize that the DA performance gap can be improved with the availability of a small subset of labeled target domain data. In this paper, we systematically investigate the impact of varying amounts of labeled target domain data on the performance gap for DA. We specifically focus on the problem of segmenting eye-regions from eye images collected using two different head mounted display systems. Source domain is comprised of 12,759 eye images with annotations and target domain is comprised of 4,629 images with varying amounts of annotations. Experiments are performed to compare the impact on DA performance gap under three schemes: unsupervised (UDA), supervised (SDA) and semi-supervised (SSDA) domain adaptation. We evaluate these schemes by measuring the mean intersection-over-union (mIoU) metric. Using only 200 samples of labeled target data under SDA and SSDA schemes, we show an improvement in mIoU of 5.4% and 6.6% respectively, over mIoU of 81.7% under UDA. By using all available labeled target data, models trained under SSDA achieve a competitive mIoU score of 89.8%. Overall, we conclude that availability of a small subset of target domain data with annotations can substantially improve DA performance.

Yiru Shen, Oleg Komogortsev, Sachin S. Talathi
EyeSeg: Fast and Efficient Few-Shot Semantic Segmentation

Semantic segmentation is a key component in eye- and gaze- tracking for virtual reality (VR) and augmented reality (AR) applications. While it is a well-studied computer vision problem, most state-of-the-art models require large amounts of labeled data, which is limited in this specific domain. An additional consideration in eye tracking is the capacity for real-time predictions, necessary for responsive AR/VR interfaces. In this work, we propose EyeSeg, an encoder-decoder architecture designed for accurate pixel-wise few-shot semantic segmentation with limited annotated data. We report results from the OpenEDS2020 Challenge, yielding a 94.5% mean Intersection Over Union (mIOU) score, which is a 10.5% score increase over the baseline approach. The experimental results demonstrate state-of-the-art performance while preserving a low latency framework. Source code is available: http://www.cs.utsa.edu/~fernandez/segmentation.html .

Jonathan Perry, Amanda S. Fernandez

W10 - TASK-CV Workshop and VisDA Challenge

Frontmatter
Class-Imbalanced Domain Adaptation: An Empirical Odyssey

Unsupervised domain adaptation is a promising way to generalize deep models to novel domains. However the current literature assumes that the label distribution is domain-invariant and only aligns the feature distributions or vice versa. In this work, we explore the more realistic task of Class-imbalanced Domain Adaptation: How to align feature distributions across domains while the label distributions of the two domains are also different? Taking a practical step towards this problem, we constructed its first benchmark with 22 cross-domain tasks from 6 real-image datasets. We conducted comprehensive experiments on 10 recent domain adaptation methods and find most of them are very fragile in the face of coexisting feature and label distribution shift. Towards a better solution, we further proposed a feature and label distribution CO-ALignment (COAL) model with a novel combination of existing ideas. COAL is empirically shown to outperform most recent domain adaptation methods on our benchmarks. We believe the provided benchmarks, empirical analysis results, and the COAL baseline could stimulate and facilitate future research towards this important problem.

Shuhan Tan, Xingchao Peng, Kate Saenko
Sequential Learning for Domain Generalization

In this paper we propose a sequential learning framework for Domain Generalization (DG), the problem of training a model that is robust to domain shift by design. Various DG approaches have been proposed with different motivating intuitions, but they typically optimize for a single step of domain generalization – training on one set of domains and generalizing to one other. Our sequential learning is inspired by the idea lifelong learning, where accumulated experience means that learning the $$n^{th}$$ n th thing becomes easier than the $$1^{st}$$ 1 st thing. In DG this means encountering a sequence of domains and at each step training to maximise performance on the next domain. The performance at domain n then depends on the previous $$n-1$$ n - 1 learning problems. Thus backpropagating through the sequence means optimizing performance not just for the next domain, but all following domains. Training on all such sequences of domains provides dramatically more ‘practice’ for a base DG learner compared to existing approaches, thus improving performance on a true testing domain. This strategy can be instantiated for different base DG algorithms, but we focus on its application to the recently proposed Meta-Learning Domain generalization (MLDG). We show that for MLDG it leads to a simple to implement and fast algorithm that provides consistent performance improvement on a variety of DG benchmarks.

Da Li, Yongxin Yang, Yi-Zhe Song, Timothy Hospedales
Generating Visual and Semantic Explanations with Multi-task Network

Explaining deep models is desirable especially for improving the user trust and experience. Much progress has been done recently towards visually and semantically explaining deep models. However, establishing the most effective explanation is often human-dependent, which suffers from the bias of the annotators. To address this issue, we propose a multitask learning network (MTL-Net) that generates saliency-based visual explanation as well as attribute-based semantic explanation. Via an integrated evaluation mechanism, our model quantitatively evaluates the quality of the generated explanations. First, we introduce attributes to the image classification process and rank the attribute contribution with gradient weighted mapping, then generate semantic explanations with those attributes. Second, we propose a fusion classification mechanism (FCM) to evaluate three recent saliency-based visual explanation methods by their influence on the classification. Third, we conduct user studies, quantitative and qualitative evaluations. According to our results on three benchmark datasets with varying size and granularity, our attribute-based semantic explanations are not only helpful to the user but they also improve the classification accuracy of the model, and our ranking framework detects the best performing visual explanation method in agreement with the users.

Wenjia Xu, Jiuniu Wang, Yang Wang, Yirong Wu, Zeynep Akata
SpotPatch: Parameter-Efficient Transfer Learning for Mobile Object Detection

As mobile hardware technology advances, on-device computation is becoming more and more affordable.

Keren Ye, Adriana Kovashka, Mark Sandler, Menglong Zhu, Andrew Howard, Marco Fornoni
Using Sentences as Semantic Representations in Large Scale Zero-Shot Learning

Zero-shot learning (ZSL) aims to recognize instances of unseen classes, for which no visual instance is available during training, by learning multimodal relations between samples from seen classes and corresponding class semantic representations. These class representations usually consist of either attributes, which do not scale well to large datasets, or word embeddings, which lead to poorer performance. A good trade-off could be to employ short sentences in natural language as class descriptions. We explore different solutions to use such short descriptions in a ZSL setting and show that while simple methods cannot achieve very good results with sentences alone, a combination of usual word embeddings and sentences can significantly outperform current state-of-the-art.

Yannick Le Cacheux, Hervé Le Borgne, Michel Crucianu
Adversarial Transfer of Pose Estimation Regression

We address the problem of camera pose estimation in visual localization. Current regression-based methods for pose estimation are trained and evaluated scene-wise. They depend on the coordinate frame of the training dataset and show a low generalization across scenes and datasets. We identify the dataset shift an important barrier to generalization and consider transfer learning as an alternative way towards a better reuse of pose estimation models. We revise domain adaptation techniques for classification and extend them to camera pose estimation, which is a multi-regression task. We develop a deep adaptation network for learning scene-invariant image representations and use adversarial learning to generate such representations for model transfer. We enrich the network with self-supervised learning and use the adaptability theory to validate the existence of scene-invariant representation of images in two given scenes. We evaluate our network on two public datasets, Cambridge Landmarks and 7Scene, demonstrate its superiority over several baselines and compare to the state of the art methods.

Boris Chidlovskii, Assem Sadek
Disentangled Image Generation for Unsupervised Domain Adaptation

We explore the use of generative modeling in unsupervised domain adaptation (UDA), where annotated real images are only available in the source domain, and pseudo images are generated in a manner that allows independent control of class (content) and nuisance variability (style). The proposed method differs from existing generative UDA models in that we explicitly disentangle the content and nuisance features at different layers of the generator network. We demonstrate the effectiveness of (pseudo)-conditional generation by showing that it improves upon baseline methods. Moreover, we outperform the previous state-of-the-art with significant margins in recently introduced multi-source domain adaptation (MSDA) tasks, achieving significant error reduction rates of $$50.27 \%$$ 50.27 % , $$89.54 \%$$ 89.54 % , $$75.35 \%$$ 75.35 % , $$27.46 \%$$ 27.46 % and $$94.3 \%$$ 94.3 % in all 5 tasks.

Safa Cicek, Ning Xu, Zhaowen Wang, Hailin Jin, Stefano Soatto
Domain Generalization Using Shape Representation

CNN-based representations have greatly advanced the state of the art in visual recognition, but the community has primarily focused on the setting where training and test set belong to the same dataset/distribution

Narges Honarvar Nazari, Adriana Kovashka
Bi-Dimensional Feature Alignment for Cross-Domain Object Detection

Recently the problem of cross-domain object detection has started drawing attention in the computer vision community. In this paper, we propose a novel unsupervised cross-domain detection model that exploits the annotated data in a source domain to train an object detector for a different target domain. The proposed model mitigates the cross-domain representation divergence for object detection by performing cross-domain feature alignment in two dimensions, the depth dimension and the spatial dimension. In the depth dimension of channel layers, it uses inter-channel information to bridge the domain divergence with respect to image style alignment. In the dimension of spatial layers, it deploys spatial attention modules to enhance detection relevant regions and suppress irrelevant regions with respect to cross-domain feature alignment. Experiments are conducted on a number of benchmark cross-domain detection datasets. The empirical results show the proposed method outperforms the state-of-the-art comparison methods.

Zhen Zhao, Yuhong Guo, Jieping Ye
Bayesian Zero-Shot Learning

Object classes that surround us have a natural tendency to emerge at varying levels of abstraction. We propose a Bayesian approach to zero-shot learning (ZSL) that introduces the notion of meta-classes and implements a Bayesian hierarchy around these classes to effectively blend data likelihood with local and global priors. Local priors driven by data from seen classes, i.e., classes available at training time, become instrumental in recovering unseen classes, i.e., classes that are missing at training time, in a generalized ZSL (GZSL) setting. Hyperparameters of the Bayesian model offer a convenient way to optimize the trade-off between seen and unseen class accuracy. We conduct experiments on seven benchmark datasets, including a large scale ImageNet and show that our model produces promising results in the challenging GZSL setting.

Sarkhan Badirli, Zeynep Akata, Murat Dundar
Self-Supervision for 3D Real-World Challenges

We consider several possible scenarios involving synthetic and real-world point clouds where supervised learning fails due to data scarcity and large domain gaps. We propose to enrich standard feature representations by leveraging self-supervision through a multi-task model that can solve a 3D puzzle while learning the main task of shape classification or part segmentation.

Antonio Alliegro, Davide Boscaini, Tatiana Tommasi
Diversified Mutual Learning for Deep Metric Learning

Mutual learning is an ensemble training strategy to improve generalization by transferring individual knowledge to each other while simultaneously training multiple models. In this work, we propose an effective mutual learning method for deep metric learning, called Diversified Mutual Metric Learning, which enhances embedding models with diversified mutual learning. We transfer relational knowledge for deep metric learning by leveraging three kinds of diversities in mutual learning: (1) model diversity from different initializations of models (2) temporal diversity from different frequencies of parameter update, and (3) view diversity from different augmentations of inputs. Our method is particularly adequate for inductive transfer learning at the lack of large-scale data, where the embedding model is initialized with a pretrained model and then fine-tuned on a target dataset. Extensive experiments show that our method significantly improves individual models as well as their ensemble. Finally, the proposed method with a conventional triplet loss achieves the state-of-the-art performance of Recall@1 on standard datasets: 69.9 on CUB-200–2011 and 89.1 on CARS-196.

Wonpyo Park, Wonjae Kim, Kihyun You, Minsu Cho
Domain Generalization vs Data Augmentation: An Unbiased Perspective

In domain generalization the target domain is not known at training time. We show that a style transfer based data augmentation strategy can be implemented easily and outperforms the current state of the art domain generalization methods. Moreover, we observe that those methods, even if combined with the described data augmentation, do not take advantage of it, indicating the need of new generalization solutions.

Francesco Cappio Borlino, Antonio D’Innocente, Tatiana Tommasi

W11 - Bodily Expressed Emotion Understanding

Frontmatter
Panel: Bodily Expressed Emotion Understanding Research: A Multidisciplinary Perspective

Developing computational methods for bodily expressed emotion understanding can benefit from knowledge and approaches of multiple fields, including computer vision, robotics, psychology/psychiatry, graphics, data mining, machine learning, and movement analysis. The panel, consisting of active researchers in some closely-related fields, attempts to open a discussion on the future of this new and exciting research area. This paper documents the opinions expressed by the individual panelists.

James Z. Wang, Norman Badler, Nadia Berthouze, Rick O. Gilmore, Kerri L. Johnson, Agata Lapedriza, Xin Lu, Nikolaus Troje
Emotion Understanding in Videos Through Body, Context, and Visual-Semantic Embedding Loss

We present our winning submission to the First International Workshop on Bodily Expressed Emotion Understanding (BEEU) challenge. Based on recent literature on the effect of context/environment on emotion, as well as visual representations with semantic meaning using word embeddings, we extend the framework of Temporal Segment Network to accommodate these. Our method is verified on the validation set of the Body Language Dataset (BoLD) and achieves 0.26235 Emotion Recognition Score on the test set, surpassing the previous best result of 0.2530.

Panagiotis Paraskevas Filntisis, Niki Efthymiou, Gerasimos Potamianos, Petros Maragos
Noisy Student Training Using Body Language Dataset Improves Facial Expression Recognition

Facial expression recognition from videos in the wild is a challenging task due to the lack of abundant labelled training data. Large DNN (deep neural network) architectures and ensemble methods have resulted in better performance, but soon reach saturation at some point due to data inadequacy. In this paper, we use a self-training method that utilizes a combination of a labelled dataset and an unlabelled dataset (Body Language Dataset - BoLD). Experimental analysis shows that training a noisy student network iteratively helps in achieving significantly better results. Additionally, our model isolates different regions of the face and processes them independently using a multi-level attention mechanism which further boosts the performance. Our results show that the proposed method achieves state-of-the-art performance on benchmark datasets CK+ and AFEW 8.0 when compared to single models.

Vikas Kumar, Shivansh Rao, Li Yu
Emotion Embedded Pose Generation

Body poses are a rich source of information in the field of sentiment analysis and they complement existing facial emotion recognition tasks by adding significant value especially when faces are not easily available. CCTV recordings and other non-human interfaces are applications where capturing facial expression is challenging and these interfaces can benefit from body pose emotion recognition to conduct context analysis. Another roadblock in this direction is the limited availability of collected and curated datasets of body poses with emotion labels. Addressing these issues, we propose two end-to-end pipelines to generate emotion conditioned human poses corresponding to specific emotion labels. An auxiliary conditional GAN network is presented for pose images and pose skeleton pipelines. The generated images improved emotion classification accuracy by an average of 5.40% across 4 different networks compared to images that were traditionally augmented. Additionally, through image and skeletal augmentation, we achieve state-of-the-art emotion classification results for the BEAST dataset.

Amogh Subbakrishna Adishesha, Tianxiang Zhao
Understanding Political Communication Styles in Televised Debates via Body Movements

Televised political debates have received much attention by scholars in political communication and social psychology who study nonverbal cues in interpersonal communication and their impact on candidate evaluations. An abundance of political multimedia and new platforms have required leaders to develop an effective and unique communication “style” which may rely on nonverbal devices such as face and body. Emotions conveyed by expressive gestures of candidates during debates have been shown to elicit stronger reactions from the public than rhetorical statements alone. Candidates, for example, may exploit assertive and aggressive gestures to communicate their confidence and attract supporters. Existing studies, however, are based largely on manual coding of human gestures, which may not be scalable or reproducible. The main objectives of our paper are to investigate the role of body movements of candidates using a systematic and automated approach as well as understand the context and effects of gestures. For this analysis, we collected a dataset of political debate videos from the 2020 Democratic presidential primaries and analyzed facial expressions and gestures of candidates. Our preliminary analysis demonstrates that candidates employ gestures to varying extents, and the amount of body movement is correlated with emotions conveyed in the candidates’ facial expressions. We discuss our dataset, preliminary results, and future directions in the following sections.

Zhiqi Kang, Christina Indudhara, Kaushik Mahorker, Erik P. Bucy, Jungseock Joo
Backmatter
Metadaten
Titel
Computer Vision – ECCV 2020 Workshops
herausgegeben von
Prof. Adrien Bartoli
Andrea Fusiello
Copyright-Jahr
2020
Electronic ISBN
978-3-030-66415-2
Print ISBN
978-3-030-66414-5
DOI
https://doi.org/10.1007/978-3-030-66415-2

Premium Partner