Skip to main content
Top

2019 | Book

Computer Vision – ECCV 2018 Workshops

Munich, Germany, September 8-14, 2018, Proceedings, Part VI

insite
SEARCH

About this book

The six-volume set comprising the LNCS volumes 11129-11134 constitutes the refereed proceedings of the workshops that took place in conjunction with the 15th European Conference on Computer Vision, ECCV 2018, held in Munich, Germany, in September 2018.43 workshops from 74 workshops proposals were selected for inclusion in the proceedings. The workshop topics present a good orchestration of new trends and traditional issues, built bridges into neighboring fields, and discuss fundamental technologies and novel applications.

Table of Contents

Frontmatter

W31 – 6th International Workshop on Assistive Computer Vision and Robotics

Frontmatter
Deep Learning for Assistive Computer Vision

This paper revises the main advances in assistive computer vision recently fostered by deep learning. To this aim, we first discuss how the application of deep learning in computer vision has contributed to the development of assistive techinologies, then analyze the recent advances in assistive technologies achieved in five main areas, namely, object classification and localization, scene understanding, human pose estimation and tracking, action/event recognition and anticipation. The paper is concluded with a discussion and insights for future directions.

Marco Leo, Antonino Furnari, Gerard G. Medioni, Mohan Trivedi, Giovanni M. Farinella
Recovering 6D Object Pose: A Review and Multi-modal Analysis

A large number of studies analyse object detection and pose estimation at visual level in 2D, discussing the effects of challenges such as occlusion, clutter, texture, etc., on the performances of the methods, which work in the context of RGB modality. Interpreting the depth data, the study in this paper presents thorough multi-modal analyses. It discusses the above-mentioned challenges for full 6D object pose estimation in RGB-D images comparing the performances of several 6D detectors in order to answer the following questions: What is the current position of the computer vision community for maintaining “automation” in robotic manipulation? What next steps should the community take for improving “autonomy” in robotics while handling objects? Our findings include: (i) reasonably accurate results are obtained on textured-objects at varying viewpoints with cluttered backgrounds. (ii) Heavy existence of occlusion and clutter severely affects the detectors, and similar-looking distractors is the biggest challenge in recovering instances’ 6D. (iii) Template-based methods and random forest-based learning algorithms underlie object detection and 6D pose estimation. Recent paradigm is to learn deep discriminative feature representations and to adopt CNNs taking RGB images as input. (iv) Depending on the availability of large-scale 6D annotated depth datasets, feature representations can be learnt on these datasets, and then the learnt representations can be customized for the 6D problem.

Caner Sahin, Tae-Kyun Kim
Computer Vision for Medical Infant Motion Analysis: State of the Art and RGB-D Data Set

Assessment of spontaneous movements of infants lets trained experts predict neurodevelopmental disorders like cerebral palsy at a very young age, allowing early intervention for affected infants. An automated motion analysis system requires to accurately capture body movements, ideally without markers or attached sensors to not affect the movements of infants. A vast majority of recent approaches for human pose estimation focuses on adults, leading to a degradation of accuracy if applied to infants. Hence, multiple systems for infant pose estimation have been developed. Due to the lack of publicly available benchmark data sets, a standardized evaluation, let alone a comparison of different approaches is impossible. We fill this gap by releasing the Moving INfants In RGB-D (MINI-RGBD) (Data set available for research purposes at http://s.fhg.de/mini-rgbd ) data set, created using the recently introduced Skinned Multi-Infant Linear body model (SMIL). We map real infant movements to the SMIL model with realistic shapes and textures, and generate RGB and depth images with precise ground truth 2D and 3D joint positions. We evaluate our data set with state-of-the-art methods for 2D pose estimation in RGB images and for 3D pose estimation in depth images. Evaluation of 2D pose estimation results in a PCKh rate of 88.1% and 94.5% (depending on correctness threshold), and PCKh rates of 64.2%, respectively 90.4% for 3D pose estimation. We hope to foster research in medical infant motion analysis to get closer to an automated system for early detection of neurodevelopmental disorders.

Nikolas Hesse, Christoph Bodensteiner, Michael Arens, Ulrich G. Hofmann, Raphael Weinberger, A. Sebastian Schroeder
Vision Augmented Robot Feeding

Researchers have over time developed robotic feeding assistants to help at meals so that people with disabilities can live more autonomous lives. Current commercial feeding assistant robots acquire food without feedback on acquisition success and move to a preprogrammed location to deliver the food. In this work, we evaluate how vision can be used to improve both food acquisition and delivery. We show that using visual feedback on whether food was captured increases food acquisition efficiency. We also show how Discriminative Optimization (DO) can be used in tracking so that the food can be effectively brought all the way to the user’s mouth, rather than to a preprogrammed feeding location.

Alexandre Candeias, Travers Rhodes, Manuel Marques, João P. Costeira, Manuela Veloso
Human-Computer Interaction Approaches for the Assessment and the Practice of the Cognitive Capabilities of Elderly People

The cognitive assessment of elderly people is usually performed by means of paper-pencil tests, which may not provide an exhaustive evaluation of the cognitive abilities of the subject. Here, we analyze two solutions based on interaction in virtual environments. In particular, we consider a non-immersive exergame based on a standard tablet, and an immersive VR environment based on a head-mounted display. We show the potential use of such tools, by comparing a set of computed metrics with the results of standard clinical tests, and we discuss the potential use of such tools to perform more complex evaluations. In particular, the use of immersive environments, which could be implemented both with head-mounted displays or with configurations of stereoscopic displays, allows us to track the patients’ pose, and to analyze his/her movements and posture, when performing Activities of Daily Living, with the aim of having a further way to assess cognitive capabilities.

Manuela Chessa, Chiara Bassano, Elisa Gusai, Alice E. Martis, Fabio Solari
Analysis of the Effect of Sensors for End-to-End Machine Learning Odometry

Accurate position and orientation estimations are essential for navigation in autonomous robots. Although it is a well studied problem, existing solutions rely on statistical filters, which usually require good parameter initialization or calibration and are computationally expensive. This paper addresses that problem by using an end-to-end machine learning approach. This work explores the incorporation of multiple sources of data (monocular RGB images and inertial data) to overcome the weaknesses of each source independently. Three different odometry approaches are proposed using CNNs and LSTMs and evaluated against the KITTI dataset and compared with other existing approaches. The obtained results show that the performance of the proposed approaches is similar to the state-of-the-art ones, outperforming some of them at a lower computational cost allowing their execution on resource constrained devices.

Carlos Marquez Rodriguez-Peral, Dexmont Peña
RAMCIP Robot: A Personal Robotic Assistant; Demonstration of a Complete Framework

At the last decades, personal domestic robots are considered as the future for tackling the societal challenge inherent in the growing elderly population. Ageing is typically associated with physical and cognitive decline, altering the way an older person moves around the house, manipulates objects and senses the home environment. This paper aims to demonstrate the RAMCIP robot, which is a Robotic Assistant for patients with Mild Cognitive Impairments (MCI), suitable to provide its services in domestic environments. The use cases that the robot addresses are described herein outlining the necessary requirements that set the basis for the software and hardware architectural components. A short description of the integrated cognitive, perception, manipulation and navigation capabilities of the robot is provided. Robot’s autonomy is enabled through a specific decision making and task planning framework. The robot has been evaluated in ten real home environments of real MCI users exhibiting remarkable performance.

Ioannis Kostavelis, Dimitrios Giakoumis, Georgia Peleka, Andreas Kargakos, Evangelos Skartados, Manolis Vasileiadis, Dimitrios Tzovaras
An Empirical Study Towards Understanding How Deep Convolutional Nets Recognize Falls

Detecting unintended falls is essential for ambient intelligence and healthcare of elderly people living alone. In recent years, deep convolutional nets are widely used in human action analysis, based on which a number of fall detection methods have been proposed. Despite their highly effective performances, the behaviors of how the convolutional nets recognize falls are still not clear. In this paper, instead of proposing a novel approach, we perform a systematical empirical study, attempting to investigate the underlying fall recognition process. We propose four tasks to investigate, which involve five types of input modalities, seven net instances and different training samples. The obtained quantitative and qualitative results reveal the patterns that the nets tend to learn, and several factors that can heavily influence the performances on fall recognition. We expect that our conclusions are favorable to proposing better deep learning solutions to fall detection systems.

Yan Zhang, Heiko Neumann
ASSIST: Personalized Indoor Navigation via Multimodal Sensors and High-Level Semantic Information

Blind & visually impaired (BVI) individuals and those with Autism Spectrum Disorder (ASD) each face unique challenges in navigating unfamiliar indoor environments. In this paper, we propose an indoor positioning and navigation system that guides a user from point A to point B indoors with high accuracy while augmenting their situational awareness. This system has three major components: location recognition (a hybrid indoor localization app that uses Bluetooth Low Energy beacons and Google Tango to provide high accuracy), object recognition (a body-mounted camera to provide the user momentary situational awareness of objects and people), and semantic recognition (map-based annotations to alert the user of static environmental characteristics). This system also features personalized interfaces built upon the unique experiences that both BVI and ASD individuals have in indoor wayfinding and tailors its multimodal feedback to their needs. Here, the technical approach and implementation of this system are discussed, and the results of human subject tests with both BVI and ASD individuals are presented. In addition, we discuss and show the system’s user-centric interface and present points for future work and expansion.

Vishnu Nair, Manjekar Budhai, Greg Olmschenk, William H. Seiple, Zhigang Zhu
Comparing Methods for Assessment of Facial Dynamics in Patients with Major Neurocognitive Disorders

Assessing facial dynamics in patients with major neurocognitive disorders and specifically with Alzheimer’s disease (AD) has shown to be highly challenging. Classically such assessment is performed by clinical staff, evaluating verbal and non-verbal language of AD-patients, since they have lost a substantial amount of their cognitive capacity, and hence communication ability. In addition, patients need to communicate important messages, such as discomfort or pain. Automated methods would support the current healthcare system by allowing for telemedicine, i.e., lesser costly and logistically inconvenient examination. In this work we compare methods for assessing facial dynamics such as talking, singing, neutral and smiling in AD-patients, captured during music mnemotherapy sessions. Specifically, we compare 3D ConvNets, Very Deep Neural Network based Two-Stream ConvNets, as well as Improved Dense Trajectories. We have adapted these methods from prominent action recognition methods and our promising results suggest that the methods generalize well to the context of facial dynamics. The Two-Stream ConvNets in combination with ResNet-152 obtains the best performance on our dataset, capturing well even minor facial dynamics and has thus sparked high interest in the medical community.

Yaohui Wang, Antitza Dantcheva, Jean-Claude Broutart, Philippe Robert, Francois Bremond, Piotr Bilinski
Deep Execution Monitor for Robot Assistive Tasks

We consider a novel approach to high-level robot task execution for a robot assistive task. In this work we explore the problem of learning to predict the next subtask by introducing a deep model for both sequencing goals and for visually evaluating the state of a task. We show that deep learning for monitoring robot tasks execution very well supports the interconnection between task-level planning and robot operations. These solutions can also cope with the natural non-determinism of the execution monitor. We show that a deep execution monitor leverages robot performance. We measure the improvement taking into account some robot helping tasks performed at a warehouse.

Lorenzo Mauro, Edoardo Alati, Marta Sanzari, Valsamis Ntouskos, Gianluca Massimiani, Fiora Pirri
Chasing Feet in the Wild: A Proposed Egocentric Motion-Aware Gait Assessment Tool

Despite advances in gait analysis tools, including optical motion capture and wireless electrophysiology, our understanding of human mobility is largely limited to controlled conditions in a clinic and/or laboratory. In order to examine human mobility under natural conditions, or the ‘wild’, this paper presents a novel markerless model to obtain gait patterns by localizing feet in the egocentric video data. Based on a belt-mounted camera feed, the proposed hybrid FootChaser model consists of: (1) the FootRegionProposer, a ConvNet that proposes regions with high probability of containing feet in RGB frames (global appearance of feet), and (2) LocomoNet, which is sensitive to the periodic gait patterns, and further examines the temporal content in the stacks of optical flow corresponding to the proposed region. The LocomoNet significantly boosted the overall model’s result by filtering out the false positives proposed by the FootRegionProposer. This work advances our long-term objective to develop novel markerless models to extract spatiotemporal gait parameters, particularly step width, to complement existing inertial measurement unit (IMU) based methods.

Mina Nouredanesh, Aaron W. Li, Alan Godfrey, Jesse Hoey, James Tung
Inferring Human Knowledgeability from Eye Gaze in Mobile Learning Environments

What people look at during a visual task reflects an interplay between ocular motor functions and cognitive processes. In this paper, we study the links between eye gaze and cognitive states to investigate whether eye gaze reveal information about an individual’s knowledgeability. We focus on a mobile learning scenario where a user and a virtual agent play a quiz game using a hand-held mobile device. To the best of our knowledge, this is the first attempt to predict user’s knowledgeability from eye gaze using a noninvasive eye tracking method on mobile devices: we perform gaze estimation using front-facing camera of mobile devices in contrast to using specialised eye tracking devices. First, we define a set of eye movement features that are discriminative for inferring user’s knowledgeability. Next, we train a model to predict users’ knowledgeability in the course of responding to a question. We obtain a classification performance of 59.1% achieving human performance, using eye movement features only, which has implications for (1) adapting behaviours of the virtual agent to user’s needs (e.g., virtual agent can give hints); (2) personalising quiz questions to the user’s perceived knowledgeability.

Oya Celiktutan, Yiannis Demiris

W32 – 4th International Workshop on Observing and Understanding Hands in Action

Frontmatter
Hand-Tremor Frequency Estimation in Videos

We focus on the problem of estimating human hand-tremor frequency from input RGB video data. Estimating tremors from video is important for non-invasive monitoring, analyzing and diagnosing patients suffering from motor-disorders such as Parkinson’s disease. We consider two approaches for hand-tremor frequency estimation: (a) a Lagrangian approach where we detect the hand at every frame in the video, and estimate the tremor frequency along the trajectory; and (b) an Eulerian approach where we first localize the hand, we subsequently remove the large motion along the movement trajectory of the hand, and we use the video information over time encoded as intensity values or phase information to estimate the tremor frequency. We estimate hand tremors on a new human tremor dataset, TIM-Tremor, containing static tasks as well as a multitude of more dynamic tasks, involving larger motion of the hands. The dataset has 55 tremor patient recordings together with: associated ground truth accelerometer data from the most affected hand, RGB video data, and aligned depth data.

Silvia L. Pintea, Jian Zheng, Xilin Li, Paulina J. M. Bank, Jacobus J. van Hilten, Jan C. van Gemert
DrawInAir: A Lightweight Gestural Interface Based on Fingertip Regression

Hand gestures form a natural way of interaction on Head-Mounted Devices (HMDs) and smartphones. HMDs such as the Microsoft HoloLens and ARCore/ARKit platform enabled smartphones are expensive and are equipped with powerful processors and sensors such as multiple cameras, depth and IR sensors to process hand gestures. To enable mass market reach via inexpensive Augmented Reality (AR) headsets without built-in depth or IR sensors, we propose a real-time, in-air gestural framework that works on monocular RGB input, termed, DrawInAir. DrawInAir uses fingertip for writing in air analogous to a pen on paper. The major challenge in training egocentric gesture recognition models is in obtaining sufficient labeled data for end-to-end learning. Thus, we design a cascade of networks, consisting of a CNN with differentiable spatial to numerical transform (DSNT) layer, for fingertip regression, followed by a Bidirectional Long Short-Term Memory (Bi-LSTM), for a real-time pointing hand gesture classification. We highlight how a model, that is separately trained to regress fingertip in conjunction with a classifier trained on limited classification data, would perform better over end-to-end models. We also propose a dataset of 10 egocentric pointing gestures designed for AR applications for testing our model. We show that the framework takes 1.73 s to run end-to-end and has a low memory footprint of 14 MB while achieving an accuracy of 88.0% on egocentric video dataset.

Gaurav Garg, Srinidhi Hegde, Ramakrishna Perla, Varun Jain, Lovekesh Vig, Ramya Hebbalaguppe
Adapting Egocentric Visual Hand Pose Estimation Towards a Robot-Controlled Exoskeleton

The basic idea behind a wearable robotic grasp assistance system is to support people that suffer from severe motor impairments in daily activities. Such a system needs to act mostly autonomously and according to the user’s intent. Vision-based hand pose estimation could be an integral part of a larger control and assistance framework. In this paper we evaluate the performance of egocentric monocular hand pose estimation for a robot-controlled hand exoskeleton in a simulation. For hand pose estimation we adopt a Convolutional Neural Network (CNN). We train and evaluate this network with computer graphics, created by our own data generator. In order to guide further design decisions we focus in our experiments on two egocentric camera viewpoints tested on synthetic data with the help of a 3D-scanned hand model, with and without an exoskeleton attached to it. We observe that hand pose estimation with a wrist-mounted camera performs more accurate than with a head-mounted camera in the context of our simulation. Further, a grasp assistance system attached to the hand alters visual appearance and can improve hand pose estimation. Our experiment provides useful insights for the integration of sensors into a context sensitive analysis framework for intelligent assistance.

Gerald Baulig, Thomas Gulde, Cristóbal Curio
Estimating 2D Multi-hand Poses from Single Depth Images

We present a novel framework based on Pictorial Structure (PS) models to estimate 2D multi-hand poses from depth images. Most existing single-hand pose estimation algorithms are either subject to strong assumptions or depend on a weak detector to detect the human hand. We utilize Mask R-CNN to avoid both aforementioned constraints. The proposed framework allows detection of multi-hand instances and localization of hand joints simultaneously. Our experiments show that our method is superior to existing methods.

Le Duan, Minmin Shen, Song Cui, Zhexiao Guo, Oliver Deussen
Spatial-Temporal Attention Res-TCN for Skeleton-Based Dynamic Hand Gesture Recognition

Dynamic hand gesture recognition is a crucial yet challenging task in computer vision. The key of this task lies in an effective extraction of discriminative spatial and temporal features to model the evolutions of different gestures. In this paper, we propose an end-to-end Spatial-Temporal Attention Residual Temporal Convolutional Network (STA-Res-TCN) for skeleton-based dynamic hand gesture recognition, which learns different levels of attention and assigns them to each spatial-temporal feature extracted by the convolution filters at each time step. The proposed attention branch assists the networks to adaptively focus on the informative time frames and features while exclude the irrelevant ones that often bring in unnecessary noise. Moreover, our proposed STA-Res-TCN is a lightweight model that can be trained and tested in an extremely short time. Experiments on DHG-14/28 Dataset and SHREC’17 Track Dataset show that STA-Res-TCN outperforms state-of-the-art methods on both the 14 gestures setting and the more complicated 28 gestures setting.

Jingxuan Hou, Guijin Wang, Xinghao Chen, Jing-Hao Xue, Rui Zhu, Huazhong Yang
Task-Oriented Hand Motion Retargeting for Dexterous Manipulation Imitation

Human hand actions are quite complex, especially when they involve object manipulation, mainly due to the high dimensionality of the hand and the vast action space that entails. Imitating those actions with dexterous hand models involves different important and challenging steps: acquiring human hand information, retargeting it to a hand model, and learning a policy from acquired data. In this work, we capture the hand information by using a state-of-the-art hand pose estimator. We tackle the retargeting problem from the hand pose to a 29 DoF hand model by combining inverse kinematics and PSO with a task objective optimisation. This objective encourages the virtual hand to accomplish the manipulation task, relieving the effect of the estimator’s noise and the domain gap. Our approach leads to a better success rate in the grasping task compared to our inverse kinematics baseline, allowing us to record successful human demonstrations. Furthermore, we used these demonstrations to learn a policy network using generative adversarial imitation learning (GAIL) that is able to autonomously grasp an object in the virtual space.

Dafni Antotsiou, Guillermo Garcia-Hernando, Tae-Kyun Kim
HANDS18: Methods, Techniques and Applications for Hand Observation

This report outlines the proceedings of the Fourth International Workshop on Observing and Understanding Hands in Action (HANDS 2018). The fourth instantiation of this workshop attracted significant interest from both academia and the industry. The program of the workshop included regular papers that are published as the workshop’s proceedings, extended abstracts, invited posters, and invited talks. Topics of the submitted works and invited talks and posters included novel methods for hand pose estimation from RGB, depth, or skeletal data, datasets for special cases and real-world applications, and techniques for hand motion re-targeting and hand gesture recognition. The invited speakers are leaders in their respective areas of specialization, coming from both industry and academia. The main conclusions that can be drawn are the turn of the community towards RGB data and the maturation of some methods and techniques, which in turn has led to increasing interest for real-world applications.

Iason Oikonomidis, Guillermo Garcia-Hernando, Angela Yao, Antonis Argyros, Vincent Lepetit, Tae-Kyun Kim

W33 – Bioimage Computing

Frontmatter
Automatic Classification of Low-Resolution Chromosomal Images

Chromosome karyotyping is a two-staged process consisting of segmentation followed by pairing and ordering of 23 pairs of human chromosomes obtained from cell spread images during metaphase stage of cell division. It is carried out by cytogeneticists in clinical labs on the basis of length, centromere position, and banding pattern of chromosomes for the diagnosis of various health and genetic disorders. The entire process demands high domain expertise and considerable amount of manual effort. This motivates us to automate or partially automate karyotyping process which would benefit and aid doctors in the analysis of chromosome images. However, the non-availability of high resolution chromosome images required for classification purpose creates a hindrance in achieving high classification accuracy. To address this issue, we propose a Super-Xception network which takes the low-resolution chromosome images as input and classifies them to one of the 24 chromosome class labels after conversion into high resolution images. In this network, we integrate super-resolution deep models with standard classification networks e.g., Xception network in our case. The network is trained in an end-to-end manner in which the super-resolution layers help in conversion of low-resolution images to high-resolution images which are subsequently passed through deep classification layers for label assigning. We evaluate our proposed network’s efficacy on a publicly available online Bioimage chromosome classification dataset of healthy chromosomes and benchmark it against the baseline models created using traditional deep convolutional neural network, ResNet-50 and Xception network.

Swati Swati, Monika Sharma, Lovekesh Vig
Feature2Mass: Visual Feature Processing in Latent Space for Realistic Labeled Mass Generation

This paper deals with a method for generating realistic labeled masses. Recently, there have been many attempts to apply deep learning to various bio-image computing fields including computer-aided detection and diagnosis. In order to learn deep network model to be well-behaved in bio-image computing fields, a lot of labeled data is required. However, in many bioimaging fields, the large-size of labeled dataset is scarcely available. Although a few researches have been dedicated to solving this problem through generative model, there are some problems as follows: (1) The generated bio-image does not seem realistic; (2) the variation of generated bio-image is limited; and (3) additional label annotation task is needed. In this study, we propose a realistic labeled bio-image generation method through visual feature processing in latent space. Experimental results have shown that mass images generated by the proposed method were realistic and had wide expression range of targeted mass characteristics.

Jae-Hyeok Lee, Seong Tae Kim, Hakmin Lee, Yong Man Ro
Ordinal Regression with Neuron Stick-Breaking for Medical Diagnosis

The classification for medical diagnosis usually involves inherently ordered labels corresponding to the level of health risk. Previous multi-task classifiers on ordinal data often use several binary classification branches to compute a series of cumulative probabilities. However, these cumulative probabilities are not guaranteed to be monotonically decreasing. It also introduces a large number of hyper-parameters to be fine-tuned manually. This paper aims to eliminate or at least largely reduce the effects of those problems. We propose a simple yet efficient way to rephrase the output layer of the conventional deep neural network. We show that our methods lead to the state-of-the-art accuracy on Diabetic Retinopathy dataset and Ultrasound Breast dataset with very little additional cost.

Xiaofeng Liu, Yang Zou, Yuhang Song, Chao Yang, Jane You, B. V. K. Vijaya Kumar
Multi-level Activation for Segmentation of Hierarchically-Nested Classes

For many biological image segmentation tasks, including topological knowledge, such as the nesting of classes, can greatly improve results. However, most ‘out-of-the-box’ CNN models are still blind to such prior information. In this paper, we propose a novel approach to encode this information, through a multi-level activation layer and three compatible losses. We benchmark all of them on nuclei segmentation in bright-field microscopy cell images from the 2018 Data Science Bowl challenge, offering an exemplary segmentation task with cells and nested subcellular structures. Our scheme greatly speeds up learning, and outperforms standard multi-class classification with soft-max activation and a previously proposed method stemming from it, improving the Dice score significantly (p-values $$<0.007$$ ). Our approach is conceptually simple, easy to implement and can be integrated in any CNN architecture. It can be generalized to a higher number of classes, with or without further relations of containment.

Marie Piraud, Anjany Sekuboyina, Björn H. Menze
Detecting Synapse Location and Connectivity by Signed Proximity Estimation and Pruning with Deep Nets

Synaptic connectivity detection is a critical task for neural reconstruction from Electron Microscopy (EM) data. Most of the existing algorithms for synapse detection do not identify the cleft location and direction of connectivity simultaneously. The few methods that computes direction along with contact location have only been demonstrated to work on either dyadic (most common in vertebrate brain) or polyadic (found in fruit fly brain) synapses, but not on both types. In this paper, we present an algorithm to automatically predict the location as well as the direction of both dyadic and polyadic synapses. The proposed algorithm first generates candidate synaptic connections from voxelwise predictions of signed proximity generated by a 3D U-net. A second 3D CNN then prunes the set of candidates to produce the final detection of cleft and connectivity orientation. Experimental results demonstrate that the proposed method outperforms the existing methods for determining synapses in both rodent and fruit fly brain. (Code at: https://github.com/paragt/EMSynConn ).

Toufiq Parag, Daniel Berger, Lee Kamentsky, Benedikt Staffler, Donglai Wei, Moritz Helmstaedter, Jeff W. Lichtman, Hanspeter Pfister
2D and 3D Vascular Structures Enhancement via Multiscale Fractional Anisotropy Tensor

The detection of vascular structures from noisy images is a fundamental process for extracting meaningful information in many applications. Most well-known vascular enhancing techniques often rely on Hessian-based filters. This paper investigates the feasibility and deficiencies of detecting curve-like structures using a Hessian matrix. The main contribution is a novel enhancement function, which overcomes the deficiencies of established methods. Our approach has been evaluated quantitatively and qualitatively using synthetic examples and a wide range of real 2D and 3D biomedical images. Compared with other existing approaches, the experimental results prove that our proposed approach achieves high-quality curvilinear structure enhancement.

Haifa F. Alhasson, Shuaa S. Alharbi, Boguslaw Obara
Improved Dictionary Learning with Enriched Information for Biomedical Images

With dictionary learning using k-means or k-means++, the optimal value of k is traditionally determined empirically using a validation set. The optimal k, which should depend on the particular problem, is chosen with previously determined values from prior work. We argue that there is rich information from clustering with a number of values of k. We propose a novel method to extract information from clustering with all reasonable values of k at the same time. It is shown that our method improves the performance of dictionary learning for the popular bag-of-features model in image classification with simple patterns like cells such as biomedical images. Our experiments demonstrate that, our proposed dictionary learning method outperforms popular methods, on two well-known datasets by 12.5 $$\%$$ and 8.5 $$\%$$ compared to k-means/k-means++ dictionary learning and by 8.9 $$\%$$ and 6.1 $$\%$$ compared to sparse coding.

Shengda Luo, Alex Po Leung
Visual and Quantitative Comparison of Real and Simulated Biomedical Image Data

The simulations in biomedical image analysis provide a solution when the real image data are difficult to be annotated or if they are available only in small quantities. The progress in simulations rapidly grows in the recent years. Nevertheless, the comparative techniques for the assessment of the plausibility of generated data are still unsatisfactory or none. This paper aims to point out the problem of insufficient comparison of real and synthetic data, which is done in many cases only by visual inspection or based on subjective measurements. The selected texture features are first compared in a univariate manner by quantile-quantile plots and Kolmogorov-Smirnov test. The evaluation is then extended into multivariate assessment using the PCA for a visualization and furthermore for a quantitative measure of similarity by Jaccard index. Two different image datasets were used to show the results and the importance of the validation of simulated data in many aspects.

Tereza Nečasová, David Svoboda
Instance Segmentation of Neural Cells

Instance segmentation of neural cells plays an important role in brain study. However, this task is challenging due to the special shapes and behaviors of neural cells. Existing methods are not precise enough to capture their tiny structures, e.g., filopodia and lamellipodia, which are critical to the understanding of cell interaction and behavior. To this end, we propose a novel deep multi-task learning model to jointly detect and segment neural cells instance-wise. Our method is built upon SSD, with ResNet101 as the backbone to achieve both high detection accuracy and fast speed. Furthermore, unlike existing works which tend to produce wavy and inaccurate boundaries, we embed a deconvolution module into SSD to better capture details. Experiments on a dataset of neural cell microscopic images show that our method is able to achieve better performance in terms of accuracy and efficiency, comparing favorably with current state-of-the-art methods.

Jingru Yi, Pengxiang Wu, Menglin Jiang, Daniel J. Hoeppner, Dimitris N. Metaxas
Densely Connected Stacked U-network for Filament Segmentation in Microscopy Images

Segmenting filamentous structures in confocal microscopy images is important for analyzing and quantifying related biological processes. However, thin structures, especially in noisy imagery, are difficult to accurately segment. In this paper, we introduce a novel deep network architecture for filament segmentation in confocal microscopy images that improves upon the state-of-the-art U-net and SOAX methods. We also propose a strategy for data annotation, and create datasets for microtubule and actin filaments. Our experiments show that our proposed network outperforms state-of-the-art approaches and that our segmentation results are not only better in terms of accuracy, but also more suitable for biological analysis and understanding by reducing the number of falsely disconnected filaments in segmentation.

Yi Liu, Wayne Treible, Abhishek Kolagunda, Alex Nedo, Philip Saponaro, Jeffrey Caplan, Chandra Kambhamettu
Deep Convolutional Neural Networks Based Framework for Estimation of Stomata Density and Structure from Microscopic Images

Analysis of stomata density and its configuration based on scanning electron microscopic (SEM) image of a leaf surface, is an effective way to characterize the plant’s behaviour under various environmental stresses (drought, salinity etc.). Existing methods for phenotyping these stomatal traits are often based on manual or semi-automatic labeling and segmentation of SEM images. This is a low-throughput process when large number of SEM images is investigated for statistical analysis. To overcome this limitation, we propose a novel automated pipeline leveraging deep convolutional neural networks for stomata detection and its quantification. The proposed framework shows a superior performance in contrast to the existing stomata detection methods in terms of precision and recall, 0.91 and 0.89 respectively. Furthermore, the morphological traits (i.e. length & width) obtained at stomata quantification step shows a correlation of 0.95 and 0.91 with manually computed traits, resulting in an efficient and high-throughput solution for stomata phenotyping.

Swati Bhugra, Deepak Mishra, Anupama Anupama, Santanu Chaudhury, Brejesh Lall, Archana Chugh, Viswanathan Chinnusamy
A Fast and Scalable Pipeline for Stain Normalization of Whole-Slide Images in Histopathology

Stain normalization is one of the main tasks in the processing pipeline of computer-aided diagnosis systems in modern digital pathology. Some of the challenges in this tasks are memory and runtime bottlenecks associated with large image datasets. In this work, we present a scalable and fast pipeline for stain normalization using a state-of-the-art unsupervised method based on stain-vector estimation. The proposed system supports single-node and distributed implementations. Based on a highly-optimized engine, our architecture enables high-speed and large-scale processing of high-magnification whole-slide images (WSI). We demonstrate the performance of the system using measurements from different datasets. Moreover, by using a novel pixel-sampling optimization we show lower processing time per image than the scanning time of ultrafast WSI scanners with the single-node implementation and additional 3.44 average speed-up with the 4-nodes distributed pipeline.

Milos Stanisavljevic, Andreea Anghel, Nikolaos Papandreou, Sonali Andani, Pushpak Pati, Jan Hendrik Rüschoff, Peter Wild, Maria Gabrani, Haralampos Pozidis
A Benchmark for Epithelial Cell Tracking

Segmentation and tracking of epithelial cells in light microscopy (LM) movies of developing tissue is an abundant task in cell- and developmental biology. Epithelial cells are densely packed cells that form a honeycomb-like grid. This dense packing distinguishes membrane-stained epithelial cells from the types of objects recent cell tracking benchmarks have focused on, like cell nuclei and freely moving individual cells. While semi-automated tools for segmentation and tracking of epithelial cells are available to biologists, common tools rely on classical watershed based segmentation and engineered tracking heuristics, and entail a tedious phase of manual curation. However, a different kind of densely packed cell imagery has become a focus of recent computer vision research, namely electron microscopy (EM) images of neurons. In this work we explore the benefits of two recent neuron EM segmentation methods for epithelial cell tracking in light microscopy. In particular we adapt two different deep learning approaches for neuron segmentation, namely Flood Filling Networks and MALA, to epithelial cell tracking. We benchmark these on a dataset of eight movies with up to 200 frames. We compare to Moral Lineage Tracing, a combinatorial optimization approach that recently claimed state of the art results for epithelial cell tracking. Furthermore, we compare to Tissue Analyzer, an off-the-shelf tool used by Biologists that serves as our baseline.

Jan Funke, Lisa Mais, Andrew Champion, Natalie Dye, Dagmar Kainmueller
Automatic Fusion of Segmentation and Tracking Labels

Labeled training images of high quality are required for developing well-working analysis pipelines. This is, of course, also true for biological image data, where such labels are usually hard to get. We distinguish human labels (gold corpora) and labels generated by computer algorithms (silver corpora). A naturally arising problem is to merge multiple corpora into larger bodies of labeled training datasets. While fusion of labels in static images is already an established field, dealing with labels in time-lapse image data remains to be explored. Obtaining a gold corpus for segmentation is usually very time-consuming and hence expensive. For this reason, gold corpora for object tracking often use object detection markers instead of dense segmentations. If dense segmentations of tracked objects are desired later on, an automatic merge of the detection-based gold corpus with (silver) corpora of the individual time points for segmentation will be necessary. Here we present such an automatic merging system and demonstrate its utility on corpora from the Cell Tracking Challenge. We additionally release all label fusion algorithms as freely available and open plugins for Fiji ( https://github.com/xulman/CTC-FijiPlugins ).

Cem Emre Akbaş, Vladimír Ulman, Martin Maška, Florian Jug, Michal Kozubek
Identification of C. elegans Strains Using a Fully Convolutional Neural Network on Behavioural Dynamics

The nematode C. elegans is a promising model organism to understand the genetic basis of behaviour due to its anatomical simplicity. In this work, we present a deep learning model capable of discerning genetically diverse strains based only on their recorded spontaneous activity, and explore how its performance changes as different embeddings are used as input. The model outperforms hand-crafted features on strain classification when trained directly on time series of worm postures.

Avelino Javer, André E. X. Brown, Iasonas Kokkinos, Jens Rittscher
Towards Automated Multiscale Imaging and Analysis in TEM: Glomerulus Detection by Fusion of CNN and LBP Maps

Glomerulal structures in kidney tissue have to be analysed at a nanometer scale for several medical diagnoses. They are therefore commonly imaged using Transmission Electron Microscopy. The high resolution produces large amounts of data and requires long acquisition time, which makes automated imaging and glomerulus detection a desired option. This paper presents a deep learning approach for Glomerulus detection, using two architectures, VGG16 (with batch normalization) and ResNet50. To enhance the performance over training based only on intensity images, multiple approaches to fuse the input with texture information encoded in local binary patterns of different scales have been evaluated. The results show a consistent improvement in Glomerulus detection when fusing texture-based trained networks with intensity-based ones at a late classification stage.

Elisabeth Wetzer, Joakim Lindblad, Ida-Maria Sintorn, Kjell Hultenby, Nataša Sladoje
Pre-training on Grayscale ImageNet Improves Medical Image Classification

Deep learning is quickly becoming the de facto standard approach for solving a range of medical image analysis tasks. However, large medical image datasets appropriate for training deep neural network models from scratch are difficult to assemble due to privacy restrictions and expert ground truth requirements, with typical open source datasets ranging from hundreds to thousands of images. A standard approach to counteract limited-size medical datasets is to pre-train models on large datasets in other domains, such as ImageNet for classification of natural images, before fine-tuning on the specific medical task of interest. However, ImageNet contains color images, which introduces artefacts and inefficiencies into models that are intended for single-channel medical images. To address this issue, we pre-trained an Inception-V3 model on ImageNet after converting the images to grayscale through a common transformation. Surprisingly, these models do not show a significant degradation in performance on the original ImageNet classification task, suggesting that color is not a critical feature of natural image classification. Furthermore, models pre-trained on grayscale ImageNet outperformed color ImageNet models in terms of both speed and accuracy when refined on disease classification from chest X-ray images.

Yiting Xie, David Richmond

W34 – 1st Workshop on Interactive and Adaptive Learning in an Open World

Frontmatter
Workshop on Interactive and Adaptive Learning in an Open World

Next generation machine learning requires stepping away from classical batch learning towards interactive and adaptive learning. This is essential to cope with demanding machine learning applications we have already today. Our workshop at ECCV 2018 in Munich therefore served as a discussion forum for experts in this field and in the following we give a brief overview. Please note that this discussion paper has not been not peer-reviewed and only contains the subjective summary of the workshop organizers.

Alexander Freytag, Vittorio Ferrari, Mario Fritz, Uwe Franke, Terrence Boult, Juergen Gall, Walter Scheirer, Angela Yao, Erik Rodner

W35 – 1st Multimodal Learning and Applications Workshop

Frontmatter
Boosting LiDAR-Based Semantic Labeling by Cross-modal Training Data Generation

Mobile robots and autonomous vehicles rely on multi-modal sensor setups to perceive and understand their surroundings. Aside from cameras, LiDAR sensors represent a central component of state-of-the-art perception systems. In addition to accurate spatial perception, a comprehensive semantic understanding of the environment is essential for efficient and safe operation. In this paper we present a novel deep neural network architecture called LiLaNet for point-wise, multi-class semantic labeling of semi-dense LiDAR data. The network utilizes virtual image projections of the 3D point clouds for efficient inference. Further, we propose an automated process for large-scale cross-modal training data generation called Autolabeling, in order to boost semantic labeling performance while keeping the manual annotation effort low. The effectiveness of the proposed network architecture as well as the automated data generation process is demonstrated on a manually annotated ground truth dataset. LiLaNet is shown to significantly outperform current state-of-the-art CNN architectures for LiDAR data. Applying our automatically generated large-scale training data yields a boost of up to 14% points compared to networks trained on manually annotated data only.

Florian Piewak, Peter Pinggera, Manuel Schäfer, David Peter, Beate Schwarz, Nick Schneider, Markus Enzweiler, David Pfeiffer, Marius Zöllner
Learning to Learn from Web Data Through Deep Semantic Embeddings

In this paper we propose to learn a multimodal image and text embedding from Web and Social Media data, aiming to leverage the semantic knowledge learnt in the text domain and transfer it to a visual model for semantic image retrieval. We demonstrate that the pipeline can learn from images with associated text without supervision and perform a thorough analysis of five different text embeddings in three different benchmarks. We show that the embeddings learnt with Web and Social Media data have competitive performances over supervised methods in the text based image retrieval task, and we clearly outperform state of the art in the MIRFlickr dataset when training in the target data. Further we demonstrate how semantic multimodal image retrieval can be performed using the learnt embeddings, going beyond classical instance-level retrieval problems. Finally, we present a new dataset, InstaCities1M, composed by Instagram images and their associated texts that can be used for fair comparison of image-text embeddings.

Raul Gomez, Lluis Gomez, Jaume Gibert, Dimosthenis Karatzas
Learning from #Barcelona Instagram Data What Locals and Tourists Post About Its Neighbourhoods

Massive tourism is becoming a big problem for some cities, such as Barcelona, due to its concentration in some neighborhoods. In this work we gather Instagram data related to Barcelona consisting on images-captions pairs and, using the text as a supervisory signal, we learn relations between images, words and neighborhoods. Our goal is to learn which visual elements appear in photos when people is posting about each neighborhood. We perform a language separate treatment of the data and show that it can be extrapolated to a tourists and locals separate analysis, and that tourism is reflected in Social Media at a neighborhood level. The presented pipeline allows analyzing the differences between the images that tourists and locals associate to the different neighborhoods.The proposed method, which can be extended to other cities or subjects, proves that Instagram data can be used to train multi-modal (image and text) machine learning models that are useful to analyze publications about a city at a neighborhood level. We publish the collected dataset, InstaBarcelona and the code used in the analysis.

Raul Gomez, Lluis Gomez, Jaume Gibert, Dimosthenis Karatzas
A Structured Listwise Approach to Learning to Rank for Image Tagging

With the growing quantity and diversity of publicly available image data, computer vision plays a crucial role in understanding and organizing visual information today. Image tagging models are very often used to make this data accessible and useful. Generating image labels and ranking them by their relevance to the visual content is still an open problem. In this work, we use a bilinear compatibility function inspired from zero-shot learning that allows us to rank tags according to their relevance to the image content. We propose a novel listwise structured loss formulation to learn it from data. We leverage captioned image data and propose different “tags from captions” schemes meant to capture user attention and intra-user agreement in a simple and effective manner. We evaluate our method on the COCO-Captions, PASCAL-sentences and MIRFlickr-25k datasets showing promising results.

Jorge Sánchez, Franco Luque, Leandro Lichtensztein
Visually Indicated Sound Generation by Perceptually Optimized Classification

Visually indicated sound generation aims to predict visually consistent sound from the video content. Previous methods addressed this problem by creating a single generative model that ignores the distinctive characteristics of various sound categories. Nowadays, state-of-the-art sound classification networks are available to capture semantic-level information in audio modality, which can also serve for the purpose of visually indicated sound generation. In this paper, we explore generating fine-grained sound from a variety of sound classes, and leverage pre-trained sound classification networks to improve the audio generation quality. We propose a novel Perceptually Optimized Classification based Audio generation Network (POCAN), which generates sound conditioned on the sound class predicted from visual information. Additionally, a perceptual loss is calculated via a pre-trained sound classification network to align the semantic information between the generated sound and its ground truth during training. Experiments show that POCAN achieves significantly better results in visually indicated sound generation task on two datasets.

Kan Chen, Chuanxi Zhang, Chen Fang, Zhaowen Wang, Trung Bui, Ram Nevatia
CentralNet: A Multilayer Approach for Multimodal Fusion

This paper proposes a novel multimodal fusion approach, aiming to produce best possible decisions by integrating information coming from multiple media. While most of the past multimodal approaches either work by projecting the features of different modalities into the same space, or by coordinating the representations of each modality through the use of constraints, our approach borrows from both visions. More specifically, assuming each modality can be processed by a separated deep convolutional network, allowing to take decisions independently from each modality, we introduce a central network linking the modality specific networks. This central network not only provides a common feature embedding but also regularizes the modality specific networks through the use of multi-task learning. The proposed approach is validated on 4 different computer vision tasks on which it consistently improves the accuracy of existing multimodal fusion approaches.

Valentin Vielzeuf, Alexis Lechervy, Stéphane Pateux, Frédéric Jurie
Where and What Am I Eating? Image-Based Food Menu Recognition

Food has become a very important aspect of our social activities. Since social networks and websites like Yelp appeared, their users have started uploading photos of their meals to the Internet. This phenomenon opens a whole world of possibilities for developing models for applying food analysis and recognition on huge amounts of real-world data. A clear application could consist in applying image food recognition by using the menu of the restaurants. Our model, based on Convolutional Neural Networks and Recurrent Neural Networks, is able to learn a language model that generalizes on never seen dish names without the need of re-training it. According to the Ranking Loss metric, the results obtained by the model improve the baseline by a 15%.

Marc Bolaños, Marc Valdivia, Petia Radeva
ThermalGAN: Multimodal Color-to-Thermal Image Translation for Person Re-identification in Multispectral Dataset

We propose a ThermalGAN framework for cross-modality color-thermal person re-identification (ReID). We use a stack of generative adversarial networks (GAN) to translate a single color probe image to a multimodal thermal probe set. We use thermal histograms and feature descriptors as a thermal signature. We collected a large-scale multispectral ThermalWorld dataset for extensive training of our GAN model. In total the dataset includes 20216 color-thermal image pairs, 516 person ID, and ground truth pixel-level object annotations. We made the dataset freely available ( http://www.zefirus.org/ThermalGAN/ ). We evaluate our framework on the ThermalWorld dataset to show that it delivers robust matching that competes and surpasses the state-of-the-art in cross-modality color-thermal ReID.

Vladimir V. Kniaz, Vladimir A. Knyaz, Jiří Hladůvka, Walter G. Kropatsch, Vladimir Mizginov
Visual-Semantic Alignment Across Domains Using a Semi-Supervised Approach

Visual-semantic embeddings have been extensively used as a powerful model for cross-modal retrieval of images and sentences. In this setting, data coming from different modalities can be projected in a common embedding space, in which distances can be used to infer the similarity between pairs of images and sentences. While this approach has shown impressive performances on fully supervised settings, its application to semi-supervised scenarios has been rarely investigated. In this paper we propose a domain adaptation model for cross-modal retrieval, in which the knowledge learned from a supervised dataset can be transferred on a target dataset in which the pairing between images and sentences is not known, or not useful for training due to the limited size of the set. Experiments are performed on two target unsupervised scenarios, respectively related to the fashion and cultural heritage domain. Results show that our model is able to effectively transfer the knowledge learned on ordinary visual-semantic datasets, achieving promising results. As an additional contribution, we collect and release the dataset used for the cultural heritage domain.

Angelo Carraggi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Generalized Bayesian Canonical Correlation Analysis with Missing Modalities

Multi-modal learning aims to build models that can relate information from multiple modalities. One challenge of multi-modal learning is the prediction of a target modality based on a set of multiple modalities. However, there are two challenges associated with the goal: Firstly, collecting a large, complete dataset containing all required modalities is difficult; some of the modalities can be missing. Secondly, the features of modalities are likely to be high dimensional and noisy. To deal with these challenges, we propose a method called Generalized Bayesian Canonical Correlation Analysis with Missing Modalities. This method can utilize the incomplete sets of modalities. By including them in the likelihood function during training, it can estimate the relationships among the non-missing modalities and the feature space in the non-missing modality accurately. In addition, this method can work well on high dimensional and noisy features of modalities. This is because, by a probabilistic model based on the prior knowledge, it is strong against outliers and can reduce the amount of data necessary for the model learning even if features of modalities are high dimensional. Experiments with artificial and real data demonstrate our method outperforms conventional methods.

Toshihiko Matsuura, Kuniaki Saito, Yoshitaka Ushiku, Tatsuya Harada
Unpaired Thermal to Visible Spectrum Transfer Using Adversarial Training

Thermal Infrared (TIR) cameras are gaining popularity in many computer vision applications due to their ability to operate under low-light conditions. Images produced by TIR cameras are usually difficult for humans to perceive visually, which limits their usability. Several methods in the literature were proposed to address this problem by transforming TIR images into realistic visible spectrum (VIS) images. However, existing TIR-VIS datasets suffer from imperfect alignment between TIR-VIS image pairs which degrades the performance of supervised methods. We tackle this problem by learning this transformation using an unsupervised Generative Adversarial Network (GAN) which trains on unpaired TIR and VIS images. When trained and evaluated on KAIST-MS dataset, our proposed methods was shown to produce significantly more realistic and sharp VIS images than the existing state-of-the-art supervised methods. In addition, our proposed method was shown to generalize very well when evaluated on a new dataset of new environments.

Adam Nyberg, Abdelrahman Eldesokey, David Bergström, David Gustafsson

W36 – What Is Optical Flow for?

Frontmatter
Devon: Deformable Volume Network for Learning Optical Flow

We propose a new neural network module, Deformable Cost Volume, for learning large displacement optical flow. The module does not distort the original images or their feature maps and therefore avoids the artifacts associated with warping. Based on this module, a new neural network model is proposed. The full version of this paper can be found online ( https://arxiv.org/abs/1802.07351 ).

Yao Lu, Jack Valmadre, Heng Wang, Juho Kannala, Mehrtash Harandi, Philip H. S. Torr
Using Phase Instead of Optical Flow for Action Recognition

Currently, the most common motion representation for action recognition is optical flow. Optical flow is based on particle tracking which adheres to a Lagrangian perspective on dynamics. In contrast to the Lagrangian perspective, the Eulerian model of dynamics does not track, but describes local changes. For video, an Eulerian phase-based motion representation, using complex steerable filters, has been successfully employed recently for motion magnification and video frame interpolation. Inspired by these previous works, here, we proposes learning Eulerian motion representations in a deep architecture for action recognition. We learn filters in the complex domain in an end-to-end manner. We design these complex filters to resemble complex Gabor filters, typically employed for phase-information extraction. We propose a phase-information extraction module, based on these complex filters, that can be used in any network architecture for extracting Eulerian representations. We experimentally analyze the added value of Eulerian motion representations, as extracted by our proposed phase extraction module, and compare with existing motion representations based on optical flow, on the UCF101 dataset.

Omar Hommos, Silvia L. Pintea, Pascal S. M. Mettes, Jan C. van Gemert
Event Extraction Using Transportation of Temporal Optical Flow Fields

In this paper, we develop a method to transform a sequence of images to a sequence of events. Optical flow, which is the vector fields of pointwise motion computed from monocular image sequences, describes pointwise motion in an environment. The method extracts the global smoothness and continuity of motion fields and detects collapses of the smoothness of the motion fields in long-time image sequences using transportation of the temporal optical flow field.

Itaru Gotoh, Hiroki Hiraoka, Atsushi Imiya
A Simple and Effective Fusion Approach for Multi-frame Optical Flow Estimation

To date, top-performing optical flow estimation methods only take pairs of consecutive frames into account. While elegant and appealing, the idea of using more than two frames has not yet produced state-of-the-art results. We present a simple, yet effective fusion approach for multi-frame optical flow that benefits from longer-term temporal cues. Our method first warps the optical flow from previous frames to the current, thereby yielding multiple plausible estimates. It then fuses the complementary information carried by these estimates into a new optical flow field. At the time of writing, our method ranks first among published results in the MPI Sintel and KITTI 2015 benchmarks.

Zhile Ren, Orazio Gallo, Deqing Sun, Ming-Hsuan Yang, Erik B. Sudderth, Jan Kautz
Unsupervised Event-Based Optical Flow Using Motion Compensation

In this work, we propose a novel framework for unsupervised learning for event cameras that learns to predict optical flow from only the event stream. In particular, we propose an input representation of the events in the form of a discretized 3D volume, which we pass through a neural network to predict the optical flow for each event. This optical flow is used to attempt to remove any motion blur in the event image. We then propose a loss function applied to the motion compensated event image that measures the motion blur in this image. We evaluate this network on the Multi Vehicle Stereo Event Camera dataset (MVSEC), along with qualitative results from a variety of different scenes.

Alex Zihao Zhu, Liangzhe Yuan, Kenneth Chaney, Kostas Daniilidis
MoA-Net: Self-supervised Motion Segmentation

Most recent approaches to motion segmentation use optical flow to segment an image into the static environment and independently moving objects. Neural network based approaches usually require large amounts of labeled training data to achieve state-of-the-art performance. In this work we propose a new approach to train a motion segmentation network in a self-supervised manner. Inspired by visual ecology, the human visual system, and by prior approaches to motion modeling, we break down the problem of motion segmentation into two smaller subproblems: (1) modifying the flow field to remove the observer’s rotation and (2) segmenting the rotation-compensated flow into static environment and independently moving objects. Compensating for rotation leads to essential simplifications that allow us to describe an independently moving object with just a few criteria which can be learned by our new motion segmentation network - the Motion Angle Network (MoA-Net). We compare our network with two other motion segmentation networks and show state-of-the-art performance on Sintel.

Pia Bideau, Rakesh R. Menon, Erik Learned-Miller
“What Is Optical Flow For?”: Workshop Results and Summary

Traditionally, computer vision problems have been classified into three levels: low (image to image), middle (image to features), and high (features to analysis) [11]. Some typical low-level vision problems include optical flow [7], stereo [10] and intrinsic image decomposition [1]. The solution to these problems would then be combined to solve higher level problems, such as action recognition and visual question answering.

Fatma Güney, Laura Sevilla-Lara, Deqing Sun, Jonas Wulff
Backmatter
Metadata
Title
Computer Vision – ECCV 2018 Workshops
Editors
Laura Leal-Taixé
Stefan Roth
Copyright Year
2019
Electronic ISBN
978-3-030-11024-6
Print ISBN
978-3-030-11023-9
DOI
https://doi.org/10.1007/978-3-030-11024-6

Premium Partner