Skip to main content

2019 | Buch

Computer Vision – ACCV 2018 Workshops

14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers

insite
SUCHEN

Über dieses Buch

This LNCS workshop proceedings, ACCV 2018, contains carefully reviewed and selected papers from 11 workshops, each having different types or programs: Scene Understanding and Modelling (SUMO) Challenge, Learning and Inference Methods for High Performance Imaging (LIMHPI), Attention/Intention Understanding (AIU), Museum Exhibit Identification Challenge (Open MIC) for Domain Adaptation and Few-Shot Learning, RGB-D - Sensing and Understanding via Combined Colour and Depth, Dense 3D Reconstruction for Dynamic Scenes, AI Aesthetics in Art and Media (AIAM), Robust Reading (IWRR), Artificial Intelligence for Retinal Image Analysis (AIRIA), Combining Vision and Language, Advanced Machine Vision for Real-life and Industrially Relevant Applications (AMV).

Inhaltsverzeichnis

Frontmatter
Correction to: PCA-RECT: An Energy-Efficient Object Detection Approach for Event Cameras

In the version of this chapter that was originally published, the funding information given at the bottom of the first page was not correct. This has been updated so that the new version now reads: “Supported by Temasek Research Fellowship.”

Bharath Ramesh, Andrés Ussa, Luca Della Vedova, Hong Yang, Garrick Orchard

Learning and Inference Methods for High-Performance Imaging (LIMHPI)

Frontmatter
Anti-occlusion Light-Field Optical Flow Estimation Using Light-Field Super-Pixels

Optical flow estimation is one of the most important problem in community. However, current methods still can not provide reliable results in occlusion boundary areas. Light field cameras provide hundred of views in a single shot, so the ambiguity can be better analysed using other views. In this paper, we present a novel method for anti-occlusion optical flow estimation in a dynamic light field. We first model the light field superpixel (LFSP) as a slanted plane in 3D. Then the motion of the occluded pixels in central view slice can be optimized by the un-occluded pixels in other views. Thus the optical flow in occlusion boundary areas can be well computed. Experimental results on both synthetic and real light fields demonstrate the advantages over state-of-the-arts and the performance on 4D optical flow computation.

Hao Zhu, Xiaoming Sun, Qi Zhang, Qing Wang, Antonio Robles-Kelly, Hongdong Li

Attention/Intention Understanding (AIU)

Frontmatter
Localizing the Gaze Target of a Crowd of People

What target is focused on by many people? Analysis of the target is a crucial task, especially in a cinema, a stadium, and so on. However, it is very difficult to estimate the gaze of each person in a crowd accurately and simultaneously with existing image-based eye tracking methods, since the image resolution of each person becomes low when we capture the whole crowd with a distant camera. Therefore, we introduce a new approach for localizing the gaze target focused on by a crowd of people. The proposed framework aggregates the individually estimated results of each person’s gaze. It enables us to localize the target being focused on by them even though each person’s gaze localization from a low-resolution image is inaccurate. We analyze the effects of an aggregation method on the localization accuracy using images capturing a crowd of people in a tennis stadium under the assumption that all of the people are focusing on the same target, and also investigate the effect of the number of people involved in the aggregation on the localization accuracy. As a result, the proposed method showed the ability to improve the localization accuracy as it is applied to a larger crowd of people.

Yuki Kodama, Yasutomo Kawanishi, Takatsugu Hirayama, Daisuke Deguchi, Ichiro Ide, Hiroshi Murase, Hidehisa Nagano, Kunio Kashino
A Thumb Tip Wearable Device Consisting of Multiple Cameras to Measure Thumb Posture

Today, cameras have become smaller and cheaper and can be utilized in various scenes. We took advantage of that to develop a thumb tip wearable device to estimate joint angles of a thumb as measuring human finger postures is important in terms of human-computer interface and to analyze human behavior. The device we developed consists of three small cameras attached at different angles so the cameras can capture the four fingers. We assumed that the appearance of the four fingers would change depending on the joint angles of the thumb. We made a convolutional neural network learn a regression relationship between the joint angles of the thumb and the images taken by the cameras. In this paper, we captured the keypoint positions of the thumb with a USB sensor device and calculated the joint angles to construct a dataset. The root mean squared error of the test data was 6.23 $$^\circ $$ and 4.75 $$^\circ $$ .

Naoto Ienaga, Wataru Kawai, Koji Fujita, Natsuki Miyata, Yuta Sugiura, Hideo Saito
Summarizing Videos with Attention

In this work we propose a novel method for supervised, keyshots based video summarization by applying a conceptually simple and computationally efficient soft, self-attention mechanism. Current state of the art methods leverage bi-directional recurrent networks such as BiLSTM combined with attention. These networks are complex to implement and computationally demanding compared to fully connected networks. To that end we propose a simple, self-attention based network for video summarization which performs the entire sequence to sequence transformation in a single feed forward pass and single backward pass during training. Our method sets a new state of the art results on two benchmarks TvSum and SumMe, commonly used in this domain.

Jiri Fajtl, Hajar Sadeghi Sokeh, Vasileios Argyriou, Dorothy Monekosso, Paolo Remagnino
Gait-Based Age Estimation Using a DenseNet

Human age is one of important attributes for many potential applications such as digital signage, customer analysis, and gait-based age estimation is promising particularly for surveillance scenarios since it can be available at a distance from a camera. We therefore proposed a method of gait-based age estimation using a deep learning framework to advance the state-of-the-art accuracy. Specifically, we employed DenseNet as one of state-of-the-art network architectures. While the previous method of gait-based age estimation using a deep learning framework was evaluated only with a small-scale gait database, we evaluated the proposed method with OULP-Age, the world’s largest gait database comprising more than 60,000 subjects with age range from 2 to 90 years old. Consequently, we demonstrated that the proposed method outperform existing methods based on both conventional machine learning frameworks for gait-based age estimation and a deep learning framework for gait recognition.

Atsuya Sakata, Yasushi Makihara, Noriko Takemura, Daigo Muramatsu, Yasushi Yagi
Human Action Recognition via Body Part Region Segmented Dense Trajectories

We propose a novel action recognition framework based on trajectory features with human-aware spatial segmentation. Our insight is that the critical features for recognition are appeared in the partial regions of human, thus we segment a video frame into spatial regions based on the human body parts to enhance feature representation. We utilize an object detector and a pose estimator to segment four regions, namely full body, left/right arm, and upper body. From these regions, we extract dense trajectory features and feed them into a shallow RNN to effectively consider the long-term relationships. The evaluation result shows that our framework outperforms previous approaches on the standard two benchmarks, i.e. J-HMDB and MPII Cooking Activities.

Kaho Yamada, Seiya Ito, Naoshi Kaneko, Kazuhiko Sumi

AI Aesthetics in Art and Media (AIAM)

Frontmatter
Let AI Clothe You: Diversified Fashion Generation

In this paper, we demonstrate automation of fashion assortment generation that appeals widely to consumer tastes given context in terms of attributes. We show how we trained generative adversarial networks to automatically generate an assortment given a fashion category (such as dresses and tops etc.) and its context (neck type, shape, color etc.), and describe the practical challenges we faced in terms of increasing assortment diversity. We explore different GAN architectures in context based fashion generation. We show that by providing context better quality images can be generated. Examples of taxonomy of design given a fashion article and finally automate generation of new designs that span the created taxonomy is shown. We also show a designer-in-loop process of taking a generated image to production level design templates (tech-packs). Here the designers bring their own creativity by adding elements, suggestive from the generated image, to accentuate the overall aesthetics of the final design.

Rajdeep H. Banerjee, Anoop Rajagopal, Nilpa Jha, Arun Patro, Aruna Rajan
Word-Conditioned Image Style Transfer

In recent years, deep learning has attracted attention not only as a method on image recognition but also as a technique for image generation and transformation. Above all, a method called Style Transfer is drawing much attention which can integrate two photos into one integrated photo regarding their content and style. Although many extended works including Fast Style Transfer have been proposed so far, all the extended methods including original one require a style image to modify the style of an input image. In this paper, we propose to use words expressing photo styles instead of using style images for neural image style transfer. In our method, we take into account the content of an input image to be stylized to decide a style for style transfer in addition to a given word. We implemented the propose method by modifying the network for arbitrary neural artistic stylization. By the experiments, we show that the proposed method has ability to change the style of an input image taking account of both a given word.

Yu Sugiyama, Keiji Yanai
Font Style Transfer Using Neural Style Transfer and Unsupervised Cross-domain Transfer

In this paper, we study about font generation and conversion. The previous methods dealt with characters as ones made of strokes. On the contrary, we extract features, which are equivalent to the strokes, from font images and texture or pattern images using deep learning, and transform the design pattern of font images. We expect that generation of original font such as hand written characters will be generated automatically by the proposed approach. In the experiments, we have created unique datasets such as a ketchup character image dataset and improve image generation quality and readability of character by combining neural style transfer with unsupervised cross-domain learning.

Atsushi Narusawa, Wataru Shimoda, Keiji Yanai
Paying Attention to Style: Recognizing Photo Styles with Convolutional Attentional Units

The notion of style in photographs is one that is highly subjective, and often difficult to characterize computationally. Recent advances in learning techniques for visual recognition have encouraged new possibilities for computing aesthetics and other related concepts in images. In this paper, we design an approach for recognizing styles in photographs by introducing adapted deep convolutional neural networks that are attentive towards strong neural activations. The proposed convolutional attentional units act as a filtering mechanism that conserves activations in convolutional blocks in order to contribute more meaningfully towards the visual style classes. State-of-the-art results were achieved on two large image style datasets, demonstrating the effectiveness of our method.

John See, Lai-Kuan Wong, Magzhan Kairanbay

Third International Workshop on Robust Reading (IWRR)

Frontmatter
E2E-MLT - An Unconstrained End-to-End Method for Multi-language Scene Text

An end-to-end trainable (fully differentiable) method for multi-language scene text localization and recognition is proposed. The approach is based on a single fully convolutional network (FCN) with shared layers for both tasks.E2E-MLT is the first published multi-language OCR for scene text. While trained in multi-language setup, E2E-MLT demonstrates competitive performance when compared to other methods trained for English scene text alone. The experiments show that obtaining accurate multi-language multi-script annotations is a challenging problem. Code and trained models are released publicly at https://github.com/MichalBusta/E2E-MLT .

Michal Bušta, Yash Patel, Jiri Matas
An Invoice Reading System Using a Graph Convolutional Network

In this paper, we present a model-free system for reading digitized invoice images, which highlights the most useful billing entities and does not require any particular parameterization. The power of the system lies in the fact that it generalizes to both seen and unseen layouts of invoice. The system first breaks down the invoice data into various set of entities to extract and then learns structural and semantic information for each entity to extract via a graph structure, which is later generalized to the whole invoice structure. This local neighborhood exploitation is accomplished via a Graph Convolutional Network (GCN). The system digs deep to extract table information and provide complete invoice reading upto 27 entities of interest without any template information or configuration with an excellent overall F-measure score of 0.93.

D. Lohani, A. Belaïd, Y. Belaïd
Reading Industrial Inspection Sheets by Inferring Visual Relations

The traditional mode of recording faults in heavy factory equipment has been via handmarked inspection sheets, wherein a machine engineer manually marks the faulty machine regions on a paper outline of the machine. Over the years, millions of such inspection sheets have been recorded and the data within these sheets has remained inaccessible. However, with industries going digital and waking up to the potential value of fault data for machine health monitoring, there is an increased impetus towards digitization of these handmarked inspection records. To target this digitization, we propose a novel visual pipeline combining state of the art deep learning models, with domain knowledge and low level vision techniques, followed by inference of visual relationships. Our framework is robust to the presence of both static and non-static background in the document, variability in the machine template diagrams, unstructured shape of graphical objects to be identified and variability in the strokes of handwritten text. The proposed pipeline incorporates a capsule and spatial transformer network based classifier for accurate text reading, and a customized CTPN [15] network for text detection in addition to hybrid techniques for arrow detection and dialogue cloud removal. We have tested our approach on a real world dataset of 50 inspection sheets for large containers and boilers. The results are visually appealing and the pipeline achieved an accuracy of 87.1% for text detection and 94.6% for text reading.

Rohit Rahul, Arindam Chowdhury, Animesh, Samarth Mittal, Lovekesh Vig
Learning to Clean: A GAN Perspective

In the big data era, the impetus to digitize the vast reservoirs of data trapped in unstructured scanned documents such as invoices, bank documents, courier receipts and contracts has gained fresh momentum. The scanning process often results in the introduction of artifacts such as salt-and-pepper/background noise, blur due to camera motion or shake, watermarkings, coffee stains, wrinkles, or faded text. These artifacts pose many readability challenges to current text recognition algorithms and significantly degrade their performance. Existing learning based denoising techniques require a dataset comprising of noisy documents paired with cleaned versions of the same document. In such scenarios, a model can be trained to generate clean documents from noisy versions. However, very often in the real world such a paired dataset is not available, and all we have for training our denoising model are unpaired sets of noisy and clean images. This paper explores the use of Generative Adversarial Networks (GAN) to generate denoised versions of the noisy documents. In particular, where paired information is available, we formulate the problem as an image-to-image translation task i.e, translating a document from noisy domain (i.e., background noise, blurred, faded, watermarked) to a target clean document using Generative Adversarial Networks (GAN). However, in the absence of paired images for training, we employed CycleGAN which is known to learn a mapping between the distributions of the noisy images to the denoised images using unpaired data to achieve image-to-image translation for cleaning the noisy documents. We compare the performance of CycleGAN for document cleaning tasks using unpaired images with a Conditional GAN trained on paired data from the same dataset. Experiments were performed on a public document dataset on which different types of noise were artificially induced, results demonstrate that CycleGAN learns a more robust mapping from the space of noisy to clean documents.

Monika Sharma, Abhishek Verma, Lovekesh Vig
Deep Reader: Information Extraction from Document Images via Relation Extraction and Natural Language

Recent advancements in the area of Computer Vision with state-of-art Neural Networks has given a boost to Optical Character Recognition (OCR) accuracies. However, extracting characters/text alone is often insufficient for relevant information extraction as documents also have a visual structure that is not captured by OCR. Extracting information from tables, charts, footnotes, boxes, headings and retrieving the corresponding structured representation for the document remains a challenge and finds application in a large number of real-world use cases. In this paper, we propose a novel enterprise based end-to-end framework called DeepReader which facilitates information extraction from document images via identification of visual entities and populating a meta relational model across different entities in the document image. The model schema allows for an easy to understand abstraction of the entities detected by the deep vision models and the relationships between them. DeepReader has a suite of state-of-the-art vision algorithms which are applied to recognize handwritten and printed text, eliminate noisy effects, identify the type of documents and detect visual entities like tables, lines and boxes. Deep Reader maps the extracted entities into a rich relational schema so as to capture all the relevant relationships between entities (words, textboxes, lines etc.) detected in the document. Relevant information and fields can then be extracted from the document by writing SQL queries on top of the relationship tables. A natural language based interface is added on top of the relationship schema so that a non-technical user, specifying the queries in natural language, can fetch the information with minimal effort. In this paper, we also demonstrate many different capabilities of Deep Reader and report results on a real-world use case.

D. Vishwanath, Rohit Rahul, Gunjan Sehgal, Swati, Arindam Chowdhury, Monika Sharma, Lovekesh Vig, Gautam Shroff, Ashwin Srinivasan
Simultaneous Recognition of Horizontal and Vertical Text in Natural Images

Recent state-of-the-art scene text recognition methods have primarily focused on horizontal text in images. However, in several Asian countries, including China, large amounts of text in signs, books, and TV commercials are vertically directed. Because the horizontal and vertical texts exhibit different characteristics, developing an algorithm that can simultaneously recognize both types of text in real environments is necessary. To address this problem, we adopted the direction encoding mask (DEM) and selective attention network (SAN) methods based on supervised learning. DEM contains directional information to compensate in cases that lack text direction; therefore, our network is trained using this information to handle the vertical text. The SAN method is designed to work individually for both types of text. To train the network to recognize both types of text and to evaluate the effectiveness of the designed model, we prepared a new synthetic vertical text dataset and collected an actual vertical text dataset (VTD142) from the Web. Using these datasets, we proved that our proposed model can accurately recognize both vertical and horizontal text and can achieve state-of-the-art results in experiments using benchmark datasets, including the street view test (SVT), IIIT-5k, and ICDAR. Although our model is relatively simple as compared to its predecessors, it maintains the accuracy and is trained in an end-to-end manner.

Chankyu Choi, Youngmin Yoon, Junsu Lee, Junseok Kim

Artificial Intelligence for Retinal Image Analysis (AIRIA)

Frontmatter
Automatic Retinal and Choroidal Boundary Segmentation in OCT Images Using Patch-Based Supervised Machine Learning Methods

The assessment of retinal and choroidal thickness derived from spectral domain optical coherence tomography (SD-OCT) images is an important clinical and research task. Current OCT instruments allow the capture of densely sampled, high-resolution cross-sectional images of ocular tissues. The extensive nature of such datasets makes the manual delineation of tissue boundaries time-consuming and impractical, especially for large datasets of images. Therefore, the development of reliable and accurate methods to automatically segment tissue boundaries in OCT images is fundamental. In this work, two different deep learning methods; convolutional neural networks (CNN) and recurrent neural networks (RNN) are evaluated to calculate the probability of the retinal and choroidal boundaries of interest to be located in a specific position within the SD-OCT images. The method is initially trained using small image patches centred around the three boundaries of interest. After that, the method can be used to provide a per-layer probability map that marks the most likely location of the boundaries. To convert each layer-probability map into a boundary position, the map is subsequently traced using a graph-search method. The effect of the network architecture (CNN vs RNN), patch size, and image intensity compensation on the performance and subsequent boundary segmentation is presented. The results are compared with manual boundary segmentation as well as a previously proposed method based on standard image analysis techniques.

David Alonso-Caneiro, Jason Kugelman, Jared Hamwood, Scott A. Read, Stephen J. Vincent, Fred K. Chen, Michael J. Collins
Discrimination Ability of Glaucoma via DCNNs Models from Ultra-Wide Angle Fundus Images Comparing Either Full or Confined to the Optic Disc

We examined the difference in ability to discriminate glaucoma among artificial intelligence models trained with partial area surrounding the optic disc (Cropped) and whole area of a ultra-wide angle ocular fundus camera (Full). 1677 normal fundus images and 950 glaucomatous fundus images of the Optos 200Tx (Optos PLC, Dunfermline, United Kingdom) images in the Tsukazaki Hospital ophthalmology database were included in the study. A k-fold method (k = 5) and a convolutional neural network (VGG16) were used. For the full data set, the area under the curve (AUC) was 0.987 (95% CI 0.983–0.991), sensitivity was 0.957 (95% CI 0.942–0.969), and specificity was 0.947 (95% CI 0.935–0.957). For the cropped data set, AUC was 0.937 (95% CI 0.927–0.949), sensitivity was 0.868 (95% CI 0.845–0.889), and specificity was 0.894 (95% CI 0.878–0.908). The values of AUC, sensitivity, and specificity for the cropped data set were lower than those for the full data set. Our results show that the whole ultra-wide angle fundus is more appropriate as the amount of information given to a neural network for the discrimination of glaucoma than only the range limited to the periphery of the optic disc.

Hitoshi Tabuchi, Hiroki Masumoto, Shunsuke Nakakura, Asuka Noguchi, Hirotaka Tanabe
Synthesizing New Retinal Symptom Images by Multiple Generative Models

Age-Related Macular Degeneration (AMD) is an asymptomatic retinal disease which may result in loss of vision. There is limited access to high-quality relevant retinal images and poor understanding of the features defining sub-classes of this disease. Motivated by recent advances in machine learning we specifically explore the potential of generative modeling, using Generative Adversarial Networks (GANs) and style transferring, to facilitate clinical diagnosis and disease understanding by feature extraction. We design an analytic pipeline which first generates synthetic retinal images from clinical images; a subsequent verification step is applied. In the synthesizing step we merge GANs (DCGANs and WGANs architectures) and style transferring for the image generation, whereas the verified step controls the accuracy of the generated images. We find that the generated images contain sufficient pathological details to facilitate ophthalmologists’ task of disease classification and in discovery of disease relevant features. In particular, our system predicts the drusen and geographic atrophy sub-classes of AMD. Furthermore, the performance using CFP images for GANs outperforms the classification based on using only the original clinical dataset. Our results are evaluated using existing classifier of retinal diseases and class activated maps, supporting the predictive power of the synthetic images and their utility for feature extraction. Our code examples are available online. ( https://github.com/huckiyang/EyeNet-GANs ).

Yi-Chieh Liu, Hao-Hsiang Yang, C.-H. Huck Yang, Jia-Hong Huang, Meng Tian, Hiromasa Morikawa, Yi-Chang James Tsai, Jesper Tegnèr
Retinal Detachment Screening with Ensembles of Neural Network Models

Rhegmatogenous retinal detachment is an important condition that should be diagnosed early. A previous study showed that normal eyes and eyes with rhegmatogenous retinal detachment could be distinguished using pseudo-ocular fundus color images obtained with the Optos camera. However, no study has used pseudo-ocular fundus color images to distinguish eyes without retinal detachment (not necessarily normal) and those with rhegmatogenous retinal detachment. Furthermore, the previous study used a single neural network with only three layers. In the current study, we trained and validated an ensemble model of a deep neural networks involving ultra-wide-field pseudocolor images to distinguish non-retinal detachment eyes (not necessarily normal) and rhegmatogenous retinal detachment eyes. The study included 600 non-retinal detachment, 693 bullous rhegmatogenous retinal detachment, and 125 non-bullous rhegmatogenous retinal detachment images. The sensitivity and specificity of the ensemble model (five models) were 97.3% and 91.5%, respectively. In sum, this study demonstrated promising results for a screening system for rhegmatogenous retinal detachment with high sensitivity and relatively high specificity.

Hiroki Masumoto, Hitoshi Tabuchi, Shoto Adachi, Shunsuke Nakakura, Hideharu Ohsugi, Daisuke Nagasato
Recent Developments of Retinal Image Analysis in Alzheimer’s Disease and Potential AI Applications

Alzheimers disease (AD) is the most common progressive neurodegenerative illness and cause of dementia in the elderly. The critical barriers for primary prevention in AD are the lack of rapid, non-invasive, sensitive and low-cost biomarkers. As the eye and brain share essential structural and pathogenic pathways, non-invasive eye biomarkers could be identified to obtain new insights into the onset and progression of AD and its complications in the eye. In this short review, recent developments of retinal image analysis in AD and potential artificial intelligence (AI) applications are presented. Some approaches are still very much novel research techniques, others are more established and transitioning into the clinical diagnostic arena. Together they provide us with the capability to move AD detection research forwards by using novel peripheral biomarkers.

Delia Cabrera DeBuc, Edmund Arthur
Intermediate Goals in Deep Learning for Retinal Image Analysis

End-to-end deep learning has been demonstrated to exhibit human-level performance in many retinal image analysis tasks. However, such models’ generalizability to data from new sources may be less than optimal. We highlight some benefits of introducing intermediate goals in deep learning-based models.

Gilbert Lim, Wynne Hsu, Mong Li Lee
Enhanced Detection of Referable Diabetic Retinopathy via DCNNs and Transfer Learning

A clinically acceptable deep learning system (DLS) has been developed for the detection of diabetic retinopathy by the Singapore Eye Research Institute. For its utility in a national screening programme, further enhancement was needed. With newer deep convolutional neural networks (DCNNs) being introduced and technological methodology such as transfer learning gaining recognition for better performance, this paper compared the performance of the DCNN used in the original DLS, VGGNet, with newer DCNNs, ResNet and Ensemble, with transfer learning. The DLS performance improved with higher AUC, sensitivity and specificity with the adoption of the newer DCNNs and transfer learning.

Michelle Yuen Ting Yip, Zhan Wei Lim, Gilbert Lim, Nguyen Duc Quang, Haslina Hamzah, Jinyi Ho, Valentina Bellemo, Yuchen Xie, Xin Qi Lee, Mong Li Lee, Wynne Hsu, Tien Yin Wong, Daniel Shu Wei Ting
Generative Adversarial Networks (GANs) for Retinal Fundus Image Synthesis

The lack of access to large annotated datasets and legal concerns regarding patient privacy are limiting factors for many applications of deep learning in the retinal image analysis domain. Therefore the idea of generating synthetic retinal images, indiscernible from real data, has gained more interest. Generative adversarial networks (GANs) have proven to be a valuable framework for producing synthetic databases of anatomically consistent retinal fundus images. In Ophthalmology, GANs in particular have shown increased interest. We discuss here the potential advantages and limitations that need to be addressed before GANs can be widely adopted for retinal imaging.

Valentina Bellemo, Philippe Burlina, Liu Yong, Tien Yin Wong, Daniel Shu Wei Ting
AI-based AMD Analysis: A Review of Recent Progress

Since 2016 much progress has been made in the automatic analysis of age related macular degeneration (AMD). Much of it was dedicated to the classification of referable vs. non-referable AMD, fine-grained AMD severity classification, and assessing the five-year risk of progression to the severe form of AMD. Here we review these developments, the main tasks that were addressed, and the main methods that were carried out.

P. Burlina, N. Joshi, N. M. Bressler
Artificial Intelligence Using Deep Learning in Classifying Side of the Eyes and Width of Field for Retinal Fundus Photographs

As the application of deep learning (DL) advances in the healthcare sector, the need for simultaneous, multi-annotated database of medical images for evaluations of novel DL systems grows. This study looked at DL algorithms that distinguish retinal images by the side of the eyes (Left and Right side) as well as the field positioning (Macular-centred or Optic Disc-centred) and evaluated these algorithms against a large dataset comprised of 7,953 images from multi-ethnic populations. For these convolutional neural networks, L/R model and Mac/OD model, a high AUC (0.978, 0.990), sensitivity (95.9%, 97.6%), specificity (95.5%, 96.7%) and accuracy (95.7%, 97.2%) were found, respectively, for the primary validation sets. The models presented high performance also using the external validation database.

Valentina Bellemo, Michelle Yuen Ting Yip, Yuchen Xie, Xin Qi Lee, Quang Duc Nguyen, Haslina Hamzah, Jinyi Ho, Gilbert Lim, Dejiang Xu, Mong Li Lee, Wynne Hsu, Renata Garcia-Franco, Geeta Menon, Ecosse Lamoureux, Ching-Yu Cheng, Tien Yin Wong, Daniel Shu Wei Ting
OCT Segmentation via Deep Learning: A Review of Recent Work

Optical coherence tomography (OCT) is an important retinal imaging method since it is a non-invasive, high-resolution imaging technique and is able to reveal the fine structure within the human retina. It has applications for retinal as well as neurological disease characterization and diagnostics. The use of machine learning techniques for analyzing the retinal layers and lesions seen in OCT can greatly facilitate such diagnostics tasks. The use of deep learning (DL) methods principally using fully convolutional networks has recently resulted in significant progress in automated segmentation of optical coherence tomography. Recent work in that area is reviewed herein.

M. Pekala, N. Joshi, T. Y. Alvin Liu, N. M. Bressler, D. Cabrera DeBuc, P. Burlina
Auto-classification of Retinal Diseases in the Limit of Sparse Data Using a Two-Streams Machine Learning Model

Automatic clinical diagnosis of retinal diseases has emerged as a promising approach to facilitate discovery in areas with limited access to specialists. Based on the fact that fundus structure and vascular disorders are the main characteristics of retinal diseases, we propose a novel visual-assisted diagnosis hybrid model mixing the support vector machine (SVM) and deep neural networks (DNNs). Furthermore, we present a new clinical retina labels collection sorted by the professional ophthalmologist from the educational project Retina Image Bank, called EyeNet, for ophthalmology incorporating 52 retina diseases classes. Using EyeNet, our model achieves 90.40% diagnosis accuracy, and the model performance is comparable to the professional ophthalmologists ( https://github.com/huckiyang/EyeNet2 ).

C.-H. Huck Yang, Fangyu Liu, Jia-Hong Huang, Meng Tian, M. D. I-Hung Lin, Yi Chieh Liu, Hiromasa Morikawa, Hao-Hsiang Yang, Jesper Tegnèr

First International Workshop on Advanced Machine Vision for Real-Life and Industrially Relevant Applications (AMV)

Frontmatter
LoANs: Weakly Supervised Object Detection with Localizer Assessor Networks

Recently, deep neural networks have achieved remarkable performance on the task of object detection and recognition. The reason for this success is mainly grounded in the availability of large scale, fully annotated datasets, but the creation of such a dataset is a complicated and costly task. In this paper, we propose a novel method for weakly supervised object detection that simplifies the process of gathering data for training an object detector. We train an ensemble of two models that work together in a student-teacher fashion. Our student (localizer) is a model that learns to localize an object, the teacher (assessor) assesses the quality of the localization and provides feedback to the student. The student uses this feedback to learn how to localize objects and is thus entirely supervised by the teacher, as we are using no labels for training the localizer. In our experiments, we show that our model is very robust to noise and reaches competitive performance compared to a state-of-the-art fully supervised approach. We also show the simplicity of creating a new dataset, based on a few videos (e.g. downloaded from YouTube) and artificially generated data.

Christian Bartz, Haojin Yang, Joseph Bethge, Christoph Meinel
Reaching Behind Specular Highlights by Registration of Two Images of Broiler Viscera

The manual postmortem inspection of broilers and their viscera is becoming a bottleneck as the slaughter rate increases. Computer vision can assist veterinarians during the inspection, but specular highlights can hide crucial details when inspecting for diseases on the viscera set. This study aims to restore details behind these specular highlights by capturing two images of the same viscera using shifting light positions. The dataset consists of images captured in-line at a poultry processing plant. The method achieves an average SSIM score of 0.96 over a test set of 100 image sets. The result is visually pleasing images with correct textural information instead of specular highlights.

Anders Jørgensen, Malte Pedersen, Rikke Gade, Jens Fagertun, Thomas B. Moeslund
Anomaly Detection Using GANs for Visual Inspection in Noisy Training Data

The detection and the quantification of anomalies in image data are critical tasks in industrial scenes such as detecting micro scratches on product. In recent years, due to the difficulty of defining anomalies and the limit of correcting their labels, research on unsupervised anomaly detection using generative models has attracted attention. Generally, in those studies, only normal images are used for training to model the distribution of normal images. The model measures the anomalies in the target images by reproducing the most similar images and scoring image patches indicating their fit to the learned distribution. This approach is based on a strong presumption; the trained model should not be able to generate abnormal images. However, in reality, the model can generate abnormal images mainly due to noisy normal data which include small abnormal pixels, and such noise severely affects the accuracy of the model. Therefore, we propose a novel anomaly detection method to distort the distribution of the model with existing abnormal images. The proposed method detects pixel-level micro anomalies with a high accuracy from $$1024\times {1024}$$ high resolution images which are actually used in an industrial scene. In this paper, we share experimental results on open datasets, due to the confidentiality of the data.

Masanari Kimura, Takashi Yanagihara
Integration of Driver Behavior into Emotion Recognition Systems: A Preliminary Study on Steering Wheel and Vehicle Acceleration

The current status of the development for emotion recognition systems in cars is mostly focused on camera-based solutions which consider the face as the main input data source. Modeling behavior of the driver in automotive domain is also a challenging topic which has a great impact on developing intelligent and autonomous vehicles. In order to study the correlation between driving behavior and emotional status of the driver, we propose a multimodal system which is based on facial expressions and driver specific behavior including steering wheel usage and the change in vehicle acceleration. The aim of this work is to investigate the impact of integration of driver behavior into emotion recognition systems and to build a structure which continuously classifies the emotions in an efficient and non-intrusive manner. We consider driver behavior as the typical range of interactions with the vehicle which represents the responses to certain stimuli. To recognize facial emotions, we extract the histogram values from the key facial regions and combine them into a single vector which is then used to train a SVM classifier. Following that, using machine learning techniques and statistical methods two modules of abrupt car maneuvers counter, based on steering wheel rotation, and aggressive driver predictor, based on a variation of acceleration, are built. In the end, all three modules are combined into one final emotion classifier which is capable of predicting the emotional group of the driver with 94% of accuracy in sub-samples. For the evaluation we used a real car simulator with 8 different participants as the drivers.

Sina Shafaei, Tahir Hacizade, Alois Knoll
Prediction Based Deep Autoencoding Model for Anomaly Detection

Latent variables and reconstruction error generated from auto encoder are the common means for anomaly detection dealing with high dimensional signals. They are exclusively typical representations of the original input, and a plenty of methods utilizing them for anomaly detection have achieved good results. In this paper, we propose a new method combining these two features together to generate proper scores for anomaly detection. As both these two features contain useful information contributing to anomaly detection, good results can be expected by fusion of those two. The architecture proposed in this paper comprises of two networks, and we only use normal data for training. To compress and rebuild an input, a deep auto encoder (AE) is utilized where low dimensional latent variables and reconstruction error can be obtained, and compactness loss is introduced on latent variables to maintain a low intra-variance. Meanwhile, multi-layer perceptron (MLP) network which takes the generated latent variables as input is established aiming at predicting its corresponding reconstruction error. By introducing MLP network, anomalies sharing similar reconstruction error yet different distribution of latent variables to normal data or vice versa can be separated. These two networks, AE and MLP are trained jointly in our model and the prediction error form MLP network is used as the final score for anomaly detection. Experiments on several benchmarks including image and multivariable datasets demonstrate the effectiveness and practicability of this new approach when comparing with several up-to-data algorithms.

Zhanzhong Pang, Xiaoyi Yu, Jun Sun, Inakoshi Hiroya
Multimodal Sensor Fusion in Single Thermal Image Super-Resolution

With the fast growth in the visual surveillance and security sectors, thermal infrared images have become increasingly necessary in a large variety of industrial applications. This is true even though IR sensors are still more expensive than their RGB counterpart having the same resolution. In this paper, we propose a deep learning solution to enhance the thermal image resolution. The following results are given: (I) Introduction of a multimodal, visual-thermal fusion model that addresses thermal image super-resolution, via integrating high-frequency information from the visual image. (II) Investigation of different network architecture schemes in the literature, their up-sampling methods, learning procedures, and their optimization functions by showing their beneficial contribution to the super-resolution problem. (III) A benchmark ULB17-VT dataset that contains thermal images and their visual images counterpart is presented. (IV) Presentation of a qualitative evaluation of a large test set with 58 samples and 22 raters which shows that our proposed model performs better against state-of-the-arts.

Feras Almasri, Olivier Debeir
PCA-RECT: An Energy-Efficient Object Detection Approach for Event Cameras

We present the first purely event-based, energy-efficient approach for object detection and categorization using an event camera. Compared to traditional frame-based cameras, choosing event cameras results in high temporal resolution (order of microseconds), low power consumption (few hundred mW) and wide dynamic range (120 dB) as attractive properties. However, event-based object recognition systems are far behind their frame-based counterparts in terms of accuracy. To this end, this paper presents an event-based feature extraction method devised by accumulating local activity across the image frame and then applying principal component analysis (PCA) to the normalized neighborhood region. Subsequently, we propose a backtracking-free k-d tree mechanism for efficient feature matching by taking advantage of the low-dimensionality of the feature representation. Additionally, the proposed k-d tree mechanism allows for feature selection to obtain a lower-dimensional dictionary representation when hardware resources are limited to implement dimensionality reduction. Consequently, the proposed system can be realized on a field-programmable gate array (FPGA) device leading to high performance over resource ratio. The proposed system is tested on real-world event-based datasets for object categorization, showing superior classification performance and relevance to state-of-the-art algorithms. Additionally, we verified the object detection method and real-time FPGA performance in lab settings under non-controlled illumination conditions with limited training data and ground truth annotations.

Bharath Ramesh, Andrés Ussa, Luca Della Vedova, Hong Yang, Garrick Orchard
Unconstrained Iris Segmentation Using Convolutional Neural Networks

The extraction of consistent and identifiable features from an image of the human iris is known as iris recognition. Identifying which pixels belong to the iris, known as segmentation, is the first stage of iris recognition. Errors in segmentation propagate to later stages. Current segmentation approaches are tuned to specific environments.We propose using a convolution neural network for iris segmentation. Our algorithm is accurate when trained on a single environment and tested on multiple environments. Our network builds on the Mask R-CNN framework (He et al. ICCV 2017). Our approach segments faster than previous approaches including the Mask R-CNN network.Our network is accurate when trained on a single environment and tested with a different sensors (either visible light or near-infrared). Its accuracy degrades when trained with a visible light sensor and tested with a near-infrared sensor (and vice versa). A small amount of retraining of the visible light model (using a few samples from a near-infrared dataset) yields a tuned network accurate in both settings.For training and testing, this work uses the Casia v4 Interval, Notre Dame 0405, Ubiris v2, and IITD datasets.

Sohaib Ahmad, Benjamin Fuller
Simultaneous Multi-view Relative Pose Estimation and 3D Reconstruction from Planar Regions

In this paper, we propose a novel solution for multi-view reconstruction, relative pose and homography estimation using planar regions. The proposed method doesn‘t require point matches, it directly uses a pair of planar image regions and simultaneously reconstructs the normal and distance of the corresponding 3D planar surface patch, the relative pose of the cameras as well as the aligning homography between the image regions. When more than two cameras are available, then a special region-based bundle adjustment is proposed, which provides robust estimates in a multi-view camera system by constructing and solving a non-linear system of equations. The method is quantitatively evaluated on a large synthetic dataset as well as on the KITTI vision benchmark dataset.

Robert Frohlich, Zoltan Kato
WNet: Joint Multiple Head Detection and Head Pose Estimation from a Spectator Crowd Image

Crowd image analysis has various application areas such as surveillance, crowd management and augmented reality. Existing techniques can detect multiple faces in a single crowd image, but small head/face size and additional non facial regions in the head bounding box makes the head detection (HD) challenging. Additionally, in existing head pose estimations (HPE) of multiple heads in an image, individual cropped head image is passed through a network one by one, instead of estimating poses of multiple heads at the same time. The proposed WNet, performs both HD and HPE jointly on multiple heads in a single crowd image, in a single pass. Experiments are demonstrated on the spectator crowd S-HOCK dataset and results are compared with the HPE benchmarks. WNet proposes to use lesser number of training images compared to number of cropped images used by benchmarks, and does not utilize transferred weights from other networks. WNet not just performs HPE, but joint HD and HPE efficiently i.e. accuracy for more number of heads while depending on lesser number of testing images, compared to the benchmarks.

Yasir Jan, Ferdous Sohel, Mohd Fairuz Shiratuddin, Kok Wai Wong
Markerless Augmented Advertising for Sports Videos

Markerless augmented reality can be a challenging computer vision task, especially in live broadcast settings and in the absence of information related to the video capture such as the intrinsic camera parameters. This typically requires the assistance of a skilled artist, along with the use of advanced video editing tools in a post-production environment. We present an automated video augmentation pipeline that identifies textures of interest and overlays an advertisement onto these regions. We constrain the advertisement to be placed in a way that is aesthetic and natural. The aim is to augment the scene such that there is no longer a need for commercial breaks. In order to achieve seamless integration of the advertisement with the original video we build a 3D representation of the scene, place the advertisement in 3D, and then project it back onto the image plane. After successful placement in a single frame, we use homography-based, shape-preserving tracking such that the advertisement appears perspective correct for the duration of a video clip. The tracker is designed to handle smooth camera motion and shot boundaries.

Hallee E. Wong, Osman Akar, Emmanuel Antonio Cuevas, Iuliana Tabian, Divyaa Ravichandran, Iris Fu, Cambron Carter
Visual Siamese Clustering for Cosmetic Product Recommendation

We investigate the problem of a visual similarity-based recommender system, where cosmetic products are recommended based on the preferences of people who share similarity of visual features. In this work we train a Siamese convolutional neural network, using our own dataset of cropped eye regions from images of 91 female subjects, such that it learns to output feature vectors that place images of the same subject close together in high-dimensional space. We evaluate the trained network based on its ability to correctly identify existing subjects from unseen images, and then assess its capability to find visually similar matches amongst the existing subjects when an image of a new subject is input.

Christopher J. Holder, Boguslaw Obara, Stephen Ricketts
Multimodal Deep Neural Networks Based Ensemble Learning for X-Ray Object Recognition

X-ray object recognition is essential to reduce the workload of human inspectors regarding X-ray baggage screening and improve the throughput of X-ray screening. Traditionally, researchers focused on single-view or multiple views object recognition from only one type of pseudo X-ray image generated from X-ray energy data (e.g., dual-energy or mono-energy X-ray image). It is known that different types of X-ray images represent different object characteristics (e.g., material or density). Thus, effectively using different types of X-ray images as multiple modalities is promising to achieve more reliable recognition performance. In this paper, we explore different stage of ensemble approaches for X-ray object-recognition and propose an approach that exploits a classifier ensemble by using the multimodality information of X-ray images on a single-view object. We use a deep neural network to learn a good representation for each modality, which is used to train the base classifiers. To ensure high overall classification performance, the reliabilities of the base classifiers are estimated by taking the inherent features (e.g., color and shape) of an object in an X-ray image into consideration. We conducted experiments to evaluate the competitive performance of our method using a 15 classes dataset.

Quan Kong, Naoto Akira, Bin Tong, Yuki Watanabe, Daisuke Matsubara, Tomokazu Murakami
Backmatter
Metadaten
Titel
Computer Vision – ACCV 2018 Workshops
herausgegeben von
Prof. Gustavo Carneiro
Dr. Shaodi You
Copyright-Jahr
2019
Electronic ISBN
978-3-030-21074-8
Print ISBN
978-3-030-21073-1
DOI
https://doi.org/10.1007/978-3-030-21074-8

Premium Partner