main-content

## Über dieses Buch

This 8-volumes set constitutes the refereed of the 25th International Conference on Pattern Recognition Workshops, ICPR 2020, held virtually in Milan, Italy and rescheduled to January 10 - 11, 2021 due to Covid-19 pandemic. The 416 full papers presented in these 8 volumes were carefully reviewed and selected from about 700 submissions. The 46 workshops cover a wide range of areas including machine learning, pattern analysis, healthcare, human behavior, environment, surveillance, forensics and biometrics, robotics and egovision, cultural heritage and document analysis, retrieval, and women at ICPR2020.

## Inhaltsverzeichnis

### Fine-Tuning for One-Look Regression Vehicle Counting in Low-Shot Aerial Datasets

We investigate the task of entity counting in overhead imagery from the perspective of re-purposing representations learned from ground imagery, e.g., ImageNet, via feature adaptation. We explore two directions of feature adaptation and analyze their performances using two popular aerial datasets for vehicle counting: PUCPR+ and CARPK. First, we explore proxy self-supervision tasks such as RotNet, jigsaw, and image inpainting to re-fine the pretrained representation. Second, we insert additional network layers to adaptively select suitable features (e.g., squeeze and excitation blocks) or impose desired properties (e.g., using active rotating filters for rotation invariance). Our experimental results show that different adaptations produce different amounts of performance improvements depending on data characteristics. Overall, we achieve a mean absolute error (MAE) of 3.71 and 5.93 on the PUCPR+ and CARPK datasets, respectively, outperforming the previous state of the art: MAEs of 5.24 for PUCPR+ and 7.48 for CARPK.

Aneesh Rangnekar, Yi Yao, Matthew Hoffman, Ajay Divakaran

### Generative Data Augmentation for Vehicle Detection in Aerial Images

Scarcity of training data is one of the prominent problems for deep networks which require large amounts data. Data augmentation is a widely used method to increase the number of training samples and their variations. In this paper, we focus on improving vehicle detection performance in aerial images and propose a generative augmentation method which does not need any extra supervision than the bounding box annotations of the vehicle objects in the training dataset. The proposed method increases the performance of vehicle detection by allowing detectors to be trained with higher number of instances, especially when there are limited number of training instances. The proposed method is generic in the sense that it can be integrated with different generators. The experiments show that the method increases the Average Precision by up to 25.2% and 25.7% when integrated with Pluralistic and DeepFill respectively.

Hilmi Kumdakcı, Cihan Öngün, Alptekin Temizel

### City-Scale Point Cloud Stitching Using 2D/3D Registration for Large Geographical Coverage

3D city-scale point cloud stitching is a critical component for large data collection, environment change detection, in which massive amounts of 3D data are captured under different times and conditions. This paper proposes a novel point cloud stitching approach, that automatically and accurately stitches multiple city-scale point clouds, which only share relatively small overlapping areas, into one single model for a larger geographical coverage. The proposed method firstly employs 2D image mosaicking techniques to estimate 3D overlapping areas among multiple point clouds, then applies 3D point cloud registration techniques to estimate the most accurate transformation matrix for 3D stitching. The proposed method is quantitatively evaluated on city-scale reconstructed point cloud dataset and real-world city LiDAR dataset, in which, our method outperforms other competing methods with significant margins and achieved the highest precision score, recall score, and F-score. Our method makes an important step towards automatic and accurate city-scale point cloud data stitching, which could be used in a variety of applications.

Shizeng Yao, Hadi AliAkbarpour, Guna Seetharaman, Kannappan Palaniappan

### An Efficient and Reasonably Simple Solution to the Perspective-Three-Point Problem

In this work, we propose an efficient and simple method for solving the perspective-three-point (P3P) problem. This algorithm leans substantially on linear algebra, in which the rotation matrix and translation vector are parameterized as linear combinations of known vectors with particular coefficients. We also show how to avoid degeneracy when performing this algorithm. Moreover, we present an approach to roughly remove invalid solutions based on the orthogonal property of the rotation matrix. The proposed method is simple to implement and easy to understand, with improved results demonstrating that it is competitive with the leading methods in accuracy, but with reduced computational requirements.

Qida Yu, Guili Xu, Jiachen Shi

### Modeling and Simulation Framework for Airborne Camera Systems

Full Motion Video (FMV) and Wide Area Motion Imagery (WAMI) systems have been used extensively in the recent years to collect overhead imagery to produce value-added geospatial products. In order to better understand the limitations of these systems and to define the best flight conditions under which they can be operated, an integrated modeling and simulation (M&S) framework named Sensair was developed. This paper presents how this tool can be leveraged to simulate data collections with airborne camera systems using different parameters under varied conditions. Sensair can simulate realistic large-scale environments with moving vehicles and generate representative FMV and WAMI high-resolution imagery. The built-in interactive analysis tools enable the study of the different factors impacting the application of computer vision algorithms to aerial images. This paper presents Sensair’s M&S capabilities and describes use cases where the framework is used to perform tasks that include 3D reconstruction and the generation of an image dataset to support the evaluation of vehicle detection approaches.

Marc-Antoine Drouin, Jonathan Fournier, Jonathan Boisvert, Louis Borgeat

### On the Development of a Classification Based Automated Motion Imagery Interpretability Prediction

Motion imagery interpretability is commonly represented by the Video National Imagery Interpretability Rating Scale (VNIIRS), which is a subjective metric based on human analysts’ visual assessment. Therefore, VNIIRS is a very time-consuming task. This paper presents the development of a fully automated motion imagery interpretability prediction, called AMIIP. AMIIP employs a three-dimensional convolutional neural network (3D-CNN) that accepts as inputs many video blocks (small image sequences) extracted from motion imagery, and outputs the label classification for each video block. The result is a histogram of the labels/categories that is then used to estimate the interpretability of the motion imagery. For each training video clip, it is labeled based on its subjectively rated VNIIRS level; thus, the required human annotation of imagery for training data is minimized. By using a collection of 76 high definition aerial video clips, three preliminary experimental results indicate that the estimation error is within 0.5 VNIIRS rating scale.

Hua-mei Chen, Genshe Chen, Erik Blasch

### Remote Liveness and Heart Rate Detection from Video

The remote detection of liveness is critical for senior and baby care, disaster response, the military, and law enforcement. Existing solutions are mostly based on special sensor hardware or the spectral signature of living skin. This paper uses commercial electro-optical and infrared (EO/IR) sensors to capture a very short video for low cost and fast liveness detection. The key components of our system include: tiny human body and face detection from long range and low-resolution video, and remote liveness detection based on micro-motion from a short human body and face video. These micro-motions are caused by breathing and heartbeat. A deep learning architecture is designed for remote body and face detection. A novel algorithm is proposed for adaptive sensor and background noise cancellation. An air platform motion compensation algorithm is tested on video data collected on a drone. The key advantages are: low cost, requires very short video, works with many parts of a human body even when skin is not visible, works on any motion caused by eyes, mouth, heartbeat, breathing, or body parts, and works in all lighting conditions. To the author’s best knowledge, this is the first work on video micro-motion based liveness detection on a moving platform and from a long standoff range of 100 m. Once a subject is deemed alive, video-based remote heart rate detection is applied to assess the physiological and psychological state of the subject. This is also the first work on outdoor remote heart rate detection from a long standoff range of 100 m. On a public available indoor COHFACE data evaluation, our heart rate estimation algorithm outperforms all published work on the same dataset.

Yunbin Deng

### RADARSAT-2 Synthetic-Aperture Radar Land Cover Segmentation Using Deep Convolutional Neural Networks

Synthetic Aperture Radar (SAR) imagery captures the physical properties of the Earth by transmitting microwave signals to its surface and analyzing the backscattered signal. It does not depends on sunlight and therefore can be obtained in any condition, such as nighttime and cloudy weather. However, SAR images are noisier than light images and so far it is not clear the level of performance that a modern recognition system could achieve. This work presents an analysis of the performance of deep learning models for the task of land segmentation using SAR images. We present segmentation results on the task of classifying four different land categories (urban, water, vegetation and farm) on six Canadian sites (Montreal, Ottawa, Quebec, Saskatoon, Toronto and Vancouver), with three state-of-the-art deep learning segmentation models. Results show that when enough data and variety on the land appearance are available, deep learning models can achieve an excellent performance despite the high input noise.

### Deep Learning Based Domain Adaptation with Data Fusion for Aerial Image Data Analysis

Current Artificial Intelligence (AI) machine learning approaches perform well with similar sensors for data collection, training, and testing. The ability to learn and analyze data from multiple sources would enhance capabilities for Artificial Intelligence (AI) systems. This paper presents a deep learning-based multi-source self-correcting approach to fuse data with different modalities. The data-level fusion approach maximizes the capability to detect unanticipated events/targets augmented with machine learning methods. The proposed Domain Adaptation for Efficient Learning Fusion (DAELF) deep neural network adapts to changes of the input distribution allowing for self-correcting of multiple source classification and fusion. When supported by a distributed computing hierarchy, the proposed DAELF scales up in neural network size and out in geographical span. The design of DAELF includes various types of data fusion, including decision-level and feature-level data fusion. The results of DAELF highlight that feature-level fusion outperforms other approaches in terms of classification accuracy for the digit data and the Aerial Image Data analysis.

Jingyang Lu, Chenggang Yu, Erik Blasch, Roman Ilin, Hua-mei Chen, Dan Shen, Nichole Sullivan, Genshe Chen, Robert Kozma

### SqueezeFacePoseNet: Lightweight Face Verification Across Different Poses for Mobile Platforms

Ubiquitous and real-time person authentication has become critical after the breakthrough of all kind of services provided via mobile devices. In this context, face technologies can provide reliable and robust user authentication, given the availability of cameras in these devices, as well as their widespread use in everyday applications. The rapid development of deep Convolutional Neural Networks (CNNs) has resulted in many accurate face verification architectures. However, their typical size (hundreds of megabytes) makes them infeasible to be incorporated in downloadable mobile applications where the entire file typically may not exceed 100 Mb. Accordingly, we address the challenge of developing a lightweight face recognition network of just a few megabytes that can operate with sufficient accuracy in comparison to much larger models. The network also should be able to operate under different poses, given the variability naturally observed in uncontrolled environments where mobile devices are typically used. In this paper, we adapt the lightweight SqueezeNet model, of just 4.4 MB, to effectively provide cross-pose face recognition. After trained on the MS-Celeb-1M and VGGFace2 databases, our model achieves an EER of 1.23% on the difficult frontal vs. profile comparison, and 0.54% on profile vs. profile images. Under less extreme variations involving frontal images in any of the enrolment/query images pair, EER is pushed down to <0.3%, and the FRR at FAR = 0.1% to less than 1%. This makes our light model suitable for face recognition where at least acquisition of the enrolment image can be controlled. At the cost of a slight degradation in performance, we also test an even lighter model (of just 2.5 MB) where regular convolutions are replaced with depth-wise separable convolutions.

Fernando Alonso-Fernandez, Javier Barrachina, Kevin Hernandez-Diaz, Josef Bigun

### Deep Learning-Based Semantic Segmentation for Touchless Fingerprint Recognition

Fingerprint recognition is one of the most popular biometric technologies. Touchless fingerprint systems do not require contact of the finger with the surface of a capture device. For this reason, they provide an increased level of hygiene, usability, and user acceptance compared to touch-based capturing technologies. Most processing steps of the recognition workflow of touchless recognition systems differ in comparison to touch-based biometric techniques. Especially the segmentation of the fingerprint areas in a 2D capturing process is a crucial and more challenging task.In this work a proposal of a fingertip segmentation using deep learning techniques is presented. The proposed system allows to submit the segmented fingertip areas from a finger image directly to the processing pipeline. To this end, we adapt the deep learning model DeepLabv3+ to the requirements of fingertip segmentation and trained it on the database for hand gesture recognition (HGR) by extending it with a fingertip ground truth. Our system is benchmarked against a well-established color-based baseline approach and shows more accurate hand segmentation results especially on challenging images. Further, the segmentation performance on fingertips is evaluated in detail. The gestures provided in the database are separated into three categories by their relevance for the use case of touchless fingerprint recognition. The segmentation performance in terms of Intersection over Union (IoU) of up to 68.03% on the fingertips (overall: 86.13%) in the most relevant category confirms the soundness of the presented approach.

Jannis Priesnitz, Christian Rathgeb, Nicolas Buchmann, Christoph Busch

### FaceHop: A Light-Weight Low-Resolution Face Gender Classification Method

A light-weight low-resolution face gender classification method, called FaceHop, is proposed in this research. We have witnessed rapid progress in face gender classification accuracy due to the adoption of deep learning (DL) technology. Yet, DL-based systems are not suitable for resource-constrained environments with limited networking and computing. FaceHop offers an interpretable non-parametric machine learning solution. It has desired characteristics such as a small model size, a small training data amount, low training complexity, and low-resolution input images. FaceHop is developed with the successive subspace learning (SSL) principle and built upon the foundation of PixelHop++. The effectiveness of the FaceHop method is demonstrated by experiments. For gray-scale face images of resolution $$32 \times 32$$ 32 × 32 in the LFW and the CMU Multi-PIE datasets, FaceHop achieves correct gender classification rates of 94.63% and 95.12% with model sizes of 16.9K and 17.6K parameters, respectively. It outperforms LeNet-5 in classification accuracy while LeNet-5 has a model size of 75.8K parameters.

Mozhdeh Rouhsedaghat, Yifan Wang, Xiou Ge, Shuowen Hu, Suya You, C. -C. Jay Kuo

### Advanced Temporal Dilated Convolutional Neural Network for a Robust Car Driver Identification

The latest generation cars are often equipped with advanced driver assistance systems, usually known as ADAS (Advanced Driver Assistance Systems). These systems are able to assist the car driver by leveraging several levels of automation. Therefore, it is essential to adapt the ADAS technology to the car driver’s identity to personalize the provided assistance services. For these reasons, such car driver profiling algorithms have been developed by the scientific community. The algorithm herein proposed is able to recognize the driver’s identity with an accuracy close to 99% thanks to ad-hoc specific analysis of the driver’s PhotoPlethysmoGraphic (PPG) signal. In order to rightly identify the driver profile, the proposed approach uses a 1D Dilated Temporal Convolutional Neural Network architecture to learn the features of the collected driver’s PPG signal. The proposed deep architecture is able to correlate the specific PPG features with subject identity enabling the car ADAS services associated with the recognized identity. Extensive validation and testing of the developed pipeline confirmed its reliability and effectiveness.

Francesco Rundo, Francesca Trenta, Roberto Leotta, Concetto Spampinato, Vincenzo Piuri, Sabrina Conoci, Ruggero Donida Labati, Fabio Scotti, Sebastiano Battiato

### VISOB 2.0 - The Second International Competition on Mobile Ocular Biometric Recognition

Following the success of VISOB 1.0 visible light ocular biometrics competition at IEEE ICIP 2016, we organized VISOB 2.0 competition at IEEE WCCI 2020. The aim of VISOB 2.0 competition was to evaluate and compare the performance of ocular biometrics recognition approaches in visible light using (a) stacks of five images captured in burst mode and (b) subject-independent evaluation, where subjects do not overlap between training and testing set. We received three submissions in which the authors developed various deep learning based and texture-analysis based methods. The best results were obtained by a team from Federal University of Parana (Curitiba, Brazil), achieving an Equal Error Rate (EER) of $$5.25\%$$ 5.25 % in a subject-independent evaluation setting.

Hoang (Mark) Nguyen, Narsi Reddy, Ajita Rattani, Reza Derakhshani

### Adapting to Movement Patterns for Face Recognition on Mobile Devices

Facial recognition is becoming an increasingly popular way to authenticate users, helped by the increased use of biometric technology within mobile devices, such as smartphones and tablets. Biometric systems use thresholds to identify whether a user is genuine or an impostor. Traditional biometric systems are static (such as eGates at airports), which allow the operators and developers to create an environment most suited for the successful operation of the biometric technology by using a fixed threshold value to determine the authenticity of the user. However, with a mobile device and scenario, the operational conditions are beyond the control of the developers and operators.In this paper, we propose a novel approach to mobile biometric authentication within a mobile scenario, by offering an adaptive threshold to authenticate users based on the environment, situations and conditions in which they are operating the device. Utilising smartphone sensors, we demonstrate the creation of a successful scenario classification. Using this, we propose our idea of an extendable framework to allow multiple scenario thresholds. Furthermore, we test the concept with data collected from a smartphone device. Results show that using an adaptive scenario threshold approach can improve the biometric performance, and hence could allow manufacturers to produce algorithms that perform consistently in multiple scenarios without compromising security, allowing an increase in public trust towards the use of the technology.

Matthew Boakes, Richard Guest, Farzin Deravi

### Probing Fairness of Mobile Ocular Biometrics Methods Across Gender on VISOB 2.0 Dataset

Recent research has questioned the fairness of face-based recognition and attribute classification methods (such as gender and race) for dark-skinned people and women. Ocular biometrics in the visible spectrum is an alternate solution over face biometrics, thanks to its accuracy, security, robustness against facial expression, and ease of use in mobile devices. With the recent COVID-19 crisis, ocular biometrics has a further advantage over face biometrics in the presence of a mask. However, fairness of ocular biometrics has not been studied till now. This first study aims to explore the fairness of ocular-based authentication and gender classification methods across males and females. To this aim, VISOB 2.0 dataset, along with its gender annotations, is used for the fairness analysis of ocular biometrics methods based on ResNet-50, MobileNet-V2 and lightCNN-29 models. Experimental results suggest the equivalent performance of males and females for ocular-based mobile user-authentication in terms of genuine match rate (GMR) at lower false match rates (FMRs) and an overall Area Under Curve (AUC). For instance, an AUC of 0.96 for females and 0.95 for males was obtained for lightCNN-29 on an average. However, males significantly outperformed females in deep learning based gender classification models based on ocular-region.

Anoop Krishnan, Ali Almadan, Ajita Rattani

### Biometric Recognition of PPG Cardiac Signals Using Transformed Spectrogram Images

Nowadays, the number of mobile, wearable, and embedded devices integrating sensors for acquiring cardiac signals is constantly increasing. In particular, plethysmographic (PPG) sensors are widely diffused thanks to their small form factor and limited cost. For example, PPG sensors are used for monitoring cardiac activities in automotive applications and in wearable devices as smartwatches, activity trackers, and wristbands. Recent studies focused on using PPG signals to secure mobile devices by performing biometric recognitions. Although their results are promising, all of these methods process PPG acquisitions as one-dimensional signals. In the literature, feature extraction techniques based on transformations of the spectrogram have been successfully used to increase the accuracy of signal processing techniques designed for other application scenarios. This paper presents a preliminary study on a biometric recognition approach that extracts features from different transformations of the spectrogram of PPG signals and classifies the obtained feature representations using machine learning techniques. To the best of our knowledge, this is the first study in the literature on biometric systems that extracts features from the spectrogram of PPG signals. Furthermore, with respect to most of the state-of-the-art biometric recognition techniques, the proposed approach presents the advantage of not requiring the search of fiducial points, thus reducing the computational complexity and increasing the robustness of the signal preprocessing step. We performed tests using a dataset of samples collected from 42 individuals, obtaining an average classification accuracy of $$99.16\%$$ 99.16 % for identity verification (FMR of 0.56% at FNMR of 13.50%), and a rank-1 identification error of $$7.24\%$$ 7.24 % for identification. The results obtained for the considered dataset are better or comparable with respect to the ones of the best-performing methods in the literature.

Ruggero Donida Labati, Vincenzo Piuri, Francesco Rundo, Fabio Scotti, Concetto Spampinato

### The EndoTect 2020 Challenge: Evaluation and Comparison of Classification, Segmentation and Inference Time for Endoscopy

The EndoTect challenge at the International Conference on Pattern Recognition 2020 aims to motivate the development of algorithms that aid medical experts in finding anomalies that commonly occur in the gastrointestinal tract. Using HyperKvasir, a large dataset containing images taken from several endoscopies, the participants competed in three tasks. Each task focuses on a specific requirement for making it useful in a real-world medical scenario. The tasks are (i) high classification performance in terms of prediction accuracy, (ii) efficient classification measured by the number of images classified per second, and (iii) pixel-level segmentation of specific anomalies. Hopefully, this can motivate different computer science researchers to help benchmark a crucial component of a future computer-aided diagnosis system, which in turn, could potentially save human lives.

Steven A. Hicks, Debesh Jha, Vajira Thambawita, Pål Halvorsen, Hugo L. Hammer, Michael A. Riegler

### A Hierarchical Multi-task Approach to Gastrointestinal Image Analysis

A large number of different lesions and pathologies can affect the human digestive system, resulting in life-threatening situations. Early detection plays a relevant role in the successful treatment and the increase of current survival rates to, e.g., colorectal cancer. The standard procedure enabling detection, endoscopic video analysis, generates large quantities of visual data that need to be carefully analyzed by an specialist. Due to the wide range of color, shape, and general visual appearance of pathologies, as well as highly varying image quality, such process is greatly dependent on the human operator experience and skill. In this work, we detail our solution to the task of multi-category classification of images from the gastrointestinal (GI) human tract within the 2020 Endotect Challenge. Our approach is based on a Convolutional Neural Network minimizing a hierarchical error function that takes into account not only the finding category, but also its location within the GI tract (lower/upper tract), and the type of finding (pathological finding/therapeutic intervention/anatomical landmark/mucosal views’ quality). We also describe in this paper our solution for the challenge task of polyp segmentation in colonoscopies, which was addressed with a pretrained double encoder-decoder network. Our internal cross-validation results show an average performance of 91.25 Mathews Correlation Coefficient (MCC) and 91.82 Micro-F1 score for the classification task, and a 92.30 F1 score for the polyp segmentation task. The organization provided feedback on the performance in a hidden test set for both tasks, which resulted in 85.61 MCC and 86.96 F1 score for classification, and 91.97 F1 score for polyp segmentation. At the time of writing no public ranking for this challenge had been released.

Adrian Galdran, Gustavo Carneiro, Miguel A. González Ballester

### Delving into High Quality Endoscopic Diagnoses

This paper introduces the solution to the Detection Task and Segmentation Task of ICPR 2020 EndoTect Challenge [7] from the DeepBlueAI Team. The Detection Task is essentially a classification problem whose target is to distinguish between 23 types of digestive system diseases. For this task, we try different data augmentation methods and feature representation networks. Ensemble learning is also adopted to improve classification performance. For the Segmentation Task, we implement it in both semantic segmentation manner and instance segmentation manner. In comparison, semantic segmentation gets a relatively better result.

Zhipeng Luo, Lixuan Che, Jianye He

### Medical Diagnostic by Data Bagging for Various Instances of Neural Network

Computer-aided diagnostics is helping the medical experts for fast diagnostics, using machine learning and representation learning techniques. Various types of diagnostics are using the assistance of machine learning approaches including endoscopy. In this paper, a transfer learning based bagging approach is investigated for endoscopy images analysis. Bagging is used to fine-tune several instances of the deep learning model with 70% of data in each bag. These all models of deep learning are combined to generate a single prediction using majority voting and neural-network-based decision approach. The best approach resulted in an F1-score of 0.60 on the EndoTech 2020 dataset having 23 abnormalities in the GI-Tract.

### Hybrid Loss with Network Trimming for Disease Recognition in Gastrointestinal Endoscopy

EndoTect Challenge 2020, which aims at the detection of gastrointestinal diseases and abnormalities, consists of three tasks including Detection, Efficient Detection and Segmentation in endoscopic images. Although pathologies belonging to different classes can be manually separated by experienced experts, however, existing classification models struggle to discriminate them due to low inter-class variability. As a result, the models’ convergence deteriorates. To this end, we propose a hybrid loss function to stabilise model training. For the detection and efficient detection tasks, we utilise ResNet-152 and MobileNetV3 architectures, respectively, along with the hybrid loss function. For the segmentation task, Cascade Mask R-CNN is investigated. In this paper, we report the architecture of our detection and segmentation models and the performance of our methods on HyperKvasir and EndoTect test dataset.

Qi He, Sophia Bano, Danail Stoyanov, Siyang Zuo

### DDANet: Dual Decoder Attention Network for Automatic Polyp Segmentation

Colonoscopy is the gold standard for examination and detection of colorectal polyps. Localization and delineation of polyps can play a vital role in treatment (e.g., surgical planning) and prognostic decision making. Polyp segmentation can provide detailed boundary information for clinical analysis. Convolutional neural networks have improved the performance in colonoscopy. However, polyps usually possess various challenges, such as intra-and inter-class variation and noise. While manual labeling for polyp assessment requires time from experts and is prone to human error (e.g., missed lesions), an automated, accurate, and fast segmentation can improve the quality of delineated lesion boundaries and reduce missed rate. The Endotect challenge provides an opportunity to benchmark computer vision methods by training on the publicly available Hyperkvasir and testing on a separate unseen dataset. In this paper, we propose a novel architecture called “DDANet” based on a dual decoder attention network. Our experiments demonstrate that the model trained on the Kvasir-SEG dataset and tested on an unseen dataset achieves a dice coefficient of 0.7874, mIoU of 0.7010, recall of 0.7987, and a precision of 0.8577, demonstrating the generalization ability of our model.

Nikhil Kumar Tomar, Debesh Jha, Sharib Ali, Håvard D. Johansen, Dag Johansen, Michael A. Riegler, Pål Halvorsen

### Efficient Detection of Lesions During Endoscopy

Endoscopy is a very important procedure in the medical field. It is used to detect almost any diseases associated with the gastrointestinal (GI) tract. Hence, the current work attempts to use Machine learning methods such that such medical procedures can be automated and used in real-time to ensure the proper diagnosis of patients. The current work implements the Tiny Darknet model with an attempt to efficiently classify the various medical conditions specified in the dataset used. Eventually, the Tiny Darknet succeeds in achieving a high classification speed, achieving up to a maximum speed of about 60 fps.

Amartya Dutta, Rajat Kanti Bhattacharjee, Ferdous Ahmed Barbhuiya

### The 106-Point Lightweight Facial Landmark Localization Grand Challenge

Facial landmark localization has been applied to numerous face related applications, such as face recognition and face image synthesis. It is a very crucial step for achieving high performance in these applications. We host the $$2^\mathrm{nd}$$ 2 nd 106-point lightweight facial landmark localization grand challenge in conjunction with ICPR 2020. The purpose is to make effort towards benchmarking lightweight facial landmark localization, which enables efficient system deployment. Compared with the $$1^\mathrm{st}$$ 1 st grand challenge ( https://facial-landmarks-localization-challenge.github.io/ ), the JD-landmark-v2 dataset contains more than 24,000 images with larger variations in identity, pose, expression and occlusion. Besides, strict limits of model size ( $$\le$$ ≤ 20M) and computational complexity ( $$\le$$ ≤ 1G Flops) are employed for computational efficiency. The challenge has attracted attention from academia and industrial practitioners. More than 70 teams participate in the competition, and nine of them involve in the final evaluation. We give a detailed introduction of the competition and the solution by the winners in this paper.

Yinglu Liu, Peipei Li, Xin Tong, Hailin Shi, Xiangyu Zhu, Zhenan Sun, Zhen Xu, Huaibo Liu, Xuefeng Su, Wei Chen, Han Huang, Duomin Wang, Xunqiang Tao, Yandong Guo, Ziye Tong, Shenqi Lai, Zhenhua Chai

### ICPR2020 Competition on Text Detection and Recognition in Arabic News Video Frames

After the success of the two first editions of the “Arabic Text in Videos Competition—AcTiVComp”, we are proposing to organize a new edition in conjunction with the 25th International Conference on Pattern Recognition (ICPR’20). The main objective is to contribute in the research field of text detection and recognition in multimedia documents, with a focus on Arabic text in video frames. The former editions were held in the framework of ICPR’16 and ICDAR’17 conferences. The obtained results on the AcTiV dataset have shown that there is still room for improvement in both text detection and recognition tasks. Four groups with five systems are participating to this edition of AcTiVComp (three for the detection task and two for the recognition task). All the submitted systems have followed a CRNN-based architecture, which is now the de facto choice for text detection and OCR problems. The achieved results are very interesting, showing a significant improvement from the state-of-the-art performances on this field of research.

Oussama Zayene, Rolf Ingold, Najoua Essoukri BenAmara, Jean Hennebert

### ICPR 2020 - Competition on Harvesting Raw Tables from Infographics

Kenny Davila, Chris Tensmeyer, Sumit Shekhar, Hrituraj Singh, Srirangaraj Setlur, Venu Govindaraju

### Visual and Textual Information Fusion Method for Chart Recognition

In this report, we present our method in the ICPR 2020 Competition on Harvesting Raw Tables from Infographics, which is composed of Chart Classification, Text Detection/Recognition, Text Role Classification, Axis Analysis, Legend Analysis, Plot Element Detection/Classification and CSV Extraction. The image classification models of ResNet are adopt in Chart Classification. We adopted a two-stage based pipeline for end-to-end recognition, considering detection and recognition as two modules in Text Detection/Recognition. An ensemble model with LayoutLM and object detection model is adopted in Text Role Classification. A two-stage pipeline with two detection model is adopt in Legend Analysis. The final results are discussed.

Chen Wang, Kaixu Cui, Suya Zhang, Changliang Xu

### A Benchmark for Analyzing Chart Images

Charts are a compact method of displaying and comparing data. Automatically extracting data from charts is a key step in understanding the intent behind a chart which could lead to a better understanding of the document itself. To promote the development of automatically decompose and understand these visualizations. The CHART-Infographics organizers holds the Competition on Harvesting Raw Tables from Infographics. In this paper, based on machine learning, image recognition, object detection, keypoint estimation, OCR, and others, we explored and proposed our methods for almost all tasks and achieved relatively good performance.

Zhipeng Luo, Zhiguang Zhang, Ge Li, Lixuan Che, Jianye He, Zhenyu Xu

### ICPR 2020 Competition on Text Block Segmentation on a NewsEye Dataset

We present a competition on text block segmentation within the framework of the International Conference on Pattern Recognition (ICPR) 2020. The main goal of this competition is to automatically analyse the structure of historical newspaper pages with a subsequent evaluation of the participants’ algorithms performance. In contrast to many existing segmentation methods, instead of working on pixels, the present study has a focus on clustering baselines/text lines into text blocks. Therefore, we introduce a new measure based on a baseline detection evaluation scheme. But also common pixel-based approaches could participate without restrictions. Working on baseline level addresses directly the application scenario where for a given image the contained text should be extracted in blocks for further investigations. We present the results of three submissions. The experiments have shown that text blocks can be reliably detected both on pages with a simple layout and on pages with a complex layout.

Johannes Michael, Max Weidemann, Bastian Laasch, Roger Labahn

### Top-1 CORSMAL Challenge 2020 Submission: Filling Mass Estimation Using Multi-modal Observations of Human-Robot Handovers

Human-robot object handover is a key skill for the future of human-robot collaboration. CORSMAL 2020 Challenge focuses on the perception part of this problem: the robot needs to estimate the filling mass of a container held by a human. Although there are powerful methods in image processing and audio processing individually, answering such a problem requires processing data from multiple sensors together. The appearance of the container, the sound of the filling, and the depth data provide essential information. We propose a multi-modal method to predict three key indicators of the filling mass: filling type, filling level, and container capacity. These indicators are then combined to estimate the filling mass of a container. Our method obtained Top-1 overall performance among all submissions to CORSMAL 2020 Challenge on both public and private subsets while showing no evidence of overfitting. Our source code is publicly available: github.com/v-iashin/CORSMAL .

Vladimir Iashin, Francesca Palermo, Gökhan Solak, Claudio Coppola

### Audio-Visual Hybrid Approach for Filling Mass Estimation

Object handover is a fundamental and essential capability for robots interacting with humans in many applications such as household chores. In this challenge, we estimate the physical properties of a variety of containers with different fillings such as container capacity and the type and percentage of the content to achieve collaborative physical handover between humans and robots. We introduce multi-modal prediction models using audio-visual-datasets of people interacting with containers distributed by CORSMAL.

Reina Ishikawa, Yuichi Nagao, Ryo Hachiuma, Hideo Saito

### VA2Mass: Towards the Fluid Filling Mass Estimation via Integration of Vision and Audio Learning

Robotic perception of filling mass estimation via multiple sensors and deep learning approaches is still an open problem due to the diverse pouring durations, small pixel ratio for target objects and complex pouring scenarios. In this paper, we propose a practical solution to tackle this challenging task via estimating filling level, filling type and container capacity simultaneously. The proposed method is inspired by how humans observe and understand the pouring process via the cooperation among multiple modalities, i.e., vision and audio. In a nutshell, our proposed method is divided into three folds to help the agent shape a rich understanding of the pouring procedure. First, the agent obtains the prior of container categories (i.e., cup, glass or box) through the object detection framework. Second, we integrate the audio features with the prior to make the agent learn a multi-modal feature space. Finally, the agent infers the distribution of both the container capacity and fluid properties. The experimental results show the effectiveness of the proposed method, which ranked as $$2^{nd}$$ 2 nd runner-up in the CORSMAL Challenge of Multi-modal Fusion and Learning For Robotics in ICPR 2020.

Qi Liu, Fan Feng, Chuanlin Lan, Rosa H. M. Chan

### Pollen Grain Classification Challenge 2020

Challenge Report

This report summarises the Pollen Grain Classification Challenge 2020, and the related findings. It serves as an introduction to the technical reports that were submitted to the competition section at the 25th International Conference on Pattern Recognition (ICPR 2020), related to the Pollen Grain Classification Challenge. The challenge is meant to develop automatic pollen grain classification systems, by leveraging on the first large scale annotated dataset of microscope pollen grain images.

Sebastiano Battiato, Francesco Guarnera, Alessandro Ortis, Francesca Trenta, Lorenzo Ascari, Consolata Siniscalco, Tommaso De Gregorio, Eloy Suárez

### The Fusion of Neural Architecture Search and Destruction and Construction Learning

First Classified

Object classification is a classic problem in the field of pattern recognition. The traditional deep neural networks have been able to achieve good results on some classification problems, however, there are still many difficulties to be overcome in the fine-grained identification task, whose performance are still baffled by practical problems. In this paper, we introduce neural architecture search (NAS) to search the appropriate network according to the specific data set, which do not need more engineering work to adjust parameters for the optimized performance. We further combine the Destruction and Construction Learning (DCL) network and the NAS-based network for pollen recognition. To this end, we use a fusion algorithm to implement the combination of different networks, and won the pollen recognition competition held at the international pattern recognition Conference (ICPR) 2020.

Chao Fang, Yutao Hu, Baochang Zhang, David Doermann

### Improved Data Augmentation of Deep Convolutional Neural Network for Pollen Grains Classification

Third Classified

Traditionally, it is a time-consuming work for experts to accomplish pollen grains classification. With the popularity of deep Convolutional Neural Network (CNN) in computer vision, many automatic pollen grains classification methods based on CNN have been proposed in recent years. However, The CNN they used often focus on the most proniment area in the center of pollen grains and neglect the less discriminative local features in the surrounding of pollen grains. In order to alleviate this situation, we propose two data augmentation operations. Our experiment results on Pollen13K achieve a weighted F1 score of 97.26% and an accuracy of 97.29%.

Penghui Gui, Ruowei Wang, Zhengbang Zhu, Feiyu Zhu, Qijun Zhao

### Backmatter

Weitere Informationen