Skip to main content
Top

2021 | Book

Computer Analysis of Images and Patterns

19th International Conference, CAIP 2021, Virtual Event, September 28–30, 2021, Proceedings, Part I

Editors: Dr. Nicolas Tsapatsoulis, Dr. Andreas Panayides, Dr. Theo Theocharides, Dr. Andreas Lanitis, Prof. Dr. Constantinos Pattichis, Mario Vento

Publisher: Springer International Publishing

Book Series : Lecture Notes in Computer Science

insite
SEARCH

About this book

The two volume set LNCS 13052 and 13053 constitutes the refereed proceedings of the 19th International Conference on Computer Analysis of Images and Patterns, CAIP 2021, held virtually, in September 2021.

The 87 papers presented were carefully reviewed and selected from 129 submissions. The papers are organized in the following topical sections across the 2 volumes: 3D vision, biomedical image and pattern analysis; machine learning; feature extractions; object recognition; face and gesture, guess the age contest, biometrics, cryptography and security; and segmentation and image restoration.

Table of Contents

Frontmatter

3D Vision

Frontmatter
Simultaneous Bi-directional Structured Light Encoding for Practical Uncalibrated Profilometry

Profilometry based on structured light is one of the most popular methods for 3D reconstruction. It is widely used when high-precision and dense models, for a variety of different objects, are required. User-friendly procedures encode the scene in horizontal and vertical directions, which allows a unique description of points in the scene. The resulting encoding, can be used to auto-calibrate the devices used. Thus, any consumer or industrial cameras or projectors can be supported and the procedure is not limited to pre-calibrated setups. This approach is extremely flexible, but requires a large number of camera acquisitions of the scene with multiple patterns projected. This paper presents a new approach that encodes the scene simultaneously in horizontal and vertical directions using sinusoidal fringe patterns. This allows to almost halve the number of recorded images, making the approach attractive again for many practical applications with time aspects.

Torben Fetzer, Gerd Reis, Didier Stricker
Joint Global ICP for Improved Automatic Alignment of Full Turn Object Scans

Point cloud registration is an important task in computer vision, computer graphics, robotics, odometry and many other disciplines. The problem has been studied for a long time and many different approaches have been established. In the case of existing rough initializations, the ICP approach is still widely used as the state of the art. Often only the pairwise problem is treated. In case of many applications, especially in 3D reconstruction, closed rotations of sequences of partial reconstructions have to be registered. We show that there are considerable advantages if ICP iterations are performed jointly instead of the usual pairwise approach (Pulli’s approach). Without the need for increased computational effort, lower alignment errors are achieved, drift is avoided and calibration errors are uniformly distributed over all scans. The joint approach is further extended into a global version, which not only considers one-sided adjacent scans, but updates symmetrically in both directions. The result is an approach that leads to a much smoother and more stable convergence, which moreover enables a stable stopping criterion to be applied. This makes the procedure fully automatic and therefore superior to most other methods, that often tremble close to the optimum and have to be terminated manually. We present a complete procedure, which in addition addresses the issue of automatic outlier detection in order to solve the investigated problem data independently without any user interaction .

Torben Fetzer, Gerd Reis, Didier Stricker
Fast Projector-Driven Structured Light Matching in Sub-pixel Accuracy Using Bilinear Interpolation Assumption

In practical applications where high-precision reconstructions are required, whether for quality control or damage assessment, structured light reconstruction is often the method of choice. It allows to achieve dense point correspondences over the entire scene independently of any object texture. The optimal matches between images with respect to an encoded surface point are usually not on pixel but on sub-pixel level. Common matching techniques that look for pixel-to-pixel correspondences between camera and projector often lead to noisy results that must be subsequently smoothed. The method presented here allows to find optimal sub-pixel positions for each projector pixel in a single pass and thus requires minimal computational effort. For this purpose, the quadrilateral regions containing the sub-pixels are extracted. The convexity of these quads and their consistency in terms of topological properties can be guaranteed during runtime. Subsequently, an explicit formulation of the optimal sub-pixel position within each quad is derived, using bilinear interpolation, and the permanent existence of a valid solution is proven. In this way, an easy-to-use procedure arises that matches any number of cameras in a structured light setup with high accuracy and low complexity. Due to the ensured topological properties, exceptionally smooth, highly precise, uniformly sampled matches with almost no outliers are achieved. The point correspondences obtained do not only have an enormously positive effect on the accuracy of reconstructed point clouds and resulting meshes, but are also extremely valuable for auto-calibrations calculated from them.

Torben Fetzer, Gerd Reis, Didier Stricker
Pyramidal Layered Scene Inference with Image Outpainting for Monocular View Synthesis

Generating novel views from a single input is a challenging task that requires the prediction of occluded and non-visible content. Nevertheless, it is an interesting and active area of research due to its several applications such as entertainment. In this work, we propose an end-to-end architecture for monocular view synthesis based on the layered scene inference (LSI) method. The LSI uses layered depth images that can represent complex scenes with a reduced number of layers. To improve the LSI predictions, we develop two new strategies: (i) a pyramidal architecture that learns LDI predictions for different resolutions of the input and (ii) an image outpainting for filling the missing information at the LDI borders. We evaluate our method on the KITTI dataset, and show that the proposed versions outperform the baseline.

Marcos R. Souza, Jhonatas S. Conceição, Jose L. Flores-Campana, Luis G. L. Decker, Diogo C. Luvizon, Gustavo Sutter P. Carvalho, Helena A. Maia, Helio Pedrini
Out of the Box: Embodied Navigation in the Real World

The research field of Embodied AI has witnessed substantial progress in visual navigation and exploration thanks to powerful simulating platforms and the availability of 3D data of indoor and photorealistic environments. These two factors have opened the doors to a new generation of intelligent agents capable of achieving nearly perfect PointGoal Navigation. However, such architectures are commonly trained with millions, if not billions, of frames and tested in simulation. Together with great enthusiasm, these results yield a question: how many researchers will effectively benefit from these advances? In this work, we detail how to transfer the knowledge acquired in simulation into the real world. To that end, we describe the architectural discrepancies that damage the Sim2Real adaptation ability of models trained on the Habitat simulator and propose a novel solution tailored towards the deployment in real-world scenarios. We then deploy our models on a LoCoBot, a Low-Cost Robot equipped with a single Intel RealSense camera. Different from previous work, our testing scene is unavailable to the agent in simulation. The environment is also inaccessible to the agent beforehand, so it cannot count on scene-specific semantic priors. In this way, we reproduce a setting in which a research group (potentially from other fields) needs to employ the agent visual navigation capabilities as-a-Service. Our experiments indicate that it is possible to achieve satisfying results when deploying the obtained model in the real world. Our code and models are available at https://github.com/aimagelab/LoCoNav .

Roberto Bigazzi, Federico Landi, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara
Toward a Novel LSB-based Collusion-Secure Fingerprinting Schema for 3D Video

Securing multimedia content and preventing it from being maliciously manipulated has developed at a rapid pace, and researchers have been studying the traitor tracing as an appropriate solution. This approach consists in retrieving back the actors who contributed to the construction of an illegal release of a multimedia product. It includes two major steps which are the fingerprinting step and the tracing one. The fingerprinting step relies on the watermarking technique whereas the efficiency of the tracing scheme depends on several requirements: the robustness of the watermarking technique, the type of the media content, and even the computational complexity. In this paper, we propose a new collusion-secure fingerprinting scheme for 3D videos. It has essentially a twofold purpose: at a first step, we propose to embed the watermark in the video copy by applying a standard Least Significant Bit (LSB) substitution to all the frames of both the 2D video and the depth map components in order to ensure simultaneously and independently the protection of these two parts. In the second step, we apply the tracing process whose target is the identification of eventual colluders by extracting the hidden identifier from the suspicious video and analyse it. Experimental assessments show that the proposed scheme provides interesting results in terms of speed and tracing accuracy constraints.

Karama Abdelhedi, Faten Chaabane, William Puech, Chokri Ben Amar
A Combinatorial Coordinate System for the Vertices in the Octagonal Grid

The octagonal $$C_4C_8$$ C 4 C 8 grid is a tessellation of the plane into regular octagons and squares. It is one of the eight semiregular grids, which have been receiving an increasing amount of research attention as a viable alternative to the traditional square grid.We present an integer-valued combinatorial coordinate system for the vertices in the $$C_4C_8(R)$$ C 4 C 8 ( R ) grid. We review the existing coordinate systems for this grid proposed in the literature and we provide formulas for the conversion between this coordinate system and the existing ones, including the Cartesian coordinates. Adjacency relation between the vertices can be easily obtained from their coordinates through simple integer arithmetics.

Lidija Čomić
Bilingual Speech Recognition by Estimating Speaker Geometry from Video Data

Speech recognition is very challenging in student learning environments that are characterized by significant cross-talk and background noise. To address this problem, we present a bilingual speech recognition system that uses an interactive video analysis system to estimate the 3D speaker geometry for realistic audio simulations. We demonstrate the use of our system in generating a complex audio dataset that contains significant cross-talk and background noise that approximate real-life classroom recordings. We then test our proposed system with real-life recordings.In terms of the distance of the speakers from the microphone, our interactive video analysis system obtained a better average error rate of 10.83% compared to 33.12% for a baseline approach. Our proposed system gave an accuracy of 27.92% that is 1.5% better than Google Speech-to-text on the same dataset. In terms of 9 important keywords, our approach gave an average sensitivity of 38% compared to 24% for Google Speech-to-text, while both methods maintained high average specificity of 90% and 92%.On average, sensitivity improved from 24% to 38% for our proposed approach. On the other hand, specificity remained high for both methods (90% to 92%).

Luis Sanchez Tapia, Antonio Gomez, Mario Esparza, Venkatesh Jatla, Marios Pattichis, Sylvia Celedón-Pattichis, Carlos LópezLeiva
Cost-Efficient Color Correction Approach on Uncontrolled Lighting Conditions

The misuse and overuse of antibiotics lead to antibiotic resistance becoming a serious problem and a threat to world health. Bacteria developing resistance results in more dangerous infections and a more difficult treatment. To monitor the antibiotic pollution of environmental waters, different detection methods have been developed, however these are normally complex, costly and time-consuming. In a previous work, we developed a method based on digital colorimetry, using smartphone cameras to acquire sample images and color correction to ensure color constancy between images. A reference chart with 24 colors, with known ground truth values, is included in the photographs in order to color correct the images using least squares minimization. Then, the color of the sample is detected and correlated to antibiotic concentration. Although achieving promising results, the method was too sensitive to contrasting illumination conditions, with high standard deviations in these cases. Here, we test different methods for improving the stability and precision of the previous algorithm. By using only the 13 patches closest to the color of the targets and more parameters for the least squares minimization, better results were achieved, with an improvement of up to 83.33% relative to the baseline. By improving the color constancy, a more precise, less influenced by extreme conditions, estimation of sulfonamides is possible, using a practical and cost-efficient method.

Pedro H. Carvalho, Inês Rocha, Fábio Azevedo, Patrícia S. Peixoto, Marcela A. Segundo, Hélder P. Oliveira
HPA-Net: Hierarchical and Parallel Aggregation Network for Context Learning in Stereo Matching

Accurate disparity estimation with regard to rectified stereo image pairs is essential for many computer vision tasks. Current deep learning-based stereo networks generally construct single-scale cost volume to regularize and regress the disparity. However, these methods do not take advantage of multi-scale context information, leading to the limited performance of disparity prediction in ill-posed regions. In this paper, we propose a novel stereo network named HPA-Net, which provides an efficient representation of context information and lower error rates in ill-posed regions. First, we propose a hierarchical aggregation module to fuse context information from multi-scale cost volumes into an integrated cost volume. Then, we apply the integrated cost volume to the proposed parallel aggregation module, which utilizes several 3D dilated convolutions simultaneously to capture global and local clues of context information for disparity regressions. Experimental results show that our proposed HPA-Net achieves state-of-the-art stereo matching performances on KITTI datasets.

Wei Chen, Jun Peng, Ziyu Zhu, Yong Zhao
MTStereo 2.0: Accurate Stereo Depth Estimation via Max-Tree Matching

Efficient yet accurate extraction of depth from stereo image pairs is required by systems with low power resources, such as robotics and embedded systems. State-of-the-art stereo matching methods based on convolutional neural networks require intensive computations on GPUs and are difficult to deploy on embedded systems. In this paper, we propose MTStereo2.0, an improved version of the MTStereo stereo matching method, which includes a more robust context-driven cost function, better detection of incorrect matches and the computation of disparity at pixel level. MTStereo provides accurate sparse and semi-dense depth estimation and does not require intensive GPU computations. We tested it on several benchmark data sets, namely KITTI 2015, Driving, FlyingThings3D, Middlebury 2014, Monkaa and the TrimBot2020 garden data sets, and achieved competitive accuracy. The code is available at https://github.com/rbrandt1/MaxTreeS .

Rafaël Brandt, Nicola Strisciuglio, Nicolai Petkov

Biomedical Image and Pattern Analysis

Frontmatter
H-OCS: A Hybrid Optic Cup Segmentation of Retinal Images

Glaucoma is the second leading cause of irreversible vision loss. Early diagnosis and treatment can, however, slow the progression of the disease. Specialists making this diagnosis rely on several tests and examinations such as visual field tests and examinations of retinal images and optical coherence tomography images. One of the regions examined by specialists when checking for retinal conditions is the optic nerve head region, which is the brightest region in retinal images. Within this region, the ratio between the cup and the disc can be used when diagnosing for glaucoma. Calculating the cup–disc ratio requires the segmentation of both the disc and the cup from retinal images. In a previous paper, a method for segmenting the disc was proposed. Here another deep learning model, H-OCS, is proposed for segmenting the cup from retinal images. A customized InceptionV3 model with transfer learning and image augmentation is used. Additionally, the output of H-OCS is refined and enhanced using a series of post-processing steps. H-OCS is tested on six publicly available datasets: RimOneV3, Drishti, Messidor, Refuge, Riga, and Magrebia and several ablation studies are conducted to evaluate the effectiveness of the proposed approach. Additionally, the performance of H-OCS is compare with other studies. An overall average accuracy of 97.86%, DC of 88.37%, Sensitivity of 89.09% and IoU of 79.66% was achieved.

Abdullah Sarhan, Jone Rokne, Reda Alhajj
Retinal Vessel Segmentation Using Blending-Based Conditional Generative Adversarial Networks

With a critical need for faster and more accurate diagnosis in medical image analysis, artificial intelligence plays a critical role. Precise artery segmentation and faster diagnosis in retinal blood vessel segmentation can be beneficial for the early detection of acute diseases such as diabetic retinopathy and glaucoma. Recent advancements in deep learning have led to some exciting improvements in the field of medical image segmentation. However, one common problem faced by such methods is the limited availability of labelled data to train a suitable deep learning model. The publicly available dataset for retinal vessel segmentation contains less than 50 images. On the other hand, deep learning is a data-hungry process. We propose a method to generate synthetic images to augment the training needs of the deep learning model. Specifically, we propose a blending and enhancement-based strategy to learn a conditional generative adversarial model. The network synthesizes high-quality fundus images used along with the real images to learn a convolutional neural network-based segmentation model. Experimental evaluation shows that the proposed synthetic generation method improves segmentation performance on the real test images of the vascular extraction (DRIVE) dataset achieving 97.01% segmentation accuracy.

Suraj Saxena, Kanhaiya Lal, Sharad Joshi
U-Shaped Densely Connected Convolutions for Left Ventricle Segmentation from CMR Images

Segmentation of cardiac magnetic resonance images (cMRI) remains a challenging task in the field of scientific research due to its significance in the medical assessment of cardiovascular diseases. Ensuring accurate segmentation of the heart structures, mainly the left ventricle cavity, serves to extract important information and has a major impact on the quantitative analysis of the heart function which helps to conduct the proper diagnosis of doctors. The present paper introduces a simple and efficient U-shaped convolutional neural network aiming to accurately segment the LV from cMR images. We applied our architecture for Left Ventricle (LV) segmentation on cardiac MR images (cMRI), from the Automated Cardiac Diagnosis Challenge (ACDC). Obtained results are promising. This simple model based on CNN has significantly fewer parameters rendering it less demanding in terms of computation. Nevertheless, it has provided accurate segmentation. The tested method achieved LV Dice scores of 0.958 at end-systolic time (ES) and 0.979 at end-diastolic time (ED), which yields a mean Dice score of 0.968 on the ACDC dataset.

Khouloud Boukhris, Ramzi Mahmoudi, Asma Ben Abdallah, Mabrouk AbdelAli, Badii Hmida, Mohamed Hédi Bedoui
Deep Learning Approaches for Head and Operculum Segmentation in Zebrafish Microscopy Images

In this paper, we propose variants of deep learning methods to segment head and operculum of the zebrafish larvae in microscopy images. In the first approach, we used a three-class model to jointly segment head and operculum area of zebrafish larvae from background. In the second, two-step, approach, we first trained binary segmentation model to segment head area from the background followed by another binary model to segment the operculum area within cropped head area thereby minimizing the class imbalance problem. Both of our approaches use a modified, simpler, U-Net architecture, and we also evaluate different loss functions to tackle the class imbalance problem. We systematically compare all these variants using various performance metrics. Data and open-source code are available at https://uliege.cytomine.org .

Navdeep Kumar, Alessio Carletti, Paulo J. Gavaia, Marc Muller, M. Leonor Cancela, Pierre Geurts, Raphaël Marée
Shape Analysis Approach Towards Assessment of Cleft Lip Repair Outcome

Current methods of assessing the quality of a surgically repaired cleft lip rely on humans scoring photographs. This is only practical for research purposes due to the resources necessary and is not used in routine audit. It has poor validity due to human subjectivity and thus low inter-rater reliability. An automatic method for aesthetic outcome assessment of cleft lip repair is required. The appearance and shape of the lips constitute the region of interest for analysis. The mouth borderline and corner points are detected using a bilateral semantic network for real-time segmentation. The bisector of the line linking the mouth corners is estimated as the vertical symmetric axis. By splitting the mouth blob into two parts, they are analyzed for similarity and a numeric score ranging from 1 to 5 is then generated. Pearson correlation coefficient between automatically generated scores and human-assigned ones serves as a validation metric. A correlation of about $$40\%$$ 40 % indicates a good agreement between human and computer-based assessments. However, better automatic scoring correlation of $$95.9\%$$ 95.9 % exists between the automatically detected mouth regions and those manually drawn by human experts, the third ground truth set in scenario two. Our method has the potential to automate an outcome estimation of the aesthetics of cleft lip repair with human bias reduced, easy implementation and computational efficiency.

Paul Bakaki, Bruce Richard, Ella Pereira, Aristides Tagalakis, Andy Ness, Yonghuai Liu
MMEC: Multi-Modal Ensemble Classifier for Protein Secondary Structure Prediction

The protein secondary structure prediction is an important task with many applications, such as local folding analysis, tertiary structure prediction, and function classification. Driven by the recent success of multi-modal classifiers, new studies have been conducted using this type of method in other domains, for instance, biology and health care. In this work, we investigate the ensemble of three different classifiers for protein secondary structure prediction. Each classifier of our method deals with a transformation of the original data into a specific domain, such as image classification, natural language processing, and time series tasks. As a result, each classifier achieved competitive results compared to the literature, and the ensemble of the three different classifiers obtained 77.9% and 73.3% of Q8 accuracy on the CB6133 and CB513 datasets, surpassing state-of-the-art approaches in both scenarios.

Gabriel Bianchin de Oliveira, Helio Pedrini, Zanoni Dias
Patch-Level Nuclear Pleomorphism Scoring Using Convolutional Neural Networks

In an effort to ease the job of pathologists while examining Hematoxylin and Eosin stained breast tissue, this study presents a deep learning-based classifier of nuclear pleomorphism according to the Nottingham grading scale. We show that high classification accuracy is attainable without pre-segmenting the cell nuclei. The data used in the experiments was acquired from our partner teaching hospital. It consists of image patches that were extracted from whole slide images. Using the labeled data, we compared the performance of three state-of-the-art convolutional neural networks and tested the trained model on the unseen testing data. Our experiments revealed that the densely connected architecture (DenseNet) outperforms the residual network (ResNet) and the dual path networks (DPN) in terms of accuracy and F1 score. Specifically, we reached an overall validation accuracy and F1 score of over 0.96 and 0.94 respectively.

Leonardo O. Iheme, Gizem Solmaz, Fatma Tokat, Sercan Çayir, Engin Bozaba, Çisem Yazici, Gülşah Özsoy, Samet Ayalti, Cavit Kerem Kayhan, Ümit İnce
Automatic Myelofibrosis Grading from Silver-Stained Images

Histopathology studies the tissues to provide evidence of a disease, type and grade. Usually, the interpretation of these tissue specimens is performed under a microscope by human experts, but since the advent of digital pathology, the slides are digitised, shared and viewed remotely, facilitating diagnosis, prognosis and treatment planning. Furthermore, digital slides can be analysed automatically with computer vision methods to provide diagnostic support, reduce subjectivity and improve efficiency. This field has attracted many researchers in recent years who mainly focused on the analysis of cells morphology on Hematoxylin & Eosin stained samples. In this work, instead, we focused on the analysis of reticulin fibres from silver stained images. This task has been addressed rarely in the literature, mainly due to the total absence of public data sets, but it is beneficial to assess the presence of fibrotic degeneration. One of them is myelofibrosis, characterised by an excess of fibrous tissue. Here we propose an automated method to grade myelofibrosis from image patches. We evaluated different Convolutional Neural Networks for this purpose, and the obtained results demonstrate that myelofibrosis can be identified and graded automatically.

Lorenzo Putzu, Maxim Untesco, Giorgio Fumera
A Deep Learning-Based Pipeline for Celiac Disease Diagnosis Using Histopathological Images

With an increasing number of celiac disease diagnoses and the increasing number of misdiagnoses, automated approaches are valuable to aid pathologists in efficiently diagnosing this disease. Histopathological analysis of intestinal biopsy is considered the gold standard for diagnosis. Convolutional neural networks have achieved promising results for various image processing tasks. A common challenge in medical imaging analysis is obtaining a large number of samples, impeding the full potential of deep learning. In this paper, we propose a classification pipeline to train deep convolutional neural networks to accurately diagnosis celiac disease using models trained with a small number of samples. To show the utility of this approach, we compared it to a typical classification pipeline. The results indicate the superiority of our classification pipeline in distinguishing celiac disease from normal tissue with precision, recall, and accuracy of 0.941, 0.889, and 0.893, respectively. Although we showed the utility of the proposed pipeline for celiac diagnosis, it can also be used for other applications utilizing histopathological imaging.

Farhad Maleki, Kevin Cote, Keyhan Najafian, Katie Ovens, Yan Miao, Rita Zakarian, Caroline Reinhold, Reza Forghani, Peter Savadjiev, Zu-hua Gao
HEp-2 Cell Image Recognition with Transferable Cross-Dataset Synthetic Samples

The paper examines the possibilities of using synthetic HEp-2 cell images as a means of data augmentation. The common problem of biomedical datasets is the shortage of annotated samples required for the training of deep learning techniques. Traditional approaches based on image rotation and mirroring have their limitations, and alternative techniques based on generative adversarial networks (GANs) are currently being explored. Instead of looking solely at a single dataset or the creation of a recognition model with applicability for multiple datasets, this study focuses on the transferability of synthetic HEp-2 samples among publicly available datasets. The paper offers a workflow where the quality of synthetic samples is confirmed via an independent fine-tuned neural network. The subsequent combination of synthetic samples with original images outperforms traditional augmentation approaches and leads to state-of-the-art performance on both publicly available HEp-2 cell image datasets employed in this study.

Tomáš Majtner
Clinically Guided Trainable Soft Attention for Early Detection of Oral Cancer

Oral cancer disproportionately affects low- and middle-income countries, where a lack of access to appropriate medical care contributes towards late disease presentation. Using artificial intelligence to facilitate the automated identification of high-risk oral lesions can improve patient survival rates. With image classification using oral cavity images and other forms of medical images, the information to be classified can often be extremely localized. To address this problem, we propose the use of convolutional neural networks with trainable soft attention. Further to this, we incorporate the use of localization loss to penalize the difference between attention maps and clinically annotated mask. This effectively allows clinicians to help guide soft attention. Improvements to the baseline were made, with an accuracy of 0.8333 and a ROC AUC of 0.8632, which equates to increases of 0.0245 and 0.0394, respectively. This accuracy corresponds to a sensitivity of 0.8469 and a specificity of 0.8208. Perhaps of more importance, is a model that demonstrates better capability at paying attention to the lesions in its decision making. Furthermore, visualizing resulting attention maps can help to strengthen clinical confidence in AI decision making.

Roshan Alex Welikala, Paolo Remagnino, Jian Han Lim, Chee Seng Chan, Senthilmani Rajendran, Thomas George Kallarakkal, Rosnah Binti Zain, Ruwan Duminda Jayasinghe, Jyotsna Rimal, Alexander Ross Kerr, Rahmi Amtha, Karthikeya Patil, Wanninayake Mudiyanselage Tilakaratne, Sok Ching Cheong, Sarah Ann Barman
Small and Large Bile Ducts Intrahepatic Cholangiocarcinoma Classification: A Preliminary Feature-Based Study

Cholangiocarcinoma (CCA) is the second most common liver malignancy and the incidence and mortality rates of this disease are worldwide increasing. This paper deals with the problem of Intrahepatic Cholangiocarcinoma (IH-CCA) classification using Computed Tomography (CT) images. Precisely, a radiomics-based approach is proposed by exploiting abdominal volumetric CT data in order to differentiate large bile duct from small bile duct IH-CCA. The developed method relies on the investigation of intrinsic discriminative properties of CT scans according to feature selection methods. The effectiveness of the proposed method is proved by enrolling in the study a total of 26 patients, including 16 patients with large bile duct and 10 with small bile duct pathological disease, respectively. The conducted tests have shown that our approach is a baseline to provide an efficient classification process with a low computational cost in order to facilitate clinical decision-making procedures.

Chiara Losquadro, Silvia Conforto, Maurizio Schmid, Gaetano Giunta, Marco Rengo, Vincenzo Cardinale, Guido Carpino, Andrea Laghi, Ana Lleo, Riccardo Muglia, Ezio Lanza, Guido Torzilli
A Review on Breast Cancer Brain Metastasis: Automated MRI Image Analysis for the Prediction of Primary Cancer Using Radiomics

Breast cancer brain metastasis (BCBM) still remains a major clinical challenge. Current systemic treatments are often inadequate while diagnosis involves time-consuming series of neuro-imaging acquisitions and dangerous invasive biopsies. Automated image analysis systems for the identification, prediction and follow up of BCBM are therefore required. This review discusses the advancements in the automated MRI brain metastasis (BM) image analysis using radiomic features based classification. Seven BM segmentation studies, and three BCBM identification studies were considered eligible. The latter studies were based on either manual or semi-automated segmentation methods. Almost every fully automated BM segmentation method presented in the literature, reported a maximum dice similarity score (DSC) of 84%, but they resulted in a poor BM segmentation for brain areas less than 5 mm (0.06 ml). The multi-class prediction of BCBM approach, which is more representative for clinical applicability, is based on imaging features and resulted in an area under the curve (AUC) of 60%. Therefore, the need still exists for the development of automated image analysis methods for the identification, follow up and prediction of BCBM. The potential clinical usage of above methods entails further multi-center studies with comprehensive clinical data and multi-class modeling with vast and varying primary and metastatic brain tumors.

Vangelis Tzardis, Efthyvoulos Kyriacou, Christos P. Loizou, Anastasia Constantinidou
An Adaptive Semi-automated Integrated System for Multiple Sclerosis Lesion Segmentation in Longitudinal MRI Scans Based on a Convolutional Neural Network

This work proposes and evaluates a semi-automated integrated segmentation system for multiple sclerosis (MS) lesions in fluid-attenuated inversion recovery (FLAIR) brain magnetic resonance images (MRI). The proposed system uses an adaptive two-dimensional (2D) full convolutional neural network (CNN) and is applied to each MRI brain slice separately. The system is based on a U-Net architecture and allows manual error corrections by the user. This task produces continuing additional improvements to the accuracy of the segmentation system, which can be adapted and reconfigured interactively based on the data entered by the user of the system. The system was evaluated based on the ISBI dataset, on 20 MRI brain images acquired from 5 MS subjects who repeated their examinations in four consecutive time points (TP1-TP4). Manual lesion delineations were provided by two different experts. A Dice Similarity Coefficient (DSC) of 0.76 was achieved using the proposed system which is the highest achieved also by another system. A higher DSC of 0.82 was achieved when the proposed system was evaluated on TP4 images only. A larger dataset will be analyzed in the future, and new measurement metrics will be suggested.

Andreas Georgiou, Christos P. Loizou, Andria Nicolaou, Marios Pantzaris, Constantinos S. Pattichis
A Three-Dimensional Reconstruction Integrated System for Brain Multiple Sclerosis Lesions

In the course of a human brain acquisition, which is acquired by a magnetic resonance imager (MRI), two-dimensional (2D) slices of the brain are captured. These have to be aligned and reconstructed to a three-dimensional (3D) volume, which will better assist the doctor in following up the development of the disease. In this study, a 3D reconstruction integrated system for MRI brain multiple sclerosis (MS) lesion visualization is proposed. Brain MRI images from 5 MS subjects were acquired at four diffident consecutive time points (TP1-TP4) with an interval of 6–12 months. MS lesions were manually segmented by an expert neurologist and semi-automatically by a system and reconstructed in a brain volume. The proposed system assists the doctor in following up the MS disease progression and provides support to better manage the disease. The proposed system includes a 5-stage investigation (pre-processing, lesion segmentation, 3D reconstruction, volume estimation and method evaluation), as well as a module for the quantitative evaluation of the method. Twenty MRI images of the brain were used to evaluate the proposed system. Results show that the 3D reconstruction method proposed in this work, can be used to differentiate brain tissues and recognize MS lesions by providing improved 3D visualization. These preliminary results provide evidence that the proposed system could be applied in the future in clinical practice given that it is further evaluated on more subjects.

Charalambos Gregoriou, Christos P. Loizou, Andreas Georgiou, Marios Pantzaris, Constantinos S. Pattichis
Rule Extraction in the Assessment of Brain MRI Lesions in Multiple Sclerosis: Preliminary Findings

Various artificial intelligence (AI) algorithms have been proposed in the literature, that are used as medical assistants in clinical diagnostic tasks. Explainability methods are lighting the black-box nature of these algorithms. The objective of this study was the extraction of rules for the assessment of brain magnetic resonance imaging (MRI) lesions in Multiple Sclerosis (MS) subjects based on texture features. Rule extraction of lesion features was used to explain and provide information on the disease diagnosis and progression. A dataset of 38 subjects diagnosed with a clinically isolated syndrome (CIS) of MS and MRI detectable brain lesions were scanned twice with an interval of 6–12 months. MS lesions were manually segmented by an experienced neurologist. Features were extracted from the segmented MS lesions and were correlated with the expanded disability status scale (EDSS) ten years after the initial diagnosis in order to quantify future disability progression. The subjects were separated into two different groups, G1: EDSS ≤ 3.5 and G2: EDSS > 3.5. Classification models were implemented on the KNIME analytics platform using decision trees (DT), to estimate the models with high accuracy and extract the best rules. The results of this study show the effectiveness of rule extraction as it can differentiate MS subjects with benign course of the disease (G1: EDSS ≤ 3.5) and subjects with advanced accumulating disability (G2: EDSS > 3.5) using texture features. Further work is currently in progress to incorporate argumentation modeling to enable rule combination as well as better explainability. The proposed methodology should also be evaluated on more subjects.

Andria Nicolaou, Christos P. Loizou, Marios Pantzaris, Antonis Kakas, Constantinos S. Pattichis
Invariant Moments, Textural and Deep Features for Diagnostic MR and CT Image Retrieval

Image analysis in the medical field aims to offer tools for the diagnosis and detection of life-threatening illness. This study means to propose a novel content-based image retrieval system oriented to medical diagnosis. In particular, we exploit several classic and deep image descriptors together with different similarity measures on three different data set, containing computed tomography and magnetic resonance images. Experiments show that feature selection can bring benefit if applied to deep and texture features, contrary to what observed for invariant moments. Moreover, the cityblock distance emerged to be quite suitable overall in this domain, although some other distances also exhibit satisfying robustness.

Lorenzo Putzu, Andrea Loddo, Cecilia Di Ruberto
Toward Multiwavelet Haar-Schauder Entropy for Biomedical Signal Reconstruction

In this paper, a wavelet/multiwavelet approach is proposed for biosignal reconstruction based on a correlation between the two well known Haar and Schauder wavelets called Haar-Schauder multiwavelet. A multiwavelet entropy is then proposed in order to optimize and evaluate the order/disorder of the reconstructed signals. Finally, an experimentation step is carried on ECG and EMG biosignals to validate the proposed approaches.

Malika Jallouli, Wafa Belhadj Khalifa, Anouar Ben Mabrouk, Mohamed Ali Mahjoub

Machine Learning

Frontmatter
Handling Missing Observations with an RNN-based Prediction-Update Cycle

In tasks such as tracking, time-series data inevitably carry missing observations. While traditional tracking approaches can handle missing observations, recurrent neural networks (RNNs) are designed to receive input data in every step. Furthermore, current solutions for RNNs, like omitting the missing data or data imputation, are not sufficient to account for the resulting increased uncertainty. Towards this end, this paper introduces an RNN-based approach that provides a full temporal filtering cycle for motion state estimation. The Kalman filter inspired approach enables to deal with missing observations and outliers. For providing a full temporal filtering cycle, a basic RNN is extended to take observations and the associated belief about its accuracy into account for updating the current state. An RNN prediction model, which generates a parametrized distribution to capture the predicted states, is combined with an RNN update model, which relies on the prediction model output and the current observation. By providing the model with masking information, binary-encoded missing events, the model can overcome limitations of standard techniques for dealing with missing input values. The model abilities are demonstrated on synthetic data reflecting prototypical pedestrian tracking scenarios.

Stefan Becker, Ronny Hug, Wolfgang Huebner, Michael Arens, Brendan T. Morris
eGAN: Unsupervised Approach to Class Imbalance Using Transfer Learning

Class imbalance is an inherent problem in many machine learning classification tasks. This often leads to learned models that are unusable for any practical purpose. In this study, we explore an unsupervised approach to address class imbalance by leveraging transfer learning from pre-trained image classification models. To this end, an encoder-based Generative Adversarial Network (eGAN) is proposed which modifies the generator of a GAN by introducing an encoder module and adopts the GAN loss function to directly classify the majority and minority class. To the best of our knowledge, this is the first work to tackle this problem using GAN-based loss function rather than augmenting the dataset with synthesized fake images. Our approach eliminates the epistemic uncertainty in the model predictions, as $$P(minority)$$ P ( m i n o r i t y ) and $$P(majority)$$ P ( m a j o r i t y ) need not sum up to 1. The impact of transfer learning and combinations of different pre-trained image classification models at the generator and the discriminator level is also explored. Best result of 0.69 F1-score was obtained on CIFAR-10 classification task with an enforced imbalance ratio of 1:2500. Our implementation code is available at - https://github.com/demolakstate/eGAN_addressing_class_imbalance_with_transfer_learning_on_GAN.git .

Ademola Okerinde, William Hsu, Tom Theis, Nasik Nafi, Lior Shamir
Progressive Contextual Excitation for Smart Farming Application

This paper attempts to address the issue of smart farming application, which targets discriminating distinct cocoa bean categories. In smart farming application, one critical issue is how to distinguish little difference among all categories. Our proposed scheme is designed to construct a more robust representation to better leverage textual information. The key concept is to adaptively accumulate contextual representations to obtain the contextual channel attention. Specifically, we introduce a contextual memory cell to progressively select the contextual channel-wise statistics. The accumulated contextual statistics are then used to explore the channel-wise relationship which implicitly correlates contextual channel states. Accordingly, we propose the progressive contextual excitation (PCE) module employing channel-attention-based architecture to simultaneously correlate the contextual channel-wise relationships. The progressive manner via the contextual memory cell demonstrates efficiently to guide high-level representation by keeping more detailed information, which benefits to discriminate small variations in tackling the smart farming application task. We evaluate our model on the cocoa beans dataset which comprises fine-grained cocoa bean categories. The experiments show a significant boost compared with existing approaches.

Chia-Hung Bai, Setya Widyawan Prakosa, He-Yen Hsieh, Jenq-Shiou Leu, Wen-Hsien Fang
Fine-Grained Image Classification for Pollen Grain Microscope Images

Pollen classification is an important task in many fields, including allergology, archaeobotany and biodiversity conservation. However, the visual classification of pollen grains is a major challenge due to the difficulty in identifying the subtle variations between the sub-categories of objects. The pollen image analysis process is often time-consuming and require expert evaluations. Even simple tasks, such as image classification or segmentation requires significant efforts from experts in aerobiology. Hence, there is a strong need to develop automatic solutions for microscopy image analysis. These considerations underline the effort to study and develop new efficient algorithms. With the growing interest in Deep Learning (DL), much research efforts have been spent to the development of several approaches to accomplish this task. Hence, this study covers the application of effective Deep Learning methods in combination with Fine-Grained Visual Classification (FGVC) approaches, comparing them with other Deep Learning-based methods from the state-of-art. All experiments were conducted using the dataset Pollen13K, composed of more than 13,000 pollen objects subdivided in 4 classes. The results of experiments confirmed the effectiveness of our proposed pipeline that reached over 97% in terms of accuracy and F1-score.

Francesca Trenta, Alessandro Ortis, Sebastiano Battiato
Adaptive Style Transfer Using SISR

Style transfer is the process that aims to recreate a given image (target image) with the style of another image (style image). In this work, a new style transfer scheme is proposed that uses a single-image super resolution (SISR) network to increase the resolution of the given target image as well as the style image and perform the transformation process using the pre-trained VGG19 model. The Combination of perceptual loss and total variation loss is used which results in more photo-realistic output. With the change in content weight, the output image contains different semantic information and precise structure of the target image resulting in visually distinguishable results. The generated outputs can be altered accordingly by the user from artistic style to photo-realistic style by changing the weights. Detailed experimentation is done with different target image and style image pairs. The subjective quality of the stylised images is measured. Experimental results show that the quality of the generated image is better than the state of the art existing schemes. This proposed scheme preserves more information from the target image and creates less distortion for all combinations of different types of images. For more effective comparison, the contour of the stylizing images are extracted and also similarity is measured. This experiment shows that the result images have contour closer to the target images, also measured similarity is found maximum which indicates more preservation of semantic information than other existing schemes.

Anindita Das, Prithwish Sen, Nilkanta Sahu
Object-Centric Anomaly Detection Using Memory Augmentation

Video anomaly detection is becoming of increased interest as surveillance is becoming more widespread. We propose an object-centric method with memory augmentation (ObjMemAE) for video anomaly detection. Recently, object-centric approaches is seen at the top of the leaderboards, where we take the novel approach of combining an object-centric approach with memory augmentation using a long term memory bank storing prototypical objects. The memory module also allows the use of additional object-centric features. The proposed method is shown to outperform the baseline by 4.5%, with an AUC score of 98.3% on the UCSD-Ped2 dataset achieving state-of-the-art.

Jacob Velling Dueholm, Kamal Nasrollahi, Thomas Baltzer Moeslund
Document Language Classification: Hierarchical Model with Deep Learning Approach

Optical character recognition (OCR) refers to the task of recognizing the characters or text from digital document images. OCR is a widely researched area for the past many years due to its applications in various fields. It helps in the natural language processing of the documents, convert the document text to speech, semantic analysis of the text, searching in the documents etc. Multilingual OCR works with documents having more than one language. Different OCR models have been created and optimized for a particular language. However, while dealing with multiple languages or translation of documents, one needs to detect the language of the document first and then give it as input to a model-specific to that language. Most of the researched work in this area focuses on identifying scripts, but considering that the Convolutional Neural Network (CNN) can learn appropriate features, our work focuses on language detection using learned features. We use a hierarchical based method in which a binary classification followed by the multiclass classification is used to improve detection accuracy. Largely, the current approaches do not use hierarchy and hence fail to identify the language correctly. The proposed hierarchical approach is used to detect six Indian languages namely: Tamil, Telugu, Kannada, Hindi, Marathi, Gujarati, using the CNN from printed documents based on the text content in a page. Experiments are performed on scanned government documents, and results indicate that the proposed approach performs better than the other similar methods. Advantage of our approach is that it is based on features extracted from the entire page rather than the words or characters, and it can also be applied to handwritten documents.

Sarathi Shah, M. V. Joshi
Parsing Digitized Vietnamese Paper Documents

In recent years, the need to exploit digitized document data has been increasing. In this paper, we address the problem of parsing digitized Vietnamese paper documents. The digitized Vietnamese documents are mainly in the form of scanned images with diverse layouts and special characters introducing many challenges. To this end, we first collect the UIT-DODV dataset, a novel Vietnamese document image dataset that includes scientific papers in Vietnamese derived from different scientific conferences. We compile both images that were converted from PDF and scanned by a smartphone in addition a physical scanner that poses many new challenges. Additionally, we further leverage the state-of-the-art object detector along with the fused loss function to efficiently parse the Vietnamese paper documents. Extensive experiments conducted on the UIT-DODV dataset provide a comprehensive evaluation and insightful analysis.

Linh Truong Dieu, Thuan Trong Nguyen, Nguyen D. Vo, Tam V. Nguyen, Khang Nguyen
EnGraf-Net: Multiple Granularity Branch Network with Fine-Coarse Graft Grained for Classification Task

Fine-Grained classification models can expressly focus on the relevant details useful to distinguish highly similar classes typically when the intra-class variance is high and the inter-class variance is low given a dataset. Most of these models use part annotations as bounding box, location part, text attributes to enhance the performance of classification and other models use sophisticated techniques to extract an attention map automatically. We assume that part-based approaches as the automatic cropping method suffers from a missing representation of local features, which are fundamental to distinguish similar objects. While Fine-Grained classification endeavours to recognize the leaf of a graph, humans recognize an object trying also to make a semantic association. In this paper, we use the semantic association structured as a hierarchy (taxonomy) as supervised signals and used them in an end-to-end deep neural network model termed as EnGraf-Net. Extensive experiments on three well-known datasets: Cifar-100, CUB-200-2011 and FGVC-Aircraft prove the superiority of EnGraf-Net over many Fine-Grained models and it is competitive with the most recent best models without using any cropping technique or manual annotations.

Riccardo La Grassa, Ignazio Gallo, Nicola Landro
When Deep Learners Change Their Mind: Learning Dynamics for Active Learning

Active learning aims to select samples to be annotated that yield the largest performance improvement for the learning algorithm. Many methods approach this problem by measuring the informativeness of samples and do this based on the certainty of the network predictions for samples. However, it is well-known that neural networks are overly confident about their prediction and are therefore an untrustworthy source to assess sample informativeness. In this paper, we propose a new informativeness-based active learning method. Our measure is derived from the learning dynamics of a neural network. More precisely we track the label assignment of the unlabeled data pool during the training of the algorithm. We capture the learning dynamics with a metric called label-dispersion, which is low when the network consistently assigns the same label to the sample during the training of the network and high when the assigned label changes frequently. We show that label-dispersion is a promising predictor of the uncertainty of the network, and show on two benchmark datasets that an active learning algorithm based on label-dispersion obtains excellent results.

Javad Zolfaghari Bengar, Bogdan Raducanu, Joost van de Weijer
Learning to Navigate in the Gaussian Mixture Surface

In the last years, deep learning models have achieved remarkable generalization capability on computer vision tasks, obtaining excellent results in fine-grained classification problems. Sophisticated approaches based-on discriminative feature learning via patches have been proposed in the literature, boosting the model performances and achieving the state-of-the-art over well-known datasets. Cross-Entropy (CE) loss function is commonly used to enhance the discriminative power of the deep learned features, encouraging the separability between the classes. However, observing the activation map generated by these models in the hidden layer, we realize that many image regions with low discriminative content have a high activation response and this could lead to misclassifications. To address this problem, we propose a loss function called Gaussian Mixture Centers (GMC) loss, leveraging on the idea that data follow multiple unimodal distributions. We aim to reduce variances considering many centers per class, using the information from the hidden layers of a deep model, and decreasing the high response from the unnecessary areas of images detected along the baselines. Using jointly CE and GMC loss, we improve the learning generalization model overcoming the performance of the baselines in several use cases. We show the effectiveness of our approach by carrying out experiments over CUB-200-2011, FGVC-Aircraft, Stanford-Dogs benchmarks, and considering the most recent Convolutional Neural Network (CNN).

Riccardo La Grassa, Ignazio Gallo, Calogero Vetro, Nicola Landro
A Deep Hybrid Approach for Hate Speech Analysis

Hate speech is about making insults or stereotypes towards a person or a group of people based on its characteristics such as origin, race, gender, religion, and more. Thus, hate speech can be classified using machine learning and deep learning methods, and it gives a distinguished output from one class to another. Also, every day tons of data are getting accumulated from social media. However, the single deep learning model cannot provide the diversified feature for text classification due to data characteristics. Therefore, this paper proposes two methods for hate speech classification. Initially, a majority voting classifier with three deep learning hybrid models is presented. Finally, a multi-channel convolutional neural network with a bi-directional gated recurrent unit and capsule network is introduced. The proposed approach helps in improving the classification accuracy and ground truth information by reducing ambiguity. The proposed models are verified using six different data sets. The experimental outcomes demonstrate that the proposed methods achieve adequate results for hate speech classification.

Vipul Shah, Sandeep S. Udmale, Vijay Sambhe, Amey Bhole
On Improving Generalization of CNN-Based Image Classification with Delineation Maps Using the CORF Push-Pull Inhibition Operator

Deployed image classification pipelines are typically dependent on the images captured in real-world environments. This means that images might be affected by different sources of perturbations (e.g. sensor noise in low-light environments). The main challenge arises by the fact that image quality directly impacts the reliability and consistency of classification tasks. This challenge has, hence, attracted wide interest within the computer vision communities. We propose a transformation step that attempts to enhance the generalization ability of CNN models in the presence of unseen noise in the test set. Concretely, the delineation maps of given images are determined using the CORF push-pull inhibition operator. Such an operation transforms an input image into a space that is more robust to noise before being processed by a CNN. We evaluated our approach on the Fashion MNIST data set with an AlexNet model. It turned out that the proposed CORF-augmented pipeline achieved comparable results on noise-free images to those of a conventional AlexNet classification model without CORF delineation maps, but it consistently achieved significantly superior performance on test images perturbed with different levels of Gaussian and uniform noise.

Guru Swaroop Bennabhaktula, Joey Antonisse, George Azzopardi
Fast Hand Detection in Collaborative Learning Environments

Long-term object detection requires the integration of frame-based results over several seconds. For non-deformable objects, long-term detection is often addressed using object detection followed by video tracking. Unfortunately, tracking is inapplicable to objects that undergo dramatic changes in appearance from frame to frame. As a related example, we study hand detection over long video recordings in collaborative learning environments. More specifically, we develop long-term hand detection methods that can deal with partial occlusions and dramatic changes in appearance.Our approach integrates object-detection, followed by time projections, clustering, and small region removal to provide effective hand detection over long videos. The hand detector achieved average precision (AP) of $$72\%$$ 72 % at 0.5 intersection over union (IoU). The detection results were improved to $$81\%$$ 81 % by using our optimized approach for data augmentation. The method runs at 4.7 $$\times $$ × the real-time with AP of $$81\%$$ 81 % at 0.5 intersection over the union. Our method reduced the number of false-positive hand detections by 80 $$\%$$ % by improving IoU ratios from 0.2 to 0.5. The overall hand detection system runs at 4 $$\times $$ × real-time.

Sravani Teeparthi, Venkatesh Jatla, Marios S. Pattichis, Sylvia Celedón-Pattichis, Carlos LópezLeiva
Assessing the Role of Boundary-Level Objectives in Indoor Semantic Segmentation

Providing fine-grained and accurate segmentation maps of indoor scenes is a challenging task with relevant applications in the fields of augmented reality, image retrieval, and personalized robotics. While most of the recent literature on semantic segmentation has focused on outdoor scenarios, the generation of accurate indoor segmentation maps has been partially under-investigated. With the goal of increasing the accuracy of semantic segmentation in indoor scenarios, we focus on the analysis of boundary-level objectives, which foster the generation of fine-grained boundaries between different semantic classes and which have never been explored in the case of indoor segmentation. In particular, we test and devise variants of both the Boundary and Active Boundary losses, two recent proposals which deal with the prediction of semantic boundaries. Through experiments on the NYUDv2 dataset, we quantify the role of such losses in terms of accuracy and quality of boundary prediction and demonstrate the accuracy gain of the proposed variants.

Roberto Amoroso, Lorenzo Baraldi, Rita Cucchiara
Skin Lesion Classification Using Convolutional Neural Networks Based on Multi-Features Extraction

In the recent era, deep learning has become a crucial technique for the detection of various forms of skin lesions. Indeed, Convolutional neural networks (CNN) have became the state-of-the-art choice for feature extraction. In this paper, we investigate the efficiency of three state-of-the-art pre-trained convolutional neural networks (CNN) architectures as feature extractors along with four machine learning classifiers to perform the classification of skin lesions on the PH2 dataset. In this research, we find out that a DenseNet201 combined with Cubic SVM achieved the best results in accuracy: 99% and 95% for 2 and 3 classes, respectively. The results also show that the suggested method is competitive with other approaches on the PH2 dataset.

Samia Benyahia, Boudjelal Meftah, Olivier Lézoray
Recursively Refined R-CNN: Instance Segmentation with Self-RoI Rebalancing

Within the field of instance segmentation, most of the state-of-the-art deep learning networks rely nowadays on cascade architectures [1], where multiple object detectors are trained sequentially, re-sampling the ground truth at each step. This offers a solution to the problem of exponentially vanishing positive samples. However, it also translates into an increase in network complexity in terms of the number of parameters. To address this issue, we propose Recursively Refined R-CNN ( $$R^3$$ R 3 -CNN) which avoids duplicates by introducing a loop mechanism instead. At the same time, it achieves a quality boost using a recursive re-sampling technique, where a specific IoU quality is utilized in each recursion to eventually equally cover the positive spectrum. Our experiments highlight the specific encoding of the loop mechanism in the weights, requiring its usage at inference time. The $$R^3$$ R 3 -CNN architecture is able to surpass the recently proposed HTC [4] model, while reducing the number of parameters significantly. Experiments on COCO minival 2017 dataset show performance boost independently from the utilized baseline model. The code is available online at https://github.com/IMPLabUniPr/mmdetection/tree/r3_cnn .

Leonardo Rossi, Akbar Karimi, Andrea Prati
Layer-Wise Relevance Propagation Based Sample Condensation for Kernel Machines

Kernel machines are a powerful class of methods for classification and regression. Making kernel machines fast and scalable to large data, however, is still a challenging problem due to the need of storing and operating on the Gram matrix. In this paper we propose a novel approach to sample condensation for kernel machines, preferably without impairing the classification performance. To our best knowledge, there is no previous work with the same goal reported in the literature. For this purpose we make use of the neural network interpretation of kernel machines. Explainable AI techniques, in particular the Layer-wise Relevance Propagation method, are used to measure the relevance (importance) of training samples. Given this relevance measure, a decremental strategy is proposed for sample condensation. Experimental results on three data sets show that our approach is able to achieve the goal of substantial reduction of the number of training samples.

Daniel Winter, Ang Bian, Xiaoyi Jiang
Backmatter
Metadata
Title
Computer Analysis of Images and Patterns
Editors
Dr. Nicolas Tsapatsoulis
Dr. Andreas Panayides
Dr. Theo Theocharides
Dr. Andreas Lanitis
Prof. Dr. Constantinos Pattichis
Mario Vento
Copyright Year
2021
Electronic ISBN
978-3-030-89128-2
Print ISBN
978-3-030-89127-5
DOI
https://doi.org/10.1007/978-3-030-89128-2

Premium Partner