main-content

This 8-volumes set constitutes the refereed of the 25th International Conference on Pattern Recognition Workshops, ICPR 2020, held virtually in Milan, Italy and rescheduled to January 10 - 11, 2021 due to Covid-19 pandemic. The 416 full papers presented in these 8 volumes were carefully reviewed and selected from about 700 submissions. The 46 workshops cover a wide range of areas including machine learning, pattern analysis, healthcare, human behavior, environment, surveillance, forensics and biometrics, robotics and egovision, cultural heritage and document analysis, retrieval, and women at ICPR2020.

### Adaptive Future Frame Prediction with Ensemble Network

Future frame prediction in videos is a challenging problem because videos include complicated movements and large appearance changes. Learning-based future frame prediction approaches have been proposed in kinds of literature. A common limitation of the existing learning-based approaches is a mismatch of training data and test data. In the future frame prediction task, we can obtain the ground truth data by just waiting for a few frames. It means we can update the prediction model online in the test phase. Then, we propose an adaptive update framework for the future frame prediction task. The proposed adaptive updating framework consists of a pre-trained prediction network, a continuous-updating prediction network, and a weight estimation network. We also show that our pre-trained prediction model achieves comparable performance to the existing state-of-the-art approaches. We demonstrate that our approach outperforms existing methods especially for dynamically changing scenes.

Wonjik Kim, Masayuki Tanaka, Masatoshi Okutomi, Yoko Sasaki

### Rain-Code Fusion: Code-to-Code ConvLSTM Forecasting Spatiotemporal Precipitation

Recently, flood damage has become a social problem owing to unexperienced weather conditions arising from climate change. An immediate response to heavy rain is important for the mitigation of economic losses and also for rapid recovery. Spatiotemporal precipitation forecasts may enhance the accuracy of dam inflow prediction, more than 6 h forward for flood damage mitigation. However, the ordinary ConvLSTM has the limitation of predictable range more than 3-timesteps in real-world precipitation forecasting owing to the irreducible bias between target prediction and ground-truth value. This paper proposes a rain-code approach for spatiotemporal precipitation code-to-code forecasting. We propose a novel rainy feature that represents a temporal rainy process using multi-frame fusion for the timestep reduction. We perform rain-code studies with various term ranges based on the standard ConvLSTM. We applied to a dam region within the Japanese rainy term hourly precipitation data, under 2006 to 2019 approximately 127 thousands hours, every year from May to October. We apply the radar analysis hourly data on the central broader region with an area of 136 × 148 km2. Finally we have provided sensitivity studies between the rain-code size and hourly accuracy within the several forecasting range.

Takato Yasuno, Akira Ishii, Masazumi Amakata

### Using Graph Neural Networks to Reconstruct Ancient Documents

In recent years, machine learning and deep learning approaches such as artificial neural networks have gained in popularity for the resolution of automatic puzzle resolution problems. Indeed, these methods are able to extract high-level representations from images, and then can be trained to separate matching image pieces from non-matching ones. These applications have many similarities to the problem of ancient document reconstruction from partially recovered fragments. In this work we present a solution based on a Graph Neural Network, using pairwise patch information to assign labels to edges representing the spatial relationships between pairs. This network classifies the relationship between a source and a target patch as being one of Up, Down, Left, Right or None. By doing so for all edges, our model outputs a new graph representing a reconstruction proposal. Finally, we show that our model is not only able to provide correct classifications at the edge-level, but also to generate partial or full reconstruction graphs from a set of patches.

Cecilia Ostertag, Marie Beurton-Aimar

### AnCoins: Image-Based Automated Identification of Ancient Coins Through Transfer Learning Approaches

The identification of ancient coins is a time consuming and complex task with huge experience demands. The analysis of numismatic evidence through patterns detection executed by Machine Learning methods has started to be recognized as approaches that can provide archaeologists with a wide range of tools, which, especially in the fields of numismatics, can be used to ascertain distribution, continuity, change in engraving style and imitation. In this paper we introduce what we call the Ancient Coins (AnCoins-12) dataset. Α set of images composed of 12 different classes of Greek ancient coins from the area of ancient Thrace, aiming for the automatic identification of their issuing authority. In this context we describe the methodology of data acquisition and dataset organization emphasizing the small number of images available in this field. In addition to that we apply deep learning approaches based on popular CNN architectures to classify the images of the new introduced dataset. Pre-trained CNNs, through transfer learning approaches, achieved a top-1 validation accuracy of 98.32% and top-5 validation accuracy of 99.99%. For a better diffusion of the results in the archaeological community, we introduce a responsive web-based application with an extension asset focusing in the identification of common characteristics in different coin types. We conclude the paper, by stressing some of the most importance key elements of the proposed approaches and by highlighting some future challenges.

Chairi Kiourt, Vasilis Evangelidis

### Subjective Assessments of Legibility in Ancient Manuscript Images - The SALAMI Dataset

The research field concerned with the digital restoration of degraded written heritage lacks a quantitative metric for evaluating its results, which prevents the comparison of relevant methods on large datasets. Thus, we introduce a novel dataset of Subjective Assessments of Legibility in Ancient Manuscript Images (SALAMI) to serve as a ground truth for the development of quantitative evaluation metrics in the field of digital text restoration. This dataset consists of 250 images of 50 manuscript regions with corresponding spatial maps of mean legibility and uncertainty, which are based on a study conducted with 20 experts of philology and paleography. As this study is the first of its kind, the validity and reliability of its design and the results obtained are motivated statistically: we report a high intra- and inter-rater agreement and show that the bulk of variation in the scores is introduced by the image regions observed and not by controlled or uncontrolled properties of participants and test environments, thus concluding that the legibility scores measured are valid attributes of the underlying images.

Simon Brenner, Robert Sablatnig

### Can OpenPose Be Used as a 3D Registration Method for 3D Scans of Cultural Heritage Artifacts

3D scanning of artifacts is an important tool for studying and preservation of a culture heritage. Systems for 3D reconstruction are constantly developing but due to the shape and size of artifacts it is usually necessary to perform 3D scanning from several different positions in space. This brings up the problem of 3D registration which is a process of aligning different point clouds. Software-based 3D registration methods typically require identifying the sufficient number of point correspondence pairs between different point clouds. These correspondences are frequently found manually and/or by introducing a specially designed objects in the scene. On the other hand, in this work we explore whether OpenPose, a well-known deep learning model, can be used to find corresponded point pairs between different views and eventually assure a successful 3D registration. OpenPose is trained to find patterns and keypoints on images containing people. We acknowledge that many artifacts are indeed human like postures and we test our ideas on finding correspondences using OpenPose. Furthermore, if an artifact is nothing like human like appearance, we demonstrate a method introducing in 3D scene a simple human like image, and in turn allowing OpenPose to facilitate 3D registration between 3D scans from different views. The proposed 3D registration pipeline is easily applicable to many existing 3D scanning solutions of artifacts.

Tomislav Pribanić, David Bojanić, Kristijan Bartol, Tomislav Petković

### Survey on Deep Learning-Based Kuzushiji Recognition

Owing to the overwhelming accuracy of the deep learning method demonstrated at the 2012 image classification competition, deep learning has been successfully applied to a variety of other tasks. The high-precision detection and recognition of Kuzushiji, a Japanese cursive script used for transcribing historical documents, has been made possible through the use of deep learning. In recent years, competitions on Kuzushiji recognition have been held, and many researchers have proposed various recognition methods. This study examines recent research trends, current problems, and future prospects in Kuzushiji recognition using deep learning.

Kazuya Ueki, Tomoka Kojima

### Stylistic Classification of Historical Violins: A Deep Learning Approach

Stylistic study of artworks is a well-known problem in the Cultural Heritage field. Traditional artworks, such as statues and paintings, have been extensively studied by art experts, producing standard methodologies to analyze and recognize the style of an artist. In this context, the case of historical violins is peculiar. Even if the main stylistic features of a violin are known, only few experts are capable to attribute a violin to its maker with a high degree of certainty. This paper presents a study about the use of deep learning to discriminate a violin style. Firstly, we collected images of 17th–18th century violins held, or in temporary loan, at “Museo del Violino” of Cremona (Italy) to be used as reference dataset. Then, we tested the performances of three state-of-the-art CNNs (VGG16, ResNet50 and InceptionV3) on a binary classification (Stradivari vs. NotStradivari). The best performing model was able to achieve 77.27% accuracy and 0.72 F1 score. A promising result, keeping in mind the limited amount of data and the complexity of the task, even for human experts. Finally, we compared the regions of interest identified by the network with the regions of interest identified in a previous eye tracking study conducted on expert luthiers, to highlight similarity and differences between the two behaviors.

Piercarlo Dondi, Luca Lombardi, Marco Malagodi, Maurizio Licchelli

### Text Line Extraction Using Fully Convolutional Network and Energy Minimization

Text lines are important parts of handwritten document images and easier to analyze by further applications. Despite recent progress in text line detection, text line extraction from a handwritten document remains an unsolved task. This paper proposes to use a fully convolutional network for text line detection and energy minimization for text line extraction. Detected text lines are represented by blob lines that strike through the text lines. These blob lines assist an energy function for text line extraction. The detection stage can locate arbitrarily oriented text lines. Furthermore, the extraction stage is capable of finding out the pixels of text lines with various heights and interline proximity independent of their orientations. Besides, it can finely split the touching and overlapping text lines without an orientation assumption. We evaluate the proposed method on VML-AHTE, VML-MOC, and Diva-HisDB datasets. The VML-AHTE dataset contains overlapping, touching and close text lines with rich diacritics. The VML-MOC dataset is very challenging by its multiply oriented and skewed text lines. The Diva-HisDB dataset exhibits distinct text line heights and touching text lines. The results demonstrate the effectiveness of the method despite various types of challenges, yet using the same parameters in all the experiments.

### Handwriting Classification of Byzantine Codices via Geometric Transformations Induced by Curvature Deformations

In the present paper, we propose a methodology of general applicability for matching, comparing and grouping planar shapes, under a unified framework. This is achieved by interpreting shapes’ grouping as a result of the hypothesis that shapes of the same class come from the same implicit family of curves. In order to render the analysis independent of the functional form of the curves families’ implicit function, we have formalized the shapes’ comparison in terms of the implicit curvature function. The implementation of the methodology targets towards automatic writer identification and the corresponding information system has been applied to the identification of the writer of Byzantine codices that preserve Iliad. The shapes in hand are the alphabet symbols appearing in the documents’ images to be classified. The realizations of each alphabet symbol are compared pairwise, modulo affine transformations. The statistical compatibility of these comparisons inside the same document and between different documents determines the likelihood of attributing different documents to the same hand. By maximizing the joint likelihood for all alphabet symbols, common in all documents we determine the most probable classification of the given documents into writing hands. Application of the methodology to 25 images of Byzantine codices’ pages indicated that these pages have been written by 4 hands in full accordance with experts’ opinion and knowledge.

Dimitris Arabadjis, Constantin Papaodysseus, Athanasios Rafail Mamatsis, Eirini Mamatsi

### Visual Programming-Based Interactive Analysis of Ancient Documents: The Case of Magical Signs in Jewish Manuscripts

This paper presents an interactive system for experimental ancient document analysis applied to the specific use case of the computational analysis of magical signs, called Brillenbuchstaben or charaktêres, in digitized Jewish manuscripts. It draws upon known theories as well as methods from open source toolboxes and embeds them into a visual programming-based user interface. Its general design is particularly aimed at the needs of Humanities scholars and thus enables them to computationally analyze these signs without requiring any prior programming experience. In the light of this, the web-based system has been designed to be e.g. interoperable, modular, flexible, transparent and readily accessible by tech-unsavvy users regardless of their background. The paper further discusses a paradigmatic user study conducted with domain experts to evaluate this system and presents first results from those evaluations.

Parth Sarthi Pandey, Vinodh Rajan, H. Siegfried Stiehl, Michael Kohs

### Quaternion Generative Adversarial Networks for Inscription Detection in Byzantine Monuments

In this work, we introduce and discuss Quaternion Generative Adversarial Networks, a variant of generative adversarial networks that uses quaternion-valued inputs, weights and intermediate network representations. Quaternionic representation has the advantage of treating cross-channel information carried by multichannel signals (e.g. color images) holistically, while quaternionic convolution has been shown to be less resource-demanding. Standard convolutional and deconvolutional layers are replaced by their quaternionic variants, in both generator and discriminator nets, while activations and loss functions are adapted accordingly. We have succesfully tested the model on the task of detecting byzantine inscriptions in the wild, where the proposed model is on par with a vanilla conditional generative adversarial network, but is significantly less expensive in terms of model size (requires $$4{\times }$$ 4 × less parameters). Code is available at https://github.com/sfikas/quaternion-gan .

Giorgos Sfikas, Angelos P. Giotis, George Retsinas, Christophoros Nikou

### Transfer Learning Methods for Extracting, Classifying and Searching Large Collections of Historical Images and Their Captions

This paper is about the creation of an interactive software tool and dataset useful for exploring the unindexed 11-volume set, Pompei: Pitture e Mosaici (PPM), a valuable resource containing over 20,000 annotated historical images of the archaeological site of Pompeii, Italy. The tool includes functionalities such as a word search, and an images and captions similarity search. Searches for similarity are conducted using transfer learning on the data retrieved from the scanned version of PPM. Image processing, convolutional neural networks and natural language processing also had to come into play to extract, classify, and archive the text and image data from the digitized version of the books.

Cindy Roullet, David Fredrick, John Gauch, Rhodora G. Vennarucci, William Loder

### Deep Learning Spatial-Spectral Processing of Hyperspectral Images for Pigment Mapping of Cultural Heritage Artifacts

In 2015, the Gough Map was imaged using a hyperspectral imaging system while in the collection at the Bodleian Library, University of Oxford. It is one of the earliest surviving maps of Britain. Hyperspectral image (HSI) classification has been widely used to identify materials in remotely sensed images. Recently, hyperspectral imaging has been applied to historical artifact studies. The collection of the HSI data of the Gough Map was aimed at pigment mapping for towns and writing with different spatial patterns and spectral (color) features. We developed a spatial-spectral deep learning framework called 3D-SE-ResNet to automatically classify pigments in large HSI of cultural heritage artifacts with limited reference (labelled) data and have applied it to the Gough Map. With much less effort and much higher efficiency, this is a breakthrough in object identification and classification in cultural heritage studies that leverages the spectral and spatial information contained in this imagery, providing codicological information to cartographic historians.

Di Bai, David W. Messinger, David Howell

### Abstracting Stone Walls for Visualization and Analysis

An innovative abstraction technique to represent both mathematically and visually some geometric properties of the facing stones in a wall is presented. The technique has been developed within the W.A.L.(L) Project, an interdisciplinary effort to apply Machine Learning techniques to support and integrate archaeological research. More precisely the paper introduces an original way to “abstract” the complex and irregular 3D shapes of stones in a wall with suitable ellipsoids. A wall is first digitized into a unique 3D point cloud and it is successively segmented into the sub-meshes of its stones. Each stone mesh is then “summarized” by the inertial ellipsoid relative to the point cloud of its vertices. A wall is in this way turned into a “population” of ellipsoid shapes statistical properties of which may be processed with Machine Learning algorithms to identify typologies of the walls under study. The paper also reports two simple case studies to assess the effectiveness of the proposed approach.

Giovanni Gallo, Francesca Buscemi, Michele Ferro, Marianna Figuera, Paolo Marco Riela

### PapyRow: A Dataset of Row Images from Ancient Greek Papyri for Writers Identification

Papyrology is the discipline that studies texts written on ancient papyri. An important problem faced by papyrologists and, in general by paleographers, is to identify the writers, also known as scribes, who contributed to the drawing up of a manuscript. Traditionally, paleographers perform qualitative evaluations to distinguish the writers, and in recent years, these techniques have been combined with computer-based tools to automatically measure quantities such as height and width of letters, distances between characters, inclination angles, number and types of abbreviations, etc. Recently-emerged approaches in digital paleography combine powerful machine learning algorithms with high-quality digital images. Some of these approaches have been used for feature extraction, other to classify writers with machine learning algorithms or deep learning systems. However, traditional techniques require a preliminary feature engineering step that involves an expert in the field. For this reason, publishing a well-labeled dataset is always a challenge and a stimulus for the academic world as researchers can test their methods and then compare their results from the same starting point. In this paper, we propose a new dataset of handwriting on papyri for the task of writer identification. This dataset is derived directly from GRK-Papyri dataset and the samples are obtained with some enhancement image operation. This paper presents not only the details of the dataset but also the operation of resizing, rotation, background smoothing, and rows segmentation in order to overcome the difficulties posed by the image degradation of this dataset. It is prepared and made freely available for non-commercial research along with their confirmed ground-truth information related to the task of writer identification.

Nicole Dalia Cilia, Claudio De Stefano, Francesco Fontanella, Isabelle Marthot-Santaniello, Alessandra Scotto di Freca

### Stone-by-Stone Segmentation for Monitoring Large Historical Monuments Using Deep Neural Networks

Monitoring and restoration of cultural heritage buildings require the definition of an accurate health record. A critical step is the labeling of the exhaustive constitutive elements of the building. Stone-by-stone segmentation is a major part. Traditionally it is done by visual inspection and manual drawing on a 2D orthomosaic. This is an increasingly complex, time-consuming and resource-intensive task.In this paper, algorithms to perform stone-by-stone segmentation automatically on large cultural heritage building are presented. Two advanced convolutional neural networks are tested and compared to conventional edge detection or thresholding methods on image dataset from Loire Valley’s châteaux: Château de Chambord and Château de Chaumont-sur-Loire, two castles of Renaissance style. The results show the applicability of the methods to the historical buildings of the Renaissance style.

Koubouratou Idjaton, Xavier Desquesnes, Sylvie Treuillet, Xavier Brunetaud

### A Convolutional Recurrent Neural Network for the Handwritten Text Recognition of Historical Greek Manuscripts

In this paper, a Convolutional Recurrent Neural Network architecture for offline handwriting recognition is proposed. Specifically, a Convolutional Neural Network is used as an encoder for the input which is a textline image, while a Bidirectional Long Short-Term Memory (BLSTM) network followed by a fully connected neural network acts as the decoder for the prediction of a sequence of characters. This work was motivated by the need to transcribe historical Greek manuscripts that entail several challenges which have been extensively analysed. The proposed architecture has been tested for standard datasets, namely the IAM and RIMES, as well as for a newly created dataset, namely EPARCHOS, which contains historical Greek manuscripts and has been made publicly available for research purposes. Our experimental work relies upon a detailed ablation study which shows that the proposed architecture outperforms state-of-the-art approaches.

K. Markou, L. Tsochatzidis, K. Zagoris, A. Papazoglou, X. Karagiannis, S. Symeonidis, I. Pratikakis

### MCCNet: Multi-Color Cascade Network with Weight Transfer for Single Image Depth Prediction on Outdoor Relief Images

Single image depth prediction is considerably difficult since depth cannot be estimated from pixel correspondences. Thus, prior knowledge, such as registered pixel and depth information from the user is required. Another problem rises when targeting a specific domain requirement as the number of freely available training datasets is limited. Due to color problem in relief images, we present a new outdoor Registered Relief Depth (RRD) Prambanan dataset, consisting of outdoor images of Prambanan temple relief with registered depth information supervised by archaeologists and computer scientists. In order to solve the problem, we also propose a new depth predictor, called Multi-Color Cascade Network (MCCNet), with weight transfer. Applied on the new RRD Prambanan dataset, our method performs better in different materials than the baseline with 2.53 mm RMSE. In the NYU Depth V2 dataset, our method’s performance is better than the baselines and in line with other state-of-the-art works.

Aufaclav Zatu Kusuma Frisky, Andi Putranto, Sebastian Zambanini, Robert Sablatnig

### Simultaneous Detection of Regular Patterns in Ancient Manuscripts Using GAN-Based Deep Unsupervised Segmentation

Document Information Retrieval has attracted researchers’ attention when discovering secrets behind ancient manuscripts. To understand such documents, analyzing their layouts and segmenting their relevant features are fundamental tasks. Recent efforts represent unsupervised document segmentation, and its importance in ancient manuscripts has provided a unique opportunity to study the said problem. This paper proposes a novel collaborative deep learning architecture in an unsupervised mode that can generate synthetic data to avoid uncertainties regarding their degradations. Moreover, this approach utilizes the generated distribution to assign labels that are associated with superpixels. The unsupervised trained model is used to segment the page, ornaments, and characters simultaneously. Promising accuracies in the segmentation task were noted. Experiments with data from degraded documents show that the proposed method can synthesize noise-free documents and enhance associations better than the state-of-the-art methods. We also investigate the usage of overall generated samples, and their effectiveness in different unlabelled historical documents tasks.

### A Two-Stage Unsupervised Deep Learning Framework for Degradation Removal in Ancient Documents

Processing historical documents is a complicated task in computer vision due to the presence of degradation, which decreases the performance of Machine Learning models. Recently, Deep Learning (DL) models have achieved state-of-the-art accomplishments in processing historical documents. However, these performances do not match the results obtained in other computer vision tasks, and the reason is that such models require large datasets to perform well. In the case of historical documents, only small datasets are available, making it hard for DL models to capture the degradation. In this paper, we propose a framework to overcome issues by following a two-stage approach. Stage-I is devoted to data augmentation. A Generative Adversarial Network (GAN), trained on degraded documents, generates synthesized new training document images. In stage-II, the document images generated in stage-I, are improved using an inverse problem model with a deep neural network structure. Our approach enhances the quality of the generated document images and removes degradation. Our results show that the proposed framework is well suited for binarization tasks. Our model was trained on the 2014 and 2016 DIBCO datasets and tested on the 2018 DIBCO dataset. The obtained results are promising and competitive with the state-of-the-art.

Milad Omrani Tamrin, Mohammed El-Amine Ech-Cherif, Mohamed Cheriet

### Recommender System for Digital Storytelling: A Novel Approach to Enhance Cultural Heritage

Italy is characterized by a significant artistic and cultural heritage. In this scenario, it is possible and necessary to implement strategies for cultural heritage enhancement. This vision has become increasingly feasible due to the introduction of systems based on new technologies, particularly those based on intensive use of Information and Communication Technologies (ICT). This approach has allowed the development of applications to build adaptive cultural paths usable by different types of users. In addition to the technological dimension, a fundamental role is played by methodologies that convey information content innovatively and effectively.The objective of this paper is to propose a Recommender approach that aims at the enhancement of Italian Cultural Heritage. In this way, everyone could have the opportunity to know or deepen their knowledge of the cultural sites that are most suited to them, which will be presented through the narrative technique of Digital Storytelling. The Recommender methodology introduced has been tested through an experimental campaign, obtaining promising results.

Mario Casillo, Dajana Conte, Marco Lombardi, Domenico Santaniello, Carmine Valentino

### A Contextual Approach for Coastal Tourism and Cultural Heritage Enhancing

In the panorama of Italian coastal tourism, there are many unique and unexplored places. These places, which suffer from the lack of government investment, present the need to be promoted through low consumption systems and widely used: distributed applications.The present work aims to develop innovative solutions to support citizens and tourists to offer advanced services, highly customizable, able to allow, through the use of new technologies, a more engaging, stimulating, and attractive use of information than the current forms. The developed system is based on graph-based formalisms such as Context Dimension Tree and Bayesian Networks, representing the context through its main components and react to it anticipating users’ needs. Through the development of a mobile app, it was analyzed a case study applied in the area of Amalfi Coast (in Italy). Finally, an experimental campaign was conducted with promising results.

Fabio Clarizia, Massimo De Santo, Marco Lombardi, Rosalba Mosca, Domenico Santaniello

### A Comparison of Character-Based Neural Machine Translations Techniques Applied to Spelling Normalization

The lack of spelling conventions and the natural evolution of human language create a linguistic barrier inherent in historical documents. This barrier has always been a concern for scholars in humanities. In order to tackle this problem, spelling normalization aims to adapt a document’s orthography to modern standards. In this work, we evaluate several character-based neural machine translation normalization approaches—using modern documents to enrich the neural models. We evaluated these approaches on several datasets from different languages and time periods, reaching the conclusion that each approach is better suited for a different set of documents.

Miguel Domingo, Francisco Casacuberta

### Weakly Supervised Bounding Box Extraction for Unlabeled Data in Table Detection

The organization and presentation of data in tabular format became an essential strategy of scientific communication and remains fundamental to the transmission of knowledge today. The use of automated detection to identify typographical elements such as tables and diagrams in digitized historical print offers a promising approach for future research. Most of the table detection tasks are using existing off-the-shelf methods for their detection algorithm. However, datasets that are used for evaluation are not challenging enough due to the lack of quantity and diversity. To have a better comparison between proposed methods we introduce the NAS dataset in this paper for historical digitized images. Tables in historic scientific documents vary widely in their characteristics. They also appear alongside visually similar items, such as maps, diagrams, and illustrations. We address these challenges with a multi-phase procedure, outlined in this article, evaluated using two datasets, ECCO ( https://www.gale.com/primary-sources/eighteenth-century-collections-online ) and NAS ( https://beta.synchromedia.ca/vok-visibility-of-knowledge ). In our approach, we utilized the Gabor filter [1] to prepare our dataset for algorithmic detection with Faster-RCNN [2]. This method detects tables against all categories of visual information. Due to the limitation in labeled data, particularly for object detection, we developed a new method, namely, weakly supervision bounding box extraction, to extract bounding boxes automatically for our training set in an innovative way. Then a pseudo-labeling technique is used to create a more general model, via a three-step process of bounding box extraction and labeling.

Arash Samari, Andrew Piper, Alison Hedley, Mohamed Cheriet

### Underground Archaeology: Photogrammetry and Terrestrial Laser Scanning of the Hypogeum of Crispia Salvia (Marsala, Italy)

The convergence of issues such as safety, lighting, and physical accessibility with problems of archaeological conservation make underground contexts particularly difficult to study, preserve, and make accessible to the public. The Hypogeum of Crispia Salvia at Marsala (Italy) is a particularly apt case study as the frescoed burial site, unique in all of Sicily, is now built over by an apartment complex that can only be accessed through scheduled tours. The authors, in partnership with the local archaeological authorities, harnessing the power of machine learning, created a digital model of this underground burial space using terrestrial laser scanning and digital photogrammetry. This is part of a larger ongoing effort to re-document important subterranean heritage sites of Sicily in order to make them accessible to both researchers and the public, increasingly important in a historic moment where even local mobility is limited due to a global pandemic.

Davide Tanasi, Stephan Hassam, Kaitlyn Kingsland

### Automatic MEP Component Detection with Deep Learning

Scan-to-BIM systems convert image and point cloud data into accurate 3D models of buildings. Research on Scan-to-BIM has largely focused on the automated identification of structural components. However, design and maintenance projects require information on a range of other assets including mechanical, electrical, and plumbing (MEP) components. This paper presents a deep learning solution that locates and labels MEP components in 360 $$^{\circ }$$ ∘ images and phone images, specifically sockets, switches and radiators. The classification and location data generated by this solution could add useful context to BIM models. The system developed for this project uses transfer learning to retrain a Faster Region-based Convolutional Neural Network (Faster R-CNN) for the MEP use case. The performance of the neural network across image formats is investigated. A dataset of 249 360 $$^{\circ }$$ ∘ images and 326 phone images was built to train the deep learning model. The Faster R-CNN achieved high precision and comparatively low recall across all image formats.

John Kufuor, Dibya D. Mohanty, Enrique Valero, Frédéric Bosché

### Mixed Reality-Based Dataset Generation for Learning-Based Scan-to-BIM

Generating as-is 3D Models is constantly explored for various construction management applications. The industry has been dependent on either manual or semi-automated workflows for the Scan-to-BIM process, which is laborious as well as time taking. Recently machine learning has opened avenues to recognize geometrical elements from point clouds but has not been much used because of the insufficient labeled dataset. This study aims to set up a semi-automated workflow to create labeled data sets which can be used to train ML algorithms for element identification purpose. The study proposes an interactive user interface using a gaming engine within a mixed reality environment. A workflow for fusing as-is spatial information with the AR/VR based information is presented in Unity 3D. A user-friendly UI is then developed and integrated with the VR environment to help the user to choose the category of the component by visualization. This results in the generation of an accurate as-is 3D Model, which does not require much computation or time. The intention is to propose a smooth workflow to generate datasets for learning-based methodologies in a streamlined Scan-to-BIM Process. However, this process requires user domain knowledge and input. The dataset can be continuously increased and improved to get automated results later.

Parth Bhadaniya, Varun Kumar Reja, Koshy Varghese

### An Augmented Reality-Based Remote Collaboration Platform for Worker Assistance

Remote working and collaboration is important towards helping workplaces to become flexible and productive. The significance of remote working has also been highlighted in the COVID-19 pandemic period in which mobility restrictions were enforced. This paper presents the development of an augmented reality platform, aiming to assist workers in remote collaboration and training. The platform consists of two communicating apps intended to be used by a remote supervisor (located e.g., at home) and an on-site worker, and uses intuitive digital annotations that enrich the physical environment of the workplace, thereby facilitating the execution of on-site tasks. The proposed platform was used and evaluated in user trials, demonstrating its usefulness and virtue by assessing its performance, worker satisfaction and task completion time.

Georgios Chantziaras, Andreas Triantafyllidis, Aristotelis Papaprodromou, Ioannis Chatzikonstantinou, Dimitrios Giakoumis, Athanasios Tsakiris, Konstantinos Votis, Dimitrios Tzovaras

### Demand Flexibility Estimation Based on Habitual Behaviour and Motif Detection

Nowadays the demand for energy is becoming higher and higher, and as the share of power supply from renewable sources of energy (RES) begins to rise, exacerbating the problem of load balancing, the need for smart grid management is becoming more urgent. One of such is the demand response technique (DR), which allows operators to make a better distribution of power energy by reducing or shifting electricity usage, thereby improving the overall grid performance and simultaneously rewarding consumers, who play one of the most significant roles at DR. In order for the DR to operate properly, it is essential to know the demand flexibility of each consumer. This paper provides a new approach to determining residential demand flexibility by identifying daily habitual behaviour of each separate house, and observing flexibility motifs in aggregate residential electricity consumption. The proposed method uses both supervised and unsupervised machine learning methods and by combining them acquires the ability to adapt to any new environment. Several tests of this method have been carried out on various datasets, as well as its experimental application in real home installations. Tests were performed both on historical data and in conditions close to real time, with the ability to partially predict Flexibility.

George Pavlidis, Apostolos C. Tsolakis, Dimosthenis Ioannidis, Dimitrios Tzovaras

### Road Tracking in Semi-structured Environments Using Spatial Distribution of Lidar Data

The future civilian, and professional autonomous vehicles to be realised into the market should apprehend and interpret the road in a manner similar to the human drivers. In structured urban environments where signs, road lanes and markers are well defined and ordered, landmark-based road tracking and localisation has significantly progressed during the last decade with many autonomous vehicles to make their debut into the market. However, in semi-structured and rural environments where traffic infrastructure is deficient, the autonomous driving is hindered by significant challenges. The paper at hand presents a Lidar-based method for road boundaries detection suitable for a service robot operation in rural and semi-structured environments. Organised Lidar data undergo a spatial distribution processing method to isolate road limits in a forward looking horizon ahead of the robot. Stereo SLAM is performed to register subsequent road limits and RANSAC is applied to identify edges that correspond to road segments. In addition, the robot traversable path is estimated and progressively merged with Bézier curves to create a smooth trajectory that respects vehicle kinematics. Experiments have been conducted on data collected from our robot on a semi-structured urban environment, while the method has also been evaluated on KITTI dataset exhibiting remarkable performance.

Kosmas Tsiakas, Ioannis Kostavelis, Dimitrios Giakoumis, Dimitrios Tzovaras

### Image Segmentation of Bricks in Masonry Wall Using a Fusion of Machine Learning Algorithms

Autonomous mortar raking requires a computer vision system which is able to provide accurate segmentation masks of close-range images of brick walls. The goal is to detect and ultimately remove the mortar, leaving the bricks intact, thus automating this construction-related task. This paper proposes such a vision system based on the combination of machine learning algorithms. The proposed system fuses the individual segmentation outputs of eight classifiers by means of a weighted voting scheme and then performing a threshold operation to generate the final binary segmentation. A novel feature of this approach is the fusion of several segmentations using a low-cost commercial off-the-shelf hardware setup. The close-range brick wall segmentation capabilities of the system are demonstrated on a total of about 9 million data points.

Roland Kajatin, Lazaros Nalpantidis

### Sentinel-2 and SPOT-7 Images in Machine Learning Frameworks for Super-Resolution

Monitoring construction sites from space using high-resolution (HR) imagery enables remote tracking instead of physically traveling to a site. Thus, valuable resources are saved while recording of the construction site progression at anytime and anywhere in the world is feasible. In the present work Sentinel-2 (S2) images at 10 m (m) are spatially super-resolved per factor 4 by means of deep-learning. Initially, the very-deep super-resolution (VDSR) network is trained with matching pairs of S2 and SPOT-7 images at 2.5 m target resolution. Then, the trained VDSR network, named SPOT7-VDSR, becomes able to increase the resolution of S2 images which are completely unknown to the net. Additionally, the VDSR net technique and bicubic interpolation are applied to increase the resolution of S2. Numerical and visual comparisons are carried out on the area of interest Karditsa, Greece. The current study of super-resolving S2 images is novel in the literature and can prove very useful in application cases where only S2 images are available and not the corresponding SPOT-7 higher-resolution ones. During the present super-resolution (SR) experimentations, the proposed net SPOT7-VDSR outperforms the VDSR net up to 8.24decibel in peak signal to noise ratio (PSNR) and bicubic interpolation up to 16.9% in structural similarity index (SSIM).

Antigoni Panagiotopoulou, Lazaros Grammatikopoulos, Georgia Kalousi, Eleni Charou

### Salient Object Detection with Pretrained Deeplab and k-Means: Application to UAV-Captured Building Imagery

We present a simple technique that can convert a pretrained segmentation neural network to a salient object detector. We show that the pretrained network can be agnostic to the semantic class of the object of interest, and no further training is required. Experiments were run on UAV-captured aerial imagery of the “smart home” structure located in the premises of the CERTH research center. Further experiments were also run on natural scenes. Our tests validate the usefulness of the proposed technique.

Victor Megir, Giorgos Sfikas, Athanasios Mekras, Christophoros Nikou, Dimosthenis Ioannidis, Dimitrios Tzovaras

### Clutter Slices Approach for Identification-on-the-Fly of Indoor Spaces

Construction spaces are constantly evolving, dynamic environments in need of continuous surveying, inspection, and assessment. Traditional manual inspection of such spaces proves to be an arduous and time-consuming activity. Automation using robotic agents can be an effective solution. Robots, with perception capabilities can autonomously classify and survey indoor construction spaces. In this paper, we present a novel identification-on-the-fly approach for coarse classification of indoor spaces using the unique signature of clutter. Using the context granted by clutter, we recognize common indoor spaces such as corridors, staircases, shared spaces, and restrooms. The proposed clutter slices pipeline achieves a maximum accuracy of 93.6% on the presented clutter slices dataset. This sensor independent approach can be generalized to various domains to equip intelligent autonomous agents in better perceiving their environment.

Upinder Kaur, Praveen Abbaraju, Harrison McCarty, Richard M. Voyles

### Remembering Both the Machine and the Crowd When Sampling Points: Active Learning for Semantic Segmentation of ALS Point Clouds

Supervised Machine Learning systems such as Convolutional Neural Networks (CNNs) are known for their great need for labeled data. However, in case of geospatial data and especially in terms of Airborne Laserscanning (ALS) point clouds, labeled data is rather scarce, hindering the application of such systems. Therefore, we rely on Active Learning (AL) for significantly reducing necessary labels and we aim at gaining a deeper understanding on its working principle for ALS point clouds. Since the key element of AL is sampling of most informative points, we compare different basic sampling strategies and try to further improve them for geospatial data. While AL reduces total labeling effort, the basic issue of experts doing this labor- and therefore cost-intensive task remains. Therefore, we propose to outsource data annotation to the crowd. However, when employing crowdworkers, labeling errors are inevitable. As a remedy, we aim on selecting points, which are easier for interpretation and evaluate the robustness of AL to labeling errors. Applying these strategies for different classifiers, we estimate realistic segmentation results from crowdsourced data solely, only differing in Overall Accuracy by about 3% points compared to results based on completely labeled dataset, which is demonstrated for two different scenes.

Michael Kölle, Volker Walter, Stefan Schmohl, Uwe Soergel

### Towards Urban Tree Recognition in Airborne Point Clouds with Deep 3D Single-Shot Detectors

Automatic mapping of individual urban trees is increasingly important to city administration and planing. Although deep learning algorithms are now standard methodology in computer vision, their adaption to individual tree detection in urban areas has hardly been investigated so far. In this work, we propose a deep single-shot object detection network to find urban trees in point clouds from airborne laser scanning. The network consists of a sparse 3D convolutional backbone for feature extraction and a subsequent single-shot region proposal network for the actual detection. It takes as input raw 3D voxel clouds, discretized from the point cloud in preprocessing. Outputs are cylindrical tree objects paired with their detection scores. We train and evaluate the network on the ISPRS Vaihingen 3D Benchmark dataset with custom tree object labels. The general feasibility of our approach is demonstrated. It achieves promising results compared to a traditional 2D baseline using watershed segmentation. We also conduct comparisons with state-of-the-art machine learning methods for semantic point segmentation.

Stefan Schmohl, Michael Kölle, Rudolf Frolow, Uwe Soergel

### Shared-Space Autoencoders with Randomized Skip Connections for Building Footprint Detection with Missing Views

Recently, a vast amount of satellite data has become available, going beyond standard optical (EO) data to other forms such as synthetic aperture radars (SAR). While more robust, SAR data are often more difficult to interpret, can be of lower resolution, and require intense pre-processing compared to EO data. On the other hand, while more interpretable, EO data often fail under unfavourable lighting, weather, or cloud-cover conditions. To leverage the advantages of both domains, we present a novel autoencoder-based architecture that is able to both (i) fuse multi-spectral optical and radar data in a common shared-space, and (ii) perform image segmentation for building footprint detection under the assumption that one of the data modalities is missing–resembling a situation often encountered under real-world settings. To do so, a novel randomized skip-connection architecture that utilizes autoencoder weight-sharing is designed. We compare the proposed method to baseline approaches relying on network fine-tuning, and established architectures such as UNet. Qualitative and quantitative results show the merits of the proposed method, that outperforms all compared techniques for the task-at-hand.

Giannis Ashiotis, James Oldfield, Charalambos Chrysostomou, Theodoros Christoudias, Mihalis A. Nicolaou

### Assessment of CNN-Based Methods for Poverty Estimation from Satellite Images

One of the major issues in predicting poverty with satellite images is the lack of fine-grained and reliable poverty indicators. To address this problem, various methodologies were proposed recently. Most recent approaches use a proxy (e.g., nighttime light), as an additional information, to mitigate the problem of sparse data. They consist in building and training a CNN with a large set of images, which is then used as a feature extractor. Ultimately, pairs of extracted feature vectors and poverty labels are used to learn a regression model to predict the poverty indicators.First, we propose a rigorous comparative study of such approaches based on a unified framework and a common set of images. We observed that the geographic displacement on the spatial coordinates of poverty observations degrades the prediction performances of all the methods. Therefore, we present a new methodology combining grid-cell selection and ensembling that improves the poverty prediction to handle coordinate displacement.

Robin Jarry, Marc Chaumont, Laure Berti-Équille, Gérard Subsol

### Using a Binary Diffractive Optical Element to Increase the Imaging System Depth of Field in UAV Remote Sensing Tasks

Using an example of a real-world data set, it is shown that the accuracy of the image detector based on a YOLOv3 neural network does not deteriorate when using only one nonblurred color channel. The binary diffractive optical element was calculated, which allows increasing the imaging system depth of field by several times. This is achieved by using different color channels for various defocus values. A comparison of the MTF curves of the original and apodized imaging systems for a given minimum acceptable value of image contrast is presented. This approach allows us to create novel remote sensing imaging systems with an increased depth of field.

Pavel G. Serafimovich, Alexey P. Dzyuba, Artem V. Nikonorov, Nikolay L. Kazanskiy

### Self-supervised Pre-training Enhances Change Detection in Sentinel-2 Imagery

While annotated images for change detection using satellite imagery are scarce and costly to obtain, there is a wealth of unlabeled images being generated every day. In order to leverage these data to learn an image representation more adequate for change detection, we explore methods that exploit the temporal consistency of Sentinel-2 times series to obtain a usable self-supervised learning signal. For this, we build and make publicly available ( https://zenodo.org/record/4280482 ) the Sentinel-2 Multitemporal Cities Pairs (S2MTCP) dataset, containing multitemporal image pairs from 1520 urban areas worldwide. We test the results of multiple self-supervised learning methods for pre-training models for change detection and apply it on a public change detection dataset made of Sentinel-2 image pairs (OSCD).

Marrit Leenstra, Diego Marcos, Francesca Bovolo, Devis Tuia

### Early and Late Fusion of Multiple Modalities in Sentinel Imagery and Social Media Retrieval

Discovering potential concepts and events by analyzing Earth Observation (EO) data may be supported by fusing other distributed data sources such as non-EO data, for instance, in-situ citizen observations from social media. The retrieval of relevant information based on a target query or event is critical for operational purposes, for example, to monitor flood events in urban areas, and crop monitoring for food security scenarios. To that end, we propose an early-fusion (low-level features) and late-fusion (high-level concepts) mechanism that combines the results of two EU-funded projects for information retrieval in Sentinel imagery and social media data sources. In the early fusion part, the model is based on active learning that effectively merges Sentinel-1 and Sentinel-2 bands, and assists users to extract patterns. On the other hand, the late fusion mechanism exploits the context of other geo-referenced data such as social media retrieval, to further enrich the list of retrieved Sentinel image patches. Quantitative and qualitative results show the effectiveness of our proposed approach.

Wei Yao, Anastasia Moumtzidou, Corneliu Octavian Dumitru, Stelios Andreadis, Ilias Gialampoukidis, Stefanos Vrochidis, Mihai Datcu, Ioannis Kompatsiaris

### SURVANT: An Innovative Semantics-Based Surveillance Video Archives Investigation Assistant

SURVANT is an innovative video archive investigation system that aims to drastically reduce the time required to examine large amounts of video content. It can collect the videos relevant to a specific case from heterogeneous repositories in a seamless manner. SURVANT employs Deep Learning technologies to extract inter/intra-camera video analytics, including object recognition, inter/intra-camera tracking, and activity detection. The identified entities are semantically indexed enabling search and retrieval of visual characteristics. Semantic reasoning and inference mechanisms based on visual concepts and spatio-temporal metadata allows users to identify hidden correlations and discard outliers. SURVANT offers the user a unified GIS-based search interface to unearth the required information using natural language query expressions and a plethora of filtering options. An intuitive interface with a relaxed learning curve assists the user to create specific queries and receive accurate results using advanced visual analytics tools. GDPR compliant management of personal data collected from surveillance videos is integrated in the system design.

Giuseppe Vella, Anastasios Dimou, David Gutierrez-Perez, Daniele Toti, Tommaso Nicoletti, Ernesto La Mattina, Francesco Grassi, Andrea Ciapetti, Michael McElligott, Nauman Shahid, Petros Daras

### Automatic Fake News Detection with Pre-trained Transformer Models

The automatic detection of disinformation and misinformation has gained attention during the last years, since fake news has a critical impact on democracy, society, and journalism and digital literacy. In this paper, we present a binary content-based classification approach for detecting fake news automatically, with several recently published pre-trained language models based on the Transformer architecture. The experiments were conducted on the FakeNewsNet dataset with XLNet, BERT, RoBERTa, DistilBERT, and ALBERT and various combinations of hyperparameters. Different preprocessing steps were carried out with only using the body text, the titles and a concatenation of both. It is concluded that Transformers are a promising approach to detect fake news, since they achieve notable results, even without using a large dataset. Our main contribution is the enhancement of fake news’ detection accuracy through different models and parametrizations with a reproducible result examination through the conducted experiments. The evaluation shows that already short texts are enough to attain 85% accuracy on the test set. Using the body text and a concatenation of both reach up to 87% accuracy. Lastly, we show that various preprocessing steps, such as removing outliers, do not have a significant impact on the models prediction output.

Mina Schütz, Alexander Schindler, Melanie Siegel, Kawa Nazemi

### A Serverless Architecture for a Wearable Face Recognition Application

This article presents an application for face recognition, which takes its video input streamed from smart glasses through an internet connection. The application resides in the cloud and uses serverless technologies. The system has been tested on a database of 1119 individuals and the results show that the classical serverfull architecture is from 1.9 to 8.65 times slower than the serverless method. The advantage of the application resides in its high adaptability, parallelisation and scalability.

Oliviu Matei, Rudolf Erdei, Alexandru Moga, Robert Heb

### RGB-D Railway Platform Monitoring and Scene Understanding for Enhanced Passenger Safety

Automated monitoring and analysis of passenger movement in safety-critical parts of transport infrastructures represent a relevant visual surveillance task. Recent breakthroughs in visual representation learning and spatial sensing opened up new possibilities for detecting and tracking humans and objects within a 3D spatial context. This paper proposes a flexible analysis scheme and a thorough evaluation of various processing pipelines to detect and track humans on a ground plane, calibrated automatically via stereo depth and pedestrian detection. We consider multiple combinations within a set of RGB- and depth-based detection and tracking modalities. We exploit the modular concepts of Meshroom [2] and demonstrate its use as a generic vision processing pipeline and scalable evaluation framework. Furthermore, we introduce a novel open RGB-D railway platform dataset with annotations to support research activities in automated RGB-D surveillance. We present quantitative results for multiple object detection and tracking for various algorithmic combinations on our dataset. Results indicate that the combined use of depth-based spatial information and learned representations yields substantially enhanced detection and tracking accuracies. As demonstrated, these enhancements are especially pronounced in adverse situations when occlusions and objects not captured by learned representations are present.

Marco Wallner, Daniel Steininger, Verena Widhalm, Matthias Schörghuber, Csaba Beleznai

### A Survey About the Cyberbullying Problem on Social Media by Using Machine Learning Approaches

The exponential growth of connected devices (i.e. laptops, smartphones or tablets) has radically changed communications means, also making it faster and impersonal by using On-line Social Networks and Instant messaging through several apps. In this paper we discuss about the cyberbullying problem, focusing on the analysis of the state-of-the-art approaches that can be classified in four different tasks (Binary Classification, Role Identification, Severity Score Computation and Incident prediction). In particular, the first task aims to predict if a particular action is aggressive or not based on the analysis of different features. In turn, the second and the third task investigate the cyberbullying problem by identifying users’ role in the exchanged message or assigning a severity score to a given users or session respectively. Nevertheless, information heterogeneity, due to different multimedia contents (i.e. text, emojis, stickers or gifs), and the use of datasets, which are typically unlabeled or manually labelled, create continuous challenges in addressing the cyberbullying problem.

Carlo Sansone, Giancarlo Sperlí