Skip to main content
Top

2024 | Book

Document Analysis and Recognition – ICDAR 2024 Workshops

Athens, Greece, August 30–31, 2024, Proceedings, Part II

insite
SEARCH

About this book

This two-volume set LNCS 14935-14936 constitutes the proceedings of International Workshops co-located with the 18th International Conference on Document Analysis and Recognition, ICDAR 2024, held in Athens, Greece, during August 30–31, 2024.
The total of 30 regular papers presented in these proceedings were carefully selected from 46 submissions.
Part I contains 16 regular papers that stem from the following workshops:

ICDAR 2024 Workshop on Automatically Domain-Adapted and Personalized Document Analysis (ADAPDA);

ICDAR 2024 Workshop on Advanced Analysis and Recognition of Parliamentary Corpora (ARPC);

ICDAR 2024 Workshop on coMics ANalysis, Processing and Understanding (MANPU).
Part II contains 14 regular papers that stem from the following workshops:

ICDAR 2024 Workshop on Computational Paleography (IWCP);

ICDAR 2024 Workshop on Machine Vision and NLP for Document Analysis (VINALDO).

Table of Contents

Frontmatter

IWCP

Frontmatter
An Interpretable Deep Learning Approach for Morphological Script Type Analysis
Abstract
Defining script types and establishing classification criteria for medieval handwriting is a central aspect of palaeographical analysis. However, existing typologies often encounter methodological challenges, such as descriptive limitations and subjective criteria. We propose an interpretable deep learning-based approach to morphological script type analysis, which enables systematic and objective analysis and contributes to bridging the gap between qualitative observations and quantitative measurements. More precisely, we adapt a deep instance segmentation method to learn comparable character prototypes, representative of letter morphology, and provide qualitative and quantitative tools for their comparison and analysis. We demonstrate our approach by applying it to the Textualis Formata script type and its two subtypes formalized by A. Derolez: Northern and Southern Textualis.
Malamatenia Vlachou-Efstathiou, Ioannis Siglidis, Dominique Stutzmann, Mathieu Aubry
Detecting and Deciphering Damaged Medieval Armenian Inscriptions Using YOLO and Vision Transformers
Abstract
This paper investigates the development and assessment of a methodology for the automatic detection and interpretation of damaged medieval Armenian inscriptions and graffiti. The research utilizes a newly compiled dataset of 150 images that include a variety of inscriptions, mosaics, and graffiti. These images are sourced from general archaeological site views and vary in quality and type, including drone and archival photos, to replicate real-world database challenges. The results highlight the efficiency of a two-step detection and classification pipeline. The detection phase employs a YOLO v8 model to identify the location and content of inscriptions, achieving an average Precision and Recall of 0.91 and 0.88, respectively. The classification phase uses a Vision Transformer (ViT) to identify similar characters, which outperforms classic CNN-based Siamese networks to handle such a complexity and variation. This approach demonstrates potential for analyzing under-resourced and damaged corpora, thus facilitating the study of deteriorated inscriptions in a variety of contexts.
Chahan Vidal-Gorène, Aliénor Decours-Perez
Optimizing HTR and Reading Order Strategies for Chinese Imperial Editions with Few-Shot Learning
Abstract
In this study, we tackle key challenges in layout analysis, reading order, and text recognition of historical Chinese texts. As part of the CHI-KNOW-PO Corpus project, which aims to digitize and publish an online edition of 60,000 xylographed documents, we have developed and released a specialized small dataset to address this common issues in HTR of historical documents in Chinese. Our approach combines a CNN-based instance segmentation model with a local algorithmic model for reading order, achieving a mean precision of 95.0% and a recall of 93.0% in region detection, and a 97.81% accuracy in reading order. Text recognition is conducted using a CRNN model enhanced with GAN-augmented data, effectively addressing few-shot learning challenges with an average accuracy of 98.45%, demonstrating the effectiveness of a small and targeted dataset over a large-scale approach. This research not only advances the digitization and analytical processing of Chinese historical documents but also sets a new benchmark for subsequent digital humanities efforts.
Marie Bizais-Lillig, Chahan Vidal-Gorène, Boris Dupin
Mind the Gap:Analyzing Lacunae with Transformer-Based Transcription
Abstract
Historical documents frequently suffer from damage and inconsistencies, including missing or illegible text resulting from issues such as holes, ink problems, and storage damage. These missing portions or gaps are referred to as lacunae. In this study, we employ transformer-based optical character recognition (OCR) models trained on synthetic data containing lacunae in a supervised manner. We demonstrate their effectiveness in detecting and restoring lacunae, achieving a success rate of 65%, compared to a base model lacking knowledge of lacunae, which achieves only 5% restoration. Additionally, we investigate the mechanistic properties of the model, such as the log probability of transcription, which can identify lacunae and other errors (e.g., mistranscriptions due to complex writing or ink issues) in line images without directly inspecting the image. This capability could be valuable for scholars seeking to distinguish images containing lacunae or errors from clean ones. Although we explore the potential of attention mechanisms in flagging lacunae and transcription errors, our findings suggest it is not a significant factor. Our work highlights a promising direction in utilizing transformer-based OCR models for restoring or analyzing damaged historical documents.
Jaydeep Borkar, David A. Smith
NeuroPapyri: A Deep Attention Embedding Network for Handwritten Papyri Retrieval
Abstract
The intersection of computer vision and machine learning has emerged as a promising avenue for advancing historical research, facilitating a more profound exploration of our past. However, the application of machine learning approaches in historical palaeography is often met with criticism due to their perceived “black box” nature. In response to this challenge, we introduce NeuroPapyri, an innovative deep learning-based model specifically designed for the analysis of images containing ancient Greek papyri. To address concerns related to transparency and interpretability, the model incorporates an attention mechanism. This attention mechanism not only enhances the model’s performance but also provides a visual representation of the image regions that significantly contribute to the decision-making process. Specifically calibrated for processing images of papyrus documents with lines of handwritten text, the model utilizes individual attention maps to inform the presence or absence of specific characters in the input image. This paper presents the NeuroPapyri model, including its architecture and training methodology. Results from the evaluation demonstrate NeuroPapyri’s efficacy in document retrieval, showcasing its potential to advance the analysis of historical manuscripts.
Giuseppe De Gregorio, Simon Perrin, Rodrigo C. G. Pena, Isabelle Marthot-Santaniello, Harold Mouchère
MONSTERMASH: Multidirectional, Overlapping, Nested, Spiral Text Extraction for Recognition Models of Arabic-Script Handwriting
Abstract
Most current models for handwritten text recognition transcribe individual lines and thus depend on accurate line extraction from page images. This line extraction task is particularly challenging for Arabic-script manuscripts, which exhibit a high proportion of curved lines, word baselines that vary within the line, and varying line orientation on the page. We present a new corpus for studying Arabic-script line extraction in the presence of these phenomena and evaluate different model architectures using several pixel-level, object-level, and extrinsic recognition metrics. Training all models on the same data, we find that the CNN-based Kraken model slightly outperforms the transformer-based TESTR model on recognition character accuracy and some object-level metrics, even though it lags behind on pixel-level metrics.
Danlu Chen, Jacob Murel, Taimoor Shahid, Xiang Zhang, Jonathan Parkes Allen, Taylor Berg-Kirkpatrick, David A. Smith
A New Framework for Error Analysis in Computational Paleographic Dating of Greek Papyri
Abstract
The study of Greek papyri from ancient Egypt is fundamental for understanding Graeco-Roman Antiquity, offering insights into various aspects of ancient culture and textual production. Palaeography, traditionally used for dating these manuscripts, relies on identifying chronologically relevant features in handwriting styles yet lacks a unified methodology, resulting in subjective interpretations and inconsistencies among experts. Recent advances in digital palaeography, which leverage artificial intelligence (AI) algorithms, have introduced new avenues for dating ancient documents. This paper presents a comparative analysis between an AI-based computational dating model and human expert palaeographers, using a novel dataset named Hell-Date comprising securely fine-grained dated Greek papyri from the Hellenistic period. The methodology involves training a convolutional neural network on visual inputs from Hell-Date to predict precise dates of papyri. In addition, experts provide palaeographic dating for comparison. To compare, we developed a new framework for error analysis that reflects the inherent imprecision of the palaeographic dating method. The results indicate that the computational model achieves performance comparable to that of human experts. These elements will help assess on a more solid basis future developments of computational algorithms to date Greek papyri.
Giuseppe De Gregorio, Lavinia Ferretti, Rodrigo C. G. Pena, Isabelle Marthot-Santaniello, Maria Konstantinidou, John Pavlopoulos
Automated Dating of Medieval Manuscripts with a New Dataset
Abstract
Automated manuscript dating is a long-awaited valuable tool for scholars in their research of historical documents. This study presents a new dataset of medieval Hebrew manuscripts annotated with dates. Our initial experiments focus on documents written in the Ashkenazi square script, allowing us to refine our methodologies in a manageable setting before addressing more complex script types. Also, to accurately reflect the script’s historical evolution, we adopt a novel classification approach for time periods of varying lengths, which acknowledges the uneven development of the script over time. We perform extensive experimentation with a variety of deep-learning models and show that the regression approach is more appropriate for estimating the date of the manuscript compared to categorical classification.
Boraq Madi, Nour Atamni, Vasily Tsitrinovich, Daria Vasyutinsky-Shapira, Jihad El-Sana, Irina Rabaev
Image-to-Image Translation Approach for Page Layout Analysis and Artificial Generation of Historical Manuscripts
Abstract
Document layout analysis is essential in Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR), especially for historical and low-resource scripts. This study explores a novel data augmentation technique using Generative Adversarial Networks (GANs) to generate realistic document layouts from semantic masks, enhancing layout analysis without increasing human annotation effort.
Our lightweight pipeline, tested on historical manuscripts (Latin, Arabic, Armenian, Hebrew), newspapers, and complex document layouts, shows that GAN-generated layouts are convincing and difficult to distinguish from real ones, even for paleographers. This method significantly boosts data augmentation, yielding a 3% point improvement in layout analysis metrics (precision, recall, mAP), and a 12 point increase in precision and recall for damaged documents. Additionally, masks with character information enhance image quality, boosting text recognition performance.
Chahan Vidal-Gorène, Jean-Baptiste Camps

VINALDO

Frontmatter
A Multimodal Framework For Structuring Legal Documents
Abstract
Document structuring plays a crucial role in various natural language processing (NLP) tasks, such as information retrieval, and document understanding. It also helps readers to effectively navigate into a structured document with a large amount of textual data. In the legal domain, document structuring is particularly important for creating inter- and intra-document links. In this paper, we present a practical implementation of a multimodal workflow to structure legal documents across various formats. We create a format-agnostic representation of each document (PDF and HTML), that includes layout and textual information. We introduce a multimodal and sequential algorithm to detect titles in each document, and then establish hierarchical relationships among paragraphs using a deterministic algorithm. Our contribution extends to the publication of an open-source dataset, facilitating further exploration in this domain of study, which has received comparatively less attention.
Thibaud Real, Pauline Chavallard
Reformulating Key-Information Extraction as Next Sentence Prediction for Hierarchical Data
Abstract
We present a reformulation of the Key-Information Extraction (KIE) problem from document images, as a Next-Sentence Prediction (NSP) task for identifying information in hierarchically structured data. KIE implemented as a Key-Value extraction task, is limited to one-to-one (single key mapping to single value) information extraction and thus does not apply to hierarchical information e.g. information present in complex semi-structured or unstructured tables. The Visual-Question-Answering (VQA) approach tries to solve information extraction from such semi-structured formats, but use visual information extraction backbone architectures along with heavy language models. In the proposed work, we use only a backbone language feature extractor for semantic entity extraction. Unlike, the four entity types in FUNSD (‘question’, ‘answer’, ‘header’ and ‘other’), for semi-structured tabular information we define additional classes that define hierarchical elements, like column-header, table-footer, cells, merged-cell, table-summary etc. For these additional entities, we define hierarchical relations like a tuple of entities {table-header entity, column-header entity, row-header entity} that point to the unique entity referred as a value-entity. We treat tuple-entity and value-entity as two sentences and formulate the task of finding how likely is the value-entity to follow the tuple-entity. Empirically, we show that the proposed method, called as, Tuple-Value Identification (TVI), can exhaustively identify all the information in the hierarchical structures. Additionally, TVI also opens up for the potential use for Table Structure Recognition (TSR) for scanned documents in bank statements or medical bills, where the narration columns span multi-lines and is challenging for existing TSR systems.
Ashish Kubade, Prathyusha Akundi, Bilal Arif Syed Mohd
HPSegNet: A Method for Handwritten and Printed Text Separation in Document Images
Abstract
The separation of handwritten and printed text in document images is an important task in the optical character recognition (OCR) research field. It is still a challenging problem to separate overlapped handwritten and printed text lines in images of complex documents including examination papers, legal documents, etc. In this paper, handwritten and printed text separation is formulated as a pixel-level document image segmentation task. Firstly, a modified Transformer-based model is designed for pixel-level document image segmentation. Secondly, a residual feature bypass is incorporated into the model to further exploit high-resolution features. Finally, a loss function combining focal loss and dice loss is designed to tackle the problem of imbalanced distributions of different classes. Experimental results on both a public English document image dataset and a self-built Chinese document image dataset have demonstrated the effectiveness of the proposed method.
Yu Chao, Changsong Liu, Liangrui Peng, Yanwei Wang
Ablation Study of a Multimodal Gat Network on Perfect Synthetic and Real-world Data to Investigate the Influence of Language Models in Invoice Recognition
Abstract
Document analysis and invoice recognition have been significantly advanced in recent years by grid-based, graph-based and transformer architectures. However, it is not only the model architecture that influences an approach’s results, but also the quality of training and test data. In this paper, we perform an ablation study on an existing state-of-the-art pre-trained multimodal GAT network. Therein we investigate two kinds of modifications to understand the sensitivity of the results by (1) exchanging the language module and (2) applying both the original and modified network on a perfect synthetic and an imperfect real-world dataset. The results of the study show the importance of language modules for semantic embeddings in multimodal invoice recognition and illustrate the impact of data annotation quality. We further contribute an adapted GAT model for German invoices.
Lukas-Walter Thiée
Backmatter
Metadata
Title
Document Analysis and Recognition – ICDAR 2024 Workshops
Editors
Harold Mouchère
Anna Zhu
Copyright Year
2024
Electronic ISBN
978-3-031-70642-4
Print ISBN
978-3-031-70641-7
DOI
https://doi.org/10.1007/978-3-031-70642-4

Premium Partner