Skip to main content
Top

2024 | Book

Document Analysis and Recognition – ICDAR 2024 Workshops

Athens, Greece, August 30–31, 2024, Proceedings, Part I

insite
SEARCH

About this book

This two-volume set LNCS 14935-14936 constitutes the proceedings of International Workshops co-located with the 18th International Conference on Document Analysis and Recognition, ICDAR 2024, held in Athens, Greece, during August 30–31, 2024.
The total of 30 regular papers presented in these proceedings were carefully selected from 46 submissions.
Part I contains 16 regular papers that stem from the following workshops:

ICDAR 2024 Workshop on Automatically Domain-Adapted and Personalized Document Analysis (ADAPDA);

ICDAR 2024 Workshop on Advanced Analysis and Recognition of Parliamentary Corpora (ARPC);

ICDAR 2024 Workshop on coMics ANalysis, Processing and Understanding (MANPU).
Part II contains 14 regular papers that stem from the following workshops:

ICDAR 2024 Workshop on Computational Paleography (IWCP);

ICDAR 2024 Workshop on Machine Vision and NLP for Document Analysis (VINALDO).

Table of Contents

Frontmatter

ADAPDA

Frontmatter
Domain Adaptation for Handwriting Trajectory Reconstruction from IMU Sensors
Abstract
Digital pens are commonly used to write on digital devices, providing the handwriting trace and enhancing human-computer interation. This study focuses on a digital pen equipped with kinematic sensors, allowing users to write on any surface while simultaneously preserving a digital trajectory of handwriting. This technology holds significant potential as a valuable educational tool, particularly in classrooms where it can facilitate the process of learning to write. A major issue is based on the difference in captured signals between adults and children. For similar handwriting trace, we have large differences in sensor signals due to differences in speed and confidence in the handwriting gesture of children. To address this, we investigate a domain adaptation approach to build a unified intermediate feature representation aimed at facilitating the trajectory reconstruction. We demonstrate the interest of domain adaptation methods in leveraging existing knowledge for application in different contexts. Specifically, we compare our domain adaptation approach with two other methods: training the model from scratch and fine-tuning the model.
Florent Imbert, Romain Tavenard, Yann Soullard, Eric Anquetil
TrOCR Meets Language Models: An End-to-End Post-correction Approach
Abstract
This study aims to enhance handwritten text recognition (HTR) performance and domain adaptability by combining an optical character recognition (OCR) model with a language model (LM) that serves as a corrector. This integration addresses three principal challenges: over-correction, which compromises text authenticity; poor domain adaptation; and the scarcity of annotated images. We explore the synergy between TrOCR, a state-of-the-art OCR model, and CharBERT, a BERT-based LM. A novel aspect of our research involves introducing common errors made by the recogniser into the LM, enabling it to consider these errors during correction, thereby improving overall performance. Our findings reveal that the hybrid TrOCR-CharBERT model effectively balances visual and linguistic information, preserving the authenticity of the original texts. Furthermore, the model is able to adapt to historical data even when the recogniser is trained solely on contemporary data, mitigating the need for a large number of annotated historical handwritten images.
Yung-Hsin Chen, Phillip B. Ströbel
LayeredDoc: Domain Adaptive Document Restoration with a Layer Separation Approach
Abstract
The rapid evolution of intelligent document processing systems demands robust solutions that adapt to diverse domains without extensive retraining. Traditional methods often falter with variable document types, leading to poor performance. To overcome these limitations, this paper introduces a text-graphic layer separation approach that enhances domain adaptability in document image restoration (DIR) systems. We propose LayeredDoc, which utilizes two layers of information: the first targets coarse-grained graphic components, while the second refines machine-printed textual content. This hierarchical DIR framework dynamically adjusts to the characteristics of the input document, facilitating effective domain adaptation. We evaluated our approach both qualitatively and quantitatively using a new real-world dataset, LayeredDocDB, developed for this study. Initially trained on a synthetically generated dataset, our model demonstrates strong generalization capabilities for the DIR task, offering a promising solution for handling variability in real-world data. Our code is accessible on this GitHub(https://​github.​com/​mpilligua/​LayeredDoc).
Maria Pilligua, Nil Biescas, Javier Vazquez-Corral, Josep Lladós, Ernest Valveny, Sanket Biswas
Normalized vs Diplomatic Annotation: A Case Study of Automatic Information Extraction from Handwritten Uruguayan Birth Certificates
Abstract
This study evaluates the recently proposed Document Attention Network (DAN) for extracting key-value information from Uruguayan birth certificates, handwritten in Spanish. We investigate two annotation strategies for automatically transcribing handwritten documents, fine-tuning DAN with minimal training data and annotation effort. Experiments were conducted on two datasets containing the same images (201 scans of birth certificates written by more than 15 different writers) but with different annotation methods. Our findings indicate that normalized annotation is more effective for fields that can be standardized, such as dates and places of birth, whereas diplomatic annotation performs much better for fields containing names and surnames, which can not be standardized.
Natalia Bottaioli, Solène Tarride, Jérémy Anger, Seginus Mowlavi, Marina Gardella, Antoine Tadros, Gabriele Facciolo, Rafael Grompone von Gioi, Christopher Kermorvant, Jean-Michel Morel, Javier Preciozzi

ARPC

Frontmatter
Diminutives in Political Discourse – The Case of Serbian and Slovenian
Abstract
Diminutives are well acknowledged for their contribution to the expressiveness of language, often adding layers of nuance and attitude to discourse. This paper draws comparisons between the use of diminutives in the ParlaMint-RS 4.0 (Serbian parliament) and ParlaMint-SI 4.0 (Slovenian parliament) corpora [1]. Our findings reveal a distinctive pattern within political discussions: the employment of diminutives, particularly when referring to entities other than the speaker (i.e., not in the first person), is almost invariably associated with a negative connotation or the intention to convey irony. This paper aims to underscore the significance of such linguistic nuances, highlighting how diminutives can subtly influence the tone and perceived intent of political discourse. Through the examination of verbal diminutives, we contribute to a deeper understanding of the use of language in political contexts.
Milena Oparnica
Loghi: An End-to-End Framework for Making Historical Documents Machine-Readable
Abstract
Loghi is a novel framework and suite of tools for the layout analysis and text recognition of historical documents. Scans are processed in a modular pipeline, with the option to use alternative tools in most stages. Layout analysis and text recognition can be trained on example images with PageXML ground truth. The framework is intended to convert scanned documents to machine-readable PageXML. Additional tooling is provided for the creation of synthetic ground truth. A visualiser for troubleshooting the text recognition training is also made available. The result is a framework for end-to-end text recognition, which works from initial layout analysis on the scanned documents, and includes text line detection, text recognition, reading order detection and language detection.
The Loghi pipeline has been used successfully in several projects. We achieve good results on the layout analysis and text recognition of both the handwritten and printed archives of the Dutch States General on resolutions spanning the 17th and 18th century. The CER on handwritten 17th century material is below 3%. Loghi is open source and free to use.
Rutger van Koert, Stefan Klut, Tim Koornstra, Martijn Maas, Luke Peters
Οpen Parliamentary Data as a Tool for Linguistic Research: Exploring the ‘Greek Language Question’ in the Journal of Parliamentary Debates
Abstract
Parliamentary Libraries currently face the challenge of shifting from being gate-keepers of a Parliament’s archival and contemporary “treasures” to functioning as dynamic information and knowledge hubs, through the production, management and availability of open diachronic and synchronic data; in this light, this paper presents the Hellenic Parliament Library experience, focusing on a subcategory of historical parliamentary data, included in the Hellenic Parliament Library Repository under construction. In the first part, more technically-oriented, an outline of the text detection and recognition process is provided, in order to render the materials in question machine-readable, while in the second part the potential for linguistic research is highlighted, through a case-study exploring aspects of the ‘Greek language question’, as discussed in the parliamentary context, within the wider framework of language policy making.
Maria Kamilaki
Digitization of Written Parliamentary Questions from the Historical Archive (1974–1977) of the Hellenic Parliament
Abstract
This article outlines the digitization process and methodology applied to the archive of parliamentary questions from the 1st Parliamentary Term (1974–1977) in the Hellenic Parliament. A collaborative pilot project involving parliament, academia, and a research center facilitated the conversion of printed material to open data. The main tasks of the project include capturing digital images, a custom Optical Character Recognition (OCR) software solution employing machine learning, and rigorous validation for accuracy of a fragmented and of variable quality polytonic corpus in a variety of modern Greek language called Katharevousa. The article discusses the approach and challenges as well as the initial results of the digitization effort, emphasizing ongoing research steps. Overall, 1,674 images were digitally processed corresponding to 1,338 questions. Following algorithmic training, character recognition accuracy is over 98.5%. Successful implementation streamlines further similar digitalization operations in the vast parliamentary archives, while enabling in-depth studies on parliamentary control in the turbulent period of the immediate post-junta era in Greece. A preliminary comparative analysis with a corpus of newer parliamentary questions (2009–2019) provides insights and incentives for the further study of the characteristics and evolution of the Greek language.
Fotios Fitsilis, Basilis Gatos, Konstantinos Palaiologos, Panagiotis Kaddas, Charalambis Kyrkos, Maria-Eleni Georgoulea, Yiannis Armenakis, Christina Tasouli, George Mikros, Olivier Rozenberg, Eleni Kiousi

MANPU

Frontmatter
Retrieving and Analyzing Translations of American Newspaper Comics with Visual Evidence
Abstract
Research on image classification and text translation for comics have transpired largely independent of one another. Machine translation tools focus on comics’ text features, thereby largely ignoring comics’ heavily visual dimension. Image classification applications for comics focus primarily on genre and artist attribution. This paper bridges the gap between these areas by investigating image classification model accuracy for identifying translations of American newspaper comic strips. How might machine learning algorithms leverage comics’ distinguishing visual features in order to identify pre-existing translations? To what extent do textual differences affect classification model accuracy in identifying otherwise identical comics? Using a dataset of \(\tilde{1}\)8,000 English and Spanish comics, we generate embeddings from three CNNs and a Vision Transformer. We generate additional embeddings from binarized images and images with text redacted using an OCR model. We compute the cosine distance between given pairs of comics and evaluate its accuracy at retrieving translations. The best models rank the true translation first for 97% of queries, falling to 94% when the language is not known.
Jacob Murel, David A. Smith
Investigating Neural Networks and Transformer Models for Enhanced Comic Decoding
Abstract
Comic books, merging art with narrative, continue to captivate readers, cinema producers, and collectors, maintaining their allure as a cherished form of visual storytelling across decades. Comic image segmentation is a pivotal aspect in the digital transformation of comics. Leveraging heuristic approaches, neural network-based model (YOLO), and innovative transformer-based architectures (GroundingDINO, SAM), our research aims to autonomously segment comic pages into fundamental components: panels, comic characters, and text areas. To this end, we further trained YOLOv5 and YOLOv8 models to identify these components, while transformer-based models employed prompts to retrieve them. By comparing their outputs across three well-known datasets (eBDtheque, DCM772, Manga109) and using different metrics (Precision, Recall, Average Precision), we conclude that pre-trained self-supervised transformer models can competently outperform state of the art approaches, which often require further fine-tuning to achieve comparable results.
Eleanna Kouletou, Vassilis Papavassiliou, Vassilis Katsouros
Comics Datasets Framework: Mix of Comics Datasets for Detection Benchmarking
Abstract
Comics, as a medium, uniquely combine text and images in styles often distinct from real-world visuals. For the past three decades, computational research on comics has evolved from basic object detection to more sophisticated tasks. However, the field faces persistent challenges such as small datasets, inconsistent annotations, inaccessible model weights, and results that cannot be directly compared due to varying train/test splits and metrics. To address these issues, we aim to standardize annotations across datasets, introduce a variety of comic styles into the datasets, and establish benchmark results with clear, replicable settings. Our proposed Comics Datasets Framework standardizes dataset annotations into a common format and addresses the overrepresentation of manga by introducing Comics100, a curated collection of 100 books from the Digital Comics Museum, annotated for detection in our uniform format. We have benchmarked a variety of detection architectures using the Comics Datasets Framework. All related code, model weights, and detailed evaluation processes are available at https://​github.​com/​emanuelevivoli/​cdf, ensuring transparency and facilitating replication. This initiative is a significant advancement towards improving object detection in comics, laying the groundwork for more complex computational tasks dependent on precise object recognition.
Emanuele Vivoli, Irene Campaioli, Mariateresa Nardoni, Niccolò Biondi, Marco Bertini, Dimosthenis Karatzas
A Comprehensive Gold Standard and Benchmark for Comics Text Detection and Recognition
Abstract
This study focuses on improving the optical character recognition (OCR) data for panels in COMICS [18], the largest dataset containing text and images from comic books. To do this, we developed a pipeline for OCR processing and labeling of comic books and created the first text detection and recognition datasets for Western comics, called “COMICS Text+: Detection” and “COMICS Text+: Recognition”. We evaluated the performance of fine-tuned state-of-the-art text detection and recognition models on these datasets and found significant improvement in word accuracy and normalized edit distance compared to the text in COMICS. We also created a new dataset called “COMICS Text+”, which contains the extracted text from the textboxes in COMICS. Using the improved text data of COMICS Text+ in the comics processing model from COMICS resulted in state-of-the-art performance on cloze-style tasks without changing the model architecture. The COMICS Text+ can be a valuable resource for researchers working on tasks including text detection, recognition, and high-level processing of comics, such as narrative understanding, character relations, and story generation. All data, models, and instructions can be accessed online (https://​github.​com/​gsoykan/​comics_​text_​plus).
Gürkan Soykan, Deniz Yuret, Tevfik Metin Sezgin
Toward Accessible Comics for Blind and Low Vision Readers
Abstract
This work explores how to fine-tune large language models using prompt engineering techniques with contextual information for generating an accurate text description of the full story, ready to be forwarded to off-the-shelve speech synthesis tools. We propose to use existing computer vision and optical character recognition techniques to build a grounded context from the comic strip image content, such as panels, characters, text, reading order and the association of bubbles and characters. Then we infer character identification and generate comic book script with context-aware panel description including character’s appearance, posture, mood, dialogues etc. We believe that such enriched content description can be easily used to produce audiobook and eBook with various voices for characters, captions and playing sound effects.
Christophe Rigaud, Jean-Christophe Burie, Samuel Petit
Quantitative Evaluation Based on CLIP for Methods Inhibiting Imitation of Painting Styles
Abstract
Image generation AIs with the ability to generate a variety of high-quality images from text and images have been gaining attention, meanwhile, they cause a serious problem where a third party generates AI art that imitates the artist’s style by fine-tuning from a specific artist’s work. Against this background, a method has been proposed to add small noise perturbations to an artwork to inhibit imitation of the painting style. Currently, two such methods exist: Glaze and Mist, which apply perturbations to artworks so that a false style is learned, and Mist, which makes it difficult to extract features from the work. Glaze and Mist differ from each other in their evaluation manners, and they are not able to evaluate the real problem described above, i.e., the inhibition of imitating the style of a particular artist. Therefore, the purpose of this study is to realize a quantitative evaluation based on the index of artist-likeness. In this paper, we discuss three points: whether the proposed artist-likeness index and the correct prediction rate are suitable for quantitative evaluation of artist-likeness, how much it inhibits the imitation of painting style quantitatively, and whether Glaze or Mist is superior. Our experimental results confirm that the proposed metrics reflect the artist-likeness and quantitatively evaluate, how much Glaze and Mist inhibit the imitation of painting styles quantitatively, and we find that Mist is superior to Glaze according to our approach.
Motoi Iwata, Keito Okamoto, Koichi Kise
Spatially Augmented Speech Bubble to Character Association via Comic Multi-task Learning
Abstract
Accurately associating speech bubbles with corresponding characters is a challenging yet crucial task in comic book processing. This problem is gaining increased attention as it enhances the accessibility and analyzability of this rapidly growing medium. Current methods often struggle with the complex spatial relationships within comic panels, which lead to inconsistent associations. To address these shortcomings, we developed a robust machine learning framework that leverages novel negative sampling methods, optimized pair-pool processes (the process of selecting speech bubble-character pairs during training) based on intra-panel spatial relationships, and an innovative masking strategy specifically designed for the relation branch of our model. Our approach builds upon and significantly enhances the COMIC MTL framework, improving its efficiency and accuracy in handling the unique challenges of comic book analysis. Finally, we conducted extensive experiments that demonstrate our model achieves state-of-the-art performance in linking characters to their speech bubbles. Moreover, through meticulous optimization of each component-from data preprocessing to neural network architecture-our method shows notable improvements in character face and body detection, as well as speech bubble segmentation.
Gürkan Soykan, Deniz Yuret, Tevfik Metin Sezgin
ComicBERT: A Transformer Model and Pre-training Strategy for Contextual Understanding in Comics
Abstract
Despite the growing interest in digital comic processing, foundational models tailored for this medium still need to be explored. Existing methods employ multimodal sequential models with cloze-style tasks, but they fall short of achieving human-like understanding. Addressing this gap, we introduce a novel transformer-based architecture, Comicsformer, and a comprehensive framework, ComicBERT, designed to process and understand the complex interplay of visual and textual elements in comics. Our approach utilizes a self-supervised objective, Masked Comic Modeling, inspired by BERT’s [6] masked language modeling objective, to train the foundation model. To fine-tune and validate our models, we adopt existing cloze-style tasks and propose new tasks - such as scene-cloze, which better capture the narrative and contextual intricacies unique to comics. Preliminary experiments indicate that these tasks enhance the model’s predictive accuracy and may provide new tools for comic creators, aiding in character dialogue generation and panel sequencing. Ultimately, ComicBERT aims to serve as a universal comic processor.
Gürkan Soykan, Deniz Yuret, Tevfik Metin Sezgin
Backmatter
Metadata
Title
Document Analysis and Recognition – ICDAR 2024 Workshops
Editors
Harold Mouchère
Anna Zhu
Copyright Year
2024
Electronic ISBN
978-3-031-70645-5
Print ISBN
978-3-031-70644-8
DOI
https://doi.org/10.1007/978-3-031-70645-5

Premium Partner