Skip to main content
Top

2024 | Book

Document Analysis and Recognition - ICDAR 2024

18th International Conference, Athens, Greece, August 30–September 4, 2024, Proceedings, Part VI

insite
SEARCH

About this book

This six-volume set LNCS 14804-14809 constitutes the proceedings of the 18th International Conference on Document Analysis and Recognition, ICDAR 2024, held in Athens, Greece, during August 30–September 4, 2024.
The total of 144 full papers presented in these proceedings were carefully selected from 263 submissions.
The papers reflect topics such as: document image processing; physical and logical layout analysis; text and symbol recognition; handwriting recognition; document analysis systems; document classification; indexing and retrieval of documents; document synthesis; extracting document semantics; NLP for document understanding; office automation; graphics recognition; human document interaction; document representation modeling and much more.

Table of Contents

Frontmatter

Music Recognition

Frontmatter
Source-Free Domain Adaptation for Optical Music Recognition
Abstract
This work addresses the problem of Domain Adaptation (DA) in the context of staff-level end-to-end Optical Music Recognition. Specifically, we consider a source-free DA approach to adapt a given trained model to a new collection—an extremely useful scenario for preserving musical heritage. The method involves re-training the pre-trained model to align the statistics stored from the original data in normalization layers with those of the new collection, while also including a regularization mechanism to prevent the model from converging to undesirable solutions. Unlike conventional DA techniques, this approach is very efficient and practical, as it only requires the pre-trained model and unlabeled data from the new collection, without relying on data from the original training collections (i.e., source-free). Evaluation of diverse music collections in Mensural notation and a synthetic-to-real scenario of common Western modern notation demonstrates consistent improvements over the baseline (no DA), often with remarkable relative improvements.
Adrián Roselló, Eliseo Fuentes-Martínez, María Alfaro-Contreras, David Rizo, Jorge Calvo-Zaragoza
Sheet Music Transformer: End-To-End Optical Music Recognition Beyond Monophonic Transcription
Abstract
State-of-the-art end-to-end Optical Music Recognition (OMR) has, to date, primarily been carried out using monophonic transcription techniques to handle complex score layouts, such as polyphony, often by resorting to simplifications or specific adaptations. Despite their efficacy, these approaches imply challenges related to scalability and limitations. This paper presents the Sheet Music Transformer (SMT), the first end-to-end OMR model designed to transcribe complex musical scores without relying solely on monophonic strategies. Our model employs a Transformer-based image-to-sequence framework that predicts score transcriptions in a standard digital music encoding format from input images. Our model has been tested on two polyphonic music datasets and has proven capable of handling these intricate music structures effectively. The experimental outcomes not only indicate the competence of the model, but also show that it is better than the state-of-the-art methods, thus contributing to advancements in end-to-end OMR transcription.
Antonio Ríos-Vila, Jorge Calvo-Zaragoza, Thierry Paquet
The KuiSCIMA Dataset for Optical Music Recognition of Ancient Chinese Suzipu Notation
Abstract
In recent years, the development of Optical Music Recognition (OMR) has progressed significantly. However, music cultures with smaller communities have only recently been considered in this process. This results in a lack of adequate ground truth datasets needed for the development and benchmarking of OMR systems. In this work, the KuiSCIMA (Jiang Kui Score Images for Musicological Analysis) dataset is introduced. KuiSCIMA is the first machine-readable dataset of the suzipu notations in Jiang Kui’s collection Baishidaoren Gequ from 1202. Collected from five different woodblock print editions, the dataset contains 21797 manually annotated instances on 153 pages in total, from which 14500 are text character annotations, and 7297 are suzipu notation symbols. The dataset comes with an open-source tool which allows editing, visualizing, and exporting the contents of the dataset files. In total, this contribution promotes the preservation and understanding of cultural heritage through digitization.
Tristan Repolusk, Eduardo Veas
Practical End-to-End Optical Music Recognition for Pianoform Music
Abstract
The majority of recent progress in Optical Music Recognition (OMR) has been achieved with Deep Learning methods, especially models following the end-to-end paradigm that read input images and produce a linear sequence of tokens. Unfortunately, many music scores, especially piano music, cannot be easily converted to a linear sequence. This has led OMR researchers to use custom linearized encodings, instead of broadly accepted structured formats for music notation. Their diversity makes it difficult to compare the performance of OMR systems directly. To bring recent OMR model progress closer to useful results: (a) We define a sequential format called Linearized MusicXML, allowing us to train an end-to-end model directly and maintain close cohesion and compatibility with the industry-standard MusicXML format. (b) We create a dev and test set for benchmarking typeset OMR based on OpenScore Lieder corpus, containing 1,438 and 1,493 pianoform systems, each with an image from IMSLP and MusicXML ground truth. (c) We train and fine-tune an end-to-end model to serve as a baseline on the dataset and employ the TEDn metric to evaluate the model. We also test our model against the recently published synthetic pianoform dataset GrandStaff and surpass the state-of-the-art results.
Jiří Mayer, Milan Straka, Jan Hajič, Pavel Pecina

Visual Question Answering and Comics

Frontmatter
UniVIE: A Unified Label Space Approach to Visual Information Extraction from Form-Like Documents
Abstract
Existing methods for Visual Information Extraction (VIE) from form-like documents typically fragment the process into separate subtasks, such as key information extraction, key-value pair extraction, and choice group extraction. However, these approaches often overlook the hierarchical structure of form documents, including hierarchical key-value pairs and hierarchical choice groups. To address these limitations, we present a new perspective, reframing VIE as a relation prediction problem and unifying labels of different tasks into a single label space. This unified approach allows for the definition of various relation types and effectively tackles hierarchical relationships in form-like documents. In line with this perspective, we present UniVIE, a unified model that addresses the VIE problem comprehensively. UniVIE functions using a coarse-to-fine strategy. It initially generates tree proposals through a Tree Proposal Network, which are subsequently refined into hierarchical trees by a Relation Decoder module. To enhance the relation prediction capabilities of UniVIE, we incorporate two novel tree constraints into the Relation Decoder: a Tree Attention Mask and a Tree Level Embedding. Extensive experimental evaluations on both our in-house dataset HierForms and a publicly available dataset SIBR, substantiate that our method achieves state-of-the-art results, underscoring the effectiveness and potential of our unified approach in advancing the field of VIE.
Kai Hu, Jiawei Wang, Weihong Lin, Zhuoyao Zhong, Lei Sun, Qiang Huo
Extractive Question Answering with Contrastive Puzzles and Reweighted Clues
Abstract
The task of Extractive Question Answering (EQA) involves identifying correct answer spans in response to provided questions and passages. The emergence of Pretrained Language Models (PLMs) has sparked increased interest in leveraging these models for EQA tasks, yielding promising results. Nonetheless, current approaches frequently neglect the issue of label noise, which arises from incomplete labeling and inconsistent annotations, thereby reducing the model performance. To address this issue, we propose the Contrastive Puzzles and Reweighted Clues (CPRC) method, designed to mitigate the adverse effects of label noise. Our approach involves categorizing training data into Puzzle and Clue samples based on their loss and text similarity to the golden answer during model training. Subsequently, CPRC incorporates a hybrid intra- and inter-contrastive learning approach for Puzzle samples and dynamically adjusts the weights of Clue samples, respectively. The experimental results, conducted on three benchmark datasets, demonstrates the superior performance of the proposed CPRC compared to conventional approaches, highlighting its efficacy in mitigating the label noise and achieving enhanced EQA performance.
Chao Liu, Jie Yang, Wanqing Li
CHIC: Corporate Document for Visual Question Answering
Abstract
The massive use of digital documents due to the substantial trend of paperless initiatives confronted some companies with finding ways to process thousands of documents per day automatically. To achieve this, they use automatic information retrieval (IR) allowing them to extract useful information from large datasets quickly. In order to have effective IR methods, it is first necessary to have an adequate dataset. Although companies have enough data to take into account their needs, there is also a need for a public database to compare contributions between state-of-the-art methods. Some public document datasets already exist like DocVQA and XFUND, but they do not fully satisfy the needs of companies. First, XFUND contains only form documents while the company uses several types of documents (i.e. structured documents like forms but also semi-structured as invoices, and unstructured as emails). DocVQA, for its part, has several types of documents but only 4.5% of them are business documents (i.e. invoice, purchase order, etc.). All of these 4.5% of documents do not meet the diversity of documents that companies may encounter in their daily document flow. In order to extend these limitations, we propose in this paper the CHIC dataset, a visual question-answering public dataset that contains different types of business documents and the information extracted from these documents meets the expectations of companies.
Ibrahim Souleiman Mahamoud, Mickaël Coustaty, Aurélie Joseph, Vincent Poulain d’Andecy, Jean-Marc Ogier
Multimodal Transformer for Comics Text-Cloze
Abstract
This work explores a closure task in comics, a medium where visual and textual elements are intricately intertwined. Specifically, text-cloze refers to the task of selecting the correct text to use in a comic panel, given its neighbouring panels. Traditional methods based on recurrent neural networks, have struggled with this task due to limited OCR accuracy and inherent model limitations. We introduce a novel Multimodal Large Language Model (Multimodal-LLM) architecture, specifically designed for text-cloze, achieving a 10% improvement over existing state-of-the-art models in both its easy and hard variants. Central to our approach is a Domain-Adapted ResNet-50 based visual encoder, fine-tuned to the comics domain in a self-supervised manner using SimCLR. This encoder delivers comparable results to more complex models with just one-fifth of the parameters. Additionally, we release new OCR annotations for this dataset, enhancing model input quality and resulting in another 1% improvement. Finally, we extend the task to a generative format, establishing new baselines and expanding the research possibilities in the field of comics analysis. Code is available at https://​github.​com/​joanlafuente/​ComicVT5.
Emanuele Vivoli, Joan Lafuente Baeza, Ernest Valveny Llobet, Dimosthenis Karatzas
Federated Document Visual Question Answering: A Pilot Study
Abstract
An important handicap of document analysis research is that documents tend to be copyrighted or contain private information, which prohibits their open publication and the creation of centralised, large-scale document datasets. Instead, documents are scattered in private data silos, making extensive training over heterogeneous data a tedious task. In this work, we explore the use of a federated learning (FL) scheme as a way to train a shared model on decentralised private document data. We focus on the problem of Document VQA, a task particularly suited to this approach, as the type of reasoning capabilities required from the model can be quite different in diverse domains. Enabling training over heterogeneous document datasets can thus substantially enrich DocVQA models. We assemble existing DocVQA datasets from diverse domains to reflect the data heterogeneity in real-world applications. We explore the self-pretraining technique in this multi-modal setting, where the same data is used for both pretraining and finetuning, making it relevant for privacy preservation. We further propose combining self-pretraining with a Federated DocVQA training method using centralized adaptive optimization that outperforms the FedAvg baseline. With extensive experiments (The code is available at https://​github.​com/​khanhnguyen21006​/​fldocvqa), we also present a multi-faceted analysis on training DocVQA models with FL, which provides insights for future research on this task. We show that our pretraining strategies can effectively learn and scale up under federated training with diverse DocVQA datasets and tuning hyperparameters is essential for practical document tasks under federation.
Khanh Nguyen, Dimosthenis Karatzas

Visual Question Answering and LLMs

Frontmatter
Revisiting N-Gram Models: Their Impact in Modern Neural Networks for Handwritten Text Recognition
Abstract
In recent advances in automatic text recognition (ATR), deep neural networks have demonstrated the ability to implicitly capture language statistics, potentially reducing the need for traditional language models. This study directly addresses whether explicit language models, specifically n-gram models, still contribute to the performance of state-of-the-art deep learning architectures in the field of handwriting recognition. We evaluate two prominent neural network architectures, PyLaia [23] and DAN [8], with and without the integration of explicit n-gram language models. Our experiments on three datasets - IAM [19], RIMES [11], and NorHand v2 [2] - at both line and page level, investigate optimal parameters for n-gram models, including their order, weight, smoothing methods and tokenization level. The results show that incorporating character or subword n-gram models significantly improves the performance of ATR models on all datasets, challenging the notion that deep learning models alone are sufficient for optimal performance. In particular, the combination of DAN with a character language model outperforms current benchmarks, confirming the value of hybrid approaches in modern document analysis systems.
Solène Tarride, Christopher Kermorvant
ConClue: Conditional Clue Extraction for Multiple Choice Question Answering
Abstract
The task of Multiple Choice Question Answering (MCQA) aims to identify the correct answer from a set of candidates, given a background passage and an associated question. Considerable research efforts have been dedicated to addressing this task, leveraging a diversity of semantic matching techniques to estimate the alignment among the answer, passage, and question. However, key challenges arise as not all sentences from the passage contribute to the question answering, while only a few supporting sentences (clues) are useful. Existing clue extraction methods suffer from inefficiencies in identifying supporting sentences, relying on resource-intensive algorithms, pseudo labels, or overlooking the semantic coherence of the original passage. Addressing this gap, this paper introduces a novel extraction approach, termed Conditional Clue extractor (ConClue), for MCQA. ConClue leverages the principles of Conditional Optimal Transport to effectively identify clues by transporting the semantic meaning of one or several words (from the original passage) to selected words (within identified clues), under the prior condition of the question and answer. Empirical studies on several competitive benchmarks consistently demonstrate the superiority of our proposed method over different traditional approaches, with a substantial average improvement of 1.1–2.5 absolute percentage points in answering accuracy.
Wangli Yang, Jie Yang, Wanqing Li, Yi Guo
Privacy-Aware Document Visual Question Answering
Abstract
Document Visual Question Answering (DocVQA) has quickly grown into a central task of document understanding. But despite the fact that documents contain sensitive or copyrighted information, none of the current DocVQA methods offers strong privacy guarantees.
In this work, we explore privacy in the domain of DocVQA for the first time, highlighting privacy issues in state of the art multi-modal LLM models used for DocVQA, and explore possible solutions.
Specifically, we focus on invoice processing as a realistic document understanding scenario, and propose a large scale DocVQA dataset comprising invoice documents and associated questions and answers. We employ a federated learning scheme, that reflects the real-life distribution of documents in different businesses, and we explore the use case where the data of the invoice provider is the sensitive information to be protected.
We demonstrate that non-private models tend to memorise, a behaviour that can lead to exposing private information. We then evaluate baseline training schemes employing federated learning and differential privacy in this multi-modal scenario, where the sensitive information might be exposed through either or both of the two input modalities: vision (document image) or language (OCR tokens).
Finally, we design attacks exploiting the memorisation effect of the model, and demonstrate their effectiveness in probing a representative DocVQA models.
Rubèn Tito, Khanh Nguyen, Marlon Tobaben, Raouf Kerkouche, Mohamed Ali Souibgui, Kangsoo Jung, Joonas Jälkö, Vincent Poulain D’Andecy, Aurelie Joseph, Lei Kang, Ernest Valveny, Antti Honkela, Mario Fritz, Dimosthenis Karatzas
Multi-page Document Visual Question Answering Using Self-attention Scoring Mechanism
Abstract
Documents are 2-dimensional carriers of written communication, and as such their interpretation requires a multi-modal approach where textual and visual information are efficiently combined. Document Visual Question Answering (Document VQA), due to this multi-modal nature, has garnered significant interest from both the document understanding and natural language processing communities. The state-of-the-art single-page Document VQA methods show impressive performance, yet in multi-page scenarios, these methods struggle. They have to concatenate all pages into one large page for processing, demanding substantial GPU resources, even for evaluation. In this work, we propose a novel method and efficient training strategy for multi-page Document VQA tasks. In particular, we employ a visual-only document representation, leveraging the encoder from a document understanding model, Pix2Struct. Our approach utilizes a self-attention scoring mechanism to generate relevance scores for each document page, enabling the retrieval of pertinent pages. This adaptation allows us to extend single-page Document VQA models to multi-page scenarios without constraints on the number of pages during evaluation, all with minimal demand for GPU resources. Our extensive experiments demonstrate not only achieving state-of-the-art performance without the need for Optical Character Recognition (OCR), but also sustained performance in scenarios extending to documents of nearly 800 pages compared to a maximum of 20 pages in the MP-DocVQA dataset. Our code is publicly available at https://​github.​com/​leitro/​SelfAttnScoring-MPDocVQA.
Lei Kang, Rubèn Tito, Ernest Valveny, Dimosthenis Karatzas
Improving Retrieval-Based Dialogue Systems: Fine-Grained Post-training Prompt Adaptation and Pairwise Optimization Fine-Tuning Strategy
Abstract
Pre-trained models have demonstrated robust performance in natural language processing tasks. In retrieval-based dialogue systems, the majority of existing studies have reduced the multi-round dialogue response selection problem to a classification problem. While such approaches have proven effective in retrieval-based dialogue systems, they have not fully exploited the rich contextual understanding of pre-trained models and have been unable to effectively deal with complex contexts and semantic relations in multi-turn dialogues, which may result in potential information loss and performance bottlenecks. This paper proposes a fine-grained post-training prompt adaptation method and pairwise optimization fine-tuning strategy (FPPP). During training, the model’s contextual understanding and logical reasoning ability are enhanced through the use of a fine-grained post-training prompt adaptation method. In the prompt-tuning phase, a pairwise optimization fine-tuning strategy is employed to improve the model’s ability to effectively discriminate between positive and negative samples. In all three datasets, FPPP outperforms the baseline model, resulting in an improvement of the R10@1 metric by 0.1%, 1.4%, and 3.6%, respectively. The experimental results not only confirm the effectiveness of our method, but also provide a new approach for retrieval-based dialogue systems.
Tianqing Zhang, Alimjan Aysa, Li Zhao, Kurban Ubul, Enguang Zuo
Information Extraction from Visually Rich Documents Using Directed Weighted Graph Neural Network
Abstract
This paper presents a novel approach to information extraction (IE) from visually rich documents (VRD) by employing a directed weighted graph representation to capture relationships among various VRD components. In contrast to conventional methods relying on spatial proximity through Euclidean distance, our approach aims to enhance performance by introducing a novel representation of relationships using directed weighted graphs. The information extraction task from VRD is treated as a node classification problem, leveraging graph convolutional networks that process the VRD graphs. We conducted evaluations on five real-world datasets, showcasing notable results and performances that align with established norms.
Hamza Gbada, Karim Kalti, Mohamed Ali Mahjoub
IndicBART Alongside Visual Element: Multimodal Summarization in Diverse Indian Languages
Abstract
In the age of information overflow, the demand for advanced summarization techniques has surged, especially in linguistically diverse regions such as India. This paper introduces an innovative approach to multimodal multilingual summarization that seamlessly unites textual and visual elements. Our research focuses on four prominent Indian languages: Hindi, Bangla, Gujarati, and Marathi, employing abstractive summarization methods to craft coherent and concise summaries. For text summarization, we leverage the capabilities of the pre-trained IndicBART model, known for its exceptional proficiency in comprehending and generating text in Indian languages. We integrate an image summarization component based on the Image Pointer model to tackle multimodal challenges. This component identifies images from the input that enhance and complement the generated summaries, contributing to the overall comprehensiveness of our multimodal summaries. Our proposed methodology attains excellent results, surpassing other text summarization approaches tailored for the specified Indian languages. Furthermore, we enhance the significance of our work by incorporating a user satisfaction evaluation method, thereby providing a robust framework for assessing the quality of summaries. This holistic approach contributes to the advancement of summarization techniques, particularly in diverse Indian languages.
Raghvendra Kumar, Deepak Prakash, Sriparna Saha, Shubham Sharma
Exploring the Capabilities of Large Multimodal Models on Dense Text
Abstract
While large multi-modal models (LMM) have shown notable progress in multi-modal tasks, their capabilities in tasks involving dense textual content remains to be fully explored. Dense text, which carries important information, is often found in documents, tables, and product descriptions. Understanding dense text enables us to obtain more accurate information, assisting in making better decisions. To further explore the capabilities of LMM in complex text tasks, we propose the DT-VQA dataset, with 170k question-answer pairs. In this paper, we conduct a comprehensive evaluation of GPT4V, Gemini, and various open-source LMMs on our dataset, revealing their strengths and weaknesses. Furthermore, we evaluate the effectiveness of two strategies for LMM: prompt engineering and downstream fine-tuning. We find that even with automatically labeled training datasets, significant improvements in model performance can be achieved. We hope that this research will promote the study of LMM in dense text tasks. Code will be released at https://​github.​com/​Yuliang-Liu/​MultimodalOCR.
Shuo Zhang, Biao Yang, Zhang Li, Zhiyin Ma, Yuliang Liu, Xiang Bai

Competitions

Frontmatter
ICDAR 2024 Competition on Artistic Text Recognition
Abstract
Artistic text is widely used in advertisements, slogans, exhibitions, decorations, magazines, and books. However, artistic text recognition is an overlooked and extremely challenging task with importance and practicability in various applications. Artistic text recognition often has several challenges such as the various appearances with special-designed fonts and effects, the complex connections and overlaps between characters, and the severe interference from background patterns. Therefore, we organized the ICDAR 2024 Competition on Artistic Text Recognition to invite participants to solve these challenges. We propose the WordArt-V1.5 dataset to advance the field by incorporating a broader range of artistic text images sourced from diverse scenes. This enhanced artistic text recognition dataset contains a total of 12,000 images with 6,000 for training and 6,000 for testing. The competition attracted 33 participants and received 126 submissions with a best accuracy of 91.07%. In this paper, we provide an overview of the competition, detailing the proposed dataset, task, evaluation protocol, and result summaries.
Xudong Xie, Linger Deng, Zhifei Zhang, Zhaowen Wang, Yuliang Liu
ICDAR 2024 Competition on Few-Shot and Many-Shot Layout Segmentation of Ancient Manuscripts (SAM)
Abstract
Layout analysis is a critical aspect of Document Image Analysis, particularly when it comes to ancient manuscripts. It serves as a foundational step in streamlining subsequent tasks such as optical character recognition and automated transcription. However, one key challenge in this context is represented by the lack of available ground truths as they are extremely time-consuming to produce. Nevertheless, numerous approaches addressing this challenge heavily lean towards a fully supervised learning paradigm, which represents a rare scenario in a real-world setting. For this reason, with this competition, we propose the challenge of addressing this task with a few-shot learning approach, involving the use of only three images for training. The competition dataset, called U-DIADS-Bib, comprises four distinct ancient manuscripts, presenting heterogeneous layout structures, levels of degradation, and languages used. This diversity adds intrigue and complexity to the challenge. In addition, we have also allowed participating in the competition with traditional many-shot learning approaches, for which the whole training set of U-DIADS-Bib was made available.
Silvia Zottin, Axel De Nardin, Gian Luca Foresti, Emanuela Colombi, Claudio Piciarelli
ICDAR 2024 Competition on Handwriting Recognition of Historical Ciphers
Abstract
Handwritten Text Recognition (HTR) in low-resource scenarios (i.e. when the amount of labeled data is scarce) is a challenging problem. This is particularly true for historical encrypted manuscripts, commonly known as ciphers, which contain secret messages and were typically used in military or diplomatic correspondence, records of secret societies, or private letters. To hide their contents, the sender and receiver created their own secret method of writing. The cipher alphabets often include digits, Latin or Greek letters, Zodiac and alchemical signs, combined with various diacritics, as well as invented ones. The first step in the decryption process is the transcription of these manuscripts, which is difficult due to the great variation in handwriting styles and cipher alphabets with a limited number of pages. Although different strategies can be considered to deal with the insufficient amount of training data (e.g., few-shot learning, self-supervised learning), the performance of available HTR models is not yet satisfactory. Thus, the proposed competition, which includes ciphers with a large number of symbol sets and scribes, aims to boost research in HTR in low-resource scenarios.
Alicia Fornés, Jialuo Chen, Pau Torras, Carles Badal, Beäta Megyesi, Michelle Waldispühl, Nils Kopal, George Lasry
ICDAR 2024 Competition on Handwritten Text Recognition in Brazilian Essays – BRESSAY
Abstract
This paper describes the “Handwritten Text Recognition in Brazilian Essays – BRESSAY” competition, held at the 18th International Conference on Document Analysis and Recognition (ICDAR 2024). The competition aimed to advance Handwritten Text Recognition (HTR) by addressing challenges specific to Brazilian Portuguese academic essays, such as diverse handwriting styles and document irregularities like smudges and erasures. Participants were encouraged to develop robust algorithms capable of accurately transcribing handwritten texts at line, paragraph, and page levels using the new BRESSAY dataset. The competition attracted 14 participants from different countries, with 4 research groups submitting a total of 11 proposals in the three challenges by the end of the competition. These proposals achieved impressive recognition rates and demonstrated advancements over traditional baseline models by using key strategies such as preprocessing techniques, synthetic data approaches, and advanced deep learning models. The evaluation metrics used were Character Error Rate (CER) and Word Error Rate (WER), with error rates reaching up to 2.88% CER and 9.39% WER for line-level recognition, 3.75% CER and 10.48% WER for paragraph-level recognition, and 3.77% CER and 10.08% WER for page-level recognition. The competition highlight the potential for continued improvements in HTR and underscore the BRESSAY dataset as a resource for future researches. The dataset is available in the repository (https://​github.​com/​arthurflor23/​handwritten-text-recognition).
Arthur F. S. Neto, Byron L. D. Bezerra, Sávio S. Araújo, Wiliane M. A. S. Souza, Kléberson F. Alves, Macileide F. Oliveira, Samara V. S. Lins, Hugo J. F. Hazin, Pedro H. V. Rocha, Alejandro H. Toselli
ICDAR 2024 Competition on Historical Map Text Detection, Recognition, and Linking
Abstract
Text on digitized historical maps contains valuable information, e.g., providing georeferenced political and cultural context. The goal of the ICDAR 2024 MapText Competition is to benchmark methods that automatically extract textual content on historical maps (e.g., place names) and connect words to form location phrases. The competition features two primary tasks—text detection and end-to-end text recognition—each with a secondary task of linking words into phrase blocks. Submissions are evaluated on two data sets: 1) David Rumsey Historical Map Collection which contains 936 map images covering 80 regions and 183 distinct publication years (from 1623 to 2012); 2) French Land Registers (created during the 19th century) which contains 145 map images of 50 French cities and towns. The competition received 44 submissions among all tasks. This report presents the motivation for the competition, the tasks, the evaluation metrics, and the submission analysis.
Zekun Li, Yijun Lin, Yao-Yi Chiang, Jerod Weinman, Solenn Tual, Joseph Chazalon, Julien Perret, Bertrand Duménieu, Nathalie Abadie
ICDAR 2024 Competition on Multi Font Group Recognition and OCR
Abstract
This competition investigates the performance of several methods for two types of analyses of early modern prints: (1) optical character recognition, and (2) font recognition at the character level. We have created and published a novel dataset that contains the ground truth for both tasks. The dataset has been carefully curated and annotated by an expert with several years of expertise in transcribing early modern prints. Both tasks involved two distinct tracks, differing in ground truth management: one that only allows the participants to use the provided data for model training and a second that removes this restriction. Out of the five participating teams, four participated in the first track, and three in the second one. The best team reached a text Character Error Rate (CER) of 0.82 % and a font CER of 2.96 % for the first track. In the second track, these numbers could be slightly improved to 0.81 % text CER and 2.78 % font CER.
Janne van der Loop, Florian Kordon, Martin Mayr, Vincent Christlein, Fei Wu, Dalia Rodríguez-Salas, Nikolaus Weichselbaumer, Mathias Seuret
ICDAR 2024 Competition on Recognition of Chemical Structures
Abstract
The recognition of chemical molecular structures is crucial in fields such as education and biochemistry. Due to the significant challenges in data acquisition and annotation, current methods mainly focus on recognizing printed structures with clean backgrounds, with limited research devoted to handwritten chemical molecular structure recognition. This disparity arises because researchers can readily obtain extensive, clean-printed structure datasets through rendering tools, whereas authentic handwritten chemical structure data and corresponding benchmarks are scarce. To prompt research in this field, we have organized a new competition aimed at the recognition of genuine handwritten chemical structures. The competition ran from January 10 to April 25, 2024, receiving five valid submissions. In this report, we provide detailed information about the data, tasks, methods of the participating teams, and a summary and discussion of the submitted results.
Mingjun Chen, Hao Wu, Qikai Chang, Hanbo Cheng, Jiefeng Ma, Pengfei Hu, Zhenrong Zhang, Chenyu Liu, Changpeng Pi, Jinshui Hu, Baocai Yin, Bing Yin, Cong Liu, Jun Du
ICDAR 2024 Competition on Reading Documents Through Aria Glasses
Abstract
This paper presents the competition report on Reading Documents through Aria Glasses (ICDAR 2024 RDTAG) held at the 18th International Conference on Document Analysis and Recognition (ICDAR 2024). From a mixed reality perspective, understanding the text in the world is of paramount importance. However, all day long, always on, machine perception devices like Aria Glasses pose a unique primary challenge of lower resolution due to their power and sensor constraints. Moreover, diverse everyday scenes like variations in the lighting conditions and reading positions further complicate the reading tasks. To address this, we propose a new dataset and a challenge. Specifically, we propose three novel tasks: Isolated Word Recognition in Low Resolution (Task A), Prediction of Reading Order (Task B), and Page Level Recognition (Task C). We provide new training and test sets consisting of document images captured by Aria Glasses while reading diverse documents in English under various everyday scenarios. Our aim is to engage researchers with prior experience in English language OCR, and to establish benchmarks contributing to the academic literature in this field. A total of thirty-three different teams from around the world registered for this competition, and twelve teams submitted their results along with algorithm details. The winning team, SRCB, achieved a 97.23% Character Recognition Rate (CRR) and a 90.45% Word Recognition Rate (WRR) for Task A: Isolated Word Recognition in Low Resolution. Team Gang-of-N won Task B: Prediction of Reading Order with a BLEU score of 0.0939. Team SRCB also won Task C: Page Level Recognition and Reading with a 77.44% average Page Level Character Recognition Rate (PCRR) and a 50.55% average Page Level Word Recognition Rate (PWRR).
Soumya Shamarao Jahagirdar, Ajoy Mondal, Yuheng (Carl) Ren, Omkar M. Parkhi, C. V. Jawahar
ICDAR 2024 Competition on Recognition and VQA on Handwritten Documents
Abstract
This paper presents a competition report on the Recognition and Visual Question Answer on Handwritten Documents towards deeper understanding of handwritten multilingual documents (ICDAR 2024-HWD) held at the 18th International Conference on Document Analysis and Recognition (ICDAR 2024). Documents are in English or Indian Languages. Earlier editions related to recognition of Indian handwriting were held in conjunction with ICFHR 2022 and ICDAR 2023. A related DocVQA task was held in DAS 2020. This edition proposes three main tasks: Isolated Word Recognition (Task A), Page Level Recognition and Reading (Task B), and Visual Question Answers on Handwritten Documents (Task A). While Task A was already part of our previous competitions, we bring in new data as part of this edition. Task B and Task C are novel additions for this year. By attracting researchers with experience in printed and handwritten documents, we aim to establish benchmarks that significantly contribute to the academic literature in this field. A total of thirty-two teams around the world registered for this competition. Among them, only ten teams submitted their results along with algorithm details. The winning team, TSNUK, achieved an average 98.00% Character Recognition Rate (CRR) and 94.26% Word Recognition Rate (WRR) across four languages for Task A: Isolated Word Recognition. IndependentOCR excelled in Task B: Page Level Recognition and Reading, with 76.32% average Page Level Character Recognition Rate (PCRR) and 62.57% average Page Level Word Recognition Rate. The team PA_VCG won Task C: Visual Question Answering on Handwritten Documents with a 0.643 ANLS score.
Ajoy Mondal, Vijay Mahadevan, R. Manmatha, C. V. Jawahar
Backmatter
Metadata
Title
Document Analysis and Recognition - ICDAR 2024
Editors
Elisa H. Barney Smith
Marcus Liwicki
Liangrui Peng
Copyright Year
2024
Electronic ISBN
978-3-031-70552-6
Print ISBN
978-3-031-70551-9
DOI
https://doi.org/10.1007/978-3-031-70552-6

Premium Partner