Skip to main content
Top

2024 | Book

Document Analysis and Recognition - ICDAR 2024

18th International Conference, Athens, Greece, August 30–September 4, 2024, Proceedings, Part IV

insite
SEARCH

About this book

This six-volume set LNCS 14804-14809 constitutes the proceedings of the 18th International Conference on Document Analysis and Recognition, ICDAR 2024, held in Athens, Greece, during August 30–September 4, 2024.
The total of 144 full papers presented in these proceedings were carefully selected from 263 submissions.
The papers reflect topics such as: document image processing; physical and logical layout analysis; text and symbol recognition; handwriting recognition; document analysis systems; document classification; indexing and retrieval of documents; document synthesis; extracting document semantics; NLP for document understanding; office automation; graphics recognition; human document interaction; document representation modeling and much more.

Table of Contents

Frontmatter

Layout Analysis and Document Classification

Frontmatter
CREPE: Coordinate-Aware End-to-End Document Parser
Abstract
In this study, we formulate an OCR-free sequence generation model for visual document understanding (VDU). Our model not only parses text from document images but also extracts the spatial coordinates of the text based on the multi-head architecture. Named as Coordinate-aware End-to-end Document Parser (CREPE), our method uniquely integrates these capabilities by introducing a special token for OCR text, and token-triggered coordinate decoding. We also proposed a weakly-supervised framework for cost-efficient training, requiring only parsing annotations without high-cost coordinate annotations. Our experimental evaluations demonstrate CREPE’s state-of-the-art performances on document parsing tasks. Beyond that, CREPE’s adaptability is further highlighted by its successful usage in other document understanding tasks such as layout analysis, document visual question answering, and so one. CREPE’s abilities including OCR and semantic parsing not only mitigate error propagation issues in existing OCR-dependent methods, it also significantly enhance the functionality of sequence generation models, ushering in a new era for document understanding studies.
Yamato Okamoto, Youngmin Baek, Geewook Kim, Ryota Nakao, DongHyun Kim, Moon Bin Yim, Seunghyun Park, Bado Lee
A Hybrid Approach for Document Layout Analysis in Document Images
Abstract
Document layout analysis involves understanding the arrangement of elements within a document. This paper navigates the complexities of understanding various elements within document images, such as text, images, tables, and headings. The approach employs an advanced Transformer-based object detection network as an innovative graphical page object detector for identifying tables, figures, and displayed elements. We introduce a query encoding mechanism to provide high-quality object queries for contrastive learning, enhancing efficiency in the decoder phase. We also present a hybrid matching scheme that integrates the decoder’s original one-to-one matching strategy with the one-to-many matching strategy during the training phase. This approach aims to improve the model’s accuracy and versatility in detecting various graphical elements on a page. Our experiments on PubLayNet, DocLayNet, and PubTables benchmarks show that our approach outperforms current state-of-the-art methods. It achieves an average precision of 97.3% on PubLayNet, 81.6% on DocLayNet, and 98.6% on PubTables, demonstrating its superior performance in layout analysis. These advancements not only enhance the conversion of document images into editable and accessible formats but also streamline information retrieval and data extraction processes.
Tahira Shehzadi, Didier Stricker, Muhammad Zeshan Afzal
DLAFormer: An End-to-End Transformer For Document Layout Analysis
Abstract
Document layout analysis (DLA) is crucial for understanding the physical layout and logical structure of documents, serving information retrieval, document summarization, knowledge extraction, etc. However, previous studies have typically used separate models to address individual sub-tasks within DLA, including table/figure detection, text region detection, logical role classification, and reading order prediction. In this work, we propose an end-to-end transformer-based approach for document layout analysis, called DLAFormer, which integrates all these sub-tasks into a single model. To achieve this, we treat various DLA sub-tasks (such as text region detection, logical role classification, and reading order prediction) as relation prediction problems and consolidate these relation prediction labels into a unified label space, allowing a unified relation prediction module to handle multiple tasks concurrently. Additionally, we introduce a novel set of type-wise queries to enhance the physical meaning of content queries in DETR. Moreover, we adopt a coarse-to-fine strategy to accurately identify graphical page objects. Experimental results demonstrate that our proposed DLAFormer outperforms previous approaches that employ multi-branch or multi-stage architectures for multiple tasks on two document layout analysis benchmarks, DocLayNet and Comp-HRDoc.
Jiawei Wang, Kai Hu, Qiang Huo
A Region-Based Approach for Layout Analysis of Music Score Images in Scarce Data Scenarios
Abstract
This work presents a novel region-based layout analysis (LA) method for Optical Music Recognition (OMR) systems, aimed at overcoming the data scarcity challenge. Contemporary OMR techniques, grounded in machine learning principles, have a critical requirement: a labeled dataset for training. This presents a practical challenge due to the extensive manual effort required, coupled with the fact that the availability of suitable data for creating training sets is not always guaranteed. Unlike other approaches, our method focuses on adapting the training and sample extraction processes within an existing neural network framework. Our approach incorporates a labeled data-driven oversampling technique, a masking layer to enable training with partial labeling, and an adaptive scaling process to improve results for varying score sizes. Through comprehensive experimentation, we established the minimal labeled data necessary for an effective model and demonstrated that our method could achieve a performance comparable with the state-of-the-art with just 8 to 32 labeled samples. The implications of our research extend beyond improving LA, providing a scalable and practical solution for digitizing and preserving musical documents.
Francisco J. Castellanos, Juan P. Martinez-Esteso, Alejandro Galán-Cuenca, Antonio Javier Gallego
Doc-DINO: A Transformer Model for Complex Logical Document Layout Analysis
Abstract
Document layout analysis is an indispensable part of document information processing. It can be applied to various tasks such as document retrieval, machine translation, document information retrieval, and structured data extraction from documents. However, most publicly available datasets in the field of layout analysis primarily consist of documents with a single layout type, are in the English language, and are limited to PDF documents. In this paper, we propose the Doc-DINO model for analyzing complex logical document layouts using a dataset that includes multiple formats, types, and a wider range of categories. Firstly, Aiming to learn more abstract and advanced representations by fusing multi-scale features, the Cross-Scale Convolution Fusion (CSCF) is proposed as the neck of the model. Secondly, the Fully Convolutional Multi-Core Self-Attention (FCMS) Encoder is presented, which includes convolutional attention and convolutional feedforward networks to better capture relationships between inputs and enhance the model’s expressive power. The model achieves a mean Average Precision (mAP) of 65.7 on the complex document layout analysis dataset M6Doc and 64.2 on SCUT-CAB, setting a new state-of-the-art performance for these datasets.
Qilin Deng, Mayire Ibrayim, Askar Hamdulla, Hailong Luo, Chunhu Zhang
Machine Unlearning for Document Classification
Abstract
Document understanding models have recently demonstrated remarkable performance by leveraging extensive collections of user documents. However, since documents often contain large amounts of personal data, their usage can pose a threat to user privacy and weaken the bonds of trust between humans and AI services. In response to these concerns, legislation advocating “the right to be forgotten” has recently been proposed, allowing users to request the removal of private information from computer systems and neural network models. A novel approach, known as machine unlearning, has emerged to make AI models forget about a particular class of data. In our research, we explore machine unlearning for document classification problems, representing, to the best of our knowledge, the first investigation into this area. Specifically, we consider a realistic scenario where a remote server houses a well-trained model and possesses only a small portion of training data. This setup is designed for efficient forgetting manipulation. This work represents a pioneering step towards the development of machine unlearning methods aimed at addressing privacy concerns in document analysis applications. Our code is publicly available at https://​github.​com/​leitro/​MachineUnlearnin​g-DocClassificatio​n.
Lei Kang, Mohamed Ali Souibgui, Fei Yang, Lluis Gomez, Ernest Valveny, Dimosthenis Karatzas
DocXplain: A Novel Model-Agnostic Explainability Method for Document Image Classification
Abstract
Deep learning (DL) has revolutionized the field of document image analysis, showcasing superhuman performance across a diverse set of tasks. However, the inherent black-box nature of deep learning models still presents a significant challenge to their safe and robust deployment in industry. Regrettably, while a plethora of research has been dedicated in recent years to the development of DL-powered document analysis systems, research addressing their transparency aspects has been relatively scarce. In this paper, we aim to bridge this research gap by introducing DocXplain, a novel model-agnostic explainability method specifically designed for generating high interpretability feature attribution maps for the task of document image classification. In particular, our approach involves independently segmenting the foreground and background features of the documents into different document elements and then ablating these elements to assign feature importance. We extensively evaluate our proposed approach in the context of document image classification, utilizing 4 different evaluation metrics, 2 widely recognized document benchmark datasets, and 10 state-of-the-art document image classification models. By conducting a thorough quantitative and qualitative analysis against 9 existing state-of-the-art attribution methods, we demonstrate the superiority of our approach in terms of both faithfulness and interpretability. To the best of the authors’ knowledge, this work presents the first model-agnostic attribution-based explainability method specifically tailored for document images. We anticipate that our work will significantly contribute to advancing research on transparency, fairness, and robustness of document image classification models.
Saifullah Saifullah, Stefan Agne, Andreas Dengel, Sheraz Ahmed
CICA: Content-Injected Contrastive Alignment for Zero-Shot Document Image Classification
Abstract
Zero-shot learning has been extensively investigated in the broader field of visual recognition, attracting significant interest recently. However, the current work on zero-shot learning in document image classification remains scarce. The existing studies either focus exclusively on zero-shot inference, or their evaluation does not align with the established criteria of zero-shot evaluation in the visual recognition domain. We provide a comprehensive document image classification analysis in Zero-Shot Learning (ZSL) and Generalized Zero-Shot Learning (GZSL) settings to address this gap. Our methodology and evaluation align with the established practices of this domain. Additionally, we propose zero-shot splits for the RVL-CDIP dataset. Furthermore, we introduce CICA (pronounced “ki\(\cdot \)ka”), a framework that enhances the zero-shot learning capabilities of CLIP. CICA consists of a novel ‘content module’ designed to leverage any generic document-related textual information. The discriminative features extracted by this module are aligned with CLIP’s text and image features using a novel ‘coupled-contrastive’ loss. Our module improves CLIP’s ZSL top-1 accuracy by 6.7% and GZSL harmonic mean by 24% on the RVL-CDIP dataset. Our module is lightweight and adds only 3.3% more parameters to CLIP. Our work sets the direction for future research in zero-shot document classification.
Sankalp Sinha, Muhammad Saif Ullah Khan, Talha Uddin Sheikh, Didier Stricker, Muhammad Zeshan Afzal
LAPDoc: Layout-Aware Prompting for Documents
Abstract
Recent advances in training large language models (LLMs) using massive amounts of solely textual data lead to strong generalization across many domains and tasks, including document-specific ones. On the other hand, there is a trend to train multi-modal transformer architectures tailored for document understanding that are designed specifically to fuse textual inputs with the corresponding document layout. This involves a separate fine-tuning step for which additional training data is required. At present, no document transformers with comparable generalization to LLMs are available. This raises the question which type of model is to be preferred for document understanding tasks. In this paper we investigate the possibility to use purely text-based LLMs for document-specific tasks by using layout enrichment. We explore drop-in modifications and rule-based methods to enrich purely textual LLM prompts with layout information. In our experiments we investigate the effects on the commercial ChatGPT model and the open-source LLM Solar. We demonstrate that using our approach both LLMs show improved performance on various standard document benchmarks. In addition, we study the impact of noisy OCR and layout errors, as well as the limitations of LLMs when it comes to utilizing document layout. Our results indicate that layout enrichment can improve the performance of purely text-based LLMs for document understanding by up to 15%, and by 6% on average compared to just using plain document text. In conclusion, this approach should be considered for the best model choice between text-based LLM or multi-modal document transformers.
Marcel Lamott, Yves-Noel Weweler, Adrian Ulges, Faisal Shafait, Dirk Krechel, Darko Obradovic
A LayoutLMv3-Based Model for Enhanced Relation Extraction in Visually-Rich Documents
Abstract
Document Understanding is an evolving field in Natural Language Processing (NLP). In particular, visual and spatial features are essential in addition to the raw text itself and hence, several multimodal models were developed in the field of Visual Document Understanding (VDU). However, while research is mainly focused on Key Information Extraction (KIE), Relation Extraction (RE) between identified entities is still under-studied. For instance, RE is crucial to regroup entities or obtain a comprehensive hierarchy of data in a document. In this paper, we present a model that, initialized from LayoutLMv3, can match or outperform the current state-of-the-art results in RE applied to Visually-Rich Documents (VRD) on FUNSD and CORD datasets, without any specific pre-training and with fewer parameters. We also report an extensive ablation study performed on FUNSD, highlighting the great impact of certain features and modelization choices on the performances.
Wiam Adnan, Joel Tang, Yassine Bel Khayat Zouggari, Seif Edinne Laatiri, Laurent Lam, Fabien Caspani
Are Layout Analysis and OCR Still Useful for Document Information Extraction Using Foundation Models?
Abstract
With the advent of end-to-end models and the remarkable performance of foundation models, the question arises regarding the relevance of preliminary steps, such as layout analysis and optical character recognition (OCR), for information extraction from document images. We attempt to provide some answers through experiments conducted on a new database of food labels. The goal is to extract nutritional values from cellphone pictures taken in grocery stores. We compare the results of OCR-free models that take the raw images as input (Donut and GPT-4-Vision) with two-stage systems that first perform OCR and then extract information using large language models (LLMs) from the recognized text (Mistral, GPT-3, and GPT-4). To assess the impact of layout analysis, we applied the same systems to three different views of the image: the original full image, a large manual crop containing the entire food label, and a small crop focusing on the relevant nutrition information. Comparative experiments are also conducted on the CORD database of receipts. Our results demonstrate that although OCR-free models achieve a remarkable performance, they still require some guidance regarding the layout, and two-stage systems achieve better results overall.
Anna Scius-Bertrand, Atefeh Fakhari, Lars Vögtlin, Daniel Ribeiro Cabral, Andreas Fischer

Machine Learning Methods

Frontmatter
DistilDoc: Knowledge Distillation for Visually-Rich Document Applications
Abstract
This work explores knowledge distillation (KD) for visually-rich document (VRD) applications such as document layout analysis (DLA) and document image classification (DIC). While VRD research is dependent on increasingly sophisticated and cumbersome models, the field has neglected to study efficiency via model compression. Here, we design a KD experimentation methodology\(^\dagger \) for more lean, performant models on document understanding (DU) tasks that are integral within larger task pipelines. We carefully selected KD strategies (response-based, feature-based) for distilling knowledge to and from backbones with different architectures (ResNet, ViT, DiT) and capacities (base-small-tiny). We study what affects the teacher-student knowledge gap and find that some methods (tuned vanilla KD, MSE, SimKD with an apt projector) can consistently outperform supervised student training. Furthermore, we design downstream task setups to evaluate covariate shift and the robustness of distilled DLA models on zero-shot layout-aware document visual question answering (DocVQA). DLA-KD experiments result in a large mAP knowledge gap, which unpredictably translates to downstream robustness, accentuating the need to further explore how to efficiently obtain more semantic document layout awareness.(\(\dagger \) Code available at: https://​github.​com/​Jordy-VL/​DistilDoc_​ICDAR24)
Jordy Van Landeghem, Subhajit Maity, Ayan Banerjee, Matthew Blaschko, Marie-Francine Moens, Josep Lladós, Sanket Biswas
Self-supervised Pre-training of Text Recognizers
Abstract
In this paper, we investigate self-supervised pre-training methods for document text recognition. Nowadays, large unlabeled datasets can be collected for many research tasks, including text recognition, but it is costly to annotate them. Therefore, methods utilizing unlabeled data are researched. We study self-supervised pre-training methods based on masked label prediction using three different approaches – Feature Quantization, VQ-VAE, and Post-Quantized AE. We also investigate joint-embedding approaches with VICReg and NT-Xent objectives, for which we propose an image shifting technique to prevent model collapse where it relies solely on positional encoding while completely ignoring the input image. We perform our experiments on historical handwritten (Bentham) and historical printed datasets mainly to investigate the benefits of the self-supervised pre-training techniques with different amounts of annotated target domain data. We use transfer learning as strong baselines. The evaluation shows that the self-supervised pre-training on data from the target domain is very effective, but it struggles to outperform transfer learning from closely related domains. This paper is one of the first researches exploring self-supervised pre-training in document text recognition, and we believe that it will become a cornerstone for future research in this area. We made our implementation of the investigated methods publicly available at https://​github.​com/​DCGM/​pero-pretraining.
Martin Kišš, Michal Hradiš
Deep Learning-Driven Innovative Model for Generating Functional Knowledge Units
Abstract
Design science research shows that existing knowledge is the basis for product design. The functional knowledge unit is the most basic knowledge to describe the functional design knowledge. Nowadays, the acquisition of functional units is mainly manual, which is time-consuming and labor-intensive. Functional knowledge integration is an effective way to achieve innovation design, yet the insufficient functional units cannot effectively support the integration. To address the above issue, this paper proposes a named-entity recognition (NER) model called Boundary Perception NER (BP-NER). From the product manual, BP-NER can automatically extract information necessary to describe the functional unit. The model leverages entity boundary information to predict entity classification labels and incorporates semantically-rich character-level feature information. BP-NER also introduces FocalLoss function to solve the problem of label imbalance. Experiments on the functional unit dataset demonstrate the effectiveness of the proposed model. Compared with the baseline model BERT-BiLSTM-CRF, BP-NER increases the overall label prediction accuracy by 5.05%, and the average F1-score improvement is 32.8% for entities CIN, COT, DIN, DOT and ENY.
Qiangang Pan, Hu Yahong, Xie Youbai, Meng Xianghui, Zhang Yilun
Global-SEG: Text Semantic Segmentation Based on Global Semantic Pair Relations
Abstract
Text semantic segmentation is a crucial task in language understanding, as subsequent natural language processing tasks often require cohesive semantic blocks. This paper introduces a new perspective on this task by utilizing global semantic pair relations from both token- and sentence-level language models. This approach addresses the limitations of prior work, which concentrated solely on individual semantic units like sentences. Our model processes both local and global levels of sentence semantics via encoders and then combines the semantics obtained at each stage into a semantic embedding matrix. This matrix is then fed through a convolutional neural network and finally used as input through another encoder. This process enables the identification of semantic segmentation boundaries by describing the relationships of global semantic pairs. Furthermore, we utilize semantic embeddings from large language models and consider the positional information of text within the document to assess their efficacy in augmenting semantics. We test our model with both contemporary and historical corpora, and the results demonstrate that our approach outperforms benchmarks on each dataset.
Wenjun Sun, Hanh Thi Hong Tran, Carlos-Emiliano González-Gallardo, Mickaël Coustaty, Antoine Doucet
Multimodal Adaptive Inference for Document Image Classification with Anytime Early Exiting
Abstract
This work addresses the need for a balanced approach between performance and efficiency in scalable production environments for visually-rich document understanding (VDU) tasks. Currently, there is a reliance on large document foundation models that offer advanced capabilities but come with a heavy computational burden. In this paper, we propose a multimodal early exit (EE) model design that incorporates various training strategies, exit layer types and placements (Code is available at https://​github.​com/​Jordy-VL/​multi-modal-early-exit). Our goal is to achieve a Pareto-optimal balance between predictive performance and efficiency for multimodal document image classification. Through a comprehensive set of experiments, we compare our approach with traditional exit policies and showcase an improved performance-efficiency trade-off. Our multimodal EE design preserves the model’s predictive capabilities, enhancing both speed and latency. This is achieved through a reduction of over 20% in latency, while fully retaining the baseline accuracy. This research represents the first exploration of multimodal EE design within the VDU community, highlighting as well the effectiveness of calibration in improving confidence scores for exiting at different layers. Overall, our findings contribute to practical VDU applications by enhancing both performance and efficiency.
Omar Hamed, Souhail Bakkali, Matthew Blaschko, Sien Moens, Jordy Van Landeghem
Integrating Dependency Type and Directionality into Adapted Graph Attention Networks to Enhance Relation Extraction
Abstract
Relation extraction is fundamental in knowledge graph construction. Recent studies have indicated the efficacy of Graph Convolutional Networks in enhancing performance in relation extraction tasks by leveraging dependency trees. However, noise in automatically generated dependency trees poses a challenge to using syntactic dependency information effectively. In this paper, we propose an Adaptive Graph Attention Network model based on Dependency Type and Direction information, which effectively reduces noise through direction-aware dependency relations, thereby enhancing extraction performance. Specifically, we propose an adaptive graph attention mechanism to construct direction-aware adjacency matrices for the precise aggregation of dependency pairs centred around entities as head words. This mechanism effectively filters out noise interference from entities acting as dependent words. Moreover, our model dynamically allocates weights to different dependency types based on their directions. This adaptive allocation enhances the learning capability of entity representations, optimizing the encoding and extraction of entity information. Experimental results on two English benchmark datasets demonstrate that introducing direction information significantly enhances the model’s performance. These findings validate the efficacy of incorporating directionality in encoding to reduce dependency noise and improve relation extraction task performance.
Yiran Zhao, Di Wu, Shuqi Dai, Tong Li
ViT-ED: Transformer Network for Image Similarity Measurement
Abstract
Measuring image similarity correctly and reliability is a critical requirement with profound implications across various applications, including puzzle reconstruction and historial document retrieval. In this work, we introduce a new deep neural network called ViT-ED, which stands for Vision Transformer with Encoder and Decoder network, to solve this task of image similarity estimation. By utilizing the attention and cross-attention mechanisms from the Transformer architecture, ViT-ED is capable of incorporating both global and local dependencies between patches from the two input images to make better similarity measurement. Experimental results on benchmark datasets of the two related problems: puzzle reconstruction and image retrieval, show that our ViT-ED model significantly outperforms state-of-the-art approaches on these tasks, e.g. ViT-ED achieves \(19\%\) of improvement in terms of mean Average Precision over state-of-the-art approach on the HisFragIR20 benchmark dataset. These results suggest that ViT-ED could be a strong candidate to solve image similarity related problems.
Manh Tu Vu, Marie Beurton-Aimar
Counting the Corner Cases: Revisiting Robust Reading Challenge Data Sets, Evaluation Protocols, and Metrics
Abstract
For two decades, robust reading challenges (RRCs) have driven and measured progress of text recognition systems in new and difficult domains. Such standardized benchmarks benefit the field by allowing participants and observers to systematically track steady performance improvements as interest in the problem continues to grow. To better understand their impacts and create opportunities for further improvements, this work empirically analyzes three important aspects of several challenges from the last decade: data sets, evaluation protocols, and competition metrics. First, we explore implications of certain annotation protocols. Second, we identify limitations in existing evaluation protocols that cause even the ground truth annotations to receive less than perfect scores. To remedy this, we propose evaluation protocol updates that boost both recall and precision. Accounting for these corner cases causes almost no changes to current rankings; however, such cases may become more prominent and important to consider as challenges focus on increasingly complex reading tasks. Finally, inspired by the recent HierText challenge’s use of Panoptic Quality (PQ), we explore the impact of this simple, parameter-free tightness-aware metric on six prior challenges, and we propose a new variant—Panoptic Character Quality (PCQ)—for simultaneously measuring character-level accuracy and word detection tightness. We find PQ-based metrics have a greater re-ranking impact on detection-only tasks, but predict end-to-end rankings slightly better than F-score. In sum, our empirical analysis and associated code should allow future challenge designers to make better-informed choices.
Jerod Weinman, Amelia Gómez Grabowska, Dimosthenis Karatzas
Coarse-to-Fine Document Image Registration for Dewarping
Abstract
Document dewarping has made great progress in recent years, however it usually requires huge document pairs with pixel-level annotation to learn a mapping function. Although photographed document images are easy to obtain, the pixel-level annotation between warped and flat images is time-consuming and almost impossible for large-scale datasets. To overcome this issue, we propose to register photographed documents with corresponding flat counterparts, obtaining the auto-annotation of pixel-level mapping labels. Due to the severe deformation in the real photographed documents, we introduce a coarse-to-fine registration pipeline to learn global-scale transformation and local details alignment respectively. In addition, the lack of registration labels motivates us to tailor a teacher-student dual branch under semi-supervised training, where the model is initialized on synthetic documents with labels. Furthermore, we contribute a large-scale dataset containing 12,500 triplets of synthetic-real-flat documents. Extensive experiments demonstrate the effectiveness of our proposed registration method. Specifically, trained by our registered pixel-level documents, the dewarping model can obtain comparable performance with SOTAs trained by almost 100\(\times \) scale of samples, showing the high quality of our registration results. Our dataset and code are available at https://​github.​com/​hanquansanren/​DIRD.
Weiguang Zhang, Qiufeng Wang, Kaizhu Huang, Xiaomeng Gu, Fengjun Guo
An Ultra-lightweight Approach for Machine Readable Zone Detection via Semantic Segmentation and Fast Hough Transform
Abstract
The Machine Readable Zone (MRZ) detection task is crucial for automated document processing systems, particularly in identity verification and authentication tasks. Due to the limited memory capacity of embedded devices, modern deep learning models for MRZ localization should not only show solid quality on challenging images but also have a small size. In this paper, we present HED-MRZ (Hough Encoder for Detection) - an ultra-lightweight deep learning model for MRZ detection based on semantic segmentation with direct and transposed Fast Hough Transform (FHT) layers. The usage of FHT layers allows us to deal with the global receptive field on the first layers of the network that helps to reduce not only the depth of the network but also the number of trainable parameters. Compared to the regression-based state-of-the-art YOLO-MRZ approach, HED-MRZ decreases number of undetected MRZs on the MIDV-LAIT by 50%, as well as outperforms it on other challenging datasets such as MIDV-2019 and MIDV-2020. At the same time, it has an order of magnitude fewer trainable parameters and weights only 120KB, thereby making it an ideal solution for use on embedded devices.
Daria Ershova, Alexander Gayer, Alexander Sheshkus, Vladimir V. Arlazarov
Synergistic Diverse Perspective for Topic Evolution Analysis on Weibo
Abstract
Nowadays, Weibo has been one of the most popular social media platforms for information dissemination, social interaction, and public opinion influence. It is an urgent study for grasping the dynamic evolution of public opinion on Weibo. However, most of existing studies take little account of the evolution of local topics instead merely discussed the overall distribution of topics along the timeline as well as the low quality of generated topics due to data sparsity. In this paper, we exploit an innovative method to extract the topic vectors from Weibo by fusing multi-semantic vectors denotation. The extracted vectors generate a new topic representations from both the global and the local perspectives to enhance the semantic depth of the topic descriptions. Extensive experimental results show that the effectiveness of our proposed method that can comprehensively mine the potential evolutionary information on Weibo.
Tong Zhang, Jianing Zhang, Rong Yan
Cross-Domain Image Conversion by CycleDM
Abstract
The purpose of this paper is to enable the conversion between machine-printed character images (i.e., font images) and handwritten character images through machine learning. For this purpose, we propose a novel unpaired image-to-image domain conversion method, CycleDM, which incorporates the concept of CycleGAN into the diffusion model. Specifically, CycleDM has two internal conversion models that bridge the denoising processes of two image domains. These conversion models are efficiently trained without explicit correspondence between the domains. By applying machine-printed and handwritten character images to the two modalities, CycleDM realizes the conversion between them. Our experiments for evaluating the converted images quantitatively and qualitatively found that ours performs better than other comparable approaches.
Sho Shimotsumagari, Shumpei Takezaki, Daichi Haraguchi, Seiichi Uchida
YOLO Assisted A* Algorithm for Robust Line Segmentation of Degraded Document Images
Abstract
Although OCR from images of good quality documents can be considered as a solved problem, the same is not true whenever its quality gets degraded due to certain reasons such as its very old age. On the other hand, OCR of old documents has significant importance towards preservation of cultural heritage, indexing, retrieval etc. The task of degraded document OCR is often critical due to a number of reasons, including the high resemblance between noisy background and faded foreground pixels, asymmetric skews of different lines etc. The study presented in this article has been conducted on a dataset of recently collected sample images of old severely degraded document pages in addition to a few others and the task is very difficult due to the high degradation level of the samples and lack of training ground truths. Here, we propose a hybrid approach combining both of a learning-based and another rule-based methods for line segmentation of similar degraded documents. The proposed method utilizes well-known object detection system YOLO, trained on a publicly available dataset of handwritten samples, to predict starting point (left extreme point) of each line divider, the remaining part of the segmenting line has been obtained using a modified version of graph traversing approach ‘A* path finding’. Thus, the path of the segmenting line suitably dividing two consecutive text lines starting from the predicted left end point and terminating at the right end point could be obtained. The proposed approach has overcome various existing challenges of line segmentation of old degraded quality documents and improved results on several publicly available datasets. Performance comparisons of three existing strategies on five datasets of different languages and varying degradation levels, both of printed and handwritten texts have been presented in this article.
Ahana Kundu, Ujjwal Bhattacharya
Enhancing CRNN HTR Architectures with Transformer Blocks
Abstract
Handwritten Text Recognition (HTR) is a challenging problem that plays an essential role in digitizing and interpreting diverse handwritten documents. While traditional approaches primarily utilize CNN-RNN (CRNN) architectures, recent advancements based on Transformer architectures have demonstrated impressive results in HTR. However, these Transformer-based systems often involve high-parameter configurations and rely extensively on synthetic data. Moreover, they lack focus on efficiently integrating the ability of Transformer modules to grasp contextual relationships within the data. In this paper, we explore a lightweight integration of Transformer modules into existing CRNN frameworks to address the complexities of HTR, aiming to enhance the context of the sequential nature of the task. We present a hybrid CNN image encoder with intermediate MobileViT blocks that effectively combines the different components in a resource-efficient manner. Through extensive experiments and ablation studies, we refine the integration of these modules and demonstrate that our proposed model enhances HTR performance. Our results on the line-level IAM and RIMES datasets suggest that our proposed method achieves competitive performance with significantly fewer parameters and without integrating synthetic data compared to existing systems.
George Retsinas, Konstantina Nikolaidou, Giorgos Sfikas
Dynamic Reasoning with Language Model and Knowledge Graph for Question Answering
Abstract
The question answering(QA) involves reasoning about the context and latent knowledge of complex textual descriptions. Current research is how to effectively utilize knowledge graph(KG) to enhance language model(LM) with external knowledge. In previous works, the interactions between the QA context and KG were limited, and KG input to the model contained noisy nodes, greatly restricting the model’s reasoning ability. We propose a dynamic reasoning model, DLM-KG, which is based on LM and KG. It resolves the above challenges through dynamic hierarchical interaction between QA context and KG, joint reasoning between LM and KG, and dynamic pruning of the KG. Specifically, DLM-KG extracts hierarchical features from KG representations and performs inter-layer and intra-layer interactions in each iteration. The features from interactions enter the joint reasoning module, where each QA context feature and KG feature mutually attend to each other. The representations of the two modalities are fused and updated through multi-step interactions. Finally, using the information provided by the interaction layer, irrelevant nodes in the KG are removed. Experiments conducted on the commonsense datasets CommonsenseQA, OpenbookQA, and the medical question and answer dataset MedQA-USMLE show that the performance on the MedQA-USMLE dataset is superior to baseline models, and on other datasets, the performance is close to baseline models, demonstrating its competitiveness in terms of reasoning ability.
Yujie Lu, Dean Wu, Yuhong Zhang
Backmatter
Metadata
Title
Document Analysis and Recognition - ICDAR 2024
Editors
Elisa H. Barney Smith
Marcus Liwicki
Liangrui Peng
Copyright Year
2024
Electronic ISBN
978-3-031-70546-5
Print ISBN
978-3-031-70545-8
DOI
https://doi.org/10.1007/978-3-031-70546-5

Premium Partner