Skip to main content
Top

2024 | Book

Document Analysis Systems

16th IAPR International Workshop, DAS 2024, Athens, Greece, August 30–31, 2024, Proceedings

insite
SEARCH

About this book

This book constitutes the refereed proceedings of the 16th IAPR International Workshop on Document Analysis Systems, DAS 2024, held in Athens, Greece, during August 30-31, 2024.
The 27 full papers presented were carefully reviewed and selected from 43 submissions addressing topics like: document analysis and understanding; retrieval and VQA; layout analysis; document classification; OCR correction and NLP; recognition systems; and historical documents.

Table of Contents

Frontmatter

Document Analysis and Understanding

Frontmatter
Two Experiments for Automatic Scoring of Handwritten Descriptive Answers
Abstract
This paper presents our motivation, design and two experiments for automatic scoring of handwritten descriptive answers. The first experiment is on scoring of handwritten short descriptive answers in Japanese language exams. We used a deep neural network (DNN)-based handwriting recognizer and a transformer-based automatic scorer without correcting misrecognized characters or adding rubric annotations for scoring. We achieved acceptable agreement between the automatic scoring and the human scoring, while using only 1.7% of the human-scored answers for training. The second experiment is to score descriptive answers written on electronic paper for Japanese, English, and math drills. We used DNN-based online and offline handwriting recognizers for each subject and took simple perfect matching of recognized candidates with correct answers. The experiment shows that the False Negative rate is reduced by combining the online and offline recognizers and the False Positive rate is reduced by rejecting low recognition scores. Even with the current system, human scorers only need to manually score less than 30% of the answers, with false positive (risky) scores of about 2% or less for the three subjects.
Masaki Nakagawa, Hung Tuan Nguyen, Nghia Thanh Truong, Nam Tuan Ly, Cuong Tuan Nguyen, Haruki Oka, Tsunenori Ishioka, Tomo Asakura, Hiroshi Miyazawa, Takahiro Yamamoto, Toshihiko Horie, Fumiko Yasuno
Transformer-Based Architecture for Judgment Prediction and Explanation in Legal Proceedings
Abstract
Advancements in language understanding have helped researchers develop a verdict prediction system that can assist a court judge in verdict formulation. This technological intervention can help streamline and standardize the decision-making process across all levels of courts. One key benefit of developing such a system is that the junior judges can benefit from the collective knowledge stored in the knowledge base, improving their ability to make consistent and well-informed decisions. For any such system to be practically useful, predictions should be explainable too. This research proposes a hierarchical pipeline that aims to leverage domain-specific variants of BERT to enhance the process of informed decision-making. The research is mainly divided into two modules: ‘Legal Judgment Prediction (LJP)’ and ‘Legal Judgment Explanation Extraction (LJEE)’. The LJP task pertains to predicting the outcome of legal decisions concerning the appellant. In contrast, the LJEE refers to extracting out the phrases/clauses that led to the final decision. To promote research in developing such a system for Pakistani legal documents, this paper also introduces the VerdictVaultPK dataset. The dataset comprises around 11,943 rental-property case proceedings, each annotated with the court decisions indicating whether the appeal was allowed or dismissed. This research highlights how the use of domain-specific transformer models enriches semantic embeddings, contributing to a substantial accuracy improvement of 3–4%.
Arooba Maqsood, Adnan Ul-Hasan, Faisal Shafait
Enhanced Bank Check Security: Introducing a Novel Dataset and Transformer-Based Approach for Detection and Verification
Abstract
Automated signature verification on bank checks is critical for fraud prevention and ensuring transaction authenticity. This task is challenging due to the coexistence of signatures with other textual and graphical elements on real-world documents. Verification systems must first detect the signature and then validate its authenticity, a dual challenge often overlooked by current datasets and methodologies focusing only on verification. To address this gap, we introduce a novel dataset specifically designed for signature verification on bank checks. This dataset includes a variety of signature styles embedded within typical check elements, providing a realistic testing ground for advanced detection methods. Moreover, we propose a novel approach for writer-independent signature verification using an object detection network. Our detection-based verification method treats genuine and forged signatures as distinct classes within an object detection framework, effectively handling both detection and verification. We employ a DINO-based network augmented with a dilation module to detect and verify signatures on check images simultaneously. Our approach achieves an AP of 99.2 for genuine and 99.4 for forged signatures, a significant improvement over the DINO baseline, which scored 93.1 and 89.3 for genuine and forged signatures, respectively. This improvement highlights our dilation module’s effectiveness in reducing both false positives and negatives. Our results demonstrate substantial advancements in detection-based signature verification technology, offering enhanced security and efficiency in financial document processing.
Muhammad Saif Ullah Khan, Tahira Shehzadi, Rabeya Noor, Didier Stricker, Muhammad Zeshan Afzal

Retrieval and VQA

Frontmatter
Multi-page Document VQA with Recurrent Memory Transformer
Abstract
Multi-page document Visual Question Answering (VQA) poses realistic challenges in the realm of document understanding due to its complexity and volume of information distributed across multiple pages. Current state-of-the-art methods often struggle to process lengthy documents, because they either exceed the model’s input token limits when treated as single-page document VQA problems, or compress pages into vectors that may omit crucial information. To our knowledge, our proposed method is the first to integrate recurrent memory mechanisms with the transformer architecture specialized for multi-page document VQA. Extensive experiments demonstrate that our proposed method achieves state-of-the-art performance while maintaining a manageable model size.
Qi Dong, Lei Kang, Dimosthenis Karatzas
Instruction Makes a Difference
Abstract
We introduce the Instruction Document Visual Question Answering (iDocVQA) dataset and the Large Language Document (LLaDoc) model, for training Language-Vision (LV) models for document analysis and predictions on document images, respectively. Usually, deep neural networks for the DocVQA task are trained on datasets lacking instructions. We show that using instruction-following datasets improves performance. We compare performance across document-related datasets using the recent state-of-the-art (SotA) Large Language and Vision Assistant (LLaVA)1.5 as the base model. We also evaluate the performance of the derived models for object hallucination using the Polling-based Object Probing Evaluation (POPE) dataset. The results show that instruction-tuning performance ranges from 11x to 32x of zero-shot performance and from 0.1% to 4.2% over non-instruction (traditional task) finetuning. Despite the gains, these still fall short of human performance (94.36%), implying there’s much room for improvement.
Tosin Adewumi, Nudrat Habib, Lama Alkhaled, Elisa Barney
Image-Text Matching for Large-Scale Book Collections
Abstract
We address the problem of detecting and mapping all books in a collection of images to entries in a given book catalogue. Instead of performing independent retrieval for each book detected, we treat the image-text mapping problem as a many-to-many matching process, looking for the best overall match between the two sets. We combine a state-of-the-art segmentation method (SAM) to detect book spines and extract book information using a commercial OCR. We then propose a two-stage approach for text-image matching, where CLIP embeddings are used first for fast matching, followed by a second slower stage to refine the matching, employing either the Hungarian Algorithm or a BERT-based model trained to cope with noisy OCR input and partial text matches. To evaluate our approach, we publish a new dataset of annotated bookshelf images that covers the whole book collection of a public library in Spain. In addition, we provide two target lists of book metadata, a closed-set of 15k book titles that corresponds to the known library inventory, and an open-set of 2.3M book titles to simulate an open-world scenario. We report results on two settings, on one hand on a matching-only task, where the book segments and OCR is given and the objective is to perform many-to-many matching against the target lists, and a combined detection and matching task, where books must be first detected and recognised before they are matched to the target list entries. We show that both the Hungarian Matching and the proposed BERT-based model outperform a fuzzy string matching baseline, and we highlight inherent limitations of the matching algorithms as the target increases in size, and when either of the two sets (detected books or target book list) is incomplete. The dataset and code are available at https://​github.​com/​llabres/​library-dataset.
Artemis Llabrés, Arka Ujjal Dey, Dimosthenis Karatzas, Ernest Valveny

Layout Analysis

Frontmatter
RCAM-Transformer: A Novel Approach to Table Reconstruction Using Row-Column Attention Mechanism
Abstract
Table reconstruction, a critical task in the field of table structure recognition (TSR), plays a vital role in various domains, such as data mining, machine learning, and information retrieval. While many existing TSR methods employ transformer-based models with generally impressive performance, a gap remains in transformer models specifically designed to handle the distinct attributes of table rows and columns. Moreover, there is a lack of robust table reconstruction strategies based on object detection models. To address these issues, we introduce the Row-Column Attention Mechanism (RCAM). When combined with a transformer model and integrated with partial global attention, it forms the RCAM-Transformer. This model is tailored to effectively process the unique properties of tabular data. In addition, we have developed a novel table reconstruction strategy that leverages object detection models, which improves the recognition and treatment of tabular data. Our experiments, conducted using the PubTables-1M and FinTabNet dataset, along with our self-constructed Annual Report TableSet, not only validated the effectiveness of the RCAM but also demonstrated the improved accuracy of table reconstruction with the use of our RCAM-Transformer. Such outcomes highlight the potential of the RCAM-Transformer to advance table extraction in various fields.
Zezhong Guo, Yongjian Zhang, Shibo Chen, Chiching Wei
LD-DOC: Light-Weight Domain-Adaptive Document Layout Analysis
Abstract
We propose the LD-DOC, a lightweight Document Layout Analysis (DLA) model specifically designed to address the challenge of accurately partitioning document regions under limited data conditions. The LD-DOC model effectively utilizes information from various scale visual features, enhancing its adaptability to feature distributions in scenarios with limited data and thereby improving the accuracy of document region partitioning. Specifically, our model incorporates a feature fusion module comprising a Shallow Feature Enhancement Path (SFEP) and a Cross-Fusion Path (CFP). The SFEP employs a 2D-Discrete Wavelet Transform (2D-DWT) to capture edge features at different scales, which enhances the model’s ability to perceive subtle variations and structural information in visual features. This enhancement is crucial for adapting to the nuanced requirements of limited data environments. On the other hand, the CFP uses a Local-Fusion Attention mechanism(LFA) to capture Discrepancy information adaptively among different scales. This approach reduces the model’s sensitivity to scale variations and significantly improves its generalization capabilities across diverse document layouts. Furthermore, we introduce the ISCAS-CLAD, a specialized small-scale Chinese Document Layout Analysis Dataset, to demonstrate the effectiveness of our model. Through rigorous testing on ISCAS-CLAD and the PubLayNet datasets, LD-DOC has shown a notable improvement in mean Average Precision (mAP) accuracy, outperforming baseline models by 2.2\(\%\) and 1.5\(\%\), respectively. These results highlight LD-DOC’s state-of-the-art performance, particularly in challenging data-limited environments, and underscore its potential for practical applications in DLA.
Zhangchi Gao, Shoubin Li, Yangyang Liu, Mingyang Li, Kai Huang, Yi Ren
UnSupDLA: Towards Unsupervised Document Layout Analysis
Abstract
Document layout analysis is a key area in document research, involving techniques like text mining and visual analysis. Despite various methods developed to tackle layout analysis, a critical but frequently overlooked problem is the scarcity of labeled data needed for analyses. With the rise of internet use, an overwhelming number of documents are now available online, making the process of accurately labeling them for research purposes increasingly challenging and labor-intensive. Moreover, the diversity of documents online presents a unique set of challenges in maintaining the quality and consistency of these labels, further complicating document layout analysis in the digital era. To address this, we employ a vision-based approach for analyzing document layouts designed to train a network without labels. Instead, we focus on pre-training, initially generating simple object masks from the unlabeled document images. These masks are then used to train a detector, enhancing object detection and segmentation performance. The model’s effectiveness is further amplified through several unsupervised training iterations, continuously refining its performance. This approach significantly advances document layout analysis, particularly precision and efficiency, without labels.
Talha Uddin Sheikh, Tahira Shehzadi, Khurram Azeem Hashmi, Didier Stricker, Muhammad Zeshan Afzal

Document Classification

Frontmatter
What Text Design Characterizes Book Genres?
Abstract
This study analyzes the relationship between non-verbal information (e.g., genres) and text design (e.g., font style, character color, etc.) through the classification of book genres using text design on book covers. Text images have both semantic information about the word itself and other information (non-semantic information or visual design), such as font style, character color, etc. When we read a word printed on some materials, we receive impressions and other information from both the word itself and the visual design. In other words, we can understand verbal information from semantic information, i.e., the words themselves; however, we can consider that text design is helpful for understanding other additional information (i.e., non-verbal information), such as impressions, genre, etc. To investigate the effect of text design, we analyze text design using words printed on book covers and their genres in two scenarios. First, we attempted to understand the importance of visual design for determining the genre (i.e., non-verbal information) of books by analyzing the differences in the relationship between semantic information/visual design and genres. In the experiment, we found that semantic information is sufficient to determine the genre; however, text design is helpful in adding more discriminative features for book genres. Second, we investigated the effect of each text design on book genres. As a result, we found that each text design characterizes some book genres. For example, font style is useful to add more discriminative features for genres of “Mystery, Thriller & Suspense” and “Christian books & Bibles”.
Daichi Haraguchi, Brian Kenji Iwana, Seiichi Uchida
Leveraging Semantic Segmentation Masks with Embeddings for Fine-Grained Form Classification
Abstract
Efficient categorization of historical documents is crucial for fields such as genealogy, legal research, and historical scholarship, where manual classification is impractical for large collections due to its labor-intensive and error-prone nature. To address this, we propose a representational learning strategy that integrates semantic segmentation and deep learning models such as ResNet, CLIP, Document Image Transformer (DiT), and masked auto-encoders (MAE), to generate embeddings that capture document features without predefined labels. To the best of our knowledge, we are the first to evaluate embeddings on fine-grained, unsupervised form classification. To improve these embeddings, we propose to first employ semantic segmentation as a preprocessing step. We contribute two novel datasets—the French 19th-century and U.S. 1950 Census records—to demonstrate our approach. Our results show the effectiveness of these various embedding techniques in distinguishing similar document types and indicate that applying semantic segmentation can greatly improve clustering and classification results. The census datasets are available at https://​github.​com/​tahlor/​census_​forms.
Taylor Archibald, Tony Martinez
DocLightDetect: A New Algorithm for Occlusion Classification in Identification Documents
Abstract
In the current digital era, organizations primarily interact with their clients and users online. However, accurately identifying these digital users in the physical realm raises significant challenges. Several entities, including financial institutions, insurance companies, and government services, require photos of documents sent through mobile applications to associate the physical and digital personas. This procedure entails significant computational challenges, mainly due to the need for adequate user guidance when capturing images and the variability of devices. User dependence often results in occlusions in images caused by various factors such as human fingers, shadows, and the spotlight effect. The latter is particularly common and complex due to using the device’s flash. While previous research has focused on automatically identifying occlusions caused by human fingers, the present work focuses on occlusions caused by the spotlight effect. We propose a new algorithm, DocLightDetect, which uses image segmentation as a preprocessing step to improve the accuracy of classifying occlusions caused by the spotlight effect in identification documents. The effectiveness of DocLightDetect is demonstrated through the new SpotBID Set dataset. The proposed algorithm improves performance compared to state-of-the-art document occlusion classification techniques. It is also optimized for low computational cost, making it suitable for applications in mobile devices, robotics, and the Internet of Things (IoT).
Ricardo Batista das Neves Junior, Byron Leite Dantas Bezerra, Cleber Zanchettin

OCR Correction and NLP

Frontmatter
Confidence-Aware Document OCR Error Detection
Abstract
Optical Character Recognition (OCR) continues to face accuracy challenges that impact subsequent applications. To address these errors, we explore the utility of OCR confidence scores for enhancing post-OCR error detection. Our study involves analyzing the correlation between confidence scores and error rates across different OCR systems. We develop ConfBERT, a BERT-based model that incorporates OCR confidence scores into token embeddings and offers an optional pre-training phase for noise adjustment. Our experimental results demonstrate that integrating OCR confidence scores can enhance error detection capabilities. This work underscores the importance of OCR confidence scores in improving detection accuracy and reveals substantial disparities in performance between commercial and open-source OCR technologies.
Arthur Hemmer, Mickaël Coustaty, Nicola Bartolo, Jean-Marc Ogier
Error Correction of Japanese Character-Recognition in Answers to Writing-Type Questions Using T5
Abstract
This paper proposes a method for correcting character-recognition errors in Japanese handwritten answers to writing-type questions from exercise books. We created a model to correct character-recognition errors by fine-tuning the text-to-text-transfer-transformer (T5) using pairs of automatically recognized data from handwritten answers and their manual corrections. The data comprised handwritten Japanese answers from 185 junior high school students to writing-type questions in a Japanese language task. In addition, we augmented the training data using the five best results of the character-recognition model with confidence scores to learn additional patterns of recognition errors. The experimental results revealed that the answers corrected by the proposed method were closer to the actual answers than those before the correction and data augmentation was effective for the correction model.
Rina Suzuki, Hisao Usui, Hiroaki Ozaki, Hung Tuan Nguyen, Kanako Komiya, Tsunenori Ishioka, Masaki Nakagawa
How Does Changing the Optical Character Recognition System Impact the Layout-Aware Named Entity Recognition Models?
Abstract
Merging information from physical and digital documents is essential in an era when information is becoming even more relevant. Different strategies have been used to combine knowledge from these two data sources. One state-of-the-art data extraction approach for this problem is the Named Entity Recognition (NER) strategy. However, even for those advanced models, the performance is still highly dependent on the Optical Character Recognition (OCR) system used to read the text from the physical documents. This paper investigates this dependence and how altering OCR systems between the training and inference phases influences NER performance. We verified that changing the OCR system negatively impacts the performance of data extraction models. Furthermore, we also show that models trained on less accurate OCR are more robust to OCR changes in the inference phase. The most accurate one regarding OCR errors should be preferred in scenarios where the OCR system is the same in the training and inference stages. We also propose a solution to mitigate this problem by mixing OCRs during the training phase. This approach enhances the model’s robustness while simultaneously preserving a high F1-score.
João Macedo, Byron Bezerra, Cleber Zanchettin
RUATS: Abstractive Text Summarization for Roman Urdu
Abstract
Recent advances in text summarization primarily target high resource languages. However, their performance on low resource and unstructured languages like Roman Urdu (RU) is not yet evaluated. This research evaluates abstractive summarization of Roman Urdu text commonly used while communicating via social media in Urdu speaking communities. Due to scarcity of relevant datasets, a corpus of Roman Urdu text is generated by transliterating samples collected from two benchmark Urdu abstractive text summarization datasets. Baseline summaries are then generated using two state-of-the-art (SOTA) transformer-based models Bidirectional Encoder Representations from Transformers (BERT) and Text-To-Text Transfer Transformer (T5). The summaries generated by both models are evaluated using different intrinsic and extrinsic methods. Results of the experiments show that T5 outperforms BERT in generating abstractive summaries of Roman Urdu text. Nonetheless, there is more research required in this direction.
Laraib Kaleem, Arif Ur Rahman, Momina Moetesum

Recognition Systems

Frontmatter
Speed-Up Pre-trained Vision Encoder–Decoder Transformers by Leveraging Lightweight Mixer Layers for Text Recognition
Abstract
Text Recognition (TR) technology leverages a range of deep learning techniques to analyze and identify characters and words embedded in images. Its scope encompasses handwritten, printed, and scene text recognition. In this paper, we take a holistic approach, treating these categories as a unified challenge to delve into the complexities associated with TR comprehensively. The state-of-the-art models predominantly rely on vision encoder–decoder (VED) transformer architectures. However, these models tend to be bulky, housing a multitude of parameters, which not only engender significant memory consumption but also lead to sluggish inference times due to their autoregressive nature. It is essential to note that these issues primarily stem from the decoder component. Consequently, our study aims to introduce an efficient workflow that substitutes the language modeling capabilities of the decoder with lightweight Mixer layers trained using Connectionist Temporal Classification. By following this approach, we unveil three decoder-free architectures that reduce the number of parameters by a 74.3%, trim down the necessary training memory by a 53.8%, and enhance inference times with an average speedup factor of 20 when compared to their VED counterparts. In terms of results, our workflow yields models that are on par or better than the state of the art across six databases encompassing historical and modern handwritten, printed, and scene text recognition. Source code is publicly available at https://​github.​com/​dparres/​Mixer-ViT-Text-Recognition.
Daniel Parres, Dan Anitei, Roberto Paredes, Joan Andreu Sánchez, José Miguel Benedí
Maximizing Data Efficiency of HTR Models by Synthetic Text
Abstract
The usability of synthetic handwritten text to improve machine learning models is assessed for the domain of HTR. Synthetic handwritten text is generated using an existing model based on a GAN. The output of this model is then used to train a state-of-the-art HTR model, which is then applied to recognize real datasets. While this results in a CER of 28.3% and a WER of 65.5% for line images of the IAM dataset - more than three times higher than the state-of-the-art result - our experiments show that the amount of real data in a mixed training set can be significantly reduced (70–80%) to achieve comparable CER and WER rates as with real data. Using only 10% of the training data (113 images) from the CVL dataset results in a CER of 54.5% and a WER of 88.8%, pre-training the model with synthetic data results in a CER of 14.6% and a WER of 43.4%.
Markus Muth, Marco Peer, Florian Kleber, Robert Sablatnig
Contrastive Self-Supervised Learning for Optical Music Recognition
Abstract
Optical Music Recognition (OMR) is the research area focused on transcribing images of musical scores. In recent years, this field has seen great development thanks to the emergence of Deep Learning. However, these types of solutions require large volumes of labeled data. To alleviate this problem, Contrastive Self-Supervised Learning (SSL) has emerged as a paradigm that leverages large amounts of unlabeled data to train neural networks, yielding meaningful and robust representations. In this work, we explore its first application to the field of OMR. By utilizing three datasets that represent the heterogeneity of musical scores in notations and graphic styles, and through multiple evaluation protocols, we demonstrate that contrastive SSL delivers promising results, significantly reducing data scarcity challenges in OMR. To the best of our knowledge, this is the first study that integrates these two fields. We hope this research serves as a baseline and stimulates further exploration.
Carlos Penarrubia, Jose J. Valero-Mas, Jorge Calvo-Zaragoza
Full-Page Music Symbols Recognition: State-of-the-Art Deep Model Comparison for Handwritten and Printed Music Scores
Abstract
The localization and classification of musical symbols on scanned or digital music scores pose significant challenges in the process of Optical Music Recognition. For instance, similar musical symbol classes and a large number of overlapping tiny musical symbols within high-resolution music scores appear in musical scores. Recently, deep learning-based techniques show promising results in addressing these challenges by leveraging object detection models. However, unclear directions in training and evaluation approaches, such as inconsistency between usage of full-page or cropped images, handling image scores at full-page level in high-resolution, reporting results on only specific object classes, missing comprehensive analysis with recent state-of-the-art object detection methods, cause a lack of benchmarking and of analyzing the impact of proposed methods in music object recognition. To address these issues, we perform intensive analysis with recent object detection models, exploring effective ways of handling high-resolution images on existing benchmarks. Our goal is to bridge the gap between object detection models designed for common objects and relatively small images compared to music scores, and the unique challenges of music score recognition in terms of object size and resolution. We achieve state-of-the-art results across mAP and Weighted mAP on two challenging datasets, namely DeepScoresV2 and the MUSCIMA++ datasets, by demonstrating the effectiveness of this approach in both printed and handwritten music scores.
Ali Yesilkanat, Yann Soullard, Bertrand Coüasnon, Nathalie Girard

Historical Documents

Frontmatter
Fetch-A-Set: A Large-Scale OCR-Free Benchmark for Historical Document Retrieval
Abstract
This paper introduces Fetch-A-Set (FAS), a comprehensive benchmark tailored for legislative historical document analysis systems, addressing the challenges of large-scale document retrieval in historical contexts. The benchmark comprises a vast repository of documents dating back to the XVII century, serving both as a training resource and an evaluation benchmark for retrieval systems. It fills a critical gap in the literature by focusing on complex extractive tasks within the domain of cultural heritage. The proposed benchmark tackles the multifaceted problem of historical document analysis, including text-to-image retrieval for queries and image-to-text topic extraction from document fragments, all while accommodating varying levels of document legibility. This benchmark aims to spur advancements in the field by providing baselines and data for the development and evaluation of robust historical document retrieval systems, particularly in scenarios characterized by wide historical spectrum.
Adrià Molina, Oriol Ramos Terrades, Josep Lladós
From Detection to Modelling: An End-to-End Paleographic System for Analysing Historical Handwriting Styles
Abstract
Handwriting analysis in historical documents, crucial for paleography, includes both macroscopic and microscopic examinations. It ranges from assessing the general visual patterns of pages to detailed studies of individual letters. Such analysis not only aids in dating documents and identifying scribes but also provides insights into the evolution of handwriting styles and offers a comprehensive view of historical writing practices. This work proposes a novel end-to-end paleographic system designed for automated handwriting analysis, consisting of three main components: automatic detection and recognition of letters, contrastive hierarchical clustering of the detected letters, and the creation of representative models for the samples in each cluster. These models provide a visual representation of the style variations for each letter. This system processes full manuscript page images to model the diverse shapes of each letter, capturing the inherent variability in historical handwriting styles. The performance of each of the three main system components has been evaluated on a dataset of Greek letters on papyri, achieving results comparable to the state of the art across all three components.
Hussein Mohammed, Mahdi Jampour
fang: Fast Annotation of Glyphs in Historical Printed Documents
Abstract
The extraction and analysis of large numbers of glyphs, and the associated opportunities for constructing a corpus of glyphs from types of the fifteenth century, offer significant research potential for scholars in book science. Such a corpus could be used in many ways, not least in assisting in the identification of fragments, charting the movements of type, and examining the impact of wear on type. Recognising this potential, we have developed fang (Code is available at https://​github.​com/​Werck-der-buecher/​FAnG.), a software that efficiently extracts and categorises glyphs from historical printed documents. Our approach involves several stages: (1) using Optical Character Recognition to extract glyphs in bulk, (2) employing a joint energy-based model for character classification and out-of-distribution pruning, and (3) providing a comprehensive toolset for manual review and editing, including deletions/reassignments and sorting by similarity. A significant strength of this design is the utilisation of existing text transcriptions and the context-awareness of trained language models, eliminating the need for explicit glyph location ground truth or glyph templates. By parallelising the extraction, we can quickly process entire digitised books with hundreds of pages, setting our system apart from existing glyph annotation tools. In experiments on digital reproductions of the Catholicon and 36-line Bible, the method demonstrates good spatial coverage of the detected glyphs, high character classification accuracy, and yields a low number of outliers. Our system represents a significant advancement in historical document analysis, providing researchers with an efficient tool for glyph extraction and categorisation.
Florian Kordon, Nikolaus Weichselbaumer, Randall Herz, Janne van der Loop, Stephen Mossman, Edward Potten, Mathias Seuret, Martin Mayr, Fei Wu, Vincent Christlein
Bessarion: Medieval Greek Inscriptions on a Challenging Dataset for Vision and NLP Tasks
Abstract
We present a text and imaging dataset of Byzantine-era Medieval Greek inscriptions, suitable as a challenging testbed for Computer Vision and Natural Language Processing tasks. The lack of sizable related training sets, as well as difficulties related to the historical character and content of the inscriptions (natural wear of characters, systematic misspellings, etc.) make for a context where modern resource-hungry techniques are not straightforward to apply. We describe the dataset contents – images, geometric and text annotation, metadata – and discuss baselines for three Computer Vision tasks (Inscription Detection, Text Recognition) and one Natural Language Processing task (Word Classification). The dataset is publicly available at https://​github.​com/​Archaeocomputers​/​Bessarion.
Giorgos Sfikas, Panagiotis Dimitrakopoulos, George Retsinas, Christophoros Nikou, Pinelopi Kitsiou
Automatic Lemmatization of Old Church Slavonic Language Using A Novel Dictionary-Based Approach
Abstract
Old Church Slavonic (OCS) is an ancient language, and it has unique challenges and hurdles in natural language processing. Currently, there is a lack of Python libraries devised for the analysis of OCS texts. This research is not just filling the crucial gap in the computational treatment of OCS language but also producing valuable resources for scholars in historical linguistics, cultural studies, and humanities for the development of further research in the field of ancient language processing. The main contribution of this research work is the development of an algorithm for the lemmatization of OCS texts based on a learned dictionary. The approach can deal with ancient languages without the need for prior linguistic knowledge. Preparing a dataset of more than 330K words of OCS and their corresponding lemmas, this approach integrates the algorithm and dictionary efficiently to achieve accurate lemmatization on test data.
Usman Nawaz, Liliana Lo Presti, Marianna Napolitano, Marco La Cascia
Automatic Transcription of Ottoman Documents Using Deep Learning
Abstract
With the accelerated pace of digitization, a vast collection of Ottoman documents has become accessible to researchers and the general public. However, most users interested in these documents are unable to read them, as the text is Turkish written in the Arabic-Persian script. Manual transcription of such a massive amount of documents is also beyond the capacity of human experts. With the advancements in deep learning, we have been able to provide a solution to the long-standing problem of automatic transcription of printed Ottoman documents. We evaluated three decoding strategies including Word Beam Search that allows to use a recognition lexicon and n-gram statistics during the decoding phase. Furthermore, the effect of lexicon size and coverage and language modelling via character or word n-grams are also evaluated. Using a general purpose large lexicon of the Ottoman era (260K words and 86% test coverage), the performance is measured as \(6.59\%\) character error rate and \(28.46\%\) word error rate on a test set of 6, 828 text lines.
Esma F. Bilgin Tasdemir, Zeynep Tandoğan, S. Doğan Akansu, Fırat Kızılırmak, M. Umut Sen, Aysu Akcan, Mehmet Kuru, Berrin Yanikoglu
Backmatter
Metadata
Title
Document Analysis Systems
Editors
Giorgos Sfikas
George Retsinas
Copyright Year
2024
Electronic ISBN
978-3-031-70442-0
Print ISBN
978-3-031-70441-3
DOI
https://doi.org/10.1007/978-3-031-70442-0

Premium Partner