Skip to main content
Top

2021 | Book

Document Analysis and Recognition – ICDAR 2021

16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part II

insite
SEARCH

About this book

This four-volume set of LNCS 12821, LNCS 12822, LNCS 12823 and LNCS 12824, constitutes the refereed proceedings of the 16th International Conference on Document Analysis and Recognition, ICDAR 2021, held in Lausanne, Switzerland in September 2021. The 182 full papers were carefully reviewed and selected from 340 submissions, and are presented with 13 competition reports.

The papers are organized into the following topical sections: document analysis for literature search, document summarization and translation, multimedia document analysis, mobile text recognition, document analysis for social good, indexing and retrieval of documents, physical and logical layout analysis, recognition of tables and formulas, and natural language processing (NLP) for document understanding.

Table of Contents

Frontmatter

Document Analysis for Literature Search

Frontmatter
Towards Document Panoptic Segmentation with Pinpoint Accuracy: Method and Evaluation

In this paper we study the task of document layout recognition for digital documents, requiring that the model should detect the exact physical object region without missing any text or containing any redundant text outside objects. It is the vital step to support high-quality information extraction, table understanding and knowledge base construction over the documents from various vertical domains (e.g. financial, legal, and government fields). Here, we consider digital documents, where characters and graphic elements are given with their exact texts, positions inside document pages, compared with image documents. Towards document layout recognition with pinpoint accuracy, we consider this problem as a document panoptic segmentation task, that each token in the document page must be assigned a class label and an instance id. Considering that two predicted objects may intersect under traditional visual panoptic segmentation method, like Mask R-CNN, however, document objects never intersect because most document pages follow manhattan layout. Therefore, we propose a novel framework, named document panoptic segmentation (DPS) model. It first splits the document page into column regions and groups tokens into line regions, then extracts the textual and visual features, and finally assigns class label and instance id to each line region. Additionally, we propose a novel metric based on the intersection over union (IoU) between the tokens contained in predicted and the ground-truth object, which is more suitable than metric based on the area IoU between predicted and the ground-truth bounding box. Finally, the empirical experiments based on PubLayNet, ArXiv and Financial datasets show that the proposed DPS model obtains 0.8833, 0.9205 and 0.8530 mAP scores on three datasets. The proposed model obtains great improvement on mAP score compared with Faster R-CNN and Mask R-CNN models.

Rongyu Cao, Hongwei Li, Ganbin Zhou, Ping Luo
A Math Formula Extraction and Evaluation Framework for PDF Documents

We present a processing pipeline for math formula extraction in PDF documents that takes advantage of character information in born-digital PDFs (e.g., created using or Word). Our pipeline is designed for indexing math in technical document collections to support math-aware search engines capable of processing queries containing keywords and formulas. The system includes user-friendly tools for visualizing recognition results in HTML pages. Our pipeline is comprised of a new state-of-the-art PDF character extractor that identifies precise bounding boxes for non-Latin symbols, a novel Single Shot Detector-based formula detector, and an existing graph-based formula parser (QD-GGA) for recognizing formula structure. To simplify analyzing structure recognition errors, we have extended the LgEval library (from the CROHME competitions) to allow viewing all instances of specific errors by clicking on HTML links. Our source code is publicly available.

Ayush Kumar Shah, Abhisek Dey, Richard Zanibbi
Toward Automatic Interpretation of 3D Plots

This paper explores the challenge of teaching a machine how to reverse-engineer the grid-marked surfaces used to represent data in 3D surface plots of two-variable functions. These are common in scientific and economic publications; and humans can often interpret them with ease, quickly gleaning general shape and curvature information from the simple collection of curves. While machines have no such visual intuition, they do have the potential to accurately extract the more detailed quantitative data that guided the surface’s construction. We approach this problem by synthesizing a new dataset of 3D grid-marked surfaces (SurfaceGrid) and training a deep neural net to estimate their shape. Our algorithm successfully recovers shape information from synthetic 3D surface plots that have had axes and shading information removed, been rendered with a variety of grid types, and viewed from a range of viewpoints.

Laura E. Brandt, William T. Freeman

Document Summarization and Translation

Frontmatter
Can Text Summarization Enhance the Headline Stance Detection Task? Benefits and Drawbacks

This paper presents an exploratory study that analyzes the benefits and drawbacks of summarization techniques for the headline stance detection. Different types of summarization approaches are tested, as well as two stance detection methods (machine learning vs deep learning) on two state-of-the-art datasets (Emergent and FNC–1). Journalists’ headlines sourced from the Emergent dataset have demonstrated with very competitive results that they can be considered a summary of the news article. Based on this finding, this work evaluates the effectiveness of using summaries as a substitute for the full body text to determine the stance of a headline. As for automatic summarization methods, although there is still some room for improvement, several of the techniques analyzed show greater results compared to using the full body text—Lead Summarizer and PLM Summarizer are among the best-performing ones. In particular, PLM summarizer, especially when five sentences are selected as the summary length and deep learning is used, obtains the highest results compared to the other automatic summarization methods analyzed.

Marta Vicente, Robiert Sepúlveda-Torrres, Cristina Barros, Estela Saquete, Elena Lloret
The Biased Coin Flip Process for Nonparametric Topic Modeling

The Dirichlet and hierarchical Dirichlet processes are two important techniques for nonparametric Bayesian learning. These learning techniques allow unsupervised learning without specifying traditionally used input parameters. In topic modeling, this can be applied to discovering topics without specifying the number beforehand. Existing methods, such as those applied to topic modeling, usually take on a complex sampling calculation for inference. These techniques for inference of the Dirichlet and hierarchal Dirichlet processes are often based on Markov processes that can deviate from parametric topic modeling. This deviation may not be the best approach in the context of nonparametric topic modeling. Additionally, since they often rely on approximations they can negatively affect the predictive power of such models. In this paper we introduce a new interpretation of nonparametric Bayesian learning called the biased coin flip process—contrived for use in the context of Bayesian topic modeling. We prove mathematically the equivalence of the biased coin flip process to the Dirichlet process with an additional parameter representing the number of trials. A major benefit of the biased coin flip process is the similarity of the inference calculation to that of previous established parametric topic models—which we hope will lead to a more widespread adoption of hierarchical Dirichlet process based topic modeling. Additionally, as we show empirically the biased coin flip process leads to a nonparametric topic model with improved predictive performance.

Justin Wood, Wei Wang, Corey Arnold
CoMSum and SIBERT: A Dataset and Neural Model for Query-Based Multi-document Summarization

Document summarization compress source document(s) into succinct and information-preserving text. A variant of this is query-based multi-document summarization (q mds) that targets summaries to providing specific informational needs, contextualized to the query. However, the progress in this is hindered by limited availability to large-scale datasets. In this work, we make two contributions. First, we propose an approach for automatically generated dataset for both extractive and abstractive summaries and release a version publicly. Second, we design a neural model SIBERT for extractive summarization that exploits the hierarchical nature of the input. It also infuses queries to extract query-specific summaries. We evaluate this model on CoMSum dataset showing significant improvement in performance. This should provide a baseline and enable using CoMSum for future research on q mds.

Sayali Kulkarni, Sheide Chammas, Wan Zhu, Fei Sha, Eugene Ie
RTNet: An End-to-End Method for Handwritten Text Image Translation

Text image recognition and translation have a wide range of applications. It is straightforward to work out a two-stage approach: first perform the text recognition, then translate the text to target language. The handwritten text recognition model and the machine translation model are trained separately. Any transcription error may degrade the translation quality. This paper proposes an end-to-end leaning architecture that directly translates English handwritten text in images into Chinese. The handwriting recognition task and translation task are combined in a unified deep learning model. Firstly we conduct a visual encoding, next bridge the semantic gaps using a feature transformer and finally present a textual decoder to generate the target sentence. To train the model effectively, we use transfer learning to improve the generalization of the model under low-resource conditions. The experiments are carried out to compare our method to the traditional two-stage one. The results indicate that the performance of end-to-end model greatly improved as the amount of training data increases. Furthermore, when larger amount of training data is available, the end-to-end model is more advantageous.

Tonghua Su, Shuchen Liu, Shengjie Zhou

Multimedia Document Analysis

Frontmatter
NTable: A Dataset for Camera-Based Table Detection

Comparing with raw textual data, information in tabular format is more compact and concise, and easier for comparison, retrieval, and understanding. Furthermore, there are many demands to detect and extract tables from photos in the era of Mobile Internet. However, most of the existing table detection methods are designed for scanned document images or Portable Document Format (PDF). And tables in the real world are seldom collected in the current mainstream table detection datasets. Therefore, we construct a dataset named NTable for camera-based table detection. NTable consists of a smaller-scale dataset NTable-ori, an augmented dataset NTable-cam, and a generated dataset NTable-gen. The experiments demonstrate deep learning methods trained on NTable improve the performance of spotting tables in the real world. We will release the dataset to support the development and evaluation of more advanced methods for table detection and other further applications in the future.

Ziyi Zhu, Liangcai Gao, Yibo Li, Yilun Huang, Lin Du, Ning Lu, Xianfeng Wang
Label Selection Algorithm Based on Boolean Interpolative Decomposition with Sequential Backward Selection for Multi-label Classification

In multi-label classification, an instance may be associated with multiple labels simultaneously and thus the class labels are correlated rather than exclusive one another. As various applications emerge, besides large instance size and high feature dimensionality, the dimensionality of label space also grows quickly, which would increase computational costs and even deteriorate classification performance. To this end, dimensionality reduction strategy is applied to label space via exploiting label correlation information, resulting in label embedding and label selection techniques. Compared with a lot of label embedding work, less attention has been paid to label selection research due to its difficulty. Therefore, it is a challenging task to design more effective label selection techniques for multi-label classification. Boolean matrix decomposition (BMD) finds two low-rank binary matrix Boolean multiplication to approximate the original binary matrix. Further, Boolean interpolative decomposition (BID) version specially forces the left low-rank matrix to be a column subset of original ones, which implies to choose some informative binary labels for multi-label classification. Since BID is an NP-hard problem, it is necessary to find out a more effective heuristic solution method. In this paper, after executing exact BMD which achieves an exact approximation via removing a few uninformative labels, we apply sequential backward selection (SBS) strategy to delete some less informative labels one by one, to detect a fixed-size column subset. Our work builds a novel label selection algorithm based on BID with SBS. This proposed method is experimentally verified through six benchmark data sets with more than 100 labels, according to two performance metrics (precision@n and discounted gain@n, n = 1, 3 and 5) for high-dimensional label situation.

Tianqi Ji, Jun Li, Jianhua Xu
GSSF: A Generative Sequence Similarity Function Based on a Seq2Seq Model for Clustering Online Handwritten Mathematical Answers

Toward a computer-assisted marking for descriptive math questions, this paper presents clustering of online handwritten mathematical expressions (OnHMEs) to help human markers to mark them efficiently and reliably. We propose a generative sequence similarity function for computing a similarity score of two OnHMEs based on a sequence-to-sequence OnHME recognizer. Each OnHME is represented by a similarity-based representation (SbR) vector. The SbR matrix is inputted to the k-means algorithm for clustering OnHMEs. Experiments are conducted on an answer dataset (Dset_Mix) of 200 OnHMEs mixed of real patterns and synthesized patterns for each of 10 questions and a real online handwritten mathematical answer dataset of 122 student answers at most for each of 15 questions (NIER_CBT). The best clustering results achieved around 0.916 and 0.915 for purity, and around 0.556 and 0.702 for the marking cost on Dset_Mix and NIER_CBT, respectively. Our method currently outperforms the previous methods for clustering HMEs.

Huy Quang Ung, Cuong Tuan Nguyen, Hung Tuan Nguyen, Masaki Nakagawa
C2VNet: A Deep Learning Framework Towards Comic Strip to Audio-Visual Scene Synthesis

Advances in technology have propelled the growth of methods and methodologies that can create the desired multimedia content. “Automatic image synthesis” is one such instance that has earned immense importance among researchers. In contrast, audio-video scene synthesis, especially from document images, remains challenging and less investigated. To bridge this gap, we propose a novel framework, Comic-to-Video Network (C2VNet), which evolves panel-by-panel in a comic strip and eventually creates a full-length video (with audio) of a digitized or born-digital storybook. This step-by-step video synthesis process enables the creation of a high-resolution video. The proposed work’s primary contributions are; (1) a novel end-to-end comic strip to audio-video scene synthesis framework, (2) an improved panel and text balloon segmentation technique, and (3) a dataset of a digitized comic storybook in the English language with complete annotation and binary masks of the text balloon. Qualitative and quantitative experimental results demonstrate the effectiveness of the proposed C2VNet framework for automatic audio-visual scene synthesis.

Vaibhavi Gupta, Vinay Detani, Vivek Khokar, Chiranjoy Chattopadhyay
LSTMVAEF: Vivid Layout via LSTM-Based Variational Autoencoder Framework

The lack of training data is still a challenge in the Document Layout Analysis task (DLA). Synthetic data is an effective way to tackle this challenge. In this paper, we propose an LSTM-based Variational Autoencoder framework (LSTMVAF) to synthesize layouts for DLA. Compared with the previous method, our method can generate more complicated layouts and only need training data from DLA without extra annotation. We use LSTM models as basic models to learn the potential representing of class and position information of elements within a page. It is worth mentioning that we design a weight adaptation strategy to help model train faster. The experiment shows our model can generate more vivid layouts that only need a few real document pages.

Jie He, Xingjiao Wu, Wenxin Hu, Jing Yang

Mobile Text Recognition

Frontmatter
HCRNN: A Novel Architecture for Fast Online Handwritten Stroke Classification

Stroke classification is an essential task for applications with free-form handwriting input. Implementation of this type of application for mobile devices places stringent requirements on different aspects of embedded machine learning models, which results in finding a trade-off between model performance and model complexity. In this work, a novel hierarchical deep neural network (HDNN) architecture with high computational efficiency is proposed. It is adopted for handwritten document processing and particularly for multi-class stroke classification. The architecture uses a stack of 1D convolutional neural networks (CNN) on the lower (point) hierarchical level and a stack of recurrent neural networks (RNN) on the upper (stroke) level. The novel fragment pooling techniques for feature transition between hierarchical levels are presented. On-device implementation of the proposed architecture establishes new state-of-the-art results in the multi-class handwritten document processing with a classification accuracy of 97.58% on the IAMonDo dataset. Our method is also more efficient in both processing time and memory consumption than the previous state-of-the-art RNN-based stroke classifier.

Andrii Grygoriev, Illya Degtyarenko, Ivan Deriuga, Serhii Polotskyi, Volodymyr Melnyk, Dmytro Zakharchuk, Olga Radyvonenko
RFDoc: Memory Efficient Local Descriptors for ID Documents Localization and Classification

Majority of recent papers in the field of image matching introduces universal descriptors for arbitrary keypoints matching. In this paper we propose a data-driven approach to building a descriptor for matching local keypoints from identity documents in the context of simultaneous ID document localization and classification on mobile devices. In the first stage, we train features robust to lighting changes. In the second stage, we select the most best-performing and discriminant features and forms a set of classifiers which we called RFDoc descriptor. To address the problem of limited computing resources the proposed descriptor is binary rather than a real-valued one. To perform experiments we prepared a dataset of aligned patches from a subset of identity document types presented in MIDV datasets and made it public. RFDoc descriptor showed similar performance in complex document detection and classification on the test part of the MIDV-500 dataset to the state-of-the-art BEBLID-512 descriptor, which is more than 2.6 times less memory efficient than RFDoc. On a more complex MIDV-2019 dataset RFDoc showed 21% fewer classification errors.

Daniil Matalov, Elena Limonova, Natalya Skoryukina, Vladimir V. Arlazarov
Dynamic Receptive Field Adaptation for Attention-Based Text Recognition

Existing attention-based recognition methods generally assume that the character scale and spacing in the same text instance are basically consistent. However, this hypothesis not always hold in the context of complex scene images. In this study, we propose an innovative dynamic receptive field adaption (DRA) mechanism for recognizing scene text robustly. Our DRA introduces different levels of receptive field features for classifying character and designs a novel way to explore historical attention information when calculating attention map. In this way, our method can adaptively adjust receptive field according to the variations of character scale and spacing in a scene text. Hence, our DRA mechanism can generate more informative features for recognition than traditional attention-based mechanisms. Notablely, our DRA mechanism can be easily generalized to off-the-shelf attention-based methods in text recognition to improve their performances. Extensive experiments on various public available benchmarks, including the IIIT-5K, SVT, SVTP, CUTE80, and ICDAR datasets, indicate the effectiveness and robustness of our method against the state-of-the art methods.

Haibo Qin, Chun Yang, Xiaobin Zhu, Xucheng Yin
Context-Free TextSpotter for Real-Time and Mobile End-to-End Text Detection and Recognition

In the deployment of scene-text spotting systems on mobile platforms, lightweight models with low computation are preferable. In concept, end-to-end (E2E) text spotting is suitable for such purposes because it performs text detection and recognition in a single model. However, current state-of-the-art E2E methods rely on heavy feature extractors, recurrent sequence modellings, and complex shape aligners to pursue accuracy, which means their computations are still heavy. We explore the opposite direction: How far can we go without bells and whistles in E2E text spotting? To this end, we propose a text-spotting method that consists of simple convolutions and a few post-processes, named Context-Free TextSpotter. Experiments using standard benchmarks show that Context-Free TextSpotter achieves real-time text spotting on a GPU with only three million parameters, which is the smallest and fastest among existing deep text spotters, with an acceptable transcription quality degradation compared to heavier ones. Further, we demonstrate that our text spotter can run on a smartphone with affordable latency, which is valuable for building stand-alone OCR applications.

Ryota Yoshihashi, Tomohiro Tanaka, Kenji Doi, Takumi Fujino, Naoaki Yamashita
MIDV-LAIT: A Challenging Dataset for Recognition of IDs with Perso-Arabic, Thai, and Indian Scripts

In this paper, we present a new dataset for identity documents (IDs) recognition called MIDV-LAIT. The main feature of the dataset is the textual fields in Perso-Arabic, Thai, and Indian scripts. Since open datasets with real IDs may not be published, we synthetically generated all the images and data. Even faces are generated and do not belong to any particular person. Recently some datasets have appeared for evaluation of the IDs detection, type identification, and recognition, but these datasets cover only Latin-based and Cyrillic-based languages. The proposed dataset is to fix this issue and make it easier to evaluate and compare various methods. As a baseline, we process all the textual field images in MIDV-LAIT with Tesseract OCR. The resulting recognition accuracy shows that the dataset is challenging and is of use for further researches.

Yulia Chernyshova, Ekaterina Emelianova, Alexander Sheshkus, Vladimir V. Arlazarov
Determining Optimal Frame Processing Strategies for Real-Time Document Recognition Systems

Mobile document analysis technologies became widespread and important, and growing reliance on the performance of critical processes, such as identity document data extraction and verification, lead to increasing speed and accuracy requirements. Camera-based documents recognition on mobile devices using video stream allows to achieve higher accuracy, however in real time systems the actual time of individual image processing needs to be taken into account, which it rarely is in the works on this subject. In this paper, a model of real-time document recognition system is described, and three frame processing strategies are evaluated, each consisting of a per-frame recognition results combination method and a dynamic stopping rule. The experimental evaluation shows that while full combination of all input results is preferable if the frame recognition time is comparable with frame acquisition time, the selection of one best frame based on an input quality predictor, or a combination of several best frames, with the corresponding stopping rule, allows to achieve higher mean recognition results accuracy if the cost of recognizing a frame is significantly higher than skipping it.

Konstantin Bulatov, Vladimir V. Arlazarov

Document Analysis for Social Good

Frontmatter
Embedded Attributes for Cuneiform Sign Spotting

In the document analysis community, intermediate representations based on binary attributes are used to perform retrieval tasks or recognize unseen categories. These visual attributes representing high-level semantics continually achieve state-of-the-art results, especially for the task of word spotting. While spotting tasks are mainly performed on Latin or Arabic scripts, the cuneiform writing system is still a less well-known domain for the document analysis community. In contrast to the Latin alphabet, the cuneiform writing system consists of many different signs written by pressing a wedge stylus into moist clay tablets. Cuneiform signs are defined by different constellations and relative positions of wedge impressions, which can be exploited to define sign representations based on visual attributes. A promising approach of representing cuneiform sign using visual attributes is based on the so-called Gottstein-System. Here, cuneiform signs are described by counting the wedge types from a holistic perspective without any spatial information for wedge positions within a sign. We extend this holistic representation by a spatial pyramid approach with a more fine-grained description of cuneiform signs. In this way, the proposed representation is capable of describing a single sign in a more detailed way and represent a more extensive set of sign categories.

Eugen Rusakov, Turna Somel, Gerfrid G. W. Müller, Gernot A. Fink
Date Estimation in the Wild of Scanned Historical Photos: An Image Retrieval Approach

This paper presents a novel method for date estimation of historical photographs from archival sources. The main contribution is to formulate the date estimation as a retrieval task, where given a query, the retrieved images are ranked in terms of the estimated date similarity. The closer are their embedded representations the closer are their dates. Contrary to the traditional models that design a neural network that learns a classifier or a regressor, we propose a learning objective based on the nDCG ranking metric. We have experimentally evaluated the performance of the method in two different tasks: date estimation and date-sensitive image retrieval, using the DEW public database, overcoming the baseline methods.

Adrià Molina, Pau Riba, Lluis Gomez, Oriol Ramos-Terrades, Josep Lladós
Two-Step Fine-Tuned Convolutional Neural Networks for Multi-label Classification of Children’s Drawings

Developmental psychologists employ several drawing-based tasks to measure the cognitive maturity of a child. Manual scoring of such tests is time-consuming and prone to scorer bias. A computerized analysis of digitized samples can provide efficiency and standardization. However, the inherent variability of hand-drawn traces and lack of sufficient training samples make it challenging for both feature engineering and feature learning. In this paper, we present a two-step fine-tuning based method to train a multi-label Convolutional Neural Network (CNN) architecture, for the scoring of a popular drawing-based test ‘Draw-A-Person’ (DAP). Our proposed two-step fine-tuned CNN architecture outperforms conventional pre-trained CNNs by achieving an accuracy of 81.1% in scoring of Gross Details, 99.2% in scoring of Attachments, and 79.3% in scoring of Head Details categories of DAP samples.

Muhammad Osama Zeeshan, Imran Siddiqi, Momina Moetesum
DCINN: Deformable Convolution and Inception Based Neural Network for Tattoo Text Detection Through Skin Region

Identifying Tattoo is an integral part of forensic investigation and crime identification. Tattoo text detection is challenging because of its freestyle handwriting over the skin region with a variety of decorations. This paper introduces Deformable Convolution and Inception based Neural Network (DCINN) for detecting tattoo text. Before tattoo text detection, the proposed approach detects skin regions in the tattoo images based on color models. This results in skin regions containing Tattoo text, which reduces the background complexity of the tattoo text detection problem. For detecting tattoo text in the skin regions, we explore a DCINN, which generates binary maps from the final feature maps using differential binarization technique. Finally, polygonal bounding boxes are generated from the binary map for any orientation of text. Experiments on our Tattoo-Text dataset and two standard datasets of natural scene text images, namely, Total-Text, CTW1500 show that the proposed method is effective in detecting Tattoo text as well as natural scene text in the images. Furthermore, the proposed method outperforms the existing text detection methods in several criteria.

Tamal Chowdhury, Palaiahnakote Shivakumara, Umapada Pal, Tong Lu, Ramachandra Raghavendra, Sukalpa Chanda
Sparse Document Analysis Using Beta-Liouville Naive Bayes with Vocabulary Knowledge

Smoothing the parameters of multinomial distributions is an important concern in statistical inference tasks. In this paper, we present a new smoothing prior for the Multinomial Naive Bayes classifier. Our approach takes advantage of the Beta-Liouville distribution for the estimation of the multinomial parameters. Dealing with sparse documents, we exploit vocabulary knowledge to define two distinct priors over the “observed” and the “unseen” words. We analyze the problem of large-scale and sparse data by enhancing Multinomial Naive Bayes classifier through smoothing the estimation of words with a Beta-scale. Our approach is evaluated on two different challenging applications with sparse and large-scale documents namely: emotion intensity analysis and hate speech detection. Experiments on real-world datasets show the effectiveness of our proposed classifier compared to the related-work methods.

Fatma Najar, Nizar Bouguila
Automatic Signature-Based Writer Identification in Mixed-Script Scenarios

Automated approach for human identification based on biometric traits has become popular research topic among the scientists since last few decades. Among the several biometric modalities, handwritten signature is one of the very common and most prevalent approaches. In the past, researchers have proposed different handcrafted feature-based techniques for automatic writer identification from offline signatures. Currently huge interests towards deep learning-based solutions for several real-life pattern recognition problems have been found which revealed promising results. In this paper, we propose a light-weight CNN architecture to identify writers from offline signatures written by two popular scripts namely Devanagari and Roman. Experiments were conducted using two different frameworks which are as follows: (i) In first case, signature script separation has been carried out followed by script-wise writer identification, (ii) Secondly, signature of two scripts was mixed together with various ratios and writer identification has been performed in a script independent manner. Outcome of both the frameworks have been analyzed to get the comparative idea. Furthermore, comparative analysis was done with recognized CNN architectures as well as handcrafted feature-based approaches and the proposed method shows better outcome. The dataset used in this paper can be freely downloaded from the link: https://ieee-dataport.org/open-access/multi-script-handwritten-signature-roman-devanagari for research purpose.

Sk Md Obaidullah, Mridul Ghosh, Himadri Mukherjee, Kaushik Roy, Umapada Pal

Indexing and Retrieval of Documents

Frontmatter
Learning to Rank Words: Optimizing Ranking Metrics for Word Spotting

In this paper, we explore and evaluate the use of ranking-based objective functions for learning simultaneously a word string and a word image encoder. We consider retrieval frameworks in which the user expects a retrieval list ranked according to a defined relevance score. In the context of a word spotting problem, the relevance score has been set according to the string edit distance from the query string. We experimentally demonstrate the competitive performance of the proposed model on query-by-string word spotting for both, handwritten and real scene word images. We also provide the results for query-by-example word spotting, although it is not the main focus of this work.

Pau Riba, Adrià Molina, Lluis Gomez, Oriol Ramos-Terrades, Josep Lladós
A-VLAD: An End-to-End Attention-Based Neural Network for Writer Identification in Historical Documents

This paper presents an end-to-end attention-based neural network for identifying writers in historical documents. The proposed network does not require any preprocessing stages such as binarization or segmentation. It has three main parts: a feature extractor using Convolutional Neural Network (CNN) to extract features from an input image; an attention filter to select key points; and a generalized deep neural VLAD model to form a representative vector by aggregating the extracted key points. The whole network is trained end-to-end by a combination of cross-entropy and triplet losses. In the experiments, we evaluate the performance of our model on the HisFragIR20 dataset that consists of about 120,000 historical fragments from many writers. The experiments demonstrate better mean average precision and accuracy at top-1 in comparison with the state-of-the-art results on the HisFragIR20 dataset. This model is rather new for dealing with various sizes of historical document fragments in the writer identification and image retrieval.

Trung Tan Ngo, Hung Tuan Nguyen, Masaki Nakagawa
Manga-MMTL: Multimodal Multitask Transfer Learning for Manga Character Analysis

In this paper, we introduce a new pipeline to learn manga character features with visual information and verbal information in manga image content. Combining these set of information is crucial to go further into comic book image understanding. However, learning feature representations from multiple modalities is not straightforward. We propose a multitask multimodal approach for effectively learning the feature of joint multimodal signals. To better leverage the verbal information, our method learn to memorize the content of manga albums by additionally using the album classification task. The experiments are carried out on Manga109 public dataset which contains the annotations for characters, text blocks, frame and album metadata. We show that manga character features learnt by the proposed method is better than all existing single-modal methods for two manga character analysis tasks.

Nhu-Van Nguyen, Christophe Rigaud, Arnaud Revel, Jean-Christophe Burie
Probabilistic Indexing and Search for Hyphenated Words

Hyphenated words are very frequent in historical manuscripts. Reliable recognition of (the prefix and suffix fragments of) these words is problematic and has not been sufficiently studied so far. If the aim is to transcribe text images, a sufficiently accurate character-level recognition of the fragments might be an admissible transcription result. However, if the goal is to allow searching for words or “keyword spotting”, this is not acceptable at all because users need to query entire words, rather than possible fragments of these words. The situation becomes even worse if the aim is to index images for lexicon-free searching for any arbitrary text. To start with, this makes it necessary to know whether the concatenation of two-word fragments may constitute a regular word, or each fragment is instead a word by itself. We propose a probabilistic model to deal with these complications and present a first development of this model, based only on lexicon-free probabilistic indexing of the text images. Albeit preliminary, it already allows to very accurately find both entire and hyphenated forms of arbitrary query words by using just the entire forms of the words. Experiments carried out on a representative part of a huge historical collection of the National Archives of Finland, confirm the usefulness of the proposed methods.

Enrique Vidal, Alejandro H. Toselli

Physical and Logical Layout Analysis

Frontmatter
SandSlide: Automatic Slideshow Normalization

Slideshows are a popular tool for presenting information in a structured and attractive manner. There exists a wide range of different slideshows editors, often with their own proprietary encoding that is incompatible with other editors. Merging slideshows from different editors and making the slide design consistent is a nontrivial and time-intensive task. We introduce SandSlide, the first system for automatically normalizing a deck of slides from a PDF file into an editable PowerPoint file that adheres to the default slide templates, and is thus able to fully leverage the flexible layout capabilities of modern slideshow editors. SandSlide achieves this by labeling objects, using a qualitative representation to find the most similar slide layout and aligning content from the slide with this layout. To evaluate SandSlide, we collected and annotated slides from different slideshows. Our experiments show that a greedy search is able to obtain high responsiveness on supported and almost supported slides, and that a significant majority of slides fall into this category. Additionally, our annotated dataset contains fine-grained annotations on different properties of slideshows to further incentivize research on all aspects of the problem of slide normalization.

Sieben Bocklandt, Gust Verbruggen, Thomas Winters
Digital Editions as Distant Supervision for Layout Analysis of Printed Books

Archivists, textual scholars, and historians often produce digital editions of historical documents. Using markup schemes such as those of the Text Encoding Initiative and EpiDoc, these digital editions often record documents’ semantic regions (such as notes and figures) and physical features (such as page and line breaks) as well as transcribing their textual content. We describe methods for exploiting this semantic markup as distant supervision for training and evaluating layout analysis models. In experiments with several model architectures on the half-million pages of the Deutsches Textarchiv (DTA), we find a high correlation of these region-level evaluation methods with pixel-level and word-level metrics. We discuss the possibilities for improving accuracy with self-training and the ability of models trained on the DTA to generalize to other historical printed books.

Alejandro H. Toselli, Si Wu, David A. Smith
Palmira: A Deep Deformable Network for Instance Segmentation of Dense and Uneven Layouts in Handwritten Manuscripts

Handwritten documents are often characterized by dense and uneven layout. Despite advances, standard deep network based approaches for semantic layout segmentation are not robust to complex deformations seen across semantic regions. This phenomenon is especially pronounced for the low-resource Indic palm-leaf manuscript domain. To address the issue, we first introduce Indiscapes2, a new large-scale diverse dataset of Indic manuscripts with semantic layout annotations. Indiscapes2 contains documents from four different historical collections and is $$150\%$$ 150 % larger than its predecessor, Indiscapes. We also propose a novel deep network Palmira for robust, deformation-aware instance segmentation of regions in handwritten manuscripts. We also report Hausdorff distance and its variants as a boundary-aware performance measure. Our experiments demonstrate that Palmira provides robust layouts, outperforms strong baseline approaches and ablative variants. We also include qualitative results on Arabic, South-East Asian and Hebrew historical manuscripts to showcase the generalization capability of Palmira.

S. P. Sharan, Sowmya Aitha, Amandeep Kumar, Abhishek Trivedi, Aaron Augustine, Ravi Kiran Sarvadevabhatla
Page Layout Analysis System for Unconstrained Historic Documents

Extraction of text regions and individual text lines from historic documents is necessary for automatic transcription. We propose extending a CNN-based text baseline detection system by adding line height and text block boundary predictions to the model output, allowing the system to extract more comprehensive layout information. We also show that pixel-wise text orientation prediction can be used for processing documents with multiple text orientations. We demonstrate that the proposed method performs well on the cBAD baseline detection dataset. Additionally, we benchmark the method on newly introduced PERO layout dataset which we also make public.

Oldřich Kodym, Michal Hradiš
Improved Graph Methods for Table Layout Understanding

Recently, there have been significant advances in document layout analysis and, particularly, in the recognition and understanding of tables and other structured documents in handwritten historical texts. In this work, a series of improvements over current techniques based on graph neural networks are proposed, which considerably improve state-of-the-art results. In addition, a two-pass approach is also proposed where two graph neural networks are sequentially used to provide further substantial improvements of more than 12 F-measure points in some tasks. The code developed for this work will be published to facilitate the reproduction of the results and possible improvements.

Jose Ramón Prieto, Enrique Vidal
Unsupervised Learning of Text Line Segmentation by Differentiating Coarse Patterns

Despite recent advances in the field of supervised deep learning for text line segmentation, unsupervised deep learning solutions are beginning to gain popularity. In this paper, we present an unsupervised deep learning method that embeds document image patches to a compact Euclidean space where distances correspond to a coarse text line pattern similarity. Once this space has been produced, text line segmentation can be easily implemented using standard techniques with the embedded feature vectors. To train the model, we extract random pairs of document image patches with the assumption that neighbour patches contain a similar coarse trend of text lines, whereas if one of them is rotated, they contain different coarse trends of text lines. Doing well on this task requires the model to learn to recognize the text lines and their salient parts. The benefit of our approach is zero manual labelling effort. We evaluate the method qualitatively and quantitatively on several variants of text line segmentation datasets to demonstrate its effectivity.

Berat Kurar Barakat, Ahmad Droby, Raid Saabni, Jihad El-Sana

Recognition of Tables and Formulas

Frontmatter
Rethinking Table Structure Recognition Using Sequence Labeling Methods

Table structure recognition is an important task in document analysis and attracts the attention of many researchers. However, due to the diversity of table types and the complexity of table structure, the performances of table structure recognition methods are still not well enough in practice. Row and column separators play a significant role in the two-stage table structure recognition and a better row and column separator segmentation result can improve the final recognition results. Therefore, in this paper, we present a novel deep learning model to detect row and column separators. This model contains a convolution encoder and two parallel row and column decoders. The encoder can extract the visual features by using convolution blocks; the decoder formulates the feature map as a sequence and uses a sequence labeling model, bidirectional long short-term memory networks (BiLSTM) to detect row and column separators. Experiments have been conducted on PubTabNet and the model is benchmarked on several available datasets, including PubTabNet, UNLV ICDAR13, ICDAR19. The results show that our model has a state-of-the-art performance than other strong models. In addition, our model shows a better generalization ability. The code is available on this site ( www.github.com/L597383845/row-col-table-recognition ).

Yibo Li, Yilun Huang, Ziyi Zhu, Lemeng Pan, Yongshuai Huang, Lin Du, Zhi Tang, Liangcai Gao
TabLeX: A Benchmark Dataset for Structure and Content Information Extraction from Scientific Tables

Information Extraction (IE) from the tables present in scientific articles is challenging due to complicated tabular representations and complex embedded text. This paper presents TabLeX, a large-scale benchmark dataset comprising table images generated from scientific articles. TabLeX consists of two subsets, one for table structure extraction and the other for table content extraction. Each table image is accompanied by its corresponding LaTeX source code. To facilitate the development of robust table IE tools, TabLeX contains images in different aspect ratios and in a variety of fonts. Our analysis sheds light on the shortcomings of current state-of-the-art table extraction models and shows that they fail on even simple table images. Towards the end, we experiment with a transformer-based existing baseline to report performance scores. In contrast to the static benchmarks, we plan to augment this dataset with more complex and diverse tables at regular intervals.

Harsh Desai, Pratik Kayal, Mayank Singh
Handwritten Mathematical Expression Recognition with Bidirectionally Trained Transformer

Encoder-decoder models have made great progress on handwritten mathematical expression recognition recently. However, it is still a challenge for existing methods to assign attention to image features accurately. Moreover, those encoder-decoder models usually adopt RNN-based models in their decoder part, which makes them inefficient in processing long sequences. In this paper, a transformer-based decoder is employed to replace RNN-based ones, which makes the whole model architecture very concise. Furthermore, a novel training strategy is introduced to fully exploit the potential of the transformer in bidirectional language modeling. Compared to several methods that do not use data augmentation, experiments demonstrate that our model improves the ExpRate of current state-of-the-art methods on CROHME 2014 by 2.23%. Similarly, on CROHME 2016 and CROHME 2019, we improve the ExpRate by 1.92% and 2.28% respectively.

Wenqi Zhao, Liangcai Gao, Zuoyu Yan, Shuai Peng, Lin Du, Ziyin Zhang
TabAug: Data Driven Augmentation for Enhanced Table Structure Recognition

Table Structure Recognition is an essential part of end-to-end tabular data extraction in document images. The recent success of deep learning model architectures in computer vision remains to be non-reflective in table structure recognition, largely because extensive datasets for this domain are still unavailable while annotating new data is expensive and time-consuming. Traditionally, in computer vision, these challenges are addressed by standard augmentation techniques that are based on image transformations like color jittering and random cropping. As demonstrated by our experiments, these techniques are not effective for the task of table structure recognition. In this paper, we propose TabAug, a re-imagined Data Augmentation technique that produces structural changes in table images through replication and deletion of rows and columns. It also consists of a data-driven probabilistic model that allows control over the augmentation process. To demonstrate the efficacy of our approach, we perform experimentation on ICDAR 2013 dataset where our approach shows consistent improvements in all aspects of the evaluation metrics, with cell-level correct detections improving from 92.16% to 96.11% over the baseline.

Umar Khan, Sohaib Zahid, Muhammad Asad Ali, Adnan Ul-Hasan, Faisal Shafait
An Encoder-Decoder Approach to Handwritten Mathematical Expression Recognition with Multi-head Attention and Stacked Decoder

Encoder-decoder framework with attention mechanism has become a mainstream solution to handwritten mathematical expression recognition (HMER) since “watch, attend and parse (WAP)" approach was proposed in 2017, where a convolutional neural network is used as encoder and a gated recurrent unit with attention is used in decoder. Inspired by the recent success of Transformer in many applications, in this paper, we adopt the design of multi-head attention and stacked decoder in Transformer to improve the decoder part of the WAP framework for HMER. Experimental results on CROHME tasks show that multi-head attention can boost the expression recognition rate (ExpRate) of WAP from 54.32%/58.05% to 56.76%/59.72% and stacked decoder can further improve ExpRate to 57.72%/61.38% on CROHME 2016/2019 test sets.

Haisong Ding, Kai Chen, Qiang Huo
Global Context for Improving Recognition of Online Handwritten Mathematical Expressions

This paper presents a temporal classification method for all three subtasks of symbol segmentation, symbol recognition and relation classification in online handwritten mathematical expressions (HMEs). The classification model is trained by multiple paths of symbols and spatial relations derived from the Symbol Relation Tree (SRT) representation of HMEs. The method benefits from global context of a deep bidirectional Long Short-term Memory network, which learns the temporal classification directly from online handwriting by the Connectionist Temporal Classification loss. To recognize an online HME, a symbol-level parse tree with Context-Free Grammar is constructed, where symbols and spatial relations are obtained from the temporal classification results. We show the effectiveness of the proposed method on the two latest CROHME datasets.

Cuong Tuan Nguyen, Thanh-Nghia Truong, Hung Tuan Nguyen, Masaki Nakagawa
Image-Based Relation Classification Approach for Table Structure Recognition

In recent years, the use of tabular data has become a major area of research and development. However, the number of tables structured in a machine-readable format is still limited. A major challenge that is encountered when using tabular data is converting the table information in a free-format document into a structured format. Unlike markup languages such as HTML, XML, and JSON, free-format documents such as PDF, Word, Excel, and images generally have no tags or separators. Therefore, the table structure should be recognized from the positional information of the table elements. A major approach of table structure recognition is to classify the relationship between each pair of bounding boxes of the table elements. Recent works have achieved significant improvements by applying graph convolutional networks (GCNs) to the graph structure of the bounding boxes. However, fully recognizing a complex table structure is still a major challenge, owing to the presence of spanning cells. In this study, we propose a novel, simple image-based approach to this relation classification task. Our model efficiently exploits information such as the geometry of the table elements and ruled lines through an image cropping strategy based on the pairs of bounding boxes. We evaluate our approach on two real-world table datasets by comparing four baselines including two state-of-the-art GCN approaches. We observe that our approach significantly outperforms the baseline in the exact matching ratio for tables by up to 6.7%.

Koji Ichikawa
Image to LaTeX with Graph Neural Network for Mathematical Formula Recognition

Mathematical formula recognition aims to automatically convert formula images into their structured description formats. Recently, some encoder-decoder models have been presented for this task, while they seldom explicitly consider spatial relationship among symbols. In this paper, we proposed a novel encoder-decoder model with Graph Neural Network (GNN) to translate mathematical formula images into LaTeX codes. In the proposed model, the symbols segmented from the raw image are used to build graphs based on their spatial connection. The encoder consists of Convolutional Neural Network (CNN) and GNN. CNN is utilized to extract the visual features from the whole formula or symbols, and GNN is used to transmit the spatial information embedded in the built graphs. The adopted decoder is a Recurrent Neural Network (RNN) model, which implements a language model to generate the output sentences based on the encoded features with attention mechanism. The experimental results on IM2LATEX-100K dataset demonstrated that the proposed model obtained a better performance than state-of-the-art approaches.

Shuai Peng, Liangcai Gao, Ke Yuan, Zhi Tang

NLP for Document Understanding

Frontmatter
A Novel Method for Automated Suggestion of Similar Software Incidents Using 2-Stage Filtering: Findings on Primary Data

Advancements in software technology have resulted in a sharp increase in the complexity of software with an equally increasing number of bugs reported every day. Some of these bugs have a high severity and can often lead to significant business impacts. Thus, they need to be resolved by the developers at the earliest. Many of these reports are similar to the bug reports that were reported and resolved in the past. By suggesting similar incidents, the developers can refer to the troubleshooting information, thus effectively reducing the TTM (Time to Mitigate) of the software bugs. The developers also spend a significant amount of time and effort in triaging the bugs into their respective areas. Previous studies have mainly relied on unsupervised learning techniques for the detection of duplicate reports and ignored some key aspects of the bug reports. We conducted comprehensive research on real bugs reported for Microsoft Dynamics 365 Application Software. Our research presents a novel two-phase approach for suggesting similar incidents. The first phase called Binning involves the creation of a labelled dataset for employing a supervised learning algorithm for triaging the software incidents into multiple categories. Thus, the first phase also presents a solution for automating the process of triaging the incidents in addition to the first stage of filtering. The second phase introduces the use of error execution information and acknowledgment information for the calculation of similarity scores which has largely been ignored in the previous studies. The evaluation results show that the precision rate of our proposed approach reaches up to 95.8% while the model achieves recall rates of 67%–93.5%.

Badal Agrawal, Mohit Mishra, Varun Parashar
Research on Pseudo-label Technology for Multi-label News Classification

Multi-label news classification exerts a significant importance with the growing size of news containing multiple semantics. However, most of the existing multi-label classification methods rely on large-scale labeled corpus while publicly available resources for multi-label classification are limited. Although many researches have proposed the application of pseudo-label technology to expand the corpus, few studies explored it for multi-label classification since the number of labels is not prone to determine. To address these problems, we construct a multi-label news classification corpus for Indonesian language and propose a new multi-label news classification framework through using pseudo-label technology in this paper. The framework employs the BERT model as a pre-trained language model to obtain the sentence representation of the texts. Furthermore, the cosine similarity algorithm is utilized to match the text labels. On the basis of matching text labels with similarity algorithms, a pseudo-label technology is used to pick up the classes for unlabeled data. Then, we screen high-confidence pseudo-label corpus to train the model together with original training data, and we also introduce loss weights including class weight adjustment method and pseudo-label loss function balance coefficient to solve the problem of data with class imbalance, as well as reduce the impact of the quantity difference between labeled texts and pseudo-label texts on model training. Experiment results demonstrate that the framework proposed in this paper has achieved significant performance in Indonesian multi-label news classification, and each strategy can perform a certain improvement.

Lianxi Wang, Xiaotian Lin, Nankai Lin
Information Extraction from Invoices

The present paper is focused on information extraction from key fields of invoices using two different methods based on sequence labeling. Invoices are semi-structured documents in which data can be located based on the context. Common information extraction systems are model-driven, using heuristics and lists of trigger words curated by domain experts. Their performances are generally high on documents they have been trained for but processing new templates often requires new manual annotations, which is tedious and time-consuming to produce. Recent works on deep learning applied to business documents claimed a gain in terms of time and performance. While these systems do not need manual curation, they nevertheless require a large amount of data to achieve good results. In this paper, we present a series of experiments using neural networks approaches to study the trade-off between data requirements and performance in the extraction of information from key fields of invoices (such as dates, document numbers, types, amounts...). The main contribution of this paper is a system that achieves competitive results using a small amount of data compared to the state-of-the-art systems that need to be trained on large datasets, that are costly and impractical to produce in real-world applications.

Ahmed Hamdi, Elodie Carel, Aurélie Joseph, Mickael Coustaty, Antoine Doucet
Are You Really Complaining? A Multi-task Framework for Complaint Identification, Emotion, and Sentiment Classification

In recent times, given the competitive nature of corporates, customer support has become the core of organizations that can strengthen their brand image. Timely and effective settlement of customer’s complaints is vital in improving customer satisfaction in different business organizations. Companies experience difficulties in automatically identifying complaints buried deep in enormous online content. Emotion detection and sentiment analysis, two closely related tasks, play very critical roles in complaint identification. We hypothesize that the association between emotion and sentiment will provide an enhanced understanding of the state of mind of the tweeter. In this paper, we propose a Bidirectional Encoder Representations from Transformers (BERT) based shared-private multi-task framework that aims to learn three closely related tasks, viz. complaint identification (primary task), emotion detection, and sentiment classification (auxiliary tasks) concurrently. Experimental results show that our proposed model obtains the highest macro-F1 score of 87.38%, outperforming the multi-task baselines as well as the state-of-the-art model by indicative margins, denoting that emotion awareness and sentiment analysis facilitate the complaint identification task when learned simultaneously.

Apoorva Singh, Sriparna Saha
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer

We address the challenging problem of Natural Language Comprehension beyond plain-text documents by introducing the TILT neural network architecture which simultaneously learns layout information, visual features, and textual semantics. Contrary to previous approaches, we rely on a decoder capable of unifying a variety of problems involving natural language. The layout is represented as an attention bias and complemented with contextualized visual information, while the core of our model is a pretrained encoder-decoder Transformer. Our novel approach achieves state-of-the-art results in extracting information from documents and answering questions which demand layout understanding (DocVQA, CORD, SROIE). At the same time, we simplify the process by employing an end-to-end model.

Rafał Powalski, Łukasz Borchmann, Dawid Jurkiewicz, Tomasz Dwojak, Michał Pietruszka, Gabriela Pałka
Data Centric Domain Adaptation for Historical Text with OCR Errors

We propose new methods for in-domain and cross-domain Named Entity Recognition (NER) on historical data for Dutch and French. For the cross-domain case, we address domain shift by integrating unsupervised in-domain data via contextualized string embeddings; and OCR errors by injecting synthetic OCR errors into the source domain and address data centric domain adaptation. We propose a general approach to imitate OCR errors in arbitrary input data. Our cross-domain as well as our in-domain results outperform several strong baselines and establish state-of-the-art results. We publish preprocessed versions of the French and Dutch Europeana NER corpora.

Luisa März, Stefan Schweter, Nina Poerner, Benjamin Roth, Hinrich Schütze
Temporal Ordering of Events via Deep Neural Networks

Ordering events with temporal relations in texts remains a challenge in natural language processing. In this paper, we introduce a new combined neural network architecture that is capable of classifying temporal relations between events in an Arabic sentence. Our model consists of two branches: the first one extracts the syntactic information and identifies the orientation of the relation between the two given events based on a Shortest Dependency Path (SDP) layer with Long and Short Memory (LSTM), and the second one encourages the model to focus on the important local information when learning sentence representations based on a Bidirectional-LSTM (BiLSTM) attention layer. The experiments suggest that our proposed model outperforms several previous state-of-the-art methods, with an F1-score equal to 86.40%.

Nafaa Haffar, Rami Ayadi, Emna Hkiri, Mounir Zrigui
Document Collection Visual Question Answering

Current tasks and methods in Document Understanding aims to process documents as single elements. However, documents are usually organized in collections (historical records, purchase invoices), that provide context useful for their interpretation. To address this problem, we introduce Document Collection Visual Question Answering (DocCVQA) a new dataset and related task, where questions are posed over a whole collection of document images and the goal is not only to provide the answer to the given question, but also to retrieve the set of documents that contain the information needed to infer the answer. Along with the dataset we propose a new evaluation metric and baselines which provide further insights to the new dataset and task.

Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny
Dialogue Act Recognition Using Visual Information

Automatic dialogue management including dialogue act (DA) recognition is usually focused on dialogues in the audio signal. However, some dialogues are also available in a written form and their automatic analysis is also very important.The main goal of this paper thus consists in the dialogue act recognition from printed documents. For visual DA recognition, we propose a novel deep model that combines two recurrent neural networks.The approach is evaluated on a newly created dataset containing printed dialogues from the English VERBMOBIL corpus. We have shown that visual information does not have any positive impact on DA recognition using good quality images where the OCR result is excellent. We have also demonstrated that visual information can significantly improve the DA recognition score on low-quality images with erroneous OCR.To the best of our knowledge, this is the first attempt focused on DA recognition from visual data.

Jiří Martínek, Pavel Král, Ladislav Lenc
Are End-to-End Systems Really Necessary for NER on Handwritten Document Images?

Named entities (NEs) are fundamental in the extraction of information from text. The recognition and classification of these entities into predefined categories is called Named Entity Recognition (NER) and plays a major role in Natural Language Processing. However, only a few works consider this task with respect to the document image domain. The approaches are either based on a two-stage or end-to-end architecture. A two-stage approach transforms the document image into a textual representation and determines the NEs using a textual NER. The end-to-end approach, on the other hand, avoids the explicit recognition step at text level and determines the NEs directly on image level. Current approaches that try to tackle the task of NER on segmented word images use end-to-end architectures. This is motivated by the assumption that handwriting recognition is too erroneous to allow for an effective application of textual NLP methods. In this work, we present a two-stage approach and compare it against state-of-the-art end-to-end approaches. Due to the lack of datasets and evaluation protocols, such a comparison is currently difficult. Therefore, we manually annotated the known IAM and George Washington datasets with NE labels and publish them along with optimized splits and an evaluation protocol. Our experiments show, contrary to the common belief, that a two-stage model can achieve higher scores on all tested datasets.

Oliver Tüselmann, Fabian Wolf, Gernot A. Fink
Training Bi-Encoders for Word Sense Disambiguation

Modern transformer-based neural architectures yield impressive results in nearly every NLP task and Word Sense Disambiguation, the problem of discerning the correct sense of a word in a given context, is no exception. State-of-the-art approaches in WSD today leverage lexical information along with pre-trained embeddings from these models to achieve results comparable to human inter-annotator agreement on standard evaluation benchmarks. In the same vein, we experiment with several strategies to optimize bi-encoders for this specific task and propose alternative methods of presenting lexical information to our model. Through our multi-stage pre-training and fine-tuning pipeline we further the state of the art in Word Sense Disambiguation.

Harsh Kohli
DeepCPCFG: Deep Learning and Context Free Grammars for End-to-End Information Extraction

We address the challenge of extracting structured information from business documents without detailed annotations. We propose Deep Conditional Probabilistic Context Free Grammars (DeepCPCFG) to parse two-dimensional complex documents and use Recursive Neural Networks to create an end-to-end system for finding the most probable parse that represents the structured information to be extracted. This system is trained end-to-end with scanned documents as input and only relational-records as labels. The relational-records are extracted from existing databases avoiding the cost of annotating documents by hand. We apply this approach to extract information from scanned invoices achieving state-of-the-art results despite using no hand-annotations.

Freddy C. Chua, Nigel P. Duffy
Consideration of the Word’s Neighborhood in GATs for Information Extraction in Semi-structured Documents

Most administrative documents take a semi-structured form (invoices, payslips, etc.). Extracting information from this type of document is still challenging because of the variability of its structure brought about by the change of layout style of the different administrations. In this work, we try to face this type of variation by using a multi-layer Graph Attention Network (GAT). We propose a general structure of a semi-structured document. Based on this latter, we adopt a star sub-graph to exploit the surrounding context of words, allowing neighboring words to help locate the searched words and rank them. The GAT makes it possible to exploit this type of neighborhood and to highlight important neighboring words likely to be better identified. Each graph node contains at the same time textual and visual features. We experiment the multi-layer GAT on three different datasets: invoices and payslips (generated artificially), and receipts (issued from SROIE ICDAR competition). For the later dataset, we get an important F1 score of 0.892.

Djedjiga Belhadj, Yolande Belaïd, Abdel Belaïd
Backmatter
Metadata
Title
Document Analysis and Recognition – ICDAR 2021
Editors
Josep Lladós
Prof. Daniel Lopresti
Seiichi Uchida
Copyright Year
2021
Electronic ISBN
978-3-030-86331-9
Print ISBN
978-3-030-86330-2
DOI
https://doi.org/10.1007/978-3-030-86331-9

Premium Partner