ABSTRACT
Since real-world ubiquitous documents (e.g., invoices, tickets, resumes and leaflets) contain rich information, automatic document image understanding has become a hot topic. Most existing works decouple the problem into two separate tasks, (1) text reading for detecting and recognizing texts in images and (2) information extraction for analyzing and extracting key elements from previously extracted plain text.However, they mainly focus on improving information extraction task, while neglecting the fact that text reading and information extraction are mutually correlated. In this paper, we propose a unified end-to-end text reading and information extraction network, where the two tasks can reinforce each other. Specifically, the multimodal visual and textual features of text reading are fused for information extraction and in turn, the semantics in information extraction contribute to the optimization of text reading. On three real-world datasets with diverse document images (from fixed layout to variable layout, from structured text to semi-structured text), our proposed method significantly outperforms the state-of-the-art methods in both efficiency and accuracy.
Supplemental Material
- Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In CVPR. 6077--6086.Google Scholar
- Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. 2018. Rosetta: Large Scale System for Text Detection and Recognition in Images. In KDD. 71--79.Google Scholar
- Michal Busta, Lukas Neumann, and Jiri Matas. 2017. Deep TextSpotter: An End-to-End Trainable Scene Text Localization and Recognition Framework. In ICCV. 2223--2231.Google Scholar
- Manuel Carbonell, Alicia Forné s, Mauricio Villegas, and Josep Lladó s. 2019. TreyNet: A Neural Model for Text Localization, Transcription and Named Entity Recognition in Full Pages. arXiv preprint arXiv:1912.10016 (2019).Google Scholar
- Zhanzhan Cheng, Fan Bai, Yunlu Xu, Gang Zheng, Shiliang Pu, and Shuigeng Zhou. 2017. Focusing Attention: Towards Accurate Text Recognition in Natural Images. In ICCV. 5086--5094.Google Scholar
- Junyoung Chung, cC aglar Gü lcc ehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv preprint arXiv:1412.3555 (2014).Google Scholar
- Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc Viet Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In ACL. 2978--2988.Google Scholar
- Andreas Dengel and Bertin Klein. 2002. smartFIX: A Requirements-Driven System for Document Analysis and Understanding. In DAS (Lecture Notes in Computer Science), Vol. 2423. 433--444.Google ScholarCross Ref
- Timo I. Denk and Christian Reisswig. 2019. BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding. arXiv preprint arXiv:1909.04948 (2019).Google Scholar
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. 4171--4186.Google Scholar
- Daniel Esser, Daniel Schuster, Klemens Muthmann, Michael Berger, and Alexander Schill. [n.d.]. Automatic indexing of scanned documents: a layout-based approach. In Document Recognition and Retrieval XIX, part of the IS&T-SPIE Electronic Imaging Symposium (SPIE Proceedings), Vol. 8297. 82970H.Google Scholar
- Wei Feng, Wenhao He, Fei Yin, Xu-Yao Zhang, and Cheng-Lin Liu. 2019. TextDragon: An End-to-End Framework for Arbitrary Shaped Text Spotting. In ICCV. 9075--9084.Google Scholar
- Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In EMNLP. 457--468.Google Scholar
- He Guo, Xiameng Qin, Jiaming Liu, Junyu Han, Jingtuo Liu, and Errui Ding. 2019. EATEN: Entity-Aware Attention for Single Shot Visual Text Extraction. In ICDAR. 254--259.Google Scholar
- Kaiming He, Georgia Gkioxari, Piotr Dollá r, and Ross B. Girshick. 2017a. Mask R-CNN. In ICCV. 2980--2988.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. 770--778.Google Scholar
- Pan He, Weilin Huang, Tong He, Qile Zhu, Yu Qiao, and Xiaolin Li. 2017b. Single Shot Text Detector with Regional Attention. In ICCV. 3066--3074.Google Scholar
- Tong He, Zhi Tian, Weilin Huang, Chunhua Shen, Yu Qiao, and Changming Sun. 2018. An End-to-End TextSpotter With Explicit Alignment and Attention. In CVPR. 5020--5029.Google Scholar
- Sepp Hochreiter and Jü rgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation, Vol. 9, 8 (1997), 1735--1780.Google ScholarDigital Library
- Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and C. V. Jawahar. 2019. ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction. In ICDAR. 1516--1520.Google Scholar
- Scott B. Huffman. 1995. Learning information extraction patterns from examples. In Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing (Lecture Notes in Computer Science), Vol. 1040. 246--260.Google Scholar
- Douglass Russell Judd, Bruce Karsh, Ram Subbaroyan, Troy Toman, Rahul Lahiri, and Patrick Lok. 2004. Apparatus and method for searching and retrieving structured, semi-structured and unstructured content. US Patent App. 10/439,338.Google Scholar
- Anoop R. Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Hö hne, and Jean Baptiste Faddoul. 2018. Chargrid: Towards Understanding 2D Documents. In EMNLP. 4459--4469.Google Scholar
- Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. In NAACL-HLT. 260--270.Google Scholar
- Chen-Yu Lee and Simon Osindero. 2016. Recursive Recurrent Nets with Attention Modeling for OCR in the Wild. In CVPR. 2231--2239.Google Scholar
- Hui Li, Peng Wang, and Chunhua Shen. 2017. Towards End-to-End Text Spotting with Convolutional Recurrent Neural Networks. In ICCV. 5248--5256.Google Scholar
- Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv preprint arXiv:1908.03557 (2019).Google Scholar
- Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and Wenyu Liu. 2017. TextBoxes: A Fast Text Detector with a Single Deep Neural Network. In AAAI. 4161--4167.Google Scholar
- Tsung-Yi Lin, Piotr Dollá r, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. 2017. Feature Pyramid Networks for Object Detection. In CVPR. 936--944.Google Scholar
- Xiaojing Liu, Feiyu Gao, Qiong Zhang, and Huasha Zhao. 2019. Graph Convolution for Multimodal Information Extraction from Visually Rich Documents. In NAACL-HLT. 32--39.Google Scholar
- Xuebo Liu, Ding Liang, Shi Yan, Dagui Chen, Yu Qiao, and Junjie Yan. 2018. FOTS: Fast Oriented Text Spotting With a Unified Network. In CVPR. 5676--5685.Google Scholar
- Yuliang Liu and Lianwen Jin. 2017. Deep Matching Prior Network: Toward Tighter Multi-oriented Text Detection. In CVPR. 3454--3461.Google Scholar
- Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao Wu, and Cong Yao. 2018. Textsnake: A Flexible Representation for Detecting Text of Arbitrary Shapes. In ECCV. 19--35.Google Scholar
- Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In NeurIPS. 13--23.Google Scholar
- Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and Xiang Bai. 2018. Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes. In ECCV. 71--88.Google Scholar
- Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. In ACL.Google Scholar
- Bodhisattwa Prasad Majumder, Navneet Potti, Sandeep Tata, James Bradley Wendt, Qi Zhao, and Marc Najork. 2020. Representation Learning for Information Extraction from Form-like Documents. In ACL. 6495--6504.Google Scholar
- Ion Muslea et al. 1999. Extraction patterns for information extraction tasks: A survey. In AAAI, Vol. 2.Google Scholar
- Rasmus Berg Palm, Florian Laws, and Ole Winther. 2019. Attend, Copy, Parse End-to-end Information Extraction from Documents. In ICDAR. 329--336.Google Scholar
- Rasmus Berg Palm, Ole Winther, and Florian Laws. 2017. CloudScan - A Configuration-Free Invoice Analysis System Using Recurrent Neural Networks. In ICDAR. 406--413.Google Scholar
- Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kö pf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In NeurIPS. 8024--8035.Google Scholar
- Liang Qiao, Sanli Tang, Zhanzhan Cheng, Yunlu Xu, Yi Niu, Shiliang Pu, and Fei Wu. 2020. Text Perceptron: Towards End-to-End Arbitrary-Shaped Text Spotting. In AAAI.Google Scholar
- Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NeurIPS. 91--99.Google Scholar
- Ellen Riloff. 1993. Automatically Constructing a Dictionary for Information Extraction Tasks. In Proceedings of the 11th National Conference on Artificial Intelligence. 811--816.Google Scholar
- Clé ment Sage, Alexandre Aussem, Haytham Elghazel, Vé ronique Eglin, and Jé ré my Espinas. 2019. Recurrent Neural Network Approach for Table Field Extraction in Business Documents. In ICDAR. 1308--1313.Google Scholar
- Erik F. Tjong Kim Sang and Jorn Veenstra. 1999. Representing Text Chunks. In EACL. 173--179.Google Scholar
- Daniel Schuster, Klemens Muthmann, Daniel Esser, Alexander Schill, Michael Berger, Christoph Weidling, Kamil Aliyev, and Andreas Hofmeier. 2013. Intellix - End-User Trained Information Extraction for Document Archiving. In ICDAR. 101--105.Google Scholar
- Khaled Shaalan. 2014. A Survey of Arabic Named Entity Recognition and Classification. Comput. Linguistics, Vol. 40, 2 (2014), 469--510.Google ScholarDigital Library
- Baoguang Shi, Xiang Bai, and Serge J. Belongie. [n.d.]. Detecting Oriented Text in Natural Images by Linking Segments. In CVPR.Google Scholar
- Baoguang Shi, Xiang Bai, and Cong Yao. 2017. An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition. IEEE TPAMI., Vol. 39, 11 (2017), 2298--2304.Google ScholarDigital Library
- Stephen Soderland. 1999. Learning Information Extraction Rules for Semi-Structured and Free Text. Mach. Learn., Vol. 34, 1--3 (1999), 233--272.Google ScholarDigital Library
- Hao Wang, Pu Lu, Hui Zhang, Mingkun Yang, Xiang Bai, Yongchao Xu, Mengchao He, Yongpan Wang, and Wenyu Liu. 2020. All You Need Is Boundary: Toward Arbitrary-Shaped Text Spotting. In AAAI.Google Scholar
- Wenhai Wang, Enze Xie, Xiang Li, Wenbo Hou, Tong Lu, Gang Yu, and Shuai Shao. 2019. Shape Robust Text Detection With Progressive Scale Expansion Network. In CVPR.Google Scholar
- Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. 2020. LayoutLM: Pre-training of Text and Layout for Document Image Understanding. KDD (2020).Google Scholar
- Vikas Yadav and Steven Bethard. 2018. A Survey on Recent Advances in Named Entity Recognition from Deep Learning models. In COLING. 2145--2158.Google Scholar
- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In NeurIPS. 5754--5764.Google Scholar
- Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander J. Smola. 2016. Stacked Attention Networks for Image Question Answering. In CVPR. 21--29.Google Scholar
- Wenwen Yu, Ning Lu, Xianbiao Qi, Ping Gong, and Rong Xiao. 2020. PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks. ICPR (2020).Google Scholar
- Matthew D. Zeiler. 2012. ADADELTA: An Adaptive Learning Rate Method. arXiv preprint arXiv:1212.5701 (2012).Google Scholar
- Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015. Character-level Convolutional Networks for Text Classification. In NeurIPS. 649--657.Google Scholar
- Xiaohui Zhao, Zhuo Wu, and Xiaoguang Wang. 2019. CUTIE: Learning to Understand Documents with Convolutional Universal Text Information Extractor. arXiv preprint arXiv:1903.12363 (2019).Google Scholar
- Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. 2017. EAST: An Efficient and Accurate Scene Text Detector. In CVPR. 2642--2651.Google Scholar
Recommendations
DocParser: End-to-end OCR-Free Information Extraction from Visually Rich Documents
Document Analysis and Recognition - ICDAR 2023AbstractInformation Extraction from visually rich documents is a challenging task that has gained a lot of attention in recent years due to its importance in several document-control based applications and its widespread commercial value. The majority of ...
Unconstrained end-to-end text reading with feature rectification
Highlights- We propose an end-to-end trainable text spotting framework.
- We find and deal with the features incompatibility problem.
- PSN is proposed to rectify the proposal features in the recognition branch.
- Experiments have demonstrated ...
AbstractWe propose an end-to-end trainable network that can simultaneously localize and recognize irregular text from images. Specifically, we find the feature incompatibility problem, which arises from the contradiction between detection and recognition ...
Multimodal weighted graph representation for information extraction from visually rich documents
AbstractThis paper introduces a novel system for information extraction from visually rich documents (VRD) using a weighted graph representation. The proposed method aims to improve the performance of the information extraction task by capturing the ...
Graphical abstractDisplay Omitted
Highlights- Information Extraction.
- Visually Rich Documents.
- Graph Convolutional Networks.
Comments