research-article

TRIE: End-to-End Text Reading and Information Extraction for Document Understanding

Authors:
Peng Zhang

Hikvision Research Institute, Hangzhou, China

Hikvision Research Institute, Hangzhou, China
View Profile

,
Yunlu Xu

Hikvision Research Institute, Hangzhou, China

Hikvision Research Institute, Hangzhou, China
View Profile

,
Zhanzhan Cheng

Zhejiang University & Hikvision Research Institute, Hangzhou, China

Zhejiang University & Hikvision Research Institute, Hangzhou, China
View Profile

,
Shiliang Pu

Hikvision Research Institute, Hangzhou, China

Hikvision Research Institute, Hangzhou, China
View Profile

,
Jing Lu

Hikvision Research Institute, Hangzhou, China

Hikvision Research Institute, Hangzhou, China
View Profile

,
Liang Qiao

Hikvision Research Institute, Hangzhou, China

Hikvision Research Institute, Hangzhou, China
View Profile

,
Yi Niu

Hikvision Research Institute, Hangzhou, China

Hikvision Research Institute, Hangzhou, China
View Profile

,
Fei Wu

Zhejiang University, Hangzhou, China

Zhejiang University, Hangzhou, China
View Profile

MM '20: Proceedings of the 28th ACM International Conference on MultimediaOctober 2020Pages 1413–1422https://doi.org/10.1145/3394171.3413900

Published:12 October 2020Publication History

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 1413–1422

ABSTRACT

Since real-world ubiquitous documents (e.g., invoices, tickets, resumes and leaflets) contain rich information, automatic document image understanding has become a hot topic. Most existing works decouple the problem into two separate tasks, (1) text reading for detecting and recognizing texts in images and (2) information extraction for analyzing and extracting key elements from previously extracted plain text.However, they mainly focus on improving information extraction task, while neglecting the fact that text reading and information extraction are mutually correlated. In this paper, we propose a unified end-to-end text reading and information extraction network, where the two tasks can reinforce each other. Specifically, the multimodal visual and textual features of text reading are fused for information extraction and in turn, the semantics in information extraction contribute to the optimization of text reading. On three real-world datasets with diverse document images (from fixed layout to variable layout, from structured text to semi-structured text), our proposed method significantly outperforms the state-of-the-art methods in both efficiency and accuracy.

Supplemental Material

3394171.3413900.mp4

mp4

47.2 MB

Download

References

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In CVPR. 6077--6086.Google Scholar
Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. 2018. Rosetta: Large Scale System for Text Detection and Recognition in Images. In KDD. 71--79.Google Scholar
Michal Busta, Lukas Neumann, and Jiri Matas. 2017. Deep TextSpotter: An End-to-End Trainable Scene Text Localization and Recognition Framework. In ICCV. 2223--2231.Google Scholar
Manuel Carbonell, Alicia Forné s, Mauricio Villegas, and Josep Lladó s. 2019. TreyNet: A Neural Model for Text Localization, Transcription and Named Entity Recognition in Full Pages. arXiv preprint arXiv:1912.10016 (2019).Google Scholar
Zhanzhan Cheng, Fan Bai, Yunlu Xu, Gang Zheng, Shiliang Pu, and Shuigeng Zhou. 2017. Focusing Attention: Towards Accurate Text Recognition in Natural Images. In ICCV. 5086--5094.Google Scholar
Junyoung Chung, cC aglar Gü lcc ehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv preprint arXiv:1412.3555 (2014).Google Scholar
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc Viet Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In ACL. 2978--2988.Google Scholar
Andreas Dengel and Bertin Klein. 2002. smartFIX: A Requirements-Driven System for Document Analysis and Understanding. In DAS (Lecture Notes in Computer Science), Vol. 2423. 433--444.Google ScholarCross Ref
Timo I. Denk and Christian Reisswig. 2019. BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding. arXiv preprint arXiv:1909.04948 (2019).Google Scholar
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. 4171--4186.Google Scholar
Daniel Esser, Daniel Schuster, Klemens Muthmann, Michael Berger, and Alexander Schill. [n.d.]. Automatic indexing of scanned documents: a layout-based approach. In Document Recognition and Retrieval XIX, part of the IS&T-SPIE Electronic Imaging Symposium (SPIE Proceedings), Vol. 8297. 82970H.Google Scholar
Wei Feng, Wenhao He, Fei Yin, Xu-Yao Zhang, and Cheng-Lin Liu. 2019. TextDragon: An End-to-End Framework for Arbitrary Shaped Text Spotting. In ICCV. 9075--9084.Google Scholar
Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In EMNLP. 457--468.Google Scholar
He Guo, Xiameng Qin, Jiaming Liu, Junyu Han, Jingtuo Liu, and Errui Ding. 2019. EATEN: Entity-Aware Attention for Single Shot Visual Text Extraction. In ICDAR. 254--259.Google Scholar
Kaiming He, Georgia Gkioxari, Piotr Dollá r, and Ross B. Girshick. 2017a. Mask R-CNN. In ICCV. 2980--2988.Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. 770--778.Google Scholar
Pan He, Weilin Huang, Tong He, Qile Zhu, Yu Qiao, and Xiaolin Li. 2017b. Single Shot Text Detector with Regional Attention. In ICCV. 3066--3074.Google Scholar
Tong He, Zhi Tian, Weilin Huang, Chunhua Shen, Yu Qiao, and Changming Sun. 2018. An End-to-End TextSpotter With Explicit Alignment and Attention. In CVPR. 5020--5029.Google Scholar
Sepp Hochreiter and Jü rgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation, Vol. 9, 8 (1997), 1735--1780.Google ScholarDigital Library
Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and C. V. Jawahar. 2019. ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction. In ICDAR. 1516--1520.Google Scholar
Scott B. Huffman. 1995. Learning information extraction patterns from examples. In Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing (Lecture Notes in Computer Science), Vol. 1040. 246--260.Google Scholar
Douglass Russell Judd, Bruce Karsh, Ram Subbaroyan, Troy Toman, Rahul Lahiri, and Patrick Lok. 2004. Apparatus and method for searching and retrieving structured, semi-structured and unstructured content. US Patent App. 10/439,338.Google Scholar
Anoop R. Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Hö hne, and Jean Baptiste Faddoul. 2018. Chargrid: Towards Understanding 2D Documents. In EMNLP. 4459--4469.Google Scholar
Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. In NAACL-HLT. 260--270.Google Scholar
Chen-Yu Lee and Simon Osindero. 2016. Recursive Recurrent Nets with Attention Modeling for OCR in the Wild. In CVPR. 2231--2239.Google Scholar
Hui Li, Peng Wang, and Chunhua Shen. 2017. Towards End-to-End Text Spotting with Convolutional Recurrent Neural Networks. In ICCV. 5248--5256.Google Scholar
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv preprint arXiv:1908.03557 (2019).Google Scholar
Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and Wenyu Liu. 2017. TextBoxes: A Fast Text Detector with a Single Deep Neural Network. In AAAI. 4161--4167.Google Scholar
Tsung-Yi Lin, Piotr Dollá r, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. 2017. Feature Pyramid Networks for Object Detection. In CVPR. 936--944.Google Scholar
Xiaojing Liu, Feiyu Gao, Qiong Zhang, and Huasha Zhao. 2019. Graph Convolution for Multimodal Information Extraction from Visually Rich Documents. In NAACL-HLT. 32--39.Google Scholar
Xuebo Liu, Ding Liang, Shi Yan, Dagui Chen, Yu Qiao, and Junjie Yan. 2018. FOTS: Fast Oriented Text Spotting With a Unified Network. In CVPR. 5676--5685.Google Scholar
Yuliang Liu and Lianwen Jin. 2017. Deep Matching Prior Network: Toward Tighter Multi-oriented Text Detection. In CVPR. 3454--3461.Google Scholar
Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao Wu, and Cong Yao. 2018. Textsnake: A Flexible Representation for Detecting Text of Arbitrary Shapes. In ECCV. 19--35.Google Scholar
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In NeurIPS. 13--23.Google Scholar
Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and Xiang Bai. 2018. Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes. In ECCV. 71--88.Google Scholar
Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. In ACL.Google Scholar
Bodhisattwa Prasad Majumder, Navneet Potti, Sandeep Tata, James Bradley Wendt, Qi Zhao, and Marc Najork. 2020. Representation Learning for Information Extraction from Form-like Documents. In ACL. 6495--6504.Google Scholar
Ion Muslea et al. 1999. Extraction patterns for information extraction tasks: A survey. In AAAI, Vol. 2.Google Scholar
Rasmus Berg Palm, Florian Laws, and Ole Winther. 2019. Attend, Copy, Parse End-to-end Information Extraction from Documents. In ICDAR. 329--336.Google Scholar
Rasmus Berg Palm, Ole Winther, and Florian Laws. 2017. CloudScan - A Configuration-Free Invoice Analysis System Using Recurrent Neural Networks. In ICDAR. 406--413.Google Scholar
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kö pf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In NeurIPS. 8024--8035.Google Scholar
Liang Qiao, Sanli Tang, Zhanzhan Cheng, Yunlu Xu, Yi Niu, Shiliang Pu, and Fei Wu. 2020. Text Perceptron: Towards End-to-End Arbitrary-Shaped Text Spotting. In AAAI.Google Scholar
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NeurIPS. 91--99.Google Scholar
Ellen Riloff. 1993. Automatically Constructing a Dictionary for Information Extraction Tasks. In Proceedings of the 11th National Conference on Artificial Intelligence. 811--816.Google Scholar
Clé ment Sage, Alexandre Aussem, Haytham Elghazel, Vé ronique Eglin, and Jé ré my Espinas. 2019. Recurrent Neural Network Approach for Table Field Extraction in Business Documents. In ICDAR. 1308--1313.Google Scholar
Erik F. Tjong Kim Sang and Jorn Veenstra. 1999. Representing Text Chunks. In EACL. 173--179.Google Scholar
Daniel Schuster, Klemens Muthmann, Daniel Esser, Alexander Schill, Michael Berger, Christoph Weidling, Kamil Aliyev, and Andreas Hofmeier. 2013. Intellix - End-User Trained Information Extraction for Document Archiving. In ICDAR. 101--105.Google Scholar
Khaled Shaalan. 2014. A Survey of Arabic Named Entity Recognition and Classification. Comput. Linguistics, Vol. 40, 2 (2014), 469--510.Google ScholarDigital Library
Baoguang Shi, Xiang Bai, and Serge J. Belongie. [n.d.]. Detecting Oriented Text in Natural Images by Linking Segments. In CVPR.Google Scholar
Baoguang Shi, Xiang Bai, and Cong Yao. 2017. An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition. IEEE TPAMI., Vol. 39, 11 (2017), 2298--2304.Google ScholarDigital Library
Stephen Soderland. 1999. Learning Information Extraction Rules for Semi-Structured and Free Text. Mach. Learn., Vol. 34, 1--3 (1999), 233--272.Google ScholarDigital Library
Hao Wang, Pu Lu, Hui Zhang, Mingkun Yang, Xiang Bai, Yongchao Xu, Mengchao He, Yongpan Wang, and Wenyu Liu. 2020. All You Need Is Boundary: Toward Arbitrary-Shaped Text Spotting. In AAAI.Google Scholar
Wenhai Wang, Enze Xie, Xiang Li, Wenbo Hou, Tong Lu, Gang Yu, and Shuai Shao. 2019. Shape Robust Text Detection With Progressive Scale Expansion Network. In CVPR.Google Scholar
Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. 2020. LayoutLM: Pre-training of Text and Layout for Document Image Understanding. KDD (2020).Google Scholar
Vikas Yadav and Steven Bethard. 2018. A Survey on Recent Advances in Named Entity Recognition from Deep Learning models. In COLING. 2145--2158.Google Scholar
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In NeurIPS. 5754--5764.Google Scholar
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander J. Smola. 2016. Stacked Attention Networks for Image Question Answering. In CVPR. 21--29.Google Scholar
Wenwen Yu, Ning Lu, Xianbiao Qi, Ping Gong, and Rong Xiao. 2020. PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks. ICPR (2020).Google Scholar
Matthew D. Zeiler. 2012. ADADELTA: An Adaptive Learning Rate Method. arXiv preprint arXiv:1212.5701 (2012).Google Scholar
Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015. Character-level Convolutional Networks for Text Classification. In NeurIPS. 649--657.Google Scholar
Xiaohui Zhao, Zhuo Wu, and Xiaoguang Wang. 2019. CUTIE: Learning to Understand Documents with Convolutional Universal Text Information Extractor. arXiv preprint arXiv:1903.12363 (2019).Google Scholar
Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. 2017. EAST: An Efficient and Accurate Scene Text Detector. In CVPR. 2642--2651.Google Scholar

Recommendations

DocParser: End-to-end OCR-Free Information Extraction from Visually Rich Documents
Document Analysis and Recognition - ICDAR 2023
Abstract
Information Extraction from visually rich documents is a challenging task that has gained a lot of attention in recent years due to its importance in several document-control based applications and its widespread commercial value. The majority of ...
Read More
Unconstrained end-to-end text reading with feature rectification
Highlights
- We propose an end-to-end trainable text spotting framework.
- We find and deal with the features incompatibility problem.
- PSN is proposed to rectify the proposal features in the recognition branch.
- Experiments have demonstrated ...
Abstract
We propose an end-to-end trainable network that can simultaneously localize and recognize irregular text from images. Specifically, we find the feature incompatibility problem, which arises from the contradiction between detection and recognition ...
Read More
Multimodal weighted graph representation for information extraction from visually rich documents
Abstract
This paper introduces a novel system for information extraction from visually rich documents (VRD) using a weighted graph representation. The proposed method aims to improve the performance of the information extraction task by capturing the ...
Graphical abstract

Display Omitted
Highlights
- Information Extraction.
- Visually Rich Documents.
- Graph Convolutional Networks.
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '20: Proceedings of the 28th ACM International Conference on Multimedia
October 2020
4889 pages
ISBN:9781450379885
DOI:10.1145/3394171
General Chairs:
Chang Wen Chen
Chinese University of Hong Kong, Shenzhen, China
,
Rita Cucchiara
UNIMORE, Italy
,
Xian-Sheng Hua
Alibaba Group, China
,
Program Chairs:
Guo-Jun Qi
Futurewei Technologies, USA
,
Elisa Ricci
UNITN & Fondazione Bruno Kessler, Italy
,
Zhengyou Zhang
Tencent, China
,
Roger Zimmermann
National University of Singapore, Singapore
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 October 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
end-to-end
information extraction
text reading
visually rich documents
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 49
  Total Citations
  View Citations
- 758
  Total Downloads
- Downloads (Last 12 months)102
- Downloads (Last 6 weeks)14
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

TRIE: End-to-End Text Reading and Information Extraction for Document Understanding

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

ABSTRACT

Supplemental Material

References

Cited By

Recommendations

DocParser: End-to-end OCR-Free Information Extraction from Visually Rich Documents

Unconstrained end-to-end text reading with feature rectification

Multimodal weighted graph representation for information extraction from visually rich documents

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media