skip to main content
10.1145/3394171.3413900acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

TRIE: End-to-End Text Reading and Information Extraction for Document Understanding

Authors Info & Claims
Published:12 October 2020Publication History

ABSTRACT

Since real-world ubiquitous documents (e.g., invoices, tickets, resumes and leaflets) contain rich information, automatic document image understanding has become a hot topic. Most existing works decouple the problem into two separate tasks, (1) text reading for detecting and recognizing texts in images and (2) information extraction for analyzing and extracting key elements from previously extracted plain text.However, they mainly focus on improving information extraction task, while neglecting the fact that text reading and information extraction are mutually correlated. In this paper, we propose a unified end-to-end text reading and information extraction network, where the two tasks can reinforce each other. Specifically, the multimodal visual and textual features of text reading are fused for information extraction and in turn, the semantics in information extraction contribute to the optimization of text reading. On three real-world datasets with diverse document images (from fixed layout to variable layout, from structured text to semi-structured text), our proposed method significantly outperforms the state-of-the-art methods in both efficiency and accuracy.

Skip Supplemental Material Section

Supplemental Material

3394171.3413900.mp4

mp4

47.2 MB

References

  1. Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In CVPR. 6077--6086.Google ScholarGoogle Scholar
  2. Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. 2018. Rosetta: Large Scale System for Text Detection and Recognition in Images. In KDD. 71--79.Google ScholarGoogle Scholar
  3. Michal Busta, Lukas Neumann, and Jiri Matas. 2017. Deep TextSpotter: An End-to-End Trainable Scene Text Localization and Recognition Framework. In ICCV. 2223--2231.Google ScholarGoogle Scholar
  4. Manuel Carbonell, Alicia Forné s, Mauricio Villegas, and Josep Lladó s. 2019. TreyNet: A Neural Model for Text Localization, Transcription and Named Entity Recognition in Full Pages. arXiv preprint arXiv:1912.10016 (2019).Google ScholarGoogle Scholar
  5. Zhanzhan Cheng, Fan Bai, Yunlu Xu, Gang Zheng, Shiliang Pu, and Shuigeng Zhou. 2017. Focusing Attention: Towards Accurate Text Recognition in Natural Images. In ICCV. 5086--5094.Google ScholarGoogle Scholar
  6. Junyoung Chung, cC aglar Gü lcc ehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv preprint arXiv:1412.3555 (2014).Google ScholarGoogle Scholar
  7. Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc Viet Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In ACL. 2978--2988.Google ScholarGoogle Scholar
  8. Andreas Dengel and Bertin Klein. 2002. smartFIX: A Requirements-Driven System for Document Analysis and Understanding. In DAS (Lecture Notes in Computer Science), Vol. 2423. 433--444.Google ScholarGoogle ScholarCross RefCross Ref
  9. Timo I. Denk and Christian Reisswig. 2019. BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding. arXiv preprint arXiv:1909.04948 (2019).Google ScholarGoogle Scholar
  10. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. 4171--4186.Google ScholarGoogle Scholar
  11. Daniel Esser, Daniel Schuster, Klemens Muthmann, Michael Berger, and Alexander Schill. [n.d.]. Automatic indexing of scanned documents: a layout-based approach. In Document Recognition and Retrieval XIX, part of the IS&T-SPIE Electronic Imaging Symposium (SPIE Proceedings), Vol. 8297. 82970H.Google ScholarGoogle Scholar
  12. Wei Feng, Wenhao He, Fei Yin, Xu-Yao Zhang, and Cheng-Lin Liu. 2019. TextDragon: An End-to-End Framework for Arbitrary Shaped Text Spotting. In ICCV. 9075--9084.Google ScholarGoogle Scholar
  13. Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In EMNLP. 457--468.Google ScholarGoogle Scholar
  14. He Guo, Xiameng Qin, Jiaming Liu, Junyu Han, Jingtuo Liu, and Errui Ding. 2019. EATEN: Entity-Aware Attention for Single Shot Visual Text Extraction. In ICDAR. 254--259.Google ScholarGoogle Scholar
  15. Kaiming He, Georgia Gkioxari, Piotr Dollá r, and Ross B. Girshick. 2017a. Mask R-CNN. In ICCV. 2980--2988.Google ScholarGoogle Scholar
  16. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. 770--778.Google ScholarGoogle Scholar
  17. Pan He, Weilin Huang, Tong He, Qile Zhu, Yu Qiao, and Xiaolin Li. 2017b. Single Shot Text Detector with Regional Attention. In ICCV. 3066--3074.Google ScholarGoogle Scholar
  18. Tong He, Zhi Tian, Weilin Huang, Chunhua Shen, Yu Qiao, and Changming Sun. 2018. An End-to-End TextSpotter With Explicit Alignment and Attention. In CVPR. 5020--5029.Google ScholarGoogle Scholar
  19. Sepp Hochreiter and Jü rgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation, Vol. 9, 8 (1997), 1735--1780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and C. V. Jawahar. 2019. ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction. In ICDAR. 1516--1520.Google ScholarGoogle Scholar
  21. Scott B. Huffman. 1995. Learning information extraction patterns from examples. In Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing (Lecture Notes in Computer Science), Vol. 1040. 246--260.Google ScholarGoogle Scholar
  22. Douglass Russell Judd, Bruce Karsh, Ram Subbaroyan, Troy Toman, Rahul Lahiri, and Patrick Lok. 2004. Apparatus and method for searching and retrieving structured, semi-structured and unstructured content. US Patent App. 10/439,338.Google ScholarGoogle Scholar
  23. Anoop R. Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Hö hne, and Jean Baptiste Faddoul. 2018. Chargrid: Towards Understanding 2D Documents. In EMNLP. 4459--4469.Google ScholarGoogle Scholar
  24. Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. In NAACL-HLT. 260--270.Google ScholarGoogle Scholar
  25. Chen-Yu Lee and Simon Osindero. 2016. Recursive Recurrent Nets with Attention Modeling for OCR in the Wild. In CVPR. 2231--2239.Google ScholarGoogle Scholar
  26. Hui Li, Peng Wang, and Chunhua Shen. 2017. Towards End-to-End Text Spotting with Convolutional Recurrent Neural Networks. In ICCV. 5248--5256.Google ScholarGoogle Scholar
  27. Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv preprint arXiv:1908.03557 (2019).Google ScholarGoogle Scholar
  28. Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and Wenyu Liu. 2017. TextBoxes: A Fast Text Detector with a Single Deep Neural Network. In AAAI. 4161--4167.Google ScholarGoogle Scholar
  29. Tsung-Yi Lin, Piotr Dollá r, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. 2017. Feature Pyramid Networks for Object Detection. In CVPR. 936--944.Google ScholarGoogle Scholar
  30. Xiaojing Liu, Feiyu Gao, Qiong Zhang, and Huasha Zhao. 2019. Graph Convolution for Multimodal Information Extraction from Visually Rich Documents. In NAACL-HLT. 32--39.Google ScholarGoogle Scholar
  31. Xuebo Liu, Ding Liang, Shi Yan, Dagui Chen, Yu Qiao, and Junjie Yan. 2018. FOTS: Fast Oriented Text Spotting With a Unified Network. In CVPR. 5676--5685.Google ScholarGoogle Scholar
  32. Yuliang Liu and Lianwen Jin. 2017. Deep Matching Prior Network: Toward Tighter Multi-oriented Text Detection. In CVPR. 3454--3461.Google ScholarGoogle Scholar
  33. Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao Wu, and Cong Yao. 2018. Textsnake: A Flexible Representation for Detecting Text of Arbitrary Shapes. In ECCV. 19--35.Google ScholarGoogle Scholar
  34. Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In NeurIPS. 13--23.Google ScholarGoogle Scholar
  35. Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and Xiang Bai. 2018. Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes. In ECCV. 71--88.Google ScholarGoogle Scholar
  36. Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. In ACL.Google ScholarGoogle Scholar
  37. Bodhisattwa Prasad Majumder, Navneet Potti, Sandeep Tata, James Bradley Wendt, Qi Zhao, and Marc Najork. 2020. Representation Learning for Information Extraction from Form-like Documents. In ACL. 6495--6504.Google ScholarGoogle Scholar
  38. Ion Muslea et al. 1999. Extraction patterns for information extraction tasks: A survey. In AAAI, Vol. 2.Google ScholarGoogle Scholar
  39. Rasmus Berg Palm, Florian Laws, and Ole Winther. 2019. Attend, Copy, Parse End-to-end Information Extraction from Documents. In ICDAR. 329--336.Google ScholarGoogle Scholar
  40. Rasmus Berg Palm, Ole Winther, and Florian Laws. 2017. CloudScan - A Configuration-Free Invoice Analysis System Using Recurrent Neural Networks. In ICDAR. 406--413.Google ScholarGoogle Scholar
  41. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kö pf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In NeurIPS. 8024--8035.Google ScholarGoogle Scholar
  42. Liang Qiao, Sanli Tang, Zhanzhan Cheng, Yunlu Xu, Yi Niu, Shiliang Pu, and Fei Wu. 2020. Text Perceptron: Towards End-to-End Arbitrary-Shaped Text Spotting. In AAAI.Google ScholarGoogle Scholar
  43. Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NeurIPS. 91--99.Google ScholarGoogle Scholar
  44. Ellen Riloff. 1993. Automatically Constructing a Dictionary for Information Extraction Tasks. In Proceedings of the 11th National Conference on Artificial Intelligence. 811--816.Google ScholarGoogle Scholar
  45. Clé ment Sage, Alexandre Aussem, Haytham Elghazel, Vé ronique Eglin, and Jé ré my Espinas. 2019. Recurrent Neural Network Approach for Table Field Extraction in Business Documents. In ICDAR. 1308--1313.Google ScholarGoogle Scholar
  46. Erik F. Tjong Kim Sang and Jorn Veenstra. 1999. Representing Text Chunks. In EACL. 173--179.Google ScholarGoogle Scholar
  47. Daniel Schuster, Klemens Muthmann, Daniel Esser, Alexander Schill, Michael Berger, Christoph Weidling, Kamil Aliyev, and Andreas Hofmeier. 2013. Intellix - End-User Trained Information Extraction for Document Archiving. In ICDAR. 101--105.Google ScholarGoogle Scholar
  48. Khaled Shaalan. 2014. A Survey of Arabic Named Entity Recognition and Classification. Comput. Linguistics, Vol. 40, 2 (2014), 469--510.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Baoguang Shi, Xiang Bai, and Serge J. Belongie. [n.d.]. Detecting Oriented Text in Natural Images by Linking Segments. In CVPR.Google ScholarGoogle Scholar
  50. Baoguang Shi, Xiang Bai, and Cong Yao. 2017. An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition. IEEE TPAMI., Vol. 39, 11 (2017), 2298--2304.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Stephen Soderland. 1999. Learning Information Extraction Rules for Semi-Structured and Free Text. Mach. Learn., Vol. 34, 1--3 (1999), 233--272.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Hao Wang, Pu Lu, Hui Zhang, Mingkun Yang, Xiang Bai, Yongchao Xu, Mengchao He, Yongpan Wang, and Wenyu Liu. 2020. All You Need Is Boundary: Toward Arbitrary-Shaped Text Spotting. In AAAI.Google ScholarGoogle Scholar
  53. Wenhai Wang, Enze Xie, Xiang Li, Wenbo Hou, Tong Lu, Gang Yu, and Shuai Shao. 2019. Shape Robust Text Detection With Progressive Scale Expansion Network. In CVPR.Google ScholarGoogle Scholar
  54. Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. 2020. LayoutLM: Pre-training of Text and Layout for Document Image Understanding. KDD (2020).Google ScholarGoogle Scholar
  55. Vikas Yadav and Steven Bethard. 2018. A Survey on Recent Advances in Named Entity Recognition from Deep Learning models. In COLING. 2145--2158.Google ScholarGoogle Scholar
  56. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In NeurIPS. 5754--5764.Google ScholarGoogle Scholar
  57. Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander J. Smola. 2016. Stacked Attention Networks for Image Question Answering. In CVPR. 21--29.Google ScholarGoogle Scholar
  58. Wenwen Yu, Ning Lu, Xianbiao Qi, Ping Gong, and Rong Xiao. 2020. PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks. ICPR (2020).Google ScholarGoogle Scholar
  59. Matthew D. Zeiler. 2012. ADADELTA: An Adaptive Learning Rate Method. arXiv preprint arXiv:1212.5701 (2012).Google ScholarGoogle Scholar
  60. Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015. Character-level Convolutional Networks for Text Classification. In NeurIPS. 649--657.Google ScholarGoogle Scholar
  61. Xiaohui Zhao, Zhuo Wu, and Xiaoguang Wang. 2019. CUTIE: Learning to Understand Documents with Convolutional Universal Text Information Extractor. arXiv preprint arXiv:1903.12363 (2019).Google ScholarGoogle Scholar
  62. Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. 2017. EAST: An Efficient and Accurate Scene Text Detector. In CVPR. 2642--2651.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    MM '20: Proceedings of the 28th ACM International Conference on Multimedia
    October 2020
    4889 pages
    ISBN:9781450379885
    DOI:10.1145/3394171

    Copyright © 2020 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 12 October 2020

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    Overall Acceptance Rate995of4,171submissions,24%

    Upcoming Conference

    MM '24
    MM '24: The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne , VIC , Australia

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader