research-article

Glean: structured extractions from templatic documents

Authors:
Sandeep Tata

Google

Google
View Profile

,
Navneet Potti

Google

Google
View Profile

,
James B. Wendt

Google

Google
View Profile

,
Lauro Beltrão Costa

Google

Google
View Profile

,
Marc Najork

Google

Google
View Profile

,
Beliz Gunel

Stanford University

Stanford University
View Profile

Proceedings of the VLDB Endowment Volume 14 Issue 6pp 997–1005https://doi.org/10.14778/3447689.3447703

Published:01 February 2021Publication History

Proceedings of the VLDB Endowment

Abstract

Extracting structured information from templatic documents is an important problem with the potential to automate many real-world business workflows such as payment, procurement, and payroll. The core challenge is that such documents can be laid out in virtually infinitely different ways. A good solution to this problem is one that generalizes well not only to known templates such as invoices from a known vendor, but also to unseen ones.

We developed a system called Glean to tackle this problem. Given a target schema for a document type and some labeled documents of that type, Glean uses machine learning to automatically extract structured information from other documents of that type. In this paper, we describe the overall architecture of Glean, and discuss three key data management challenges : 1) managing the quality of ground truth data, 2) generating training data for the machine learning model using labeled documents, and 3) building tools that help a developer rapidly build and improve a model for a given document type. Through empirical studies on a real-world dataset, we show that these data management techniques allow us to train a model that is over 5 F1 points better than the exact same model architecture without the techniques we describe. We argue that for such information-extraction problems, designing abstractions that carefully manage the training data is at least as important as choosing a good model architecture.

References

Ting Bai, Ji-Rong Wen, Jun Zhang, and Wayne Xin Zhao. 2017. A Neural Collaborative Filtering Model with Interaction-based Neighborhood. Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (2017), 1979--1982. Google ScholarDigital Library
Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. 2003. Extracting Content Structure for Web Pages Based on Visual Representation. In Proceedings of the 5th Asia-Pacific Web Conference. 406--417. Google ScholarDigital Library
Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. 2004. Block-based Web Search. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 456--463. Google ScholarDigital Library
Gobinda G. Chowdhury. 1999. Template Mining for Information Extraction from Digital Documents. Library Trends 48, 1 (1999), 182--208.Google Scholar
Florian Daniel, Pavel Kucherbaev, Cinzia Cappiello, Boualem Benatallah, and Mohammad Allahbakhsh. 2018. Quality Control in Crowdsourcing: A Survey of Quality Attributes, Assessment Techniques, and Assurance Actions. ACM Comput. Surv. 51, 1, Article 7 (Jan. 2018), 40 pages. Google ScholarDigital Library
Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed K. Elmagarmid, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, and Nan Tang. 2017. The Data Civilizer System. In Proceedings of the 8th Biennial Conference on Innovative Data Systems Research.Google Scholar
Timo I. Denk and Christian Reisswig. 2019. BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding. arXiv:1909.04948 [cs.CL]Google Scholar
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171--4186.Google Scholar
Hazem Elmeleegy, Jayant Madhavan, and Alon Halevy. 2009. Harvesting Relational Tables from Lists on the Web. Proceedings of the VLDB Endowment 2 (2009), 1078--1089. Google ScholarDigital Library
Łukasz Garncarek, Rafał Powalski, Tomasz Stanisławek, Bartosz Topolski, Piotr Halama, and Filip Graliński. 2020. LAMBERT: Layout-Aware language Modeling using BERT for information extraction. arXiv:2002.08087 [cs.CL]Google Scholar
Anoop R. Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Höhne, and Jean Baptiste Faddoul. 2018. Chargrid: Towards Understanding 2D Documents. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 4459--4469.Google ScholarCross Ref
Tie-Yan Liu. 2011. Learning to Rank for Information Retrieval. Springer-Verlag.Google ScholarDigital Library
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs.CL]Google Scholar
Bodhisattwa Prasad Majumder, Navneet Potti, Sandeep Tata, James B. Wendt, Qi Zhao, and Marc Najork. 2020. Representation Learning for Information Extraction from Form-like Documents. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 6495--6504.Google ScholarCross Ref
Rachel Millner. 2008. Four regular expressions to check email addresses. https://www.wired.com/2008/08/four-regular-expressions-to-check-email-addresses/Google Scholar
Alexander Ratner, Stephen H. Bach, Henry R. Ehrenberg, Jason Alan Fries, Sen Wu, and Christopher Ré. 2017. Snorkel: Rapid Training Data Creation with Weak Supervision. Proceedings of the VLDB Endowment 11, 3 (Nov. 2017), 269--282. Google ScholarDigital Library
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 6 (2015), 1137--1149. Google ScholarDigital Library
El Kindi Rezig, Lei Cao, Giovanni Simonini, Maxime Schoemans, Samuel Madden, Nan Tang, Mourad Ouzzani, and Michael Stonebraker. 2020. Dagger: A Data (not code) Debugger. In Proceedings of the 10th Conference on Innovative Data Systems Research.Google Scholar
Ritesh Sarkhel and Arnab Nandi. 2019. Visual Segmentation for Information Extraction from Heterogeneous Visually Rich Documents. In Proceedings of the 2019 International Conference on Management of Data. 247--262. Google ScholarDigital Library
Guilherme A. Toda, Eli Cortez, Altigran S. da Silva, and Edleno de Moura. 2010. A Probabilistic Approach for Automatically Filling Form-Based Web Interfaces. Proceedings of the VLDB Endowment 4, 3 (Dec. 2010), 151--160. Google ScholarDigital Library
Jake Walker, Yasuhisa Fujii, and Ashok Popat. 2018. A Web-Based OCR Service for Documents. In 13th IAPR International Workshop on Document Analysis Systems - Short Papers Booklet. 21--22.Google Scholar
Sen Wu, Luke Hsiao, Xiao Cheng, Braden Hancock, Theodoros Rekatsinas, Philip Levis, and Christopher Ré. 2018. Fonduer: Knowledge Base Construction from Richly Formatted Data. In Proceedings of the 2018 International Conference on Management of Data. 1301--1316. Google ScholarDigital Library
Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. 2019. LayoutLM: Pre-training of Text and Layout for Document Image Understanding. arXiv:1912.13318 [cs.CL]Google Scholar
Zhilin Yang, Ruslan Salakhutdinov, and William W. Cohen. 2017. Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks. arXiv:1703.06345 [cs.CL]Google ScholarDigital Library
Shipeng Yu, Deng Cai, Ji-Rong Wen, and Wei-Ying Ma. 2003. Improving Pseudo-Relevance Feedback in Web Information Retrieval Using Web Page Segmentation. In Proceedings of the 12th International World Wide Web Conference. 11--18. Google ScholarDigital Library
Shanshan Zhang, Lihong He, Eduard Dragut, and Slobodan Vucetic. 2019. How to Invest My Time: Lessons from Human-in-the-Loop Entity Extraction. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2305--2313. Google ScholarDigital Library
Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang, and Wei-Ying Ma. 2006. Simultaneous Record Detection and Attribute Labeling in Web Data Extraction. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 494--503. Google ScholarDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 14, Issue 6
February 2021
261 pages
ISSN:2150-8097
Editors:
Xin Luna Dong
Amazon
,
Felix Naumann
HPI, University of Potsdam
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 February 2021
Published in pvldb Volume 14, Issue 6
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 58
  Total Downloads
- Downloads (Last 12 months)18
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Glean: structured extractions from templatic documents

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Glean: using syntactic information in document filtering

Automatic office document classification and information extraction

Intelligent Document Capture with Ephesoft

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Glean: structured extractions from templatic documents

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Glean: using syntactic information in document filtering

Automatic office document classification and information extraction

Intelligent Document Capture with Ephesoft

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media