skip to main content
research-article

Glean: structured extractions from templatic documents

Published:01 February 2021Publication History
Skip Abstract Section

Abstract

Extracting structured information from templatic documents is an important problem with the potential to automate many real-world business workflows such as payment, procurement, and payroll. The core challenge is that such documents can be laid out in virtually infinitely different ways. A good solution to this problem is one that generalizes well not only to known templates such as invoices from a known vendor, but also to unseen ones.

We developed a system called Glean to tackle this problem. Given a target schema for a document type and some labeled documents of that type, Glean uses machine learning to automatically extract structured information from other documents of that type. In this paper, we describe the overall architecture of Glean, and discuss three key data management challenges : 1) managing the quality of ground truth data, 2) generating training data for the machine learning model using labeled documents, and 3) building tools that help a developer rapidly build and improve a model for a given document type. Through empirical studies on a real-world dataset, we show that these data management techniques allow us to train a model that is over 5 F1 points better than the exact same model architecture without the techniques we describe. We argue that for such information-extraction problems, designing abstractions that carefully manage the training data is at least as important as choosing a good model architecture.

References

  1. Ting Bai, Ji-Rong Wen, Jun Zhang, and Wayne Xin Zhao. 2017. A Neural Collaborative Filtering Model with Interaction-based Neighborhood. Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (2017), 1979--1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. 2003. Extracting Content Structure for Web Pages Based on Visual Representation. In Proceedings of the 5th Asia-Pacific Web Conference. 406--417. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. 2004. Block-based Web Search. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 456--463. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Gobinda G. Chowdhury. 1999. Template Mining for Information Extraction from Digital Documents. Library Trends 48, 1 (1999), 182--208.Google ScholarGoogle Scholar
  5. Florian Daniel, Pavel Kucherbaev, Cinzia Cappiello, Boualem Benatallah, and Mohammad Allahbakhsh. 2018. Quality Control in Crowdsourcing: A Survey of Quality Attributes, Assessment Techniques, and Assurance Actions. ACM Comput. Surv. 51, 1, Article 7 (Jan. 2018), 40 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed K. Elmagarmid, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, and Nan Tang. 2017. The Data Civilizer System. In Proceedings of the 8th Biennial Conference on Innovative Data Systems Research.Google ScholarGoogle Scholar
  7. Timo I. Denk and Christian Reisswig. 2019. BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding. arXiv:1909.04948 [cs.CL]Google ScholarGoogle Scholar
  8. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171--4186.Google ScholarGoogle Scholar
  9. Hazem Elmeleegy, Jayant Madhavan, and Alon Halevy. 2009. Harvesting Relational Tables from Lists on the Web. Proceedings of the VLDB Endowment 2 (2009), 1078--1089. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Łukasz Garncarek, Rafał Powalski, Tomasz Stanisławek, Bartosz Topolski, Piotr Halama, and Filip Graliński. 2020. LAMBERT: Layout-Aware language Modeling using BERT for information extraction. arXiv:2002.08087 [cs.CL]Google ScholarGoogle Scholar
  11. Anoop R. Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Höhne, and Jean Baptiste Faddoul. 2018. Chargrid: Towards Understanding 2D Documents. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 4459--4469.Google ScholarGoogle ScholarCross RefCross Ref
  12. Tie-Yan Liu. 2011. Learning to Rank for Information Retrieval. Springer-Verlag.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs.CL]Google ScholarGoogle Scholar
  14. Bodhisattwa Prasad Majumder, Navneet Potti, Sandeep Tata, James B. Wendt, Qi Zhao, and Marc Najork. 2020. Representation Learning for Information Extraction from Form-like Documents. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 6495--6504.Google ScholarGoogle ScholarCross RefCross Ref
  15. Rachel Millner. 2008. Four regular expressions to check email addresses. https://www.wired.com/2008/08/four-regular-expressions-to-check-email-addresses/Google ScholarGoogle Scholar
  16. Alexander Ratner, Stephen H. Bach, Henry R. Ehrenberg, Jason Alan Fries, Sen Wu, and Christopher Ré. 2017. Snorkel: Rapid Training Data Creation with Weak Supervision. Proceedings of the VLDB Endowment 11, 3 (Nov. 2017), 269--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 6 (2015), 1137--1149. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. El Kindi Rezig, Lei Cao, Giovanni Simonini, Maxime Schoemans, Samuel Madden, Nan Tang, Mourad Ouzzani, and Michael Stonebraker. 2020. Dagger: A Data (not code) Debugger. In Proceedings of the 10th Conference on Innovative Data Systems Research.Google ScholarGoogle Scholar
  19. Ritesh Sarkhel and Arnab Nandi. 2019. Visual Segmentation for Information Extraction from Heterogeneous Visually Rich Documents. In Proceedings of the 2019 International Conference on Management of Data. 247--262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Guilherme A. Toda, Eli Cortez, Altigran S. da Silva, and Edleno de Moura. 2010. A Probabilistic Approach for Automatically Filling Form-Based Web Interfaces. Proceedings of the VLDB Endowment 4, 3 (Dec. 2010), 151--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Jake Walker, Yasuhisa Fujii, and Ashok Popat. 2018. A Web-Based OCR Service for Documents. In 13th IAPR International Workshop on Document Analysis Systems - Short Papers Booklet. 21--22.Google ScholarGoogle Scholar
  22. Sen Wu, Luke Hsiao, Xiao Cheng, Braden Hancock, Theodoros Rekatsinas, Philip Levis, and Christopher Ré. 2018. Fonduer: Knowledge Base Construction from Richly Formatted Data. In Proceedings of the 2018 International Conference on Management of Data. 1301--1316. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. 2019. LayoutLM: Pre-training of Text and Layout for Document Image Understanding. arXiv:1912.13318 [cs.CL]Google ScholarGoogle Scholar
  24. Zhilin Yang, Ruslan Salakhutdinov, and William W. Cohen. 2017. Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks. arXiv:1703.06345 [cs.CL]Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Shipeng Yu, Deng Cai, Ji-Rong Wen, and Wei-Ying Ma. 2003. Improving Pseudo-Relevance Feedback in Web Information Retrieval Using Web Page Segmentation. In Proceedings of the 12th International World Wide Web Conference. 11--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Shanshan Zhang, Lihong He, Eduard Dragut, and Slobodan Vucetic. 2019. How to Invest My Time: Lessons from Human-in-the-Loop Entity Extraction. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2305--2313. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang, and Wei-Ying Ma. 2006. Simultaneous Record Detection and Attribute Labeling in Web Data Extraction. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 494--503. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 14, Issue 6
    February 2021
    261 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 February 2021
    Published in pvldb Volume 14, Issue 6

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader