research-article

Improving information extraction from visually rich documents using visual span representations

Authors:
Ritesh Sarkhel

The Ohio State Universtiy

The Ohio State Universtiy
View Profile

,
Arnab Nandi

The Ohio State University

The Ohio State University
View Profile

Proceedings of the VLDB Endowment Volume 14 Issue 5pp 822–834https://doi.org/10.14778/3446095.3446104

Published:01 January 2021Publication History

Proceedings of the VLDB Endowment

Abstract

Along with textual content, visual features play an essential role in the semantics of visually rich documents. Information extraction (IE) tasks perform poorly on these documents if these visual cues are not taken into account. In this paper, we present Artemis - a visually aware, machine-learning-based IE method for heterogeneous visually rich documents. Artemis represents a visual span in a document by jointly encoding its visual and textual context for IE tasks. Our main contribution is two-fold. First, we develop a deep-learning model that identifies the local context boundary of a visual span with minimal human-labeling. Second, we describe a deep neural network that encodes the multimodal context of a visual span into a fixed-length vector by taking its textual and layout-specific features into account. It identifies the visual span(s) containing a named entity by leveraging this learned representation followed by an inference task. We evaluate Artemis on four heterogeneous datasets from different domains over a suite of information extraction tasks. Results show that it outperforms state-of-the-art text-based methods by up to 17 points in F1-score.

References

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google Scholar
T. Breuel. 2007. The hOCR Microformat for OCR Workflow and Results. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Vol. 2. 1063--1067. Google ScholarDigital Library
Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. 2003. Vips: a vision-based page segmentation algorithm. (2003).Google Scholar
Kuang Chen, Akshay Kannan, Yoriyasu Yano, Joseph M Hellerstein, and Tapan S Parikh. 2012. Shreddr: pipelined paper digitization for low-resource organizations. In Proceedings of the 2nd ACM Symposium on Computing for Development. 3. Google ScholarDigital Library
Antonio Clavelli, Dimosthenis Karatzas, and Josep Llados. 2010. A framework for the assessment of text extraction algorithms on complex colour images. In Proceedings of the 9th IAPR International Workshop on Document Analysis Systems. 19--26. Google ScholarDigital Library
Quanyu Dai, Qiang Li, Jian Tang, and Dan Wang. 2018. Adversarial network embedding. In Thirty-second AAAI conference on artificial intelligence.Google ScholarCross Ref
Brian Davis, Bryan Morse, Scott Cohen, Brian Price, and Chris Tensmeyer. 2019. Deep visual template-free form parsing. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 134--141.Google ScholarCross Ref
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google Scholar
AnHai Doan, Jeffrey F Naughton, Raghu Ramakrishnan, Akanksha Baid, Xiaoyong Chai, Fei Chen, Ting Chen, Eric Chu, Pedro DeRose, Byron Gao, et al. 2009. Information extraction challenges in managing unstructured data. ACM SIGMOD Record 37, 4 (2009), 14--20. Google ScholarDigital Library
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (VOC) challenge. International journal of computer vision 88, 2 (2010), 303--338. Google ScholarDigital Library
Ignazio Gallo, Alessandro Zamberletti, and Lucia Noce. 2015. Content extraction from marketing flyers. In International Conference on Computer Analysis of Images and Patterns. Springer, 325--336.Google ScholarCross Ref
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672--2680. Google ScholarDigital Library
The Stanford NLP Group. 2020. Stanford Part-Of-Speech Tagger. Accessed: 2020-01-31.Google Scholar
The Stanford NLP Group. 2020. Stanford Word Tokenizer. Accessed: 2020-01-31.Google Scholar
Adam W Harley, Alex Ufkes, and Konstantinos G Derpanis. [n.d.]. Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval. In International Conference on Document Analysis and Recognition (ICDAR). Google ScholarDigital Library
Nurse Tech Inc. 2018. NurseBrains. Accessed: 2019-01-25.Google Scholar
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-image translation with conditional adversarial networks. arXiv preprint (2017).Google Scholar
Anoop Raveendra Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Höhne, and Jean Baptiste Faddoul. 2018. Chargrid: Towards understanding 2d documents. arXiv preprint arXiv:1809.08799 (2018).Google Scholar
Keras. 2018. Keras: Deep Learning for Humans. Accessed: 2018-09-30.Google Scholar
D Kinga and J Ba Adam. 2015. A method for stochastic optimization. In International Conference on Learning Representations (ICLR), Vol. 5.Google Scholar
Nicholas Kushmerick. 2000. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence 118, 1-2 (2000), 15--68. Google ScholarDigital Library
Matthew Lamm. 2020. Natural Language Processing with Deep Learning. Accessed: 2020-01-31.Google Scholar
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature 521, 7553 (2015), 436--444.Google Scholar
David Lewis, Gady Agam, Shlomo Argamon, Ophir Frieder, D Grossman, and Jefferson Heard. 2006. Building a test collection for complex document information processing. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 665--666. Google ScholarDigital Library
Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. 2015. A hierarchical neural autoencoder for paragraphs and documents. arXiv preprint arXiv:1506.01057 (2015).Google Scholar
Xiaojing Liu, Feiyu Gao, Qiong Zhang, and Huasha Zhao. 2019. Graph convolution for multimodal information extraction from visually rich documents. arXiv preprint arXiv:1903.11279 (2019).Google Scholar
Astera LLC. 2018. ReportMiner: A Data Extraction Solution. Accessed: 2018-09-30.Google Scholar
Colin Lockard, Xin Luna Dong, Arash Einolghozati, and Prashant Shiralkar. 2018. Ceres: Distantly supervised relation extraction from the semi-structured web. arXiv preprint arXiv:1804.04635 (2018).Google Scholar
Yi Luan, Dave Wadden, Luheng He, Amy Shah, Mari Ostendorf, and Hannaneh Hajishirzi. 2019. A general framework for information extraction using dynamic span graphs. arXiv preprint arXiv:1904.03296 (2019).Google Scholar
Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bidirectional lstm-cnns-crf. arXiv preprint arXiv:1603.01354 (2016).Google Scholar
Bodhisattwa Prasad Majumder, Navneet Potti, Sandeep Tata, James Bradley Wendt, Qi Zhao, and Marc Najork. 2020. Representation learning for information extraction from form-like documents. In proceedings of the 58th annual meeting of the Association for Computational Linguistics. 6495--6504.Google ScholarCross Ref
Tomohiro Manabe and Keishi Tajima. 2015. Extracting logical hierarchical structure of HTML documents based on headings. Proceedings of the VLDB Endowment 8, 12 (2015), 1606--1617. Google ScholarDigital Library
Christopher Manning. 2017. Representations for language: From word embeddings to sentence meanings. Accessed: 2020-01-31.Google Scholar
Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations. 55--60.Google ScholarCross Ref
Marcin Michał Mirończuk. 2018. The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction. Knowledge and Information Systems 54, 3 (2018), 711--776. Google ScholarDigital Library
Austin F Mount-Campbell, Kevin D Evans, David D Woods, Esther M Chipps, Susan D Moffatt-Bruce, and Emily S Patterson. 2019. Value and usage of a workaround artifact: A cognitive work analysis of "brains" use by hospital nurses. Journal of Cognitive Engineering and Decision Making 13, 2 (2019), 67--80.Google ScholarCross Ref
Bastien Moysset, Christopher Kermorvant, Christian Wolf, and Jérôme Louradour. 2015. Paragraph text segmentation into lines with recurrent neural networks. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR). IEEE, 456--460. Google ScholarDigital Library
NIST. 2018. NIST Special Database 6. Accessed: 2018-09-30.Google Scholar
Feng Niu, Ce Zhang, Christopher Ré, and Jude W Shavlik. 2012. DeepDive: Webscale Knowledge-base Construction using Statistical Learning and Inference. VLDS 12 (2012), 25--28.Google Scholar
Sinno Jialin Pan and Qiang Yang. 2009. A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22, 10 (2009), 1345--1359. Google ScholarDigital Library
Santiago Pascual, Antonio Bonafonte, and Joan Serra. 2017. SEGAN: Speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452 (2017).Google Scholar
Frédéric Patin. 2003. An introduction to digital image processing. online]: http://www.programmersheaven.com/articles/patin/ImageProc.pdf (2003).Google Scholar
P David Pearson, Michael L Kamil, Peter B Mosenthal, Rebecca Barr, et al. 2016. Handbook of reading research. Routledge.Google Scholar
Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2017. Snorkel: Rapid training data creation with weak supervision. In Proceedings of the VLDB Endowment. International Conference on Very Large DataBases, Vol. 11. NIH Public Access, 269. Google ScholarDigital Library
Alexander J Ratner, Stephen H Bach, Henry R Ehrenberg, and Chris Ré. 2017. Snorkel: Fast training set generation for information extraction. In Proceedings of the 2017 ACM international conference on management of data. 1683--1686. Google ScholarDigital Library
Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. 2016. Data programming: Creating large training sets, quickly. In Advances in neural information processing systems. 3567--3575. Google ScholarDigital Library
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99. Google ScholarDigital Library
Sunita Sarawagi et al. 2008. Information extraction. Foundations and Trends® in Databases 1, 3 (2008), 261--377. Google ScholarDigital Library
Ritesh Sarkhel, Moniba Keymanesh, Arnab Nandi, and Srinivasan Parthasarathy. 2020. Interpretable Multi-headed Attention for Abstractive Summarization at Controllable Lengths. In Proceedings of the 28th International Conference on Computational Linguistics. 6871--6882.Google Scholar
Ritesh Sarkhel and Arnab Nandi. 2019. Deterministic routing between layout abstractions for multi-scale classification of visually rich documents. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. AAAI Press, 3360--3366. Google ScholarCross Ref
Ritesh Sarkhel and Arnab Nandi. 2019. Visual segmentation for information extraction from heterogeneous visually rich documents. In Proceedings of the 2019 International Conference on Management of Data. ACM, 247--262. Google ScholarDigital Library
Ritesh Sarkhel, Jacob J Socha, Austin Mount-Campbell, Susan Moffatt-Bruce, Simon Fernandez, Kashvi Patel, Arnab Nandi, and Emily S Patterson. 2018. How Nurses Identify Hospitalized Patients on Their Personal Notes: Findings From Analyzing 'Brains' Headers with Multiple Raters. In Proceedings of the International Symposium on Human Factors and Ergonomics in Health Care, Vol. 7. SAGE Publications Sage India: New Delhi, India, 205--209.Google ScholarCross Ref
Erich Schubert, Jörg Sander, Martin Ester, Hans Peter Kriegel, and Xiaowei Xu. 2017. DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (TODS) 42, 3 (2017), 1--21. Google ScholarDigital Library
Ray Smith. 2007. An overview of the Tesseract OCR engine. In Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on, Vol. 2. IEEE, 629--633. Google ScholarCross Ref
Fei Sun, Dandan Song, and Lejian Liao. 2011. Dom based content extraction via text density. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. ACM, 245--254. Google ScholarDigital Library
GFG Thoma. 2003. Ground truth data for document image analysis. In Symposium on document image understanding and technology (SDIUT). 199--205.Google Scholar
Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics. 384--394. Google ScholarDigital Library
David Wadden, Ulme Wennberg, Yi Luan, and Hannaneh Hajishirzi. 2019. Entity, relation, and event extraction with contextualized span representations. arXiv preprint arXiv:1909.03546 (2019).Google Scholar
Sen Wu, Luke Hsiao, Xiao Cheng, Braden Hancock, Theodoros Rekatsinas, Philip Levis, and Christopher Ré. 2018. Fonduer: Knowledge base construction from richly formatted data. In Proceedings of the 2018 International Conference on Management of Data. ACM, 1301--1316. Google ScholarDigital Library
Bishan Yang and Tom Mitchell. 2016. Joint extraction of events and entities within a document context. arXiv preprint arXiv:1609.03632 (2016).Google Scholar
Xiao Yang, Ersin Yumer, Paul Asente, Mike Kraley, Daniel Kifer, and C Lee Giles. 2017. Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5315--5324.Google ScholarCross Ref
Xin Yi, Ekta Walia, and Paul Babyn. 2019. Generative adversarial network in medical imaging: A review. Medical image analysis 58 (2019), 101552.Google ScholarCross Ref

Index Terms

Improving information extraction from visually rich documents using visual span representations
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Visual Segmentation for Information Extraction from Heterogeneous Visually Rich Documents
SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

Physical and digital documents often contain visually rich information. With such information, there is no strict ordering or positioning in the document where the data values must appear. Along with textual cues, these documents often also rely on ...
Read More
Visual information extraction

Typographic and visual information is an integral part of textual documents. Most information extraction (IE) systems ignore most of this visual information, processing the text as a linear sequence of words. Thus, much valuable information is lost. In ...
Read More
Fusion of visual representations for multimodal information extraction from unstructured transactional documents
Abstract
The importance of automated document understanding in terms of today’s businesses’ speed, efficiency, and cost reduction is indisputable. Although structured and semi-structured business documents have been studied intensively within the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 14, Issue 5
January 2021
142 pages
ISSN:2150-8097
Editors:
Xin Luna Dong
Amazon
,
Felix Naumann
HPI, University of Potsdam
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 January 2021
Published in pvldb Volume 14, Issue 5
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 150
  Total Downloads
- Downloads (Last 12 months)26
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Improving information extraction from visually rich documents using visual span representations

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Visual Segmentation for Information Extraction from Heterogeneous Visually Rich Documents

Visual information extraction

Fusion of visual representations for multimodal information extraction from unstructured transactional documents

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Improving information extraction from visually rich documents using visual span representations

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Visual Segmentation for Information Extraction from Heterogeneous Visually Rich Documents

Visual information extraction

Fusion of visual representations for multimodal information extraction from unstructured transactional documents

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media