skip to main content
research-article

Improving information extraction from visually rich documents using visual span representations

Published:01 January 2021Publication History
Skip Abstract Section

Abstract

Along with textual content, visual features play an essential role in the semantics of visually rich documents. Information extraction (IE) tasks perform poorly on these documents if these visual cues are not taken into account. In this paper, we present Artemis - a visually aware, machine-learning-based IE method for heterogeneous visually rich documents. Artemis represents a visual span in a document by jointly encoding its visual and textual context for IE tasks. Our main contribution is two-fold. First, we develop a deep-learning model that identifies the local context boundary of a visual span with minimal human-labeling. Second, we describe a deep neural network that encodes the multimodal context of a visual span into a fixed-length vector by taking its textual and layout-specific features into account. It identifies the visual span(s) containing a named entity by leveraging this learned representation followed by an inference task. We evaluate Artemis on four heterogeneous datasets from different domains over a suite of information extraction tasks. Results show that it outperforms state-of-the-art text-based methods by up to 17 points in F1-score.

References

  1. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google ScholarGoogle Scholar
  2. T. Breuel. 2007. The hOCR Microformat for OCR Workflow and Results. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Vol. 2. 1063--1067. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. 2003. Vips: a vision-based page segmentation algorithm. (2003).Google ScholarGoogle Scholar
  4. Kuang Chen, Akshay Kannan, Yoriyasu Yano, Joseph M Hellerstein, and Tapan S Parikh. 2012. Shreddr: pipelined paper digitization for low-resource organizations. In Proceedings of the 2nd ACM Symposium on Computing for Development. 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Antonio Clavelli, Dimosthenis Karatzas, and Josep Llados. 2010. A framework for the assessment of text extraction algorithms on complex colour images. In Proceedings of the 9th IAPR International Workshop on Document Analysis Systems. 19--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Quanyu Dai, Qiang Li, Jian Tang, and Dan Wang. 2018. Adversarial network embedding. In Thirty-second AAAI conference on artificial intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  7. Brian Davis, Bryan Morse, Scott Cohen, Brian Price, and Chris Tensmeyer. 2019. Deep visual template-free form parsing. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 134--141.Google ScholarGoogle ScholarCross RefCross Ref
  8. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google ScholarGoogle Scholar
  9. AnHai Doan, Jeffrey F Naughton, Raghu Ramakrishnan, Akanksha Baid, Xiaoyong Chai, Fei Chen, Ting Chen, Eric Chu, Pedro DeRose, Byron Gao, et al. 2009. Information extraction challenges in managing unstructured data. ACM SIGMOD Record 37, 4 (2009), 14--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (VOC) challenge. International journal of computer vision 88, 2 (2010), 303--338. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ignazio Gallo, Alessandro Zamberletti, and Lucia Noce. 2015. Content extraction from marketing flyers. In International Conference on Computer Analysis of Images and Patterns. Springer, 325--336.Google ScholarGoogle ScholarCross RefCross Ref
  12. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672--2680. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. The Stanford NLP Group. 2020. Stanford Part-Of-Speech Tagger. Accessed: 2020-01-31.Google ScholarGoogle Scholar
  14. The Stanford NLP Group. 2020. Stanford Word Tokenizer. Accessed: 2020-01-31.Google ScholarGoogle Scholar
  15. Adam W Harley, Alex Ufkes, and Konstantinos G Derpanis. [n.d.]. Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval. In International Conference on Document Analysis and Recognition (ICDAR). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Nurse Tech Inc. 2018. NurseBrains. Accessed: 2019-01-25.Google ScholarGoogle Scholar
  17. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-image translation with conditional adversarial networks. arXiv preprint (2017).Google ScholarGoogle Scholar
  18. Anoop Raveendra Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Höhne, and Jean Baptiste Faddoul. 2018. Chargrid: Towards understanding 2d documents. arXiv preprint arXiv:1809.08799 (2018).Google ScholarGoogle Scholar
  19. Keras. 2018. Keras: Deep Learning for Humans. Accessed: 2018-09-30.Google ScholarGoogle Scholar
  20. D Kinga and J Ba Adam. 2015. A method for stochastic optimization. In International Conference on Learning Representations (ICLR), Vol. 5.Google ScholarGoogle Scholar
  21. Nicholas Kushmerick. 2000. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence 118, 1-2 (2000), 15--68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Matthew Lamm. 2020. Natural Language Processing with Deep Learning. Accessed: 2020-01-31.Google ScholarGoogle Scholar
  23. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature 521, 7553 (2015), 436--444.Google ScholarGoogle Scholar
  24. David Lewis, Gady Agam, Shlomo Argamon, Ophir Frieder, D Grossman, and Jefferson Heard. 2006. Building a test collection for complex document information processing. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 665--666. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. 2015. A hierarchical neural autoencoder for paragraphs and documents. arXiv preprint arXiv:1506.01057 (2015).Google ScholarGoogle Scholar
  26. Xiaojing Liu, Feiyu Gao, Qiong Zhang, and Huasha Zhao. 2019. Graph convolution for multimodal information extraction from visually rich documents. arXiv preprint arXiv:1903.11279 (2019).Google ScholarGoogle Scholar
  27. Astera LLC. 2018. ReportMiner: A Data Extraction Solution. Accessed: 2018-09-30.Google ScholarGoogle Scholar
  28. Colin Lockard, Xin Luna Dong, Arash Einolghozati, and Prashant Shiralkar. 2018. Ceres: Distantly supervised relation extraction from the semi-structured web. arXiv preprint arXiv:1804.04635 (2018).Google ScholarGoogle Scholar
  29. Yi Luan, Dave Wadden, Luheng He, Amy Shah, Mari Ostendorf, and Hannaneh Hajishirzi. 2019. A general framework for information extraction using dynamic span graphs. arXiv preprint arXiv:1904.03296 (2019).Google ScholarGoogle Scholar
  30. Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bidirectional lstm-cnns-crf. arXiv preprint arXiv:1603.01354 (2016).Google ScholarGoogle Scholar
  31. Bodhisattwa Prasad Majumder, Navneet Potti, Sandeep Tata, James Bradley Wendt, Qi Zhao, and Marc Najork. 2020. Representation learning for information extraction from form-like documents. In proceedings of the 58th annual meeting of the Association for Computational Linguistics. 6495--6504.Google ScholarGoogle ScholarCross RefCross Ref
  32. Tomohiro Manabe and Keishi Tajima. 2015. Extracting logical hierarchical structure of HTML documents based on headings. Proceedings of the VLDB Endowment 8, 12 (2015), 1606--1617. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Christopher Manning. 2017. Representations for language: From word embeddings to sentence meanings. Accessed: 2020-01-31.Google ScholarGoogle Scholar
  34. Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations. 55--60.Google ScholarGoogle ScholarCross RefCross Ref
  35. Marcin Michał Mirończuk. 2018. The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction. Knowledge and Information Systems 54, 3 (2018), 711--776. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Austin F Mount-Campbell, Kevin D Evans, David D Woods, Esther M Chipps, Susan D Moffatt-Bruce, and Emily S Patterson. 2019. Value and usage of a workaround artifact: A cognitive work analysis of "brains" use by hospital nurses. Journal of Cognitive Engineering and Decision Making 13, 2 (2019), 67--80.Google ScholarGoogle ScholarCross RefCross Ref
  37. Bastien Moysset, Christopher Kermorvant, Christian Wolf, and Jérôme Louradour. 2015. Paragraph text segmentation into lines with recurrent neural networks. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR). IEEE, 456--460. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. NIST. 2018. NIST Special Database 6. Accessed: 2018-09-30.Google ScholarGoogle Scholar
  39. Feng Niu, Ce Zhang, Christopher Ré, and Jude W Shavlik. 2012. DeepDive: Webscale Knowledge-base Construction using Statistical Learning and Inference. VLDS 12 (2012), 25--28.Google ScholarGoogle Scholar
  40. Sinno Jialin Pan and Qiang Yang. 2009. A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22, 10 (2009), 1345--1359. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Santiago Pascual, Antonio Bonafonte, and Joan Serra. 2017. SEGAN: Speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452 (2017).Google ScholarGoogle Scholar
  42. Frédéric Patin. 2003. An introduction to digital image processing. online]: http://www.programmersheaven.com/articles/patin/ImageProc.pdf (2003).Google ScholarGoogle Scholar
  43. P David Pearson, Michael L Kamil, Peter B Mosenthal, Rebecca Barr, et al. 2016. Handbook of reading research. Routledge.Google ScholarGoogle Scholar
  44. Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2017. Snorkel: Rapid training data creation with weak supervision. In Proceedings of the VLDB Endowment. International Conference on Very Large DataBases, Vol. 11. NIH Public Access, 269. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Alexander J Ratner, Stephen H Bach, Henry R Ehrenberg, and Chris Ré. 2017. Snorkel: Fast training set generation for information extraction. In Proceedings of the 2017 ACM international conference on management of data. 1683--1686. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. 2016. Data programming: Creating large training sets, quickly. In Advances in neural information processing systems. 3567--3575. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Sunita Sarawagi et al. 2008. Information extraction. Foundations and Trends® in Databases 1, 3 (2008), 261--377. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Ritesh Sarkhel, Moniba Keymanesh, Arnab Nandi, and Srinivasan Parthasarathy. 2020. Interpretable Multi-headed Attention for Abstractive Summarization at Controllable Lengths. In Proceedings of the 28th International Conference on Computational Linguistics. 6871--6882.Google ScholarGoogle Scholar
  50. Ritesh Sarkhel and Arnab Nandi. 2019. Deterministic routing between layout abstractions for multi-scale classification of visually rich documents. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. AAAI Press, 3360--3366. Google ScholarGoogle ScholarCross RefCross Ref
  51. Ritesh Sarkhel and Arnab Nandi. 2019. Visual segmentation for information extraction from heterogeneous visually rich documents. In Proceedings of the 2019 International Conference on Management of Data. ACM, 247--262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Ritesh Sarkhel, Jacob J Socha, Austin Mount-Campbell, Susan Moffatt-Bruce, Simon Fernandez, Kashvi Patel, Arnab Nandi, and Emily S Patterson. 2018. How Nurses Identify Hospitalized Patients on Their Personal Notes: Findings From Analyzing 'Brains' Headers with Multiple Raters. In Proceedings of the International Symposium on Human Factors and Ergonomics in Health Care, Vol. 7. SAGE Publications Sage India: New Delhi, India, 205--209.Google ScholarGoogle ScholarCross RefCross Ref
  53. Erich Schubert, Jörg Sander, Martin Ester, Hans Peter Kriegel, and Xiaowei Xu. 2017. DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (TODS) 42, 3 (2017), 1--21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Ray Smith. 2007. An overview of the Tesseract OCR engine. In Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on, Vol. 2. IEEE, 629--633. Google ScholarGoogle ScholarCross RefCross Ref
  55. Fei Sun, Dandan Song, and Lejian Liao. 2011. Dom based content extraction via text density. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. ACM, 245--254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. GFG Thoma. 2003. Ground truth data for document image analysis. In Symposium on document image understanding and technology (SDIUT). 199--205.Google ScholarGoogle Scholar
  57. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics. 384--394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. David Wadden, Ulme Wennberg, Yi Luan, and Hannaneh Hajishirzi. 2019. Entity, relation, and event extraction with contextualized span representations. arXiv preprint arXiv:1909.03546 (2019).Google ScholarGoogle Scholar
  59. Sen Wu, Luke Hsiao, Xiao Cheng, Braden Hancock, Theodoros Rekatsinas, Philip Levis, and Christopher Ré. 2018. Fonduer: Knowledge base construction from richly formatted data. In Proceedings of the 2018 International Conference on Management of Data. ACM, 1301--1316. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Bishan Yang and Tom Mitchell. 2016. Joint extraction of events and entities within a document context. arXiv preprint arXiv:1609.03632 (2016).Google ScholarGoogle Scholar
  61. Xiao Yang, Ersin Yumer, Paul Asente, Mike Kraley, Daniel Kifer, and C Lee Giles. 2017. Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5315--5324.Google ScholarGoogle ScholarCross RefCross Ref
  62. Xin Yi, Ekta Walia, and Paul Babyn. 2019. Generative adversarial network in medical imaging: A review. Medical image analysis 58 (2019), 101552.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Improving information extraction from visually rich documents using visual span representations
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 14, Issue 5
      January 2021
      142 pages
      ISSN:2150-8097
      Issue’s Table of Contents

      Publisher

      VLDB Endowment

      Publication History

      • Published: 1 January 2021
      Published in pvldb Volume 14, Issue 5

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader