Abstract
Along with textual content, visual features play an essential role in the semantics of visually rich documents. Information extraction (IE) tasks perform poorly on these documents if these visual cues are not taken into account. In this paper, we present Artemis - a visually aware, machine-learning-based IE method for heterogeneous visually rich documents. Artemis represents a visual span in a document by jointly encoding its visual and textual context for IE tasks. Our main contribution is two-fold. First, we develop a deep-learning model that identifies the local context boundary of a visual span with minimal human-labeling. Second, we describe a deep neural network that encodes the multimodal context of a visual span into a fixed-length vector by taking its textual and layout-specific features into account. It identifies the visual span(s) containing a named entity by leveraging this learned representation followed by an inference task. We evaluate Artemis on four heterogeneous datasets from different domains over a suite of information extraction tasks. Results show that it outperforms state-of-the-art text-based methods by up to 17 points in F1-score.
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google Scholar
- T. Breuel. 2007. The hOCR Microformat for OCR Workflow and Results. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Vol. 2. 1063--1067. Google ScholarDigital Library
- Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. 2003. Vips: a vision-based page segmentation algorithm. (2003).Google Scholar
- Kuang Chen, Akshay Kannan, Yoriyasu Yano, Joseph M Hellerstein, and Tapan S Parikh. 2012. Shreddr: pipelined paper digitization for low-resource organizations. In Proceedings of the 2nd ACM Symposium on Computing for Development. 3. Google ScholarDigital Library
- Antonio Clavelli, Dimosthenis Karatzas, and Josep Llados. 2010. A framework for the assessment of text extraction algorithms on complex colour images. In Proceedings of the 9th IAPR International Workshop on Document Analysis Systems. 19--26. Google ScholarDigital Library
- Quanyu Dai, Qiang Li, Jian Tang, and Dan Wang. 2018. Adversarial network embedding. In Thirty-second AAAI conference on artificial intelligence.Google ScholarCross Ref
- Brian Davis, Bryan Morse, Scott Cohen, Brian Price, and Chris Tensmeyer. 2019. Deep visual template-free form parsing. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 134--141.Google ScholarCross Ref
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google Scholar
- AnHai Doan, Jeffrey F Naughton, Raghu Ramakrishnan, Akanksha Baid, Xiaoyong Chai, Fei Chen, Ting Chen, Eric Chu, Pedro DeRose, Byron Gao, et al. 2009. Information extraction challenges in managing unstructured data. ACM SIGMOD Record 37, 4 (2009), 14--20. Google ScholarDigital Library
- Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (VOC) challenge. International journal of computer vision 88, 2 (2010), 303--338. Google ScholarDigital Library
- Ignazio Gallo, Alessandro Zamberletti, and Lucia Noce. 2015. Content extraction from marketing flyers. In International Conference on Computer Analysis of Images and Patterns. Springer, 325--336.Google ScholarCross Ref
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672--2680. Google ScholarDigital Library
- The Stanford NLP Group. 2020. Stanford Part-Of-Speech Tagger. Accessed: 2020-01-31.Google Scholar
- The Stanford NLP Group. 2020. Stanford Word Tokenizer. Accessed: 2020-01-31.Google Scholar
- Adam W Harley, Alex Ufkes, and Konstantinos G Derpanis. [n.d.]. Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval. In International Conference on Document Analysis and Recognition (ICDAR). Google ScholarDigital Library
- Nurse Tech Inc. 2018. NurseBrains. Accessed: 2019-01-25.Google Scholar
- Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-image translation with conditional adversarial networks. arXiv preprint (2017).Google Scholar
- Anoop Raveendra Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Höhne, and Jean Baptiste Faddoul. 2018. Chargrid: Towards understanding 2d documents. arXiv preprint arXiv:1809.08799 (2018).Google Scholar
- Keras. 2018. Keras: Deep Learning for Humans. Accessed: 2018-09-30.Google Scholar
- D Kinga and J Ba Adam. 2015. A method for stochastic optimization. In International Conference on Learning Representations (ICLR), Vol. 5.Google Scholar
- Nicholas Kushmerick. 2000. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence 118, 1-2 (2000), 15--68. Google ScholarDigital Library
- Matthew Lamm. 2020. Natural Language Processing with Deep Learning. Accessed: 2020-01-31.Google Scholar
- Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature 521, 7553 (2015), 436--444.Google Scholar
- David Lewis, Gady Agam, Shlomo Argamon, Ophir Frieder, D Grossman, and Jefferson Heard. 2006. Building a test collection for complex document information processing. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 665--666. Google ScholarDigital Library
- Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. 2015. A hierarchical neural autoencoder for paragraphs and documents. arXiv preprint arXiv:1506.01057 (2015).Google Scholar
- Xiaojing Liu, Feiyu Gao, Qiong Zhang, and Huasha Zhao. 2019. Graph convolution for multimodal information extraction from visually rich documents. arXiv preprint arXiv:1903.11279 (2019).Google Scholar
- Astera LLC. 2018. ReportMiner: A Data Extraction Solution. Accessed: 2018-09-30.Google Scholar
- Colin Lockard, Xin Luna Dong, Arash Einolghozati, and Prashant Shiralkar. 2018. Ceres: Distantly supervised relation extraction from the semi-structured web. arXiv preprint arXiv:1804.04635 (2018).Google Scholar
- Yi Luan, Dave Wadden, Luheng He, Amy Shah, Mari Ostendorf, and Hannaneh Hajishirzi. 2019. A general framework for information extraction using dynamic span graphs. arXiv preprint arXiv:1904.03296 (2019).Google Scholar
- Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bidirectional lstm-cnns-crf. arXiv preprint arXiv:1603.01354 (2016).Google Scholar
- Bodhisattwa Prasad Majumder, Navneet Potti, Sandeep Tata, James Bradley Wendt, Qi Zhao, and Marc Najork. 2020. Representation learning for information extraction from form-like documents. In proceedings of the 58th annual meeting of the Association for Computational Linguistics. 6495--6504.Google ScholarCross Ref
- Tomohiro Manabe and Keishi Tajima. 2015. Extracting logical hierarchical structure of HTML documents based on headings. Proceedings of the VLDB Endowment 8, 12 (2015), 1606--1617. Google ScholarDigital Library
- Christopher Manning. 2017. Representations for language: From word embeddings to sentence meanings. Accessed: 2020-01-31.Google Scholar
- Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations. 55--60.Google ScholarCross Ref
- Marcin Michał Mirończuk. 2018. The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction. Knowledge and Information Systems 54, 3 (2018), 711--776. Google ScholarDigital Library
- Austin F Mount-Campbell, Kevin D Evans, David D Woods, Esther M Chipps, Susan D Moffatt-Bruce, and Emily S Patterson. 2019. Value and usage of a workaround artifact: A cognitive work analysis of "brains" use by hospital nurses. Journal of Cognitive Engineering and Decision Making 13, 2 (2019), 67--80.Google ScholarCross Ref
- Bastien Moysset, Christopher Kermorvant, Christian Wolf, and Jérôme Louradour. 2015. Paragraph text segmentation into lines with recurrent neural networks. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR). IEEE, 456--460. Google ScholarDigital Library
- NIST. 2018. NIST Special Database 6. Accessed: 2018-09-30.Google Scholar
- Feng Niu, Ce Zhang, Christopher Ré, and Jude W Shavlik. 2012. DeepDive: Webscale Knowledge-base Construction using Statistical Learning and Inference. VLDS 12 (2012), 25--28.Google Scholar
- Sinno Jialin Pan and Qiang Yang. 2009. A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22, 10 (2009), 1345--1359. Google ScholarDigital Library
- Santiago Pascual, Antonio Bonafonte, and Joan Serra. 2017. SEGAN: Speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452 (2017).Google Scholar
- Frédéric Patin. 2003. An introduction to digital image processing. online]: http://www.programmersheaven.com/articles/patin/ImageProc.pdf (2003).Google Scholar
- P David Pearson, Michael L Kamil, Peter B Mosenthal, Rebecca Barr, et al. 2016. Handbook of reading research. Routledge.Google Scholar
- Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2017. Snorkel: Rapid training data creation with weak supervision. In Proceedings of the VLDB Endowment. International Conference on Very Large DataBases, Vol. 11. NIH Public Access, 269. Google ScholarDigital Library
- Alexander J Ratner, Stephen H Bach, Henry R Ehrenberg, and Chris Ré. 2017. Snorkel: Fast training set generation for information extraction. In Proceedings of the 2017 ACM international conference on management of data. 1683--1686. Google ScholarDigital Library
- Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. 2016. Data programming: Creating large training sets, quickly. In Advances in neural information processing systems. 3567--3575. Google ScholarDigital Library
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99. Google ScholarDigital Library
- Sunita Sarawagi et al. 2008. Information extraction. Foundations and Trends® in Databases 1, 3 (2008), 261--377. Google ScholarDigital Library
- Ritesh Sarkhel, Moniba Keymanesh, Arnab Nandi, and Srinivasan Parthasarathy. 2020. Interpretable Multi-headed Attention for Abstractive Summarization at Controllable Lengths. In Proceedings of the 28th International Conference on Computational Linguistics. 6871--6882.Google Scholar
- Ritesh Sarkhel and Arnab Nandi. 2019. Deterministic routing between layout abstractions for multi-scale classification of visually rich documents. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. AAAI Press, 3360--3366. Google ScholarCross Ref
- Ritesh Sarkhel and Arnab Nandi. 2019. Visual segmentation for information extraction from heterogeneous visually rich documents. In Proceedings of the 2019 International Conference on Management of Data. ACM, 247--262. Google ScholarDigital Library
- Ritesh Sarkhel, Jacob J Socha, Austin Mount-Campbell, Susan Moffatt-Bruce, Simon Fernandez, Kashvi Patel, Arnab Nandi, and Emily S Patterson. 2018. How Nurses Identify Hospitalized Patients on Their Personal Notes: Findings From Analyzing 'Brains' Headers with Multiple Raters. In Proceedings of the International Symposium on Human Factors and Ergonomics in Health Care, Vol. 7. SAGE Publications Sage India: New Delhi, India, 205--209.Google ScholarCross Ref
- Erich Schubert, Jörg Sander, Martin Ester, Hans Peter Kriegel, and Xiaowei Xu. 2017. DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (TODS) 42, 3 (2017), 1--21. Google ScholarDigital Library
- Ray Smith. 2007. An overview of the Tesseract OCR engine. In Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on, Vol. 2. IEEE, 629--633. Google ScholarCross Ref
- Fei Sun, Dandan Song, and Lejian Liao. 2011. Dom based content extraction via text density. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. ACM, 245--254. Google ScholarDigital Library
- GFG Thoma. 2003. Ground truth data for document image analysis. In Symposium on document image understanding and technology (SDIUT). 199--205.Google Scholar
- Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics. 384--394. Google ScholarDigital Library
- David Wadden, Ulme Wennberg, Yi Luan, and Hannaneh Hajishirzi. 2019. Entity, relation, and event extraction with contextualized span representations. arXiv preprint arXiv:1909.03546 (2019).Google Scholar
- Sen Wu, Luke Hsiao, Xiao Cheng, Braden Hancock, Theodoros Rekatsinas, Philip Levis, and Christopher Ré. 2018. Fonduer: Knowledge base construction from richly formatted data. In Proceedings of the 2018 International Conference on Management of Data. ACM, 1301--1316. Google ScholarDigital Library
- Bishan Yang and Tom Mitchell. 2016. Joint extraction of events and entities within a document context. arXiv preprint arXiv:1609.03632 (2016).Google Scholar
- Xiao Yang, Ersin Yumer, Paul Asente, Mike Kraley, Daniel Kifer, and C Lee Giles. 2017. Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5315--5324.Google ScholarCross Ref
- Xin Yi, Ekta Walia, and Paul Babyn. 2019. Generative adversarial network in medical imaging: A review. Medical image analysis 58 (2019), 101552.Google ScholarCross Ref
Index Terms
- Improving information extraction from visually rich documents using visual span representations
Recommendations
Visual Segmentation for Information Extraction from Heterogeneous Visually Rich Documents
SIGMOD '19: Proceedings of the 2019 International Conference on Management of DataPhysical and digital documents often contain visually rich information. With such information, there is no strict ordering or positioning in the document where the data values must appear. Along with textual cues, these documents often also rely on ...
Visual information extraction
Typographic and visual information is an integral part of textual documents. Most information extraction (IE) systems ignore most of this visual information, processing the text as a linear sequence of words. Thus, much valuable information is lost. In ...
Fusion of visual representations for multimodal information extraction from unstructured transactional documents
AbstractThe importance of automated document understanding in terms of today’s businesses’ speed, efficiency, and cost reduction is indisputable. Although structured and semi-structured business documents have been studied intensively within the ...
Comments