nach oben

International Journal on Document Analysis and Recognition (IJDAR)

Erschienen in:

22.04.2022 | Original Paper

Fusion of visual representations for multimodal information extraction from unstructured transactional documents

verfasst von: Berke Oral, Gülşen Eryiğit

Erschienen in: International Journal on Document Analysis and Recognition (IJDAR) | Ausgabe 3/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

The importance of automated document understanding in terms of today’s businesses’ speed, efficiency, and cost reduction is indisputable. Although structured and semi-structured business documents have been studied intensively within the literature, information extraction from the unstructured ones remains still an open and challenging research topic due to their difficulty levels and the scarcity of available datasets. Transactional documents occupy a special place among the various types of business documents as they serve to track the financial flow and are the most studied type accordingly. The processing of unstructured transactional documents requires the extraction of complex relations (i.e., n-ary, document-level, overlapping, and nested relations). Studies focusing on unstructured transactional documents rely mostly on textual information. However, the impact of their visual compositions remains an unexplored area and may be valuable on their automatic understanding. For the first time in the literature, this article investigates the impact of using different visual representations and their fusion on information extraction from unstructured transactional documents (i.e., for complex relation extraction from money transfer order documents). It introduces and experiments with five different visual representation approaches (i.e., word bounding box, grid embedding, grid convolutional neural network, layout embedding, and layout graph convolutional neural network) and their possible fusion with five different strategies (i.e., three basic vector operations, weighted fusion, and attention-based fusion). The results show that fusion strategies provide a valuable enhancement on combining diverse visual information from which unstructured transactional document understanding obtains different benefits depending on the context. While different visual representations have little effect when added individually to a pure textual baseline, their fusion provides a relative error reduction of up to 33%.

Vorheriger Artikel CarveNet: a channel-wise attention-based network for irregular scene text recognition

Nächster Artikel Boosting modern and historical handwritten text recognition with deformable convolutions

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

There exist also semi-structured money transfer orders which are processed with table-detection algorithms, which are beyond the scope this article.

In the original study, Oral et al. [2] also use character embeddings next to pretrained textual word embeddings and report that this helps the performances at very low levels (0.2 percentage points) for NER. To alleviate this complexity, we dropped the character BiLSTM layer from textual representations to better observe the effects of the visual representations.

The term Layout embedding is also used by Xu et al. [22], but the approach that we introduce here should not be confused with it which is more similar to our Grid Embedding approach.

We also tested with an attention-based fusion approach using textual features as our attention context, but could not obtain good results.

Although there could appear semi-structured documents in this domain containing well-formed forms and tables, the authors state that these are not included in this dataset.

Graliński, F., Stanisławek, T., Wróblewska, A., Lipiński, D., Kaliska, A., Rosalska, P., Topolski, B., Biecek, P.: Kleister: A novel task for information extraction involving long documents with complex layout. arXiv preprint arXiv:2003.02356 (2020)

Oral, B., Emekligil, E., Arslan, S., Eryiǧit, G.: Information extraction from text intensive and visually rich banking documents. Inform. Process. Manag. (2020). https://doi.org/10.1016/j.ipm.2020.102361CrossRef

Cristani, M., Bertolaso, A., Scannapieco, S., Tomazzoli, C.: Future paradigms of automated processing of business documents. Int. J. Inf. Manag. 40, 67–75 (2018). https://doi.org/10.1016/j.ijinfomgt.2018.01.010CrossRef

Chalkidis, I., Androutsopoulos, I., Michos, A.: Extracting contract elements. In: Proceedings of the 16th Edition of the International Conference on Articial Intelligence and Law. ICAIL ’17, pp. 19–28. Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3086512.3086515

Ilias, C., Ion, A.: A deep learning approach to contract element extraction. Frontiers in Artificial Intelligence and Applications 302 (Legal Knowledge and Information Systems), 155–164 (2017). https://doi.org/10.3233/978-1-61499-838-9-155

Göbel, M., Hassan, T., Oro, E., Orsi, G.: Icdar 2013 table competition. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 1449–1453 (2013). https://doi.org/10.1109/ICDAR.2013.292

Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets for document image classification and retrieval. In: International Conference on Document Analysis and Recognition (ICDAR) (2015)

Park, S., Shin, S., Lee, B., Lee, J., Surh, J., Seo, M., Lee, H.: Cord: A consolidated receipt dataset for post-ocr parsing. In: Workshop on Document Intelligence at NeurIPS 2019 (2019)

Huang, Z., Chen, K., He, J., Bai, X., Karatzas, D., Lu, S., Jawahar, C.V.: Icdar2019 competition on scanned receipt ocr and information extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1516–1520 (2019). https://doi.org/10.1109/ICDAR.2019.00244

10.

Jaume, G., Kemal Ekenel, H., Thiran, J.-P.: Funsd: A dataset for form understanding in noisy scanned documents. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), vol. 2, pp. 1–6 (2019). https://doi.org/10.1109/ICDARW.2019.10029

11.

Palm, R.B., Winther, O., Laws, F.: Cloudscan—A configuration-free invoice analysis system using recurrent neural networks. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 406–413 (2017). https://doi.org/10.1109/ICDAR.2017.74

12.

Sage, C., Aussem, A., Elghazel, H., Eglin, V., Espinas, J.: Recurrent neural network approach for table field extraction in business documents. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1308–1313 (2019). https://doi.org/10.1109/ICDAR.2019.00211

13.

Sage, C., Aussem, A., Eglin, V., Elghazel, H., Espinas, J.: End-to-end extraction of structured information from business documents with pointer-generator networks. In: Proceedings of the Fourth Workshop on Structured Prediction for NLP, pp. 43–52. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.spnlp-1.6

14.

Santosh, K., Belaid, A.: Document information extraction and its evaluation based on client’s relevance. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 35–39 (2013). IEEE

15.

Santosh, K.: g-DICE: graph mining-based document information content exploitation. Int. J. Doc. Anal. Recogn. (IJDAR) 18(4), 337–355 (2015). https://doi.org/10.1007/s10032-015-0253-zCrossRef

16.

Katti, A.R., Reisswig, C., Guder, C., Brarda, S., Bickel, S., Höhne, J., Faddoul, J.B.: Chargrid: Towards understanding 2D documents. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4459–4469. Association for Computational Linguistics, Brussels, Belgium (2018). https://doi.org/10.18653/v1/D18-1476

17.

Denk, T.I., Reisswig, C.: BERTgrid: Contextualized embedding for 2d document representation and understanding. In: Workshop on Document Intelligence at NeurIPS 2019 (2019). https://openreview.net/forum?id=H1gsGaq9US

18.

Palm, R.B., Laws, F., Winther, O.: Attend, copy, parse end-to-end information extraction from documents. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 329–336 (2019). https://doi.org/10.1109/ICDAR.2019.00060

19.

Zhao, X., Niu, E., Wu, Z., Wang, X.: Cutie: Learning to understand documents with convolutional universal text information extractor. arXiv preprint arXiv:1903.12363 (2019)

20.

Liu, X., Gao, F., Zhang, Q., Zhao, H.: Graph convolution for multimodal information extraction from visually rich documents. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers), pp. 32–39. Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-2005

21.

Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: Pre-Training of Text and Layout for Document Image Understanding, pp. 1192–1200. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3394486.3403172

22.

Xu, Y., Xu, Y., Lv, T., Cui, L., Wei, F., Wang, G., Lu, Y., Florencio, D., Zhang, C., Che, W., Zhang, M., Zhou, L.: LayoutLMv2: Multi-modal pre-training for visually-rich document understanding. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 2579–2591. Association for Computational Linguistics, Online (2021). https://doi.org/10.18653/v1/2021.acl-long.201

23.

Zhang, P., Xu, Y., Cheng, Z., Pu, S., Lu, J., Qiao, L., Niu, Y., Wu, F.: Trie: End-to-end text reading and information extraction for document understanding. In: Proceedings of the 28th ACM International Conference on Multimedia. MM ’20, pp. 1413–1422. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3394171.3413900

24.

Yadav, V., Bethard, S.: A survey on recent advances in named entity recognition from deep learning models. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 2145–2158. Association for Computational Linguistics, Santa Fe, New Mexico, USA (2018). https://www.aclweb.org/anthology/C18-1182

25.

Li, J., Sun, A., Han, J., Li, C.: A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 1, 2020. https://doi.org/10.1109/TKDE.2020.2981314

26.

Weld, H., Huang, X., Long, S., Poon, J., Han, S.C.: A survey of joint intent detection and slot-filling models in natural language understanding. arXiv preprint arXiv:2101.08091 (2021)

27.

Subramani, N., Matton, A., Greaves, M., Lam, A.: A Survey of Deep Learning Approaches for OCR and Document Understanding (2021)

28.

Jiang, H., Bao, Q., Cheng, Q., Yang, D., Wang, L., Xiao, Y.: Complex relation extraction: Challenges and opportunities. arXiv preprint arXiv:2012.04821 (2020)

29.

Sahin, G.G., Emekligil, E., Arslan, S., Ağın, O., Eryiğit, G.: Relation extraction via one-shot dependency parsing on intersentential, higher-order, and nested relations. Turk. J. Electr. Eng. Comput. Sci. 26(2), 830–843 (2018)CrossRef

30.

Oral, B., Emekligil, E., Arslan, S., Eryiğit, G.: Extracting complex relations from banking documents. In: Proceedings of the Second Workshop on Economics and Natural Language Processing, pp. 1–9. Association for Computational Linguistics, Hong Kong (2019). https://doi.org/10.18653/v1/D19-5101

31.

R, A., Kuanr, A., KR, S.: Developing banking intelligence in emerging markets: systematic review and agenda. Int. J. Inf. Manag. Data Insights 1(2), 100026 (2021). https://doi.org/10.1016/j.jjimei.2021.100026

32.

Yu, W., Lu, N., Qi, X., Gong, P., Xiao, R.: PICK: Processing key information extraction from documents using improved graph learning-convolutional networks. In: 2020 25th International Conference on Pattern Recognition (ICPR) (2020)

33.

Bach, N., Badaskar, S.: A review of relation extraction. Lit. Rev. Lang. Stat. II(2), 1–15 (2007)

34.

Riedel, S., Yao, L., McCallum, A.: Modeling relations and their mentions without labeled text. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) Machine Learning and Knowledge Discovery in Databases, pp. 148–163. Springer, Berlin, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15939-8_10

35.

McDonald, R., Pereira, F., Kulick, S., Winters, S., Jin, Y., White, P.: Simple algorithms for complex relation extraction with applications to biomedical IE. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pp. 491–498. Association for Computational Linguistics, Ann Arbor, MI (2005). https://doi.org/10.3115/1219840.1219901

36.

Peng, N., Poon, H., Quirk, C., Toutanova, K., Yih, W.T.: Cross-sentence n-ary relation extraction with graph lstms. Trans. Assoc. Comput. Linguist. 5, 101–115 (2017)CrossRef

37.

Jia, R., Wong, C., Poon, H.: Document-level n-ary relation extraction with multiscale representation learning. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3693–3704. Association for Computational Linguistics, Minneapolis, MN (2019). https://doi.org/10.18653/v1/N19-1370

38.

Song, L., Zhang, Y., Wang, Z., Gildea, D.: N-ary relation extraction using graph-state LSTM. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2226–2235. Association for Computational Linguistics, Brussels, Belgium (2018). https://doi.org/10.18653/v1/D18-1246

39.

Prasojo, R.E., Kacimi, M., Nutt, W.: Stuffie: Semantic tagging of unlabeled facets using fine-grained information extraction. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management. CIKM ’18, pp. 467–476. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3269206.3271812

40.

Takanobu, R., Zhang, T., Liu, J., Huang, M.: A hierarchical framework for relation extraction with reinforcement learning. Proc. AAAI Conf. Artif. Intell. 33(01), 7072–7079 (2019). https://doi.org/10.1609/aaai.v33i01.33017072CrossRef

41.

Zeng, X., Zeng, D., He, S., Liu, K., Zhao, J.: Extracting relational facts by an end-to-end neural model with copy mechanism. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 506–514. Association for Computational Linguistics, Melbourne, Australia (2018). https://doi.org/10.18653/v1/P18-1047

42.

Sahu, S.K., Christopoulou, F., Miwa, M., Ananiadou, S.: Inter-sentence relation extraction with document-level graph convolutional neural network. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4309–4316. Association for Computational Linguistics, Florence, Italy (2019). https://doi.org/10.18653/v1/P19-1423

43.

Xiong, L., Hu, C., Xiong, C., Campos, D., Overwijk, A.: Open domain web keyphrase extraction beyond language modeling. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5175–5184. Association for Computational Linguistics, Hong Kong, China (2019). https://doi.org/10.18653/v1/D19-1521

44.

Zhang, D., Cao, R., Wu, S.: Information fusion in visual question answering: a survey. Inform. Fusion 52, 268–280 (2019). https://doi.org/10.1016/j.inffus.2019.03.005CrossRef

45.

Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. CoRR arXiv:1802.05365 (2018)

46.

Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR arXiv:1810.04805 (2018)

47.

Kang, L., Kumar, J., Ye, P., Li, Y., Doermann, D.: Convolutional neural networks for document image classification. In: 2014 22nd International Conference on Pattern Recognition, pp. 3168–3172 (2014). https://doi.org/10.1109/ICPR.2014.546

48.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 26, pp. 3111–3119. Curran Associates, Inc. (2013)

49.

Dai, Y., Gieseke, F., Oehmcke, S., Wu, Y., Barnard, K.: Attentional feature fusion. In: 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 3559–3568 (2021). https://doi.org/10.1109/WACV48630.2021.00360

50.

Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018). https://doi.org/10.1109/CVPR.2018.00745

Titel: Fusion of visual representations for multimodal information extraction from unstructured transactional documents
verfasst von: Berke Oral
Gülşen Eryiğit
Publikationsdatum: 22.04.2022
Verlag: Springer Berlin Heidelberg
Erschienen in: International Journal on Document Analysis and Recognition (IJDAR) / Ausgabe 3/2022
Print ISSN: 1433-2833
Elektronische ISSN: 1433-2825
DOI: https://doi.org/10.1007/s10032-022-00399-3

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 3/2022

Retraction Note: Offline scripting-free author identification based on speeded-up robust features

Boosting modern and historical handwritten text recognition with deformable convolutions

Radical-based extract and recognition networks for Oracle character recognition

Scene text detection via decoupled feature pyramid networks

Correction to: Radical-based extract and recognition networks for Oracle character recognition

CarveNet: a channel-wise attention-based network for irregular scene text recognition

Premium Partner