Top

Published in:

2020 | OriginalPaper | Chapter

Multimodal Deep Networks for Text and Image-Based Document Classification

Authors : Nicolas Audebert, Catherine Herold, Kuider Slimani, Cédric Vidal

Published in: Machine Learning and Knowledge Discovery in Databases

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Classification of document images is a critical step for accelerating archival of old manuscripts, online subscription and administrative procedures. Computer vision and deep learning have been suggested as a first solution to classify documents based on their visual appearance. However, achieving the fine-grained classification that is required in real-world setting cannot be achieved by visual analysis alone. Often, the relevant information is in the actual text content of the document, although this text is not available in digital form. In this work, we introduce a novel pipeline based on off-the-shelf architectures to deal with document classification by taking into account both text and visual information. We design a multimodal neural network that is able to learn both the image and from word embeddings, computed on noisy text extracted by OCR. We show that this approach allows us to improve single-modality classification accuracy by several points on the small Tobacco3482 and large RVL-CDIP datasets, even without clean text information. We release a post-OCR text classification (https://github.com/Quicksign/ocrized-text-dataset) that complements the Tobacco3482 and RVL-CDIP ones to encourage researchers to look into multi-modal text/image classification.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter MHDNE: Network Embedding Based on Multivariate Hawkes Process

next chapter Manifold Mixing for Stacked Regularization

Based on the Wikipedia 2014 + Gigaword 5 datasets.

https://www.industrydocuments.ucsf.edu/tobacco/.

https://github.com/tesseract-ocr/tesseract/.

The QS-OCR dataset is available at: https://github.com/Quicksign/ocrized-text-dataset.

Hyperparameters are manually tuned on a small validation set.

https://github.com/wolfgarbe/SymSpell.

https://sites.google.com/view/icdar2019-postcorrectionocr.

Afzal, M.Z., Kölsch, A., Ahmed, S., Liwicki, M.: Cutting the error by half: investigation of very deep CNN and advanced training strategies for document image classification. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 883–888, November 2017. https://doi.org/10.1109/ICDAR.2017.149

Ares Oliveira, S., Seguin, B.L.A., Kaplan, F.: dhSegment: a generic deep-learning approach for document segmentation, August 2018

Arora, S., Liang, Y., Ma, T.: A simple but tough-to-beat baseline for sentence embeddings. In: Proceedings of the International Conference on Learning Representations (ICLR), November 2016

Augereau, O., Journet, N., Vialard, A., Domenger, J.: Improving classification of an industrial document image database by combining visual and textual features. In: 2014 11th IAPR International Workshop on Document Analysis Systems, pp. 314–318, April 2014. https://doi.org/10.1109/DAS.2014.44

Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)CrossRef

Borko, H., Bernick, M.: Automatic document classification. J. ACM 10(2), 151–162 (1963). https://doi.org/10.1145/321160.321165CrossRefMATH

Chen, N., Blostein, D.: A survey of document image classification: problem statement, classifier architecture and performance evaluation. Int. J. Doc. Anal. Recogn. (IJDAR) 10(1), 1–16 (2007). https://doi.org/10.1007/s10032-006-0020-2CrossRef

Chollet, F.: Xception: deep Learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, United States, pp. 1800–1807, July 2017. https://doi.org/10.1109/CVPR.2017.195

Das, A., Roy, S., Bhattacharya, U., Parui, S.K.: Document image classification with intra-domain transfer learning and stacked generalization of deep convolutional neural networks. In: 2018 24th International Conference on Pattern Recognition (ICPR), pp. 3180–3185, August 2018. https://doi.org/10.1109/ICPR.2018.8545630

10.

Eitel, A., Springenberg, J.T., Spinello, L., Riedmiller, M., Burgard, W.: Multimodal deep learning for robust RGB-D object recognition. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 681–687, September 2015. https://doi.org/10.1109/IROS.2015.7353446

11.

Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets for document image classification and retrieval. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 991–995, August 2015. https://doi.org/10.1109/ICDAR.2015.7333910

12.

He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034, December 2015. https://doi.org/10.1109/ICCV.2015.123

13.

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, United States, pp. 770–778, June 2016. https://doi.org/10.1109/CVPR.2016.90

14.

Honnibal, M., Montani, I.: spaCy 2: natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing (2017, to appear)

15.

Imade, S., Tatsuta, S., Wada, T.: Segmentation and classification for mixed text/image documents using neural network. In: Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR 1993), pp. 930–934, October 1993. https://doi.org/10.1109/ICDAR.1993.395584

16.

Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 427–431. Association for Computational Linguistics (2017)

17.

Kay, A.: Tesseract: an open-source optical character recognition engine. Linux J. 2007(159), 2 (2007)

18.

Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751, October 2014. https://doi.org/10.3115/v1/D14-1181

19.

Kumar, J., Ye, P., Doermann, D.: Structural similarity for document image classification and retrieval. Pattern Recogn. Lett. 43, 119–126 (2014). https://doi.org/10.1016/j.patrec.2013.10.030CrossRef

20.

Le, D.X., Thoma, G.R., Wechsler, H.: Classification of binary document images into textual or nontextual data blocks using neural network models. Mach. Vis. Appl. 8(5), 289–304 (1995). https://doi.org/10.1007/BF01211490CrossRef

21.

LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998). https://doi.org/10.1109/5.726791CrossRef

22.

Malykh, V., Logacheva, V., Khakhulin, T.: Robust word vectors: context-informed embeddings for noisy texts. In: Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-Generated Text, pp. 54–63, November 2018

23.

Manevitz, L.M., Yousef, M.: One-class SVMs for document classification. J. Mach. Learn. Res. 2(Dec), 139–154 (2001)MATH

24.

Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR 2013, January 2013

25.

Nielsen, J.: Usability Engineering. Morgan Kaufmann Publishers Inc., San Francisco (1993)CrossRef

26.

Noce, L., Gallo, I., Zamberletti, A., Calefati, A.: Embedded textual content for document image classification with convolutional neural networks. In: Proceedings of the 2016 ACM Symposium on Document Engineering (DocEng 2016), pp. 165–173. ACM, New York (2016). https://doi.org/10.1145/2960811.2960814

27.

Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9(1), 62–66 (1979). https://doi.org/10.1109/TSMC.1979.4310076CrossRef

28.

Patel, A., Sands, A., Callison-Burch, C., Apidianaki, M.: Magnitude: a fast, efficient universal vector embedding utility package. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 120–126, November 2018

29.

Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. Association for Computational Linguistics (2014). https://doi.org/10.3115/v1/D14-1162

30.

Peters, M., et al.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/N18-1202

31.

Pinter, Y., Guthrie, R., Eisenstein, J.: Mimicking word embeddings using Subword RNNs. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 102–112. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/D17-1010

32.

Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 512–519, June 2014. https://doi.org/10.1109/CVPRW.2014.131

33.

Rubin, T.N., Chambers, A., Smyth, P., Steyvers, M.: Statistical topic models for multi-label document classification. Mach. Learn. 88(1), 157–208 (2012). https://doi.org/10.1007/s10994-011-5272-5MathSciNetCrossRefMATH

34.

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.: MobileNetV2: inverted residuals and linear bottlenecks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4510–4520, June 2018. https://doi.org/10.1109/CVPR.2018.00474

35.

Sicre, R., Awal, A.M., Furon, T.: Identity documents classification as an image classification problem. In: Battiato, S., Gallo, G., Schettini, R., Stanco, F. (eds.) ICIAP 2017. LNCS, vol. 10485, pp. 602–613. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68548-9_55CrossRef

36.

Tensmeyer, C., Martinez, T.: Analysis of convolutional neural networks for document image classification. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 388–393, November 2017. https://doi.org/10.1109/ICDAR.2017.71

37.

Wong, K.Y., Casey, R.G., Wahl, F.M.: Document analysis system. IBM J. Res. Dev. 26(6), 647–656 (1982). https://doi.org/10.1147/rd.266.0647CrossRef

38.

Yang, X., Yumer, E., Asente, P., Kraley, M., Kifer, D., Giles, C.L.: Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4342–4351, July 2017. https://doi.org/10.1109/CVPR.2017.462

39.

Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention networks for document classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489. Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/N16-1174

Title: Multimodal Deep Networks for Text and Image-Based Document Classification
Authors: Nicolas Audebert
Catherine Herold
Kuider Slimani
Cédric Vidal
Publisher: Springer International Publishing
Book: Machine Learning and Knowledge Discovery in Databases
Print ISBN: 978-3-030-43822-7

Electronic ISBN: 978-3-030-43823-4

Copyright Year: 2020
DOI: https://doi.org/10.1007/978-3-030-43823-4_35

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner