Top

International Journal on Document Analysis and Recognition (IJDAR)

Published in:

06-08-2021 | Special Issue Paper

Asking questions on handwritten document collections

Authors: Minesh Mathew, Lluis Gomez, Dimosthenis Karatzas, C. V. Jawahar

Published in: International Journal on Document Analysis and Recognition (IJDAR) | Issue 3/2021

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

This work addresses the problem of Question Answering (QA) on handwritten document collections. Unlike typical QA and Visual Question Answering (VQA) formulations where the answer is a short text, we aim to locate a document snippet where the answer lies. The proposed approach works without recognizing the text in the documents. We argue that the recognition-free approach is suitable for handwritten documents and historical collections where robust text recognition is often difficult. At the same time, for human users, document image snippets containing answers act as a valid alternative to textual answers. The proposed approach uses an off-the-shelf deep embedding network which can project both textual words and word images into a common sub-space. This embedding bridges the textual and visual domains and helps us retrieve document snippets that potentially answer a question. We evaluate results of the proposed approach on two new datasets: (i) HW-SQuAD: a synthetic, handwritten document image counterpart of SQuAD1.0 dataset and (ii) BenthamQA: a smaller set of QA pairs defined on documents from the popular Bentham manuscripts collection. We also present a thorough analysis of the proposed recognition-free approach compared to a recognition-based approach which uses text recognized from the images using an OCR. Datasets presented in this work are available to download at docvqa.org.

previous article Self-supervised deep metric learning for ancient papyrus fragments retrieval

next article EAML: ensemble self-attention-based mutual learning network for document image classification

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

https://github.com/facebookresearch/DrQA.

https://pypi.org/project/Unidecode/.

https://fonts.google.com/.

https://annotate.deepset.ai.

https://github.com/kris314/e2eEmbed.

https://huggingface.co/transformers/pretrained_models.html.

https://github.com/deepset-ai/haystack.

Vincent, L.: Google book search: document understanding on a massive scale. In: ICDAR, (2007)

Jaume, Kemal Ekenel,H. K.,Thiran, J.: FUNSD: A dataset for form understanding in noisy scanned documents. In: International Conference on Document Analysis and Recognition Workshops (ICDARW), (2019)

Huang, Z., Chen, K., He, J., Bai, X., Karatzas, D., Lu, S., Jawahar, C. V.: ICDAR 2019 competition on scanned receipt OCR and information extraction. In: ICDAR, (2019)

Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: Pre-training of text and layout for document image understanding. In: CoRR, vol. abs/1912.13318, (2019)

Mathew, M., Karatzas, D., Jawahar, C.V.: Docvqa: A dataset for vqa on document images. In: WACV, (2021)

Oliveira, D, A, B., Viana, M. P.: Fast CNN-based document layout analysis. In: ICCVW, (2017)

Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. In: CoRR, vol. abs/1507.05717, (2015)

Krishnan, P., Dutta, K., Jawahar, C.V.: Word spotting and recognition using deep embedding. In: DAS, (2018)

Chen, X., Fang, H., Lin, T., Vedantam, R., Gupta, S., Dollár, P., Zitnick, C, L.: Microsoft COCO captions: data collection and evaluation server. CoRR, vol. abs/1504.00325, (2015)

10.

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C, L., Parikh, D.: VQA: Visual question answering. In: ICCV, (2015)

11.

Kafle, K., Price, B., Cohen, S., Kanan, C.: DVQA: Understanding data visualizations via question answering. In: CVPR, (2018)

12.

Singh, A.,Natarjan, V., Shah, M., Jiang, Y., Chen, X., Parikh, D., Rohrbach, M.: Towards VQA models that can read. In: CVPR, (2019)

13.

Biten, A, F., Tito, R., Mafla, A., Gómez, L., Rusiñol, M., Valveny, E., Jawahar, C, V., Karatzas, D.: Scene text visual question answering. In: CoRR, vol. abs/1905.13648, (2019)

14.

Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text,” arXiv preprint arXiv:1606.05250, (2016)

15.

Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., Deng, L.: MS MARCO: A human generated machine reading comprehension dataset. In: CoRR, vol. abs/1611.09268, (2016)

16.

Chen, D., Fisch, A., Weston, J., Bordes, A: Reading wikipedia to answer open-domain questions. In: ACL, (2017)

17.

Devlin, J., Chang, M. -W., Lee K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: ACL, (2019)

18.

Mathew, M., Tito, R., Karatzas, D., Manmatha, R., Jawahar, C. V.: Document visual question answering challenge 2020. (2020)

19.

Wigington, C., Tensmeyer, C., Davis, B. L., Barrett, W. A., Price, B. L., Cohen, S.: Start, follow, read: end-to-end full-page handwriting recognition. In: ECCV, (2018)

20.

Kang, L., Riba, P., Rusiñol, M., Fornés, A., Villegas, M.: Pay attention to what you read: non-recurrent handwritten text-line recognition. (2020)

21.

Bazzo, G.T., Lorentz, G.A., Vargas, D.S., Moreira, V.P.: Assessing the impact of OCR errors in information retrieval. In: Jose, M., Yilmaz, E., Magalhães, J., Castells, P., Ferro, N., Silva, M.J., Martins, F. (eds.) Advances in Information Retrieval. Nature Publishing Group, Berlin (2020)

22.

Chiron, G., Doucet, A., Coustaty, M., M. Visani, M., Moreux, J. -P.: Impact of OCR errors on the use of digital libraries: towards a better access to information. In: ACM/IEEE JCDL, (2017)

23.

Shih, K. J., Singh, S., Hoiem, D.: Where to look: focus regions for visual question answering. (2016)

24.

Wang, X., Liu, Y., Shen, C., Ng, C. C., Luo, C., Jin, L., Chan, C. S., van den Hengel, A., Wang, L.: On the general value of evidence, and bilingual scene-text visual question answering. (2020)

25.

Trischler, A., Wang, T., Yuan, X., Harris, J., Sordoni, A., Bachman, P., Suleman, K.: Newsqa: A machine comprehension dataset. In: CoRR, vol. abs/1611.09830, (2016)

26.

Causer, T., Wallace, V.: Building a volunteer community: results and findings from transcribe Bentham. Digit. Humanit. Q. 6, 01 (2012)

27.

Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Kelcey, M., Devlin, J., Lee, K., Toutanova, K.N., Jones, L., Chang, M.-W., Dai, A., Uszkoreit, J., Le, Q., Petrov, S.: Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguist. 7, 453–466 (2019)CrossRef

28.

Hermann, K. M., Kocisky, T.,Grefenstette, E., Espeholt, L., Kay, W., Suleyman, M.,Blunsom, P.: Teaching machines to read and comprehend. In: NeurIPS, (2015)

29.

Shen, Y., Huang, P.-S., Gao, J., Chen, W.: Reasonet: Learning to stop reading in machine comprehension. In: ACM SIGKDD, (2017)

30.

Wang, S., Jiang, J.: Machine comprehension using match-lstm and answer pointer. arXiv preprint arXiv:1608.07905, (2016)

31.

Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? dataset and methods for multilingual image question. In: NeurIPS, (2015)

32.

Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: NeurIPS, (2015)

33.

Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR, (2017)

34.

Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C. L., Girshick, R.: CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR, (2017)

35.

Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and answer: overcoming priors for visual question answering. In: CVPR, (2018)

36.

Gurari, D., Li, Q., Stangl, A. J., Guo, A., Lin, C., Grauman, K., Luo, J., Bigham, J. P.: VizWiz Grand challenge: answering visual questions from blind people. In: CoRR, vol. abs/1802.08218, 2018

37.

Krasin, I., Duerig, T., Alldrin, N., Ferrari, V., Abu-El-Haija, S., Kuznetsova, A., Rom, H., Uijlings, J.,Popov, S., Kamali, S., Malloci, M., Pont-Tuset, J., Veit, A., Belongie, S., Gomes, V., Gupta, A., Sun, C., Chechik, G., Cai, D., Feng, Z., Narayanan, D., Murphy, K.: OpenImages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available fromhttps://storage.googleapis.com/openimages/web/index.html, (2017)

38.

Kahou, S. E., Michalski, V., Atkinson, A., Trischler,K. A., Bengio, Y.: FigureQA: An annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300, (2017)

39.

Kembhavi, A., Seo, M., Schwenk, D., Choi, J., Farhadi, A., Hajishirzi, H.: Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In: CVPR, (2017)

40.

Jain A, K., Namboodiri, A. M.: Indexing and retrieval of on-line handwritten documents. In: ICDAR, (2003)

41.

Howe, N. R., Rath, T. M., Manmatha, R.: Boosted decision trees for word recognition in handwritten document retrieval. In: SIGIR, (2005)

42.

Cao, H., Bhardwaj, A., Govindaraju, V.: A probabilistic method for keyword retrieval in handwritten document images. Pattern Recognit. 42(12), 3374–3382 (2009)CrossRef

43.

Ahmed, R., Al-Khatib, W.G., Mahmoud, S.: A survey on handwritten documents word spotting. IJMIR 6(1), 31–47 (2017)

44.

Villegas, M., Puigcerver, J., Toselli, A., Sánchez, J. -A., Vidal, E.: Overview of the ImageCLEF 2016 handwritten scanned document retrieval task. In: CLEF, (2016)

45.

Kise, K., Fukushima, S., Matsumoto, K.: Document image retrieval for QA systems based on the density distributions of successive terms. IEICE 88–D, 1843–1851 (2005)

46.

Sudholt, S., Fink, G. A.: PHOCNet: A deep convolutional neural network for word spotting in handwritten documents. In: ICFHR, (2016)

47.

Sudholt, S., Fink, G. A.: Evaluating word string embeddings and loss functions for CNN-based word spotting. In: ICDAR, (2017)

48.

Gómez, L., Rusiñol, M., Karatzas, D.: LSDE: Levenshtein space deep embedding for query-by-string word spotting. In: ICDAR, (2017)

49.

Jaakkola, T. S., Haussler, D.: Exploiting generative models in discriminative classifiers. In: NeurIPS, (1999)

50.

Perronnin F., Dance, C. R.: Fisher Kernels on visual vocabularies for image categorization. In: CVPR, (2007)

51.

Sánchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classification with the Fisher vector: theory and practice. Int. J. Comput, Vis. 105, 222–245 (2013)MathSciNetCrossRef

52.

Perronnin, F., Sánchez, J., Mensink, T.: Improving the Fisher Kernel for large-scale image classification. In: ECCV, (2010)

53.

Jegou, H., Perronnin, F., Douze, M., Sánchez, J., Perez, P., Schmid, C.: Aggregating local image descriptors into compact codes. TPAMI 34, 1704–1716 (2012)CrossRef

54.

Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227, (2014)

55.

Bradski, G.: The openCV library. Dr. Dobb’s J. Softw. Tools 25, 120–125 (2000)

56.

Loper E., Bird, S.: NLTK: The natural language toolkit. In: ACL ETMTNLP

57.

Krishnan, P., Jawahar, C. V.: Generating synthetic data for text recognition. arXiv preprint arXiv:1608.04224, (2016)

58.

Marti, U.-V., Bunke, H.: The IAM-database: an english sentence database for offline handwriting recognition. IJDAR 5, 39–46 (2002)CrossRef

59.

Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Berlin (2007)MATH

60.

Gormley, C., Tong, Z.: Elasticsearch: The Definitive Guide. O’Reilly Media Inc, Sebastopol (2015)

61.

Cao, H., Govindaraju, V., Bhardwaj, A.: Unconstrained handwritten document retrieval. IJDAR 14, 145–157 (2011)CrossRef

62.

Fataicha, Y., Cheriet, M., Nie, J.Y., Suen, C.Y.: Retrieving poorly degraded OCR documents. IJDAR 8, 15 (2006)CrossRef

Title: Asking questions on handwritten document collections
Authors: Minesh Mathew
Lluis Gomez
Dimosthenis Karatzas
C. V. Jawahar
Publication date: 06-08-2021
Publisher: Springer Berlin Heidelberg
Published in: International Journal on Document Analysis and Recognition (IJDAR) / Issue 3/2021
Print ISSN: 1433-2833
Electronic ISSN: 1433-2825
DOI: https://doi.org/10.1007/s10032-021-00383-3

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 3/2021

EAML: ensemble self-attention-based mutual learning network for document image classification

Beyond document object detection: instance-level segmentation of complex layouts

A two-step framework for text line segmentation in historical Arabic and Latin document images

Learning from similarity and information extraction from structured documents

Editorial for special issue on “Advanced Topics in Document Analysis and Recognition”

Data Augmentation using Geometric, Frequency, and Beta Modeling approaches for Improving Multi-lingual Online Handwriting Recognition

Premium Partner