Top

Published in:

Open Access 2021 | OriginalPaper | Chapter

A Semantic Search Engine for Historical Handwritten Document Images

Authors : Vuong M. Ngo, Gary Munnelly, Fabrizio Orlandi, Peter Crooks, Declan O’Sullivan, Owen Conlan

Published in: Linking Theory and Practice of Digital Libraries

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Patentsearch

Off

Abstract

A very large number of historical manuscript collections are available in image formats and require extensive manual processing in order to search through them. So, we propose and build a search engine for automatically storing, indexing and efficiently searching the manuscript images. Firstly, a handwritten text recognition technique is used to convert the images into textual representations. In the next steps, we apply the named entity recognition and historical knowledge graph to build a semantic search model, which can understand the user’s intent in the query and the contextual meaning of concepts in documents, to return correctly the transcriptions and their corresponding images for users.

1 Introduction

Every year, the great collections of historical handwritten manuscripts in museums, libraries and other organisations are digitised as electronic images. The digitisation makes the manuscripts available to a wider audience, and preserves the cultural heritage. The automatic recognition of textual corpora and named entities generated from medieval and early-modern manuscript sources with high accuracy is a challenge [2, 20, 22]. Manuscript images are often processed through keyword spotting or word recognition to be accessed and searched, such as [4, 8, 14, 17] and [18]. There are some papers build a search system for handwritten images, such as [1, 5, 15, 16, 21] and [23]. However, their systems only offer keyword search.

Unlike keyword search, semantic search improves search precision and recall by understanding the user’s intent and the contextual meaning of concepts in documents and queries [3, 12, 19, 24]. This paper proposes a semantic search engine for full-text retrieval of historical handwritten document images based on named entity (NE), keyword (KW) and knowledge graph (KG). This would help not only in processing, storing and indexing automatically, but also would allow users to access quickly and retrieve efficiently manuscripts.

2 System Architecture

The Public Record Office of Ireland (PROI) was destroyed on 30 June 1922, resulting in the loss of 700 years of Irish history. The Beyond 2022 Project (https://beyond2022.ie) is combining historical research, archival discovery, and technical innovation to create a virtual reconstruction of the PROI. There are over 300 volumes of surviving and collected handwritten copies of lots documents, with some 100,000 pages containing 25 million words of text.

Our system architecture of the search engine is illustrated in Fig. 1 which has four separate processing modules being Handwritten Text Recognition, NE Recognition, KW-NE Indexing and KW-NE-Based IR Model. Firstly, the historical handwritten document images are digitised to transcriptions through the Handwritten Text Recognition module. Then, the transcriptions are annotated by NEs through the NE Recognition module. This module needs to connect to the Knowledge Graph to extract the classes and identifiers of NEs. Next, KWs and NEs of the annotated transcriptions and the respective original images are presented and indexed by the KW-NE indexing module and stored in KW-NE Annotated Text and Image Repository. The raw text query is also annotated NEs through the NE Recognition module to become a KW-NE annotated query. Finally, the KW-NE-Based IR Model module compares the annotated query and the annotated documents to return the ranked transcriptions and images.

3 Image Representation and Knowledge Graph

Transkribus [13] is used for training and deploying Handwritten Text Recognition (HTR) models to derive text transcription from image scans. Given the rate at which transcriptions can be generated, NE Recognition (NER) and Entity Linking (EL) are required to automated annotate all instances of entities occurring in the transcription text. We used SpaCy [11] for NER and had highly results on 18\(^{th}\) century English text. To provide flexibility, an NLP pipeline has been implemented as a thin layer over a number of standard NLP tools. The output of the pipeline is a NLP Interchange Format [10] in which a NER tool has annotated classes of entities and, where possible, an EL tool has connected the recognized entities to KG.

The KG collects structured data from various historical sources. Part of the data is manually curated by historians through spreadsheets. Other data sources (e.g. geographical data from OSi [6]) are imported automatically as RDF for direct insertion into KG. The schema (or ontology) used to structure KG, is mainly based on the popular CIDOC-CRM ontology [7]. A short excerpt of KG is depicted in Fig. 2. It shows a few main entities and relationships related to a person (of type CIDOC-CRM:E21_Person) named “William Sutton”, who was member of a few relevant offices in Ireland.

4 Information Retrieval Model and Demo

A search engine needs to not only return the best documents, but also be fast. We implemented the index and search functions based on Elasticsearch to have a real-time search engine [9]. The Okapi BM25 model was proposed to find and rank the relevant handwritten manuscripts for queries. In the model, documents and queries are presented by sets of concepts being NEs or KWs. Figure 3 presents an image of a handwritten medieval historical manuscript, its transcription and its concept set d, applied in the model. In the transcription, there are three kinds of words determined by our NER tool: (1) stop-words being the, to, of, we and you; (2) NEs being sheriff, Meath, clerk and William Sutton; and (3) KWs being king, &c, greeting, direct, pay, shilling and silver. The stop-words are not added into the concept set d.

Figure 4 presents the interface of our search engine¹, and the concept sets of \(q_1\) and \(q_2\). In that, coun_meath is the identifier of an entity named Meath and classed Country, which is determined by our NER algorithm. While, silver and shilling are keywords. To exploit the features of NEs for semantic search, a NE needs to be presented by its most specific meaning in the concept set d. It means that, with a NE in the transcription,

If our NER can determine its identifier, the NE will be presented by its identifier in d. For example, occu_sheriff, coun_meath and occu_clerk are identifiers of entities named sheriff, Meath and clerk, and added into d.
If our NER only determines its most specific class, the NE will be presented by a combined information including its name and class. For example, the entity named William Sutton does not exist in our historical KG, so its identifier cannot be extracted. However, the NER determines its most specific class being Person. So william_sutton/person is added into d.

5 Conclusion

We proposed a novel semantic full-text search system for images of historical handwritten manuscripts. Unlike the existing approach only using KW extracted from images, we exploited NE, KW and KG of increase search performance. In that, NER and HTR tools were built to recognise transcriptions and NEs from the manuscript images. Besides, to increase the precision of our NER tool, the historical KG was designed and proposed. Then, we implemented the index and search functions for transcriptions based on Elasticsearch and Okapi BM25 to search images in real-time. Finally, the semantic search engine was also implemented and deployed.

Acknowledgment

Beyond 2022 is funded by the Government of Ireland, through the Department of Culture, Heritage and the Gaeltacht, under the Project Ireland 2040 framework. The project is also partially supported by the ADAPT Centre for Digital Content Technology under the SFI Research Centres Programme (Grant 13/RC/2106_P2).

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

previous chapter Token-Level Multilingual Epidemic Dataset for Event Extraction

next chapter Temporal Analysis of Worldwide War

https://by2022.adaptcentre.ie/conf_demo.

Aghbari, Z., Brook, S.: HAH manuscripts: a holistic paradigm for classifying and retrieving historical Arabic handwritten documents. Expert Syst. Appl. 36(8), 10942–10951 (2009)CrossRef

Ahmed, R., Al-Khatib, W., Mahmoud, S.: A survey on handwritten documents word spotting. Int. J. Multimed. Inf. Retr. 6(1), 31–47 (2017). https://doi.org/10.1007/s13735-016-0110-yCrossRef

Cao, T., Ngo, V.: Semantic search by latent ontological features. Int. J. New Gener. Comput. 30(1), 53–71 (2012). https://doi.org/10.1007/s00354-012-0104-0CrossRef

Cheikhrouhou, A., Kessentini, Y., Kanoun, S.: Multi-task learning for simultaneous script identification and keyword spotting in document images. Pattern Recogn. 113, 107832 (2021)CrossRef

Colutto, S., Kahle, P., Guenter, H., Muehlberger, G.: Transkribus. A platform for automated text recognition and searching of historical documents. In: Proceedings of the 15th International Conference on eScience (eScience), pp. 463–466 (2019)

Debruyne, C., et al.: Ireland?s authoritative geospatial linked data. In: d’Amato, C., et al. (eds.) ISWC 2017. LNCS, vol. 10588, pp. 66–74. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68204-4_6CrossRef

Doerr, M.: The CIDOC conceptual reference module: an ontological approach to semantic interoperability of metadata. AI Mag. 24(3), 75–92 (2003)

Frinken, V., Palakodety, S.: Handwritten keyword spotting in historical documents. In: Handwritten Historical Document Analysis, Recognition, and Retrieval—State of the Art and Future Trends, Series in MP&AI, vol. 89, pp. 81–99. World Scientific Publishing (2021)

Gheorghe, R., Hinman, M., Russo, R.: Elasticsearch in Action, 1st edn. Manning Publications Co., Shelter Island (2015)

10.

Hellmann, S., Lehmann, J., Auer, S., Brümmer, M.: Integrating NLP using linked data. In: Alani, H., et al. (eds.) ISWC 2013. LNCS, vol. 8219, pp. 98–113. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41338-4_7CrossRef

11.

Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: SpaCy: industrial-strength natural language processing in Python (2020). https://doi.org/10.5281/zenodo.1212303

12.

Jiang, Y.: Semantically-enhanced information retrieval using multiple knowledge sources. Clust. Comput. 23(4), 2925–2944 (2020). https://doi.org/10.1007/s10586-020-03057-7CrossRef

13.

Kahle, P., Colutto, S., Hackl, G., Mühlberger, G.: Transkribus - a service platform for transcription, recognition and retrieval of historical documents. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 04, pp. 19–24 (2017). https://doi.org/10.1109/ICDAR.2017.307

14.

Kang, L., Riba, P., Villegas, M., Fornés, A., Rusiñol, M.: Candidate fusion: integrating language modelling into a sequence-to-sequence handwritten word recognition architecture. Pattern Recogn. 112, 107790 (2021)CrossRef

15.

Lang, E., Puigcerver, J., Toselli, A.H., Vidal, E.: Probabilistic indexing and search for information extraction on handwritten German parish records. In: Proceedings of 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 44–49 (2018)

16.

Leydier, Y., Lebourgeois, F., Emptoz, H.: Text search for medieval manuscript images. Pattern Recogn. 40(12), 3552–3567 (2007)CrossRef

17.

Li, Z., Wu, Q., Xiao, Y., Jin, M., Lu, H.: Deep matching network for handwritten Chinese character recognition. Pattern Recogn. 107, 107471 (2020)CrossRef

18.

Martínek, J., Lenc, L., Král, P.: Building an efficient OCR system for historical documents with little training data. Neural Comput. Appl. 32(23), 17209–17227 (2020). https://doi.org/10.1007/s00521-020-04910-xCrossRef

19.

Ngo, V., Cao, T.: Discovering latent concepts and exploiting ontological features for semantic text search. In: Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP-2011), pp. 571–579. ACL (2011)

20.

Nozza, D., Manchanda, P., Fersini, E., Palmonari, M., Messina, E.: LearningToAdapt with word embeddings: domain adaptation of named entity recognition systems. Inf. Process. Manag. 58(3), 102537 (2021)CrossRef

21.

Stauffer, M., Fischer, A., Riesen, K.: Filters for graph-based keyword spotting in historical handwritten documents. Pattern Recogn. Lett. 134, 125–134 (2020)CrossRef

22.

Toledo, J., Carbonell, M., Fornés, A., Lladós, J.: Information extraction from historical handwritten document images with a context-aware neural model. Pattern Recogn. 86, 27–36 (2019)CrossRef

23.

Vidal, E., et al.: The carabela project and manuscript collection: large-scale probabilistic indexing and content-based classification. In: The 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 85–90 (2020)

24.

Wang, J., et al.: A pseudo-relevance feedback framework combining relevance matching and semantic matching for information retrieval. Inf. Process. Manag. 57(6), 102342 (2020)CrossRef

Title: A Semantic Search Engine for Historical Handwritten Document Images
Authors: Vuong M. Ngo
Gary Munnelly
Fabrizio Orlandi
Peter Crooks
Declan O’Sullivan
Owen Conlan
Publisher: Springer International Publishing
Book: Linking Theory and Practice of Digital Libraries
Print ISBN: 978-3-030-86323-4

Electronic ISBN: 978-3-030-86324-1

Copyright Year: 2021
DOI: https://doi.org/10.1007/978-3-030-86324-1_7

Springer Professional