Skip to main content
Top

2021 | Book

Linking Theory and Practice of Digital Libraries

25th International Conference on Theory and Practice of Digital Libraries, TPDL 2021, Virtual Event, September 13–17, 2021, Proceedings

insite
SEARCH

About this book

This book constitutes the proceedings of the 25th International Conference on Theory and Practice of Digital Libraries, TPDL 2021, held in September 2021. Due to COVID-10 pandemic the conference was held virtually.

The 10 full papers, 3 short papers and 13 other papers presented were carefully reviewed and selected from 53 submissions. TPDL 2021 attempts to facilitate establishing connections and convergences between diverse research communities such as Digital Humanities, Information Sciences and others that could benefit from ecosystems offered by digital libraries and repositories. This edition of TPDL was held under the general theme of “Linking Theory and Practice”. The papers are organized in topical sections as follows: Document and Text Analysis; Data Repositories and Archives; Linked Data and Open Data; User Interfaces and Experience.

Table of Contents

Frontmatter

Document and Text Analysis

Frontmatter
FETD: A Framework for Enabling Textual Data Denoising via Robust Contextual Embeddings
Abstract
Efforts by national libraries, institutions, and (inter-) national projects have led to an increased effort in preserving textual contents - including non-digitally born data - for future generations . These activities have resulted in novel initiatives in preserving the cultural heritage by digitization. However, a systematic approach toward Textual Data Denoising (TD\(^{2}\)) is still in its infancy and commonly limited to a primarily dominant language (mostly English). However, digital preservation requires a universal approach. To this end, we introduce a “Framework for Enabling Textual Data Denoising via robust contextual embeddings” (FETD\(^{2}\)). FETD\(^{2}\) improves data quality by training language-specific data denoising models based on a small number of language-specific training data. Our approach employs a bi-directional language modeling in order to produce noise-resilient deep contextualized embeddings. In experiments we show the superiority compared with the state-of-the-art.
Govind, Céline Alec, Jean-Luc Manguin, Marc Spaniol
Minimalist Fitted Bayesian Classifier-Based on Likelihood Estimations and Bag-of-Words
Abstract
The expansion of institutional repositories involves new challenges for autonomous agents that control the quality of semantic annotations in large amounts of scholarly knowledge. While evaluating metadata integrity in documents was already widely tackled in the literature, a majority of the frameworks are intractable when confronted with a big data environment. In this paper, we propose an optimal strategy based on feature engineering to identify spurious objects in large academic repositories. Through an application case dealing with a Brazilian institutional repository containing objects like PhD theses and MSc dissertations, we use maximum likelihood estimations and bag-of-words techniques to fit a minimalist Bayesian classifier that can quickly detect inconsistencies in class assertions guaranteeing approximately 94% of accuracy.
Jean-Rémi Bourguet, Wesley Silva, Elias de Oliveira
Inventory and Content Separation in Grammatical Descriptions of Languages of the World
Abstract
Grammatical descriptions of languages of the world form a sub-genre of scholarly documents in the field of linguistics. A document of this genre may be modeled as a concatenation of table of contents, sociolinguistic description, phonological description, morphosyntactic description, comparative remarks, lexicon, text, bibliography and index (where morphosyntactic description is the only mandatory section). Separation of these parts is useful for information extraction, bibliometrics and information content analysis. Using a collection of over 10 000 digitized grammatical descriptions and an associated bibliography with document-level categorizations, we show that standard techniques from text classification can be adapted to classify individual pages. Assuming that the divisions of interest form continuous page ranges, we can achieve the sought after division in a transparent way. In contrast to previous work on similar tasks in other domains, no use is made of formatting cues, no additional annotated data is needed, high-quality OCR is not required, and the document collection is highly multilingual.
Harald Hammarström
An Empirical Study of Span Modeling in Science NER
Abstract
Little evaluation has been performed on the many modeling options for span-based approaches. This paper investigates the performances of a wide range of span and context representation methods and their combinations with a focus on scientific named entity recognition (science NER). While some most common classical span encodings and their combination prove to be effective, few conclusions can be derived to context representations.
Xiaorui Jiang
Terminology/Keyphrase Extraction for Creation of Book Indexes in Polish
Abstract
The paper addresses the problem of automatic identification of phrases to be included in back-of-book indexes. We analyzed books in Polish and English published with subject indexes compiled by their authors. We checked what kinds of phrases are placed in those indexes and how often they actually occur in the corresponding books. In the experiments, we use existing terminology and keyphrase extraction tools. For Polish, the first tool is better than the second one, but for English texts, the results are inconclusive.
Małgorzata Marciniak, Agnieszka Mykowiecka, Piotr Rychlik
Token-Level Multilingual Epidemic Dataset for Event Extraction
Abstract
In this paper, we present a dataset and a baseline evaluation for multilingual epidemic event extraction. We experiment with a multilingual news dataset which we annotate at the token level, a common tagging scheme utilized in event extraction systems. We approach the task of extracting epidemic events by first detecting the relevant documents from a large collection of news reports. Then, event extraction (disease names and locations) is performed on the detected relevant documents. Preliminary experiments with the entire dataset and with ground-truth relevant documents showed promising results, while also establishing a stronger baseline for epidemiological event extraction.
Stephen Mutuvi, Emanuela Boros, Antoine Doucet, Gaël Lejeune, Adam Jatowt, Moses Odeo

Open Access

A Semantic Search Engine for Historical Handwritten Document Images
Abstract
A very large number of historical manuscript collections are available in image formats and require extensive manual processing in order to search through them. So, we propose and build a search engine for automatically storing, indexing and efficiently searching the manuscript images. Firstly, a handwritten text recognition technique is used to convert the images into textual representations. In the next steps, we apply the named entity recognition and historical knowledge graph to build a semantic search model, which can understand the user’s intent in the query and the contextual meaning of concepts in documents, to return correctly the transcriptions and their corresponding images for users.
Vuong M. Ngo, Gary Munnelly, Fabrizio Orlandi, Peter Crooks, Declan O’Sullivan, Owen Conlan
Temporal Analysis of Worldwide War
Abstract
In this paper, we study the wars fought in history and draw conclusions by analysing a curated temporal multi-graph. We explore the participation of countries in wars and the nature of relationships between various countries during different timelines. This study also attempts to shed light on different countries’ exposure to terrorist encounters.
Devansh Bajpai, Rishi Ranjan Singh

Data Repositories and Archives

Frontmatter
Where Did the Web Archive Go?
Abstract
To perform a longitudinal investigation of web archives and detecting variations and changes replaying individual archived pages, or mementos, we created a sample of 16,627 mementos from 17 public web archives. Over the course of our 14-month study (November, 2017–January, 2019), we found that four web archives changed their base URIs and did not leave a machine-readable method of locating their new base URIs, necessitating manual rediscovery. Of the 1,981 mementos in our sample from these four web archives, 537 were impacted: 517 mementos were rediscovered but with changes in their time of archiving (or Memento-Datetime), HTTP status code, or the string comprising their original URI (or URI-R), and 20 of the mementos could not be found at all.
Mohamed Aturban, Michael L. Nelson, Michele C. Weigle
What’s Data Got to Do with It? An Agenda for a New Generation of Digital Libraries
Abstract
Digital libraries have matured rapidly in recent years: practical large-scale libraries are now ubiquitous, and many fundamental problems are resolved. This paper addresses the future of one area of DL theory and practice, identifying common requirements and needs that are found in contemporary DLs. This shows that a new wave of research and engineering problems need to be solved, and that corresponding theories and principles need to be developed. We draw on both the current literature and four ongoing data DL projects to demonstrate the next generation of data DL systems, and where new DL theory is needed.
George Buchanan, Dana McKay, David Bainbridge
Semantic Tagging via Entity-Level Analytics: Assessment of Concise Content Tagging
Abstract
Digital curation requires substantial human expertise in order to achieve and maintain document collections of high quality. This necessitates usually expert knowledge of a librarian or curator in order to interpret the content and categorize it accordingly. This process is at the same time expensive and time-consuming. With the advent of knowledge bases and the plenitude of information contained within them new opportunities emerge at the horizon. In particular, entity-level analytics allows to semantically enrich contents via linked open data (LOD). To this end, we assess in this paper the approach of concise content annotation as a means of supporting the process of digital curation. In particular, we compare various entity-level annotation methods and highlight the importance of concise semantic tagging based on qualitative as well as quantitative evaluations.
Amit Kumar, Marc Spaniol
Automating the Selection of Emulated Rendering Environments for Born-Digital Data-Sets
Abstract
Digital, and born digital, collections in libraries and archives are growing. Managing growing backlogs of unprocessed and inaccessible digital content they cannot afford to manually process and make accessible requires automation. To support users in both creating useful setups covering a relevant set of objects as well as choosing from a list of available setups for a given object, tool support is required. In this paper, we propose a method based on co-occurrence of file formats to automate the selection of ready-made software setups for a given artifact to be accessed through emulation.
Julian Giessl, Rafael Gieschke, Klaus Rechert, Euan Cochrane
Colabo.Space - Participatory Platform for Evolving Research and Publishing Workflows
Abstract
We explore and evaluate the Colabo.Space ecosystem as a basis for conducting literary (and, by extension, other) research. The key principle of the ecosystem is to support participatory design at each stage to enable visual, declarative and co-creative design and evolution of the ecosystem, its infrastructure, data types and contained (research) knowledge.
Accompanied with specialized platforms, it supports describing research workflows; collecting data; distant reading research; publishing and visualizing the findings.
We argue for supporting the continuous evolution of the ecosystem, the findings and publishing content, with reference to the ecosystem and its publishing process (both the content and meta-data).
We evaluate the constituting components of the Colabo.Space ecosystem within three collaborative research projects.
Sasha Mile Rudan, Sinisha Rudan, Eugenia Kelbert, Andrija Sagic, Lazar Kovacevic, Matthew Reynolds
How Can an Archive Be Characterized?
Abstract
Archives are evolving. Analog archives are becoming increasingly digitized and linked with other cultural heritage institutions and information sources. Diverse forms of born-digital archives are appearing. This diversity asks for systematic ways to characterize existing archives managing physical or digital records. We conducted a systematic review to identify and understand how archives are characterized. From the 885 identified articles, only 15 were focused on archives’ characterization and, therefore, included in the study. We found several characterization features organized in three main groups: archival materials, provided services, and internal processes.
Marta Faria Araújo, Carla Teixeira Lopes
Visualizing Copyright-Protected Video Archive Content Through Similarity Search
Abstract
Providing access to protected media archives can be difficult due to licensing restrictions. In this paper, an alternative way to examine video content without violating terms of use is proposed. For this purpose, keyframes of the original, archived videos are replaced with images from publicly available sources using person recognition and visual similarity search for scenes and locations.
Kader Pustu-Iren, Eric Müller-Budack, Sherzod Hakimov, Ralph Ewerth

Open Access

Self-assessment and Monitoring of CHI Performance in Digital Transformation
Abstract
To fully reap the benefits of digitisation and sustainably create value for their audiences, cultural heritage institutions (CHI) need to implement and monitor digital, data-driven strategies that touch upon all aspects of how organisations operate. This can range from staffing and skills development to adoption of metadata models, novel audience engagement approaches and methods for collecting and using user data. We introduce the concept for the CHI Self-Assessment Tool that enables institutions to assess their strategy and plan against several aspects of digital transformation. The tool proposes a novel approach on how CHIs can continuously gather data on their activities and use insights from this data to adjust their strategies and increase their digital maturity. Equally, this data can be used by policy-makers to implement more effective policies and support the sector with targeted capacity building.
Rasa Bocyte, Johan Oomen, Fred Truyen
Automatic Translation and Multilingual Cultural Heritage Retrieval: A Case Study with Transcriptions in Europeana
Abstract
Multilinguality is of particular interest for digital libraries in Cultural Heritage (CH), where the language of the data may not match users’ languages. However, multilingual access is rarely implemented beyond the use of multilingual interfaces. We have run an experiment using the Europeana CH digital library as a use case. We evaluate the effectiveness of a multilingual information retrieval strategy using machine translations to English as pivot language. We conducted an indirect evaluation that should be considered preliminary. Yet, together with a manual analysis of the query translations, it already shows (or confirms) some of the benefits and challenges of deploying such systems in CH.
Mónica Marrero, Antoine Isaac, Nuno Freire

Linked Data and Open Data

Frontmatter
Leveraging a Federation of Knowledge Graphs to Improve Faceted Search in Digital Libraries
Abstract
Scientists always look for the most accurate and relevant answers to their queries in the literature. Traditional scholarly digital libraries list documents in search results, and therefore are unable to provide precise answers to search queries. In other words, search in digital libraries is metadata search and, if available, full-text search. We present a methodology for improving a faceted search system on structured content by leveraging a federation of scholarly knowledge graphs. We implemented the methodology on top of a scholarly knowledge graph. This search system can leverage content from third-party knowledge graphs to improve the exploration of scholarly content. A novelty of our approach is that we use dynamic facets on diverse data types, meaning that facets can change according to the user query. The user can also adjust the granularity of dynamic facets. An additional novelty is that we leverage third-party knowledge graphs to improve exploring scholarly knowledge.
Golsa Heidari, Ahmad Ramadan, Markus Stocker, Sören Auer
A Comprehensive Extraction of Relevant Real-World-Event Qualifiers for Semantic Search Engines
Abstract
In this paper, we present an efficient and accurate method to represent events from numerous public sources, such as Wikidata or more specific knowledge bases. We focus on events happening in the real world, such as festivals or assassinations. Our method merges knowledge from Wikidata and Wikipedia article summaries to gather entities involved in events, dates, types and labels. This event characterization procedure is extended by including vernacular languages. Our method is evaluated by a comparative experiment on two datasets that shows that events are represented more accurately and exhaustively with vernacular languages. This can help to extend the research that mainly exploits hub languages, or biggest language editions of Wikipedia. This method and the tool we release will for instance enhance event-centered semantic search engines, a context in which we already use it. An additional contribution of this paper is the public release of the source code of the tool, as well as the corresponding datasets.
Guillaume Bernard, Cyrille Suire, Cyril Faucher, Antoine Doucet
Citation Recommendation for Research Papers via Knowledge Graphs
Abstract
Citation recommendation for research papers is a valuable task that can help researchers improve the quality of their work by suggesting relevant related work. Current approaches for this task rely primarily on the text of the papers and the citation network. In this paper, we propose to exploit an additional source of information, namely research knowledge graphs (KGs) that interlink research papers based on mentioned scientific concepts. Our experimental results demonstrate that the combination of information from research KGs with existing state-of-the-art approaches is beneficial. Experimental results are presented for the STM-KG (STM: Science, Technology, Medicine), which is an automatically populated knowledge graph based on the scientific concepts extracted from papers of ten domains. The proposed approach outperforms the state of the art with a mean average precision of 20.6% (+0.8) for the top-50 retrieved results.
Arthur Brack, Anett Hoppe, Ralph Ewerth
AnnoTag: Concise Content Annotation via LOD Tags derived from Entity-Level Analytics
Abstract
Digital libraries build on classifying contents by capturing their semantics and (optionally) aligning the description with an underlying categorization scheme. This process is usually based on human intervention, either by the content creator or a curator. As such, this procedure is highly time-consuming and - thus - expensive. In order to support the human in data curation, we introduce an annotation tagging system called “AnnoTag”. AnnoTag aims at providing concise content annotations by employing entity-level analytics in order to derive semantic descriptions in the form of tags. In particular, we are generating “Semantic LOD Tags” (linked open data) that allow an interlinking of the derived tags with the LOD cloud. Based on a qualitative evaluation on Web news articles we prove the viability of our approach and the high-quality of the automatically extracted information.
Amit Kumar, Marc Spaniol
SmartReviews: Towards Human- and Machine-Actionable Reviews
Abstract
Review articles summarize state-of-the-art work and provide a means to organize the growing number of scholarly publications. However, the current review method and publication mechanisms hinder the impact review articles can potentially have. Among other limitations, reviews only provide a snapshot of the current literature and are generally not readable by machines. In this work, we identify the weaknesses of the current review method. Afterwards, we present the SmartReview approach addressing those weaknesses. The approach pushes towards semantic community-maintained review articles. At the core of our approach, knowledge graphs are employed to make articles more machine-actionable and maintainable.
Allard Oelen, Markus Stocker, Sören Auer

User Interfaces and Experience

Frontmatter
Comparing Methods for Finding Search Sessions on a Specified Topic: A Double Case Study
Abstract
Users searching for different topics in a collection may show distinct search patterns. To analyze search behavior of users searching for a specific topic, we need to retrieve the sessions containing this topic. In this paper, we compare different topic representations and approaches to find topic-specific sessions. We conduct our research in a double case study of two topics, World War II and feminism, using search logs of a historical newspaper collection. We evaluate the results using manually created ground truths of over 600 sessions per topic. The two case studies show similar results: The query-based methods yield high precision, at the expense of recall. The document-based methods find more sessions, at the expense of precision. In both approaches, precision improves significantly by manually curating the topic representations. This study demonstrates how different methods to find sessions containing specific topics can be applied by digital humanities scholars and practitioners.
Tessel Bogaard, Aysenur Bilgin, Jan Wielemaker, Laura Hollink, Kees Ribbens, Jacco van Ossenbruggen
Clustering and Classifying Users from the National Museums Liverpool Website
Abstract
Museum websites have been designed to provide access for different types of users, such as museum staff, teachers and the general public. Therefore, understanding user needs and demographics is paramount to the provision of user-centred features, services and design. Various approaches exist for studying and grouping users, with a more recent emphasis on data-driven and automated methods. In this paper, we investigate user groups of a large national museum’s website using multivariate analysis and machine learning methods to cluster and categorise users based on an existing user survey. In particular, we apply the methods to the dominant group - general public - and show that sub-groups exist, although they share similarities with clusters for all users. We find that clusters provide better results for categorising users than the self-assigned groups from the survey, potentially helping museums develop new and improved services.
David Walsh, Paul Clough, Mark Michael Hall, Frank Hopfgartner, Jonathan Foster
Humanities Scholars and Digital Humanities Projects: Practice Barriers in Tools Usage
Abstract
Humanities scholars face many problems when trying to design, build, present, and maintain digital humanities projects. To mitigate these problems and to improve the user experience of digital humanities collections, it is essential to understand the problems in detail. However, we currently have a fragmented and incomplete picture of what these problems actually are. This study presents a wide systematic literature review (SLR) on the problems encountered by humanities scholars when adopting particular software tools in digital humanities projects. As a result of this review, this paper finds problems in different categories of tools used in digital humanities. The practice barriers can be divided into four types: content, technique, interface, and storage. These results draw a full picture of problems in tools usage, suggest digital humanities discipline further improve tools application and offer developers of software designed for humanities scholars some feedback to make them optimize these tools.
Rui Liu, Dana McKay, George Buchanan
Researching Pandemics Through Time: A Covid-19 Inspired Data-Driven Approach to Explore Historical Newspapers
Abstract
Heritage institutions are exploring new ways to open up their digital collections. In this context, the KB, national library of the Netherlands, has built a data-driven demonstration website based on historical newspapers. This website centers around a currently relevant topic due to the Covid-19 crisis: pandemics. A Toolbox with Notebooks and a sample data set is provided to support students and starting researchers. This paper describes the data selection process, the functionality of the website and corresponding Toolbox, as well as the initial reception.
Mirjam Cuper
Backmatter
Metadata
Title
Linking Theory and Practice of Digital Libraries
Editors
Gerd Berget
Mark Michael Hall
Daniel Brenn
Sanna Kumpulainen
Copyright Year
2021
Electronic ISBN
978-3-030-86324-1
Print ISBN
978-3-030-86323-4
DOI
https://doi.org/10.1007/978-3-030-86324-1

Premium Partner