Skip to main content

2018 | Buch

Digital Libraries for Open Knowledge

22nd International Conference on Theory and Practice of Digital Libraries, TPDL 2018, Porto, Portugal, September 10–13, 2018, Proceedings

herausgegeben von: Eva Méndez, Fabio Crestani, Cristina Ribeiro, Gabriel David, João Correia Lopes

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes the proceedings of the 22nd International Conference on Theory and Practice of Digital Libraries, TPDL 2018, held in Porto, Portugal, in September 2018. The 51 full papers, 17 short papers, and 13 poster and tutorial papers presented in this volume were carefully reviewed and selected from 81 submissions. The general theme of TPDL 2018 was Digital Libraries for Open Knowledge. The papers present a wide range of the following topics: Metadata, Entity Disambiguation, Data Management, Scholarly Communication, Digital Humanities, User Interaction, Resources, Information Extraction, Information Retrieval, Recommendation.

Inhaltsverzeichnis

Frontmatter

Metadata

Frontmatter
Content-Based Quality Estimation for Automatic Subject Indexing of Short Texts Under Precision and Recall Constraints

Digital libraries strive for integration of automatic subject indexing methods into operative information retrieval systems, yet integration is prevented by misleading and incomplete semantic annotations. For this reason, we investigate approaches to detect documents where quality criteria are met. In contrast to mainstream methods, our approach, named Qualle, estimates quality at the document-level rather than the concept-level. Qualle is implemented as a combination of different machine learning models into a deep, multi-layered regression architecture that comprises a variety of content-based indicators, in particular label set size calibration. We evaluated the approach on very short texts from law and economics, investigating the impact of different feature groups on recall estimation. Our results show that Qualle effectively determined subsets of previously unseen data where considerable gains in document-level recall can be achieved, while upholding precision at the same time. Such filtering can therefore be used to control compliance with data quality standards in practice. Qualle allows to make trade-offs between indexing quality and collection coverage, and it can complement semi-automatic indexing to process large datasets more efficiently.

Martin Toepfer, Christin Seifert
Metadata Synthesis and Updates on Collections Harvested Using the Open Archive Initiative Protocol for Metadata Harvesting

Harvesting tasks gather information to a central repository. We studied the metadata returned from 744179 harvesting tasks from 2120 harvesting services in 529 harvesting rounds during a period of two years. To achieve that, we initiated nearly 1,500,000 tasks, because a significant part of the Open Archive Initiative harvesting services never worked or have ceased working while many other services fail occasionally. We studied the synthesis (elements and verbosity of values) of the harvested metadata, and how it evolved over time. We found that most services utilize almost all Dublin Core elements, but there are services with minimal descriptions. Most services have very minimal updates and, overall, the harvested metadata is slowly improving over time with “description” and “relation” improving the most. Our results help us to better understand how and when the metadata are improved and have more realistic expectations about the quality of the metadata when we design harvesting or information systems that rely on them.

Sarantos Kapidakis
Metadata Enrichment of Multi-disciplinary Digital Library: A Semantic-Based Approach

In the scientific digital libraries, some papers from different research communities can be described by community-dependent keywords even if they share a semantically similar topic. Articles that are not tagged with enough keyword variations are poorly indexed in any information retrieval system which limits potentially fruitful exchanges between scientific disciplines. In this paper, we introduce a novel experimentally designed pipeline for multi-label semantic-based tagging developed for open-access metadata digital libraries. The approach starts by learning from a standard scientific categorization and a sample of topic tagged articles to find semantically relevant articles and enrich its metadata accordingly. Our proposed pipeline aims to enable researchers reaching articles from various disciplines that tend to use different terminologies. It allows retrieving semantically relevant articles given a limited known variation of search terms. In addition to achieving an accuracy that is higher than an expanded query based method using a topic synonym set extracted from a semantic network, our experiments also show a higher computational scalability versus other comparable techniques. We created a new benchmark extracted from the open-access metadata of a scientific digital library and published it along with the experiment code to allow further research in the topic.

Hussein T. Al-Natsheh, Lucie Martinet, Fabrice Muhlenbach, Fabien Rico, Djamel Abdelkader Zighed

Entity Disambiguation

Frontmatter
Harnessing Historical Corrections to Build Test Collections for Named Entity Disambiguation

Matching mentions of persons to the actual persons (the name disambiguation problem) is central for many digital library applications. Scientists have been working on algorithms to create this matching for decades without finding a universal solution. One problem is that test collections for this problem are often small and specific to a certain collection. In this work, we present an approach that can create large test collections from historical metadata with minimal extra cost. We apply this approach to the dblp collection to generate two freely available test collections. One collection focuses on the properties of name-related defects (such as similarities of synonymous names) and one on the evaluation of disambiguation algorithms.

Florian Reitz
Homonym Detection in Curated Bibliographies: Learning from dblp’s Experience

Identifying (and fixing) homonymous and synonymous author profiles is one of the major tasks of curating personalized bibliographic metadata repositories like the dblp computer science bibliography. In this paper, we present a machine learning approach to identify homonymous profiles. We train our model on a novel gold-standard data set derived from the past years of active, manual curation at dblp.

Marcel R. Ackermann, Florian Reitz

Data Management

Frontmatter
Research Data Preservation Using Process Engines and Machine-Actionable Data Management Plans

Scientific experiments in various domains require nowadays collecting, processing, and reusing data. Researchers have to comply with funder policies that prescribe how data should be managed, shared and preserved. In most cases this has to be documented in data management plans. When data is selected and moved into a repository when project ends, it is often hard for researchers to identify which files need to be preserved and where they are located. For this reason, we need a mechanism that allows researchers to integrate preservation functionality into their daily workflows of data management to avoid situations in which scientific data is not properly preserved.In this paper we demonstrate how systems used for managing data during research can be extended with preservation functions using process engines that run pre-defined preservation workflows. We also show a prototype of a machine-actionable data management plan that is automatically generated during this process to document actions performed. Thus, we break the traditional distinction between platforms for managing data during research and repositories used for preservation afterwards. Furthermore, we show how researchers can easier comply with funder requirements while reducing their effort.

Asztrik Bakos, Tomasz Miksa, Andreas Rauber
Maturity Models for Data and Information Management
A State of the Art

A Maturity Model is a widely used technique that is proved to be valuable to assess business processes or certain aspects of organizations, as it represents a path towards an increasingly organized and systematic way of doing business. A maturity assessment can be used to measure the current maturity level of a certain aspect of an organization in a meaningful way, enabling stakeholders to clearly identify strengths and improvement points, and accordingly prioritize what to do in order to reach higher maturity levels. This paper collects and analyzes the current practice on maturity models from the data and information management domains, by analyzing a collection of maturity models from literature. It also clarifies available options for practitioners and opportunities for further research.

Diogo Proença, José Borbinha
An Operationalized DDI Infrastructure to Document, Publish, Preserve and Search Social Science Research Data

The social sciences are here in a very privileged position, as there is already an existing meta-data standard defined by the Data Documentation Initiative (DDI) to document research data such as empirical surveys. But even so the DDI standard already exists since the year 2000, it is not widely used because there are almost no (open source) tools available. In this article we present our technical infrastructure to operationalize DDI, to use DDI as living standard for documentation and preservation and to support the publishing process and search functions to foster re-use and research. The main contribution of this paper is to present our DDI architecture, to showcase how to operationalize DDI and to show the efficient and effective handling and usage of complex meta-data. The infrastructure can be adopted and used as blueprint for other domains.

Claus-Peter Klas, Oliver Hopt

Scholarly Communication

Frontmatter
Unveiling Scholarly Communities over Knowledge Graphs

Knowledge graphs represent the meaning of properties of real-world entities and relationships among them in a natural way. Exploiting semantics encoded in knowledge graphs enables the implementation of knowledge-driven tasks such as semantic retrieval, query processing, and question answering, as well as solutions to knowledge discovery tasks including pattern discovery and link prediction. In this paper, we tackle the problem of knowledge discovery in scholarly knowledge graphs, i.e., graphs that integrate scholarly data, and present Korona, a knowledge-driven framework able to unveil scholarly communities for the prediction of scholarly networks. Korona implements a graph partition approach and relies on semantic similarity measures to determine relatedness between scholarly entities. As a proof of concept, we built a scholarly knowledge graph with data from researchers, conferences, and papers of the Semantic Web area, and apply Korona to uncover co-authorship networks. Results observed from our empirical evaluation suggest that exploiting semantics in scholarly knowledge graphs enables the identification of previously unknown relations between researchers. By extending the ontology, these observations can be generalized to other scholarly entities, e.g., articles or institutions, for the prediction of other scholarly patterns, e.g., co-citations or academic collaboration.

Sahar Vahdati, Guillermo Palma, Rahul Jyoti Nath, Christoph Lange, Sören Auer, Maria-Esther Vidal
Metadata Analysis of Scholarly Events of Computer Science, Physics, Engineering, and Mathematics

Although digitization has significantly eased publishing, finding a relevant and a suitable channel of publishing remains challenging. Scientific events such as conferences, workshops or symposia are among the most popular channels, especially in computer science, natural sciences, and technology. To obtain a better understanding of scholarly communication in different fields and the role of scientific events, metadata of scientific events of four research communities have analyzed: Computer Science, Physics, Engineering, and Mathematics. Our transferable analysis methodology is based on descriptive statistics as well as exploratory data analysis. Metadata used in this work have been collected from the OpenResearch.org community platform and SCImago as the main resources containing metadata of scientific events in a semantically structured way. There is no comprehensive information about submission numbers and acceptance rates in fields other than Computer Science. The evaluation uses metrics such as continuity, geographical and time-wise distribution, field popularity and productivity as well as event progress ratio and rankings based on the SJR indicator and h5-indices. Recommendations for different stakeholders involved in the life cycle of events, such as chairs, potential authors, and sponsors, are given.

Said Fathalla, Sahar Vahdati, Sören Auer, Christoph Lange
Venue Classification of Research Papers in Scholarly Digital Libraries

Open-access scholarly digital libraries crawl periodically a list of URLs in order to obtain appropriate collections of freely-available research papers. The metadata of the crawled papers, e.g., title, authors, and references, are automatically extracted before the papers are indexed in a digital library. The venue of publication is another important aspect about a scientific paper, which reflects its authoritativeness. However, the venue is not always readily available for a paper. Instead, it needs to be extracted from the references lists of other papers that cite the target paper. We explore a supervised learning approach to automatically classifying the venue of a research paper using information solely available from the content of the paper and show experimentally on a dataset of approximately 44,000 papers that this approach outperforms several baselines and prior work.

Cornelia Caragea, Corina Florescu

Digital Humanities

Frontmatter
Towards Better Understanding Researcher Strategies in Cross-Lingual Event Analytics

With an increasing amount of information on globally important events, there is a growing demand for efficient analytics of multilingual event-centric information. Such analytics is particularly challenging due to the large amount of content, the event dynamics and the language barrier. Although memory institutions increasingly collect event-centric Web content in different languages, very little is known about the strategies of researchers who conduct analytics of such content. In this paper we present researchers’ strategies for the content, method and feature selection in the context of cross-lingual event-centric analytics observed in two case studies on multilingual Wikipedia. We discuss the influence factors for these strategies, the findings enabled by the adopted methods along with the current limitations and provide recommendations for services supporting researchers in cross-lingual event-centric analytics.

Simon Gottschalk, Viola Bernacchi, Richard Rogers, Elena Demidova
Adding Words to Manuscripts: From PagesXML to TEITOK

This article describes a two-step method for transcribing historic manuscripts. In this method, the first step uses a page-based representation making it easy to transcribe the document page-by-page and line-by-line, while the second step converts this to the TEI/XML text-based format, in order to make sure the document becomes fully searchable.

Maarten Janssen

User Interaction

Frontmatter
Predicting Retrieval Success Based on Information Use for Writing Tasks

This paper asks to what extent querying, clicking, and text editing behavior can predict the usefulness of the search results retrieved during essay writing. To render the usefulness of a search result directly observable for the first time in this context, we cast the writing task as “essay writing with text reuse,” where text reuse serves as usefulness indicator. Based on 150 essays written by 12 writers using a search engine to find sources for reuse, while their querying, clicking, reuse, and text editing activities were recorded, we build linear regression models for the two indicators (1) number of words reused from clicked search results, and (2) number of times text is pasted, covering 69% (90%) of the variation. The three best predictors from both models cover 91–95% of the explained variation. By demonstrating that straightforward models can predict retrieval success, our study constitutes a first step towards incorporating usefulness signals in retrieval personalization for general writing tasks.

Pertti Vakkari, Michael Völske, Martin Potthast, Matthias Hagen, Benno Stein
Personalised Session Difficulty Prediction in an Online Academic Search Engine

Search sessions consist of multiple user-system interactions. As a user-oriented measure for the difficulty of a session, we regard the time needed for finding the next relevant document (TTR). In this study, we analyse the search log of an academic search engine, focusing on the user interaction data without regarding the actual content. After observing a user for a short time, we predict the TTR for the remainder of the session. In addition to standard machine learning methods for numeric prediction, we investigate a new approach based on an ensemble of Markov models. Both types of methods yield similar performance. However, when we personalise the Markov models by adapting their parameters to the current user, this leads to significant improvements.

Vu Tran, Norbert Fuhr
User Engagement with Generous Interfaces for Digital Cultural Heritage

Digitisation has created vast digital cultural heritage collections and has spawned interest in novel interfaces that go beyond the search box and aim to engage users better. In this study we investigate this proposed link between generous interfaces and user engagement. The results indicate that while generous interfaces tend to focus on novel interface components and increasing visit duration, neither of these significantly influence user engagement.

Robert Speakman, Mark Michael Hall, David Walsh

Resources

Frontmatter
Peer Review and Citation Data in Predicting University Rankings, a Large-Scale Analysis

Most Performance-based Research Funding Systems (PRFS) draw on peer review and bibliometric indicators, two different methodologies which are sometimes combined. A common argument against the use of indicators in such research evaluation exercises is their low correlation at the article level with peer review judgments. In this study, we analyse 191,000 papers from 154 higher education institutes which were peer reviewed in a national research evaluation exercise. We combine these data with 6.95 million citations to the original papers. We show that when citation-based indicators are applied at the institutional or departmental level, rather than at the level of individual papers, surprisingly large correlations with peer review judgments can be observed, up to $${ r<= 0.802, n = 37, p < 0.001}$$ for some disciplines. In our evaluation of ranking prediction performance based on citation data, we show we can reduce the mean rank prediction error by 25% compared to previous work. This suggests that citation-based indicators are sufficiently aligned with peer review results at the institutional level to be used to lessen the overall burden of peer review on national evaluation exercises leading to considerable cost savings.

David Pride, Petr Knoth
The MUIR Framework: Cross-Linking MOOC Resources to Enhance Discussion Forums

New learning resources are created and minted in Massive Open Online Courses every week – new videos, quizzes, assessments and discussion threads are deployed and interacted with – in the era of on-demand online learning. However, these resources are often artificially siloed between platforms and artificial web application models. Facilitating the linking between such resources facilitates learning and multimodal understanding, bettering learners’ experience. We create a framework for MOOC Uniform Identifier for Resources (MUIR). MUIR enables applications to refer and link to such resources in a cross-platform way, allowing the easy minting of identifiers to MOOC resources, akin to #hashtags. We demonstrate the feasibility of this approach to the automatic identification, linking and resolution – a task known as Wikification – of learning resources mentioned on MOOC discussion forums, from a harvested collection of 100K+ resources. Our Wikification system achieves a high initial rate of 54.6% successful resolutions on key resource mentions found in discussion forums, demonstrating the utility of the MUIR framework. Our analysis on this new problem shows that context is a key factor in determining the correct resolution of such mentions.

Ya-Hui An, Muthu Kumar Chandresekaran, Min-Yen Kan, Yan Fu
Figures in Scientific Open Access Publications

This paper summarizes the results of a comprehensive statistical analysis on a corpus of open access articles and contained figures. It gives an insight into quantitative relationships between illustrations or types of illustrations, caption lengths, subjects, publishers, author affiliations, article citations and others.

Lucia Sohmen, Jean Charbonnier, Ina Blümel, Christian Wartena, Lambert Heller

Information Extraction

Frontmatter
Finding Person Relations in Image Data of News Collections in the Internet Archive

The amount of multimedia content in the World Wide Web is rapidly growing and contains valuable information for many applications in different domains. The Internet Archive initiative has gathered billions of time-versioned web pages since the mid-nineties. However, the huge amount of data is rarely labeled with appropriate metadata and automatic approaches are required to enable semantic search. Normally, the textual content of the Internet Archive is used to extract entities and their possible relations across domains such as politics and entertainment, whereas image and video content is usually disregarded. In this paper, we introduce a system for person recognition in image content of web news stored in the Internet Archive. Thus, the system complements entity recognition in text and allows researchers and analysts to track media coverage and relations of persons more precisely. Based on a deep learning face recognition approach, we suggest a system that detects persons of interest and gathers sample material, which is subsequently used to identify them in the image data of the Internet Archive. We evaluate the performance of the face recognition system on an appropriate standard benchmark dataset and demonstrate the feasibility of the approach with two use cases.

Eric Müller-Budack, Kader Pustu-Iren, Sebastian Diering, Ralph Ewerth
Ontology-Driven Information Extraction from Research Publications

Extraction of information from a research article, association with other sources and inference of new knowledge is a challenging task that has not yet been entirely addressed. We present Research Spotlight, a system that leverages existing information from DBpedia, retrieves articles from repositories, extracts and interrelates various kinds of named and non-named entities by exploiting article metadata, the structure of text as well as syntactic, lexical and semantic constraints, and populates a knowledge base in the form of RDF triples. An ontology designed to represent scholarly practices is driving the whole process. The system is evaluated through two experiments that measure the overall accuracy in terms of token- and entity- based precision, recall and F1 scores, as well as entity boundary detection, with promising results.

Vayianos Pertsas, Panos Constantopoulos

Information Retrieval

Frontmatter
Scientific Claims Characterization for Claim-Based Analysis in Digital Libraries

In this paper, we promote the idea of automatic semantic characterization of scientific claims to explore entity-entity relationships in Digital collections. Our proposed approach aims at alleviating time-consuming analysis of query results when the information need is not just one document but an overview over a set of documents. With the semantic characterization, we propose to find what we called “dominant” claims and rely on two core properties: the consensual support of a claim in the light of the collection’s previous knowledge as well as the authors’ assertiveness of the language used when expressing it. We will discuss useful features to efficiently capture these two core properties and formalize the idea of finding “dominant” claims by relying on Pareto dominance. We demonstrate the effectiveness of our method regarding quality by a practical evaluation using a real-world document collection from the medical domain to show the potential of our approach.

José María González Pinto, Wolf-Tilo Balke
Automatic Segmentation and Semantic Annotation of Verbose Queries in Digital Library

In this paper, we propose a system for automatic segmentation and semantic annotation of verbose queries with predefined metadata fields. The problem of generating optimal segmentation has been modeled as a simulated annealing problem with proposed solution cost function and neighborhood function. The annotation problem has been modeled as a sequence labeling problem and has been implemented with Hidden Markov Model (HMM). Component-wise and holistic evaluation of the system have been performed using gold standard annotation developed over query log collected from National Digital Library (NDLI) (National Digital Library of India: https://ndl.iitkgp.ac.in ). In component-wise evaluation, the segmentation module yields 82% F1 and the annotation module performs with 56% accuracy. In holistic evaluation, the F1 of the system has been obtained to be 33%.

Susmita Sadhu, Plaban Kumar Bhowmick

Recommendation

Frontmatter
Open Source Software Recommendations Using Github

The focus of this work is on providing an open source software recommendations using the Github API. Specifically, we propose a hybrid method that considers the programming languages, topics and README documents that appear in the users’ repositories. To demonstrate our approach, we implement a proof of concept that provides recommendations.

Miika Koskela, Inka Simola, Kostas Stefanidis
Recommending Scientific Videos Based on Metadata Enrichment Using Linked Open Data

The amount of available videos in the Web has significantly increased not only for entertainment etc., but also to convey educational or scientific information in an effective way. There are several web portals that offer access to the latter kind of video material. One of them is the TIB AV-Portal of the Leibniz Information Centre for Science and Technology (TIB), which hosts scientific and educational video content. In contrast to other video portals, automatic audiovisual analysis (visual concept classification, optical character recognition, speech recognition) is utilized to enhance metadata information and semantic search. In this paper, we propose to further exploit and enrich this automatically generated information by linking it to the Integrated Authority File (GND) of the German National Library. This information is used to derive a measure to compare the similarity of two videos which serves as a basis for recommending semantically similar videos. A user study demonstrates the feasibility of the proposed approach.

Justyna Medrek, Christian Otto, Ralph Ewerth

Posters

Frontmatter
TIB-arXiv: An Alternative Search Portal for the arXiv Pre-print Server

arXiv is a popular pre-print server focusing on natural science disciplines (e.g., physics, computer science, quantitative biology). As a platform with an emphasis on easy publishing services it does not provide enhanced search functionality – but offers programming interfaces which allow external parties to add these services. This paper presents extensions of the open source framework arXiv Sanity Preserver (SP). With respect to the original framework, it derestricts SP’s topical focus and allows for text-based search and visualisation of all papers in arXiv. To this end, all papers are stored in a unified back-end; the extension provides enhanced search and ranking facilities and allows the exploration of arXiv papers by a novel user interface.

Matthias Springstein, Huu Hung Nguyen, Anett Hoppe, Ralph Ewerth
An Analytics Tool for Exploring Scientific Software and Related Publications

Scientific software is one of the key elements for reproducible research. However, classic publications and related scientific software are typically not (sufficiently) linked, and tools are missing to jointly explore these artefacts. In this paper, we report on our work on developing the analytics tool SciSoftX ( https://labs.tib.eu/info/projekt/scisoftx/ ) for jointly exploring software and publications. The presented prototype, a concept for automatic code discovery, and two use cases demonstrate the feasibility and usefulness of the proposal.

Anett Hoppe, Jascha Hagen, Helge Holzmann, Günter Kniesel, Ralph Ewerth
Digital Museum Map

The digitisation of cultural heritage has created large digital collections that have the potential to open up our cultural heritage. However, the search box, which for non-expert users presents a significant obstacle, remains the primary interface for accessing these. This demo presents a fully automated, data-driven system for generating a generous interface for exploring digital cultural heritage (DCH) collections.

Mark Michael Hall
ORCID iDs in the Open Knowledge Era

The focus of this poster is to highlight the importance of sufficient metadata in ORCID records for the purpose of name disambiguation. In 2017 the authors counted ORCID iDs containing minimal information. They invoked RESTful API calls using Postman software and searched ORCID records created between 2012–2017 that did not include affiliation or organization name, Ringgold ID, and any work titles. A year later, they reproduced the same API calls and compared with the results achieved the year before. The results reveal that a high number of records are still minimal or orphan, thus making the name disambiguation process difficult. The authors recognize the benefit of a unique identifier that facilitates name disambiguation and remain confident that with continued work in the areas of system interoperability and technical integration, alongside continued advocacy and outreach, ORCID will grow and develop not only in number of iDs but also in metadata robustness.

Marina Morgan, Naomi Eichenlaub
Revealing Historical Events Out of Web Archives

As the living Web expands, worldwide volumes of Web archives constantly increase, making difficult to identify relevant archived contents. Here we propose an application for detecting historical events out of a corpus of Web archives and based on an entity called Web Fragment: a semantic and syntactic subset of a given Web page. The Web fragment has the particularity to be indexed by its edition date instead of its archiving date. We apply our framework on an archived Moroccan forum and witness how it reacted to the Arab Spring at the end of 2010.

Quentin Lobbé
The FAIR Accessor as a Tool to Reinforce the Authenticity of Digital Archival Information

The constant changeability of the digital environment raises a complex series of issues regarding the preservation of authentic, accessible, intelligible and reusable digital information. An implementation of the FAIR Accessor, a technology developed with the goal of delivering findable, accessible, interoperable and reusable research data, is discussed as a means of supporting archival description with the goal of ensuring its authenticity. A qualitative literature review focused on some of the main tenets of digital preservation in the fields of Information Science, Diplomatics and research data is followed by a discussion on how the core criteria of each area overlap and complement each other. It is concluded that the FAIR Accessor can assist in providing a rich archival description, ultimately helping to determine the authenticity of records.

André Pacheco
Who Cites What in Computer Science? - Analysing Citation Patterns Across Conference Rank and Gender

Citations are a means to refer to previous, relevant scientific bodies of work. However, little is known about how citations behave with respect to venue reputation. Do A* papers get more often cited by C papers or vice versa? What is the source and sink of a citation in terms of venue reputation? In this work, we investigate this issue by analysing the DBLP database of computer science publications, utilizing rank information from the CORE database. Our analysis shows that authors tend to cite publications from the same or higher ranked venues more often than from lower tier venues. Self-citations, on the contrary, are especially focused on same-tier venues. The gender of the first author does not seem to have any impact on the citations from and to differently ranked mediums.

Tobias Milz, Christin Seifert
Back to the Source: Recovering Original (Hebrew) Script from Transcribed Metadata

Due to technical constrains of the past, metadata in languages written with non-Latin scripts have frequently been entered using various systems of transcription. While this transcription is essential for data curators who may not be familiar with the source script, it is often an encumbrance for researchers in discovery and retrieval. Until 2011 the Judaica collection in Hebrew and Yiddish of the University Library J. C. Senckenberg were catalogued with transcription only. The aim of this work is to develop an open-source system to aid in the automatic conversion of Hebrew transcription back into Hebrew script, using a multi-faceted approach.

Aaron Christianson, Rachel Heuberger, Thomas Risse
From Handwritten Manuscripts to Linked Data

Museums, archives and digital libraries make increasing use of Semantic Web technologies to enrich and publish their collection items. The contents of those items, however, are not often enriched in the same way. Extracting named entities within historical manuscripts and disclosing the relationships between them would facilitate cultural heritage research, but it is a labour-intensive and time-consuming process, particularly for handwritten documents.It requires either automated handwriting recognition techniques, or manual annotation by domain experts before the content can be semantically structured. Different workflows have been proposed to address this problem, involving full-text transcription and named entity extraction, with results ranging from unstructured files to semantically annotated knowledge bases. Here, we detail these workflows and describe the approach we have taken to disclose historical biodiversity data, which enables the direct labelling and semantic annotation of document images in hand-written archives.

Lise Stork, Andreas Weber, Jaap van den Herik, Aske Plaat, Fons Verbeek, Katherine Wolstencroft
A Study on the Monetary Value Estimation of User Satisfaction with the Digital Library Service Focused on Construction Technology Information in South Korea

Korea Institute of Civil Engineering and Building Technology has been constructing a database by collecting, classifying, and processing the construction technology data required for construction engineers and providing a database information service through the Construction Technology Digital Library portal since 2001. In this study, the monetary value of the user satisfaction with digital library service was estimated by applying the double-bounded dichotomous choice contingent valuation method for the purpose of using the limited information service budget to improve the user satisfaction.

Seong-Yun Jeong
Visual Analysis of Search Results in Scopus Database

The enormous growth of research and development has been accompanied by a growing number of scientific publications in recent decades. These publications are collected and processed by a number of digital libraries. Although these digital libraries provide basic search tools, more advanced methods such as visualization and a visual analysis can be implemented only by using special software. This article presents the possibilities how to visually analyse the content of digital libraries using the CiteViz tool developed in [6] and shows its implementation using of the Scopus database.

Ondrej Klapka, Antonin Slaby
False-Positive Reduction in Ontology Matching Based on Concepts’ Domain Similarity

In this study we explore if considering the domain similarity between concepts to be matched can contribute filter out false positive relations. This is particularly relevant in areas where the “universe of discourse” encompasses several diverse domains, such as cultural heritage. Our approach is based on an algorithm that employs the lexical resource WordNet Domains to filter out relations where the two concepts to be matched are associated with different domains. We evaluate our approach in an experiment involving Bibframe and Schema.org , two ontologies of complementary nature. The results from the evaluation show that the use of such a domain filter indeed can have a positive effect on reducing false positives while retaining true ones.

Audun Vennesland, Trond Aalberg
Association Rule Based Clustering of Electronic Resources in University Digital Library

Library Analytics is used to analyze the huge amount of data that is collected by most colleges and universities when the library electronic resources are browsed. In this research work, we have analyzed the library usage data to accomplish the task of e-resource item clustering. We have compared different clustering algorithms and found that association-rule (ARM) based clustering is more accurate than others and it also identifies the hidden relationships between articles which are content-wise not similar. We have also shown that items in the same cluster offer a good source for recommendation.

Debashish Roy, Chen Ding, Lei Jin, Dana Thomas
Hybrid Image Retrieval in Digital Libraries
A Large Scale Multicollection Experimentation of Deep Learning Techniques

While digital heritage libraries historically took advantage of OCR to index their printed collections, the access to iconographic resources has not progressed in the same way, and the latter remain in the shadows. Today, it would be possible to make better use of these resources, especially by leveraging the illustrations recognized thanks to the OCR produced during the last two decades. This work presents an ETL (extract-transform-load) approach to this need, that aims to: Identify iconography wherever it may be found; Enrich the illustrations metadata with deep learning approaches; Load it all into a web app for hybrid image retrieval.

Jean-Philippe Moreux, Guillaume Chiron
Grassroots Meets Grasstops: Integrated Research Data Management with EUDAT B2 Services, Dendro and LabTablet

We present an integrated research data management (RDM) workflow that captures data from the moment of creation until its deposit. We integrated LabTablet, our electronic laboratory notebook, Dendro, our data organisation and description platform aimed at collaborative management of research data, and EUDAT’s B2DROP and B2SHARE platforms. This approach combines the portability and automated metadata production abilities of LabTablet, Dendro as a collaborative RDM tool for dataset preparation, with the scalable storage of B2DROP and the long-term deposit of datasets in B2SHARE. The resulting workflow can be put to work in research groups where laboratorial or field work is central.

João Rocha da Silva, Nelson Pereira, Pedro Dias, Bruno Barros
Linked Publications and Research Data: Use Cases for Digital Libraries

Linking publications to research data is becoming important for a more complete research picture. Siloes of publication and data collections within institutions hamper this realization. We explore few cases that result from linking scholarly resources in a digital library setting.

Fidan Limani, Atif Latif, Klaus Tochtermann
The Emergence of Thai OER to Support Open Education

This paper aims to present the practical work of the development of Open Educational Resources (OER) in a developing country, Thailand, to support open education and lifelong learning in the society. Thai OER is an on-going project under the Online Learning Resources for Distance Learning project in the Celebration of the Auspicious Occasion of Her Royal Highness Princess Maha Chakri Sirindhorn’s 60th Birthday Anniversary on the 2nd April 2015. It is developed by the collaborative efforts of multiple stakeholders in the country to share educational materials via the Internet under an open licensing agreement. This is to reduce the cost, access and usage barriers of students, teachers and learners, especially disadvantaged and disabled children and young people who lack opportunities to access education and knowledge. The materials, provided in Thai OER, cover a range of topics in different fields, especially Thai local and indigenous knowledge, and in different formats for all users. This paper also presents the benefits of Thai OER for different levels and major challenges to develop and adopt OER in a developing country, Thailand.

Titima Thumbumrung, Boonlert Aroonpiboon
Fair Play at Carlos III University of Madrid Library

Our purpose is to show projects held at Carlos III University Library related to FAIR principles. As time passes the Library evolves from a traditional library to “as open as possible, as closed as necessary” Digital Library.

Belén Fernández-del-Pino Torres, Teresa Malo-de-Molina y Martín-Montalvo
Supporting Description of Research Data: Evaluation and Comparison of Term and Concept Extraction Approaches

The importance of research data management is widely recognized. Dendro is an ontology-based platform that allows researchers to describe datasets using generic and domain-specific descriptors from ontologies. Selecting or building the right ontologies for each research domain or group requires meetings between curators and researchers in order to capture the main concepts of their research. Envisioning a tool to assist curators through the automatic extraction of key concepts from research documents, we propose 2 concept extraction methods and compare them with a term extraction method. To compare the three approaches, we use as ground truth an ontology previously created by human curators.

Cláudio Monteiro, Carla Teixeira Lopes, João Rocha Silva
Anonymized Distributed PHR Using Blockchain for Openness and Non-repudiation Guarantee

We introduce our solution developed for data privacy, and specifically for cognitive security that can be enforced and guaranteed using blockchain technology in SAAL (Smart Ambient Assisted Living) environments. Using our proposal the access to a patient’s clinical process resists tampering and ransomware attacks that have recently plagued the HIS (Hospital Information Systems) in various countries. One important side effect of this data infrastructure is that it can be accessed in open form, for research purposes for instance, since no individual re-identification or group profiling is possible by any means.

David Mendes, Irene Rodrigues, César Fonseca, Manuel Lopes, José Manuel García-Alonso, Javier Berrocal
Backmatter
Metadaten
Titel
Digital Libraries for Open Knowledge
herausgegeben von
Eva Méndez
Fabio Crestani
Cristina Ribeiro
Gabriel David
João Correia Lopes
Copyright-Jahr
2018
Electronic ISBN
978-3-030-00066-0
Print ISBN
978-3-030-00065-3
DOI
https://doi.org/10.1007/978-3-030-00066-0