scroll identifier for mobile
main-content

## Über dieses Buch

This book constitutes the proceedings of the 20th International Conference on Theory and Practice of Digital Libraries, TPDL 2016, held in Hannover, Germany, in September 2016.

The 28 full papers, 5 posters and 8 short papers presented in this volume were carefully reviewed and selected from 93 submissions. They were organized in topical sections named: Digital Library Design; User Aspects; Search; Web Archives; Semantics; Multimedia and Time Aspects; Digital Library Evaluation; Digital Humanities; e-Infrastructures.

## Inhaltsverzeichnis

### Erratum to: CERN Analysis Preservation: A Novel Digital Library Service to Enable Reusable and Reproducible Research

Xiaoli Chen, Sünje Dallmeier-Tiessen, Anxhela Dani, Robin Dasler, Javier Delgado Fernández, Pamfilos Fokianos, Patricia Herterich, Tibor Šimko

### Realizing Inclusive Digital Library Environments: Opportunities and Challenges

Universal design, also known as inclusive design, envisions the design of products and services to be accessible and usable to all irrespective of their disability status, cultural background, age, etc. Libraries have been benefiting from the breakthroughs in accessibility research to design their environments as friendly as possible for all groups of uses. However, the present scenario of digital library environments characterized by different types of resources acquired or subscribed from different vendors operating with different rules, and who would maintain some form of control over the collections shows that adherence to guidelines by itself won’t ensure inclusive digital library environments. The paper attempts to explore the matter taking the case of digital services run in selected libraries to identify trends that favor universal design and point challenges that need to be dealt with as part of further endeavors.

Wondwossen M. Beyene

### A Maturity Model for Information Governance

Information Governance (IG) as defined by Gartner is the “specification of decision rights and an accountability framework to encourage desirable behavior in the valuation, creation, storage, use, archival and deletion of information. Includes the processes, roles, standards and metrics that ensure the effective and efficient use of information in enabling an organization to achieve its goals”.Organizations that wish to comply with IG best practices, can seek support on the existing best practices, standards and other relevant references not only in the core domain but also in relevant peripheral domains. Thus, despite the existence of these references, organizations still are unable, in many scenarios, to determine in a straightforward manner two fundamental business-related concerns: (1) to which extent do their current processes comply with such standards; and, if not, (2) which goals do they need to achieve in order to be compliant.In this paper, we present how to create an IG maturity model based on existing reference documents. The process is based on existing maturity model development methods that allow for a systematic approach to maturity model development backed up by a well-known and proved scientific research method called Design Science Research.

Diogo Proença, Ricardo Vieira, José Borbinha

### Usage-Driven Dublin Core Descriptor Selection

A Case Study Using the Dendro Platform for Research Dataset Description

Dublin Core schemas are the core metadata models of most repositories, and this includes recent repositories dedicated to datasets. DC descriptors are generic and are being adapted to the needs of different communities with the so-called Dublin Core Application Profiles. DCAPs rely on the agreement within user communities, in a process mainly driven by their evolving needs. In this paper, we propose a complementary automated process, designed to help curators and users discover the descriptors that better suit the needs of a specific research group. We target the description of datasets, and test our approach using Dendro, a prototype research data management platform, where an experimental method is used to rank and present DC Terms descriptors to the users based on their usage patterns. In a controlled experiment, we gathered the interactions of two groups as they used Dendro to describe datasets from selected sources. One of the groups had descriptor ranking on, while the other had the same list of descriptors throughout the whole experiment. Preliminary results show that 1. some DC Terms are filled in more often than others, with different distribution in the two groups, 2. selected descriptors were increasingly accepted by users in detriment of manual selection and 3. users were satisfied with the performance of the platform, as demonstrated by a post-study survey.

João Rocha da Silva, Cristina Ribeiro, João Correia Lopes

### Retrieving and Ranking Similar Questions from Question-Answer Archives Using Topic Modelling and Topic Distribution Regression

Presented herein is a novel model for similar question ranking within collaborative question answer platforms. The presented approach integrates a regression stage to relate topics derived from questions to those derived from question-answer pairs. This helps to avoid problems caused by the differences in vocabulary used within questions and answers, and the tendency for questions to be shorter than answers. The performance of the model is shown to outperform translation methods and topic modelling (without regression) on several real-world datasets.

Pedro Chahuara, Thomas Lampert, Pierre Gançarski

### Survey on High-level Search Activities Based on the Stratagem Level in Digital Libraries

High-level search activities for Digital Libraries (DLs) introduced by Fuhr et al. [8] go beyond basic query searches because they include targeted and structured searches like e.g. a journal run or citation searching. In this paper, we investigate if and how typical high-level search activities are really used in current DLs. We conducted an online survey with 129 participating researchers from different fields of study that aims at getting a quantitative view on the usage of high level search activities in DLs. Although our results indicate the usefulness of high-level search activities, they are not well supported by modern DLs with regards to the users’ state of search, e.g. looking at a relevant or not relevant document. Furthermore, we identified differences in the information seeking behavior across the respondents. Respondents with a higher academic degree significantly considered journals and conference proceedings as more useful than respondents with a lower academic degree.

Zeljko Carevic, Philipp Mayr

### Content-Based Video Retrieval in Historical Collections of the German Broadcasting Archive

The German Broadcasting Archive (DRA) maintains the cultural heritage of radio and television broadcasts of the former German Democratic Republic (GDR). The uniqueness and importance of the video material stimulates a large scientific interest in the video content. In this paper, we present an automatic video analysis and retrieval system for searching in historical collections of GDR television recordings. It consists of video analysis algorithms for shot boundary detection, concept classification, person recognition, text recognition and similarity search. The performance of the system is evaluated from a technical and an archival perspective on 2,500 h of GDR television recordings.

Markus Mühling, Manja Meister, Nikolaus Korfhage, Jörg Wehling, Angelika Hörth, Ralph Ewerth, Bernd Freisleben

### Profile-Based Selection of Expert Groups

In a wide variety of daily activities, the need of selecting a group of k experts from a larger pool of n candidates ($$k<n$$) based on some criteria often arises. Indicative examples, among many others, include the selection of program committee members for a research conference, staffing an organization’s board with competent members, forming a subject-specific task force, or building a group of project evaluators. Unfortunately, the process of expert group selection is typically carried out manually by a certain individual, which poses two significant shortcomings: (a) the task is particularly cumbersome, and (b) the selection process is largely subjective thus leading to results of doubtful quality. To address these challenges, in this paper, we propose an automatic profile-based expert group selection mechanism that is supported by digital libraries. To this end, we build textual profiles of candidates and propose algorithms that follow an IR-based approach to perform the expert group selection. Our approach is generic and independent of the actual expert group selection problem, as long as the candidate profiles have been generated. To evaluate the effectiveness of our approach, we demonstrate its applicability on the scenario of automatically building a program committee for a research conference.

Georgios A. Sfyris, Nikolaos Fragkos, Christos Doulkeridis

### Tracking and Re-finding Printed Material Using a Personal Digital Library

Most web searches aim to re-find previously known information or documents. Keeping track of one’s digital and printed reading material is known to be a challenging and costly task. We describe the design, implementation and evaluation of our Human-centred workplace (HCW) – a system that supports the tracking of physical document printouts. HCW embeds QR codes in the document printout, stores the documents in a personal Digital Library, and uses cameras in the office to track changes in the document locations. We explored the HCW in three evaluations, using the system over several weeks in an office setting, a user study in a lab environment, and extensive functional tests.

Annika Hinze, Amay Dighe

### ArchiveWeb: Collaboratively Extending and Exploring Web Archive Collections

Curated web archive collections contain focused digital contents which are collected by archiving organizations to provide a representative sample covering specific topics and events to preserve them for future exploration and analysis. In this paper, we discuss how to best support collaborative construction and exploration of these collections through the ArchiveWeb system. ArchiveWeb has been developed using an iterative evaluation-driven design-based research approach, with considerable user feedback at all stages. This paper describes the functionalities of our current prototype for searching, constructing, exploring and discussing web archive collections, as well as feedback on this prototype from seven archiving organizations, and our plans for improving the next release of the system.

Zeon Trevor Fernando, Ivana Marenzi, Wolfgang Nejdl, Rishita Kalyani

### Web Archive Profiling Through Fulltext Search

An archive profile is a high-level summary of a web archive’s holdings that can be used for routing Memento queries to the appropriate archives. It can be created by generating summaries from the CDX files (index of web archives) which we explored in an earlier work. However, requiring archives to update their profiles periodically is difficult. Alternative means to discover the holdings of an archive involve sampling based approaches such as fulltext keyword searching to learn the URIs present in the response or looking up for a sample set of URIs and see which of those are present in the archive. It is the fulltext search based discovery and profiling that is the scope of this paper. We developed the Random Searcher Model (RSM) to discover the holdings of an archive by a random search walk. We measured the search cost of discovering certain percentages of the archive holdings for various profiling policies under different RSM configurations. We can make routing decisions of 80 % of the requests correctly while maintaining about 0.9 recall by discovering only 10 % of the archive holdings and generating a profile that costs less than 1 % of the complete knowledge profile.

Sawood Alam, Michael L. Nelson, Herbert Van de Sompel, David S. H. Rosenthal

### Comparing Topic Coverage in Breadth-First and Depth-First Crawls Using Anchor Texts

Web archives preserve the fast changing Web by repeatedly crawling its content. The crawling strategy has an influence on the data that is archived. We use link anchor text of two Web crawls created with different crawling strategies in order to compare their coverage of past popular topics. One of our crawls was collected by the National Library of the Netherlands (KB) using a depth-first strategy on manually selected websites from the .nl domain, with the goal to crawl websites as completes as possible. The second crawl was collected by the Common Crawl foundation using a breadth-first strategy on the entire Web, this strategy focuses on discovering as many links as possible. The two crawls differ in their scope of coverage, while the KB dataset covers mainly the Dutch domain, the Common Crawl dataset covers websites from the entire Web. Therefore, we used three different sources to identify topics that were popular on the Web; both at the global level (entire Web) and at the national level (.nl domain): Google Trends, WikiStats, and queries collected from users of the Dutch historic newspaper archive. The two crawls are different in terms of their size, number of included websites and domains. To allow fair comparison between the two crawls, we created sub-collections from the Common Crawl dataset based on the .nl domain and the KB seeds. Using simple exact string matching between anchor texts and popular topics from the three different sources, we found that the breadth-first crawl covered more topics than the depth-first crawl. Surprisingly, this is not limited to popular topics from the entire Web but also applies to topics that were popular in the .nl domain.

Thaer Samar, Myriam C. Traub, Jacco van Ossenbruggen, Arjen P. de Vries

### How to Search the Internet Archive Without Indexing It

Significant parts of cultural heritage are produced on the web during the last decades. While easy accessibility to the current web is a good baseline, optimal access to the past web faces several challenges. This includes dealing with large-scale web archive collections and lacking of usage logs that contain implicit human feedback most relevant for today’s web search. In this paper, we propose an entity-oriented search system to support retrieval and analytics on the Internet Archive. We use Bing to retrieve a ranked list of results from the current web. In addition, we link retrieved results to the WayBack Machine; thus allowing keyword search on the Internet Archive without processing and indexing its raw archived content. Our search system complements existing web archive search tools through a user-friendly interface, which comes close to the functionalities of modern web search engines (e.g., keyword search, query auto-completion and related query suggestion), and provides a great benefit of taking user feedback on the current web into account also for web archive search. Through extensive experiments, we conduct quantitative and qualitative analyses in order to provide insights that enable further research on and practical applications of web archives.

Nattiya Kanhabua, Philipp Kemkes, Wolfgang Nejdl, Tu Ngoc Nguyen, Felipe Reis, Nam Khanh Tran

### BIB-R: A Benchmark for the Interpretation of Bibliographic Records

In a global context which promotes the use of explicit semantics for sharing information and developing new services, the MAchine Readable Cataloguing (MARC) format that is commonly used by libraries worldwide has demonstrated its limitations. The semantic model for representing cultural items presented in the Functional Requirements for Bibliographic Records (FRBR) is expected to be a successor of MARC, and the complex transformation of MARC catalogs to FRBR catalogs (FRBRization) led to the proposition of various tools and approaches. However, these projects and the results they achieve are difficult to compare on a fair basis due to a lack of common datasets and appropriate metrics. Our contributions fill this gap by proposing the first public benchmark for the FRBRization process.

Joffrey Decourselle, Fabien Duchateau, Trond Aalberg, Naimdjon Takhirov, Nicolas Lumineau

### Querying the Web of Data with SPARQL-LD

A constantly increasing number of data providers publish their data on the Web in the RDF format as Linked Data. SPARQL is the standard query language for retrieving and manipulating RDF data. However, the majority of SPARQL implementations requires the data to be available in advance (in main memory or in a repository), not exploiting thereby the real-time and dynamic nature of Linked Data. In this paper we present SPARQL-LD, an extension of SPARQL 1.1 Federated Query that allows to directly fetch and query RDF data from any Web source. Using SPARQL-LD, one can even query a dataset coming from the partial results of a query (i.e., discovered at query execution time), or RDF data that is dynamically created by Web Services. Such a functionality motivates Web publishers to adopt the Linked Data principles and enrich their digital contents and services with RDF, since their data is made directly accessible and exploitable via SPARQL (without needing to set up and maintain an endpoint). In this paper, we showcase the benefits offered by SPARQL-LD through an example related to the Europeana digital library, we report experimental results that demonstrate the feasibility of SPARQL-LD, and we introduce optimizations that improve its efficiency.

Pavlos Fafalios, Thanos Yannakis, Yannis Tzitzikas

### A Scalable Approach to Incrementally Building Knowledge Graphs

We work on converting the metadata of 13 American art museums and archives into Linked Data, to be able to integrate and query the resulting data. While there are many good sources of artist data, no single source covers all artists. We thus address the challenge of building a comprehensive knowledge graph of artists that we can then use to link the data from each of the individual museums. We present a framework to construct and incrementally extend a knowledge graph, describe and evaluate techniques for efficiently building knowledge graphs through the use of the MinHash/LSH algorithm for generating candidate matches, and conduct an evaluation that demonstrates our approach can efficiently and accurately build a knowledge graph about artists.

Gleb Gawriljuk, Andreas Harth, Craig A. Knoblock, Pedro Szekely

### Sub-document Timestamping: A Study on the Content Creation Dynamics of Web Documents

The creation time of documents is an important kind of information in temporal information retrieval, especially for document clustering, timeline construction and search engine improvements. Considering the manner in which content on the Web is created, updated & deleted, the common assumption that each document has only one creation time is not suitable for Web documents. In this paper, we investigate to what extent this assumption is wrong. We introduce two methods to timestamp individual parts (sub-documents) of Web documents and analyze in detail the creation & update dynamics of three classes of Web documents.

Yue Zhao, Claudia Hauff

### Archiving Software Surrogates on the Web for Future Reference

Software has long been established as an essential aspect of the scientific process in mathematics and other disciplines. However, reliably referencing software in scientific publications is still challenging for various reasons. A crucial factor is that software dynamics with temporal versions or states are difficult to capture over time. We propose to archive and reference surrogates instead, which can be found on the Web and reflect the actual software to a remarkable extent. Our study shows that about a half of the webpages of software are already archived with almost all of them including some kind of documentation.

Helge Holzmann, Wolfram Sperber, Mila Runnwerth

### From Water Music to ‘Underwater Music’: Multimedia Soundtrack Retrieval with Social Mass Media Resources

In creative media, visual imagery is often combined with music soundtracks. In the resulting artefacts, the consumption of isolated music or imagery will not be the main goal, but rather the combined multimedia experience. Through frequent combination of music with non-musical information resources and the corresponding public exposure, certain types of music will get associated to certain types of non-musical contexts. As a consequence, when dealing with the problem of soundtrack retrieval for non-musical media, it would be appropriate to not only address corresponding music search engines in music-technical terms, but to also exploit typical surrounding contextual and connotative associations. In this work, we make use of this information, and present and validate a search engine framework based on collaborative and social Web resources on mass media and corresponding music usage. Making use of the SRBench dataset, we show that employing social folksonomic descriptions in search indices is effective for multimedia soundtrack retrieval.

Cynthia C. S. Liem

### The “Nomenclature of Multidimensionality” in the Digital Libraries Evaluation Domain

Digital libraries evaluation is characterised as an interdisciplinary and multidisciplinary domain posing a set of challenges to the research communities that intend to utilise and assess criteria, methods and tools. The amount of scientific production, which is published on the field, hinders and disorientates the researchers who are interested in the domain. The researchers need guidance in order to exploit the considerable amount of data and the diversity of methods effectively as well as to identify new research goals and develop their plans for future works. This paper proposes a methodological pathway to investigate the core topics of the digital library evaluation domain, author communities, their relationships, as well as the researchers who significantly contribute to major topics. The proposed methodology exploits topic modelling algorithms and network analysis on a corpus consisting of the digital library evaluation papers presented in JCDL,ECDL/TDPL and ICADL conferences in the period 2001–2013.

Leonidas Papachristopoulos, Giannis Tsakonas, Michalis Sfakakis, Nikos Kleidis, Christos Papatheodorou

### Dissecting a Scholar Popularity Ranking into Different Knowledge Areas

In this paper, we analyze a ranking of the most “popular” scholars working in Brazilian institutions. The ranking was built by first sorting scholars according to their h-index (based on Google scholar) and then by their total citation count. In our study, we correlate the positions of these top scholars with various academic features such as number of publications, years after doctorate, number of supervised students, as well as other popularity metrics. Moreover, we separate scholars by knowledge area so as to assess how each area is represented in the ranking as well as the importance of the academic features on ranking position across different areas. Our analyses help to dissect the ranking into each area, uncovering similarities and differences as to the relative importance of each feature to scholar popularity as well as the correlations between popularity metrics across knowledge areas.

Gabriel Pacheco, Pablo Figueira, Jussara M. Almeida, Marcos A. Gonçalves

### Exploring Comparative Evaluation of Semantic Enrichment Tools for Cultural Heritage Metadata

Semantic enrichment of metadata is an important and difficult problem for digital heritage efforts such as Europeana. This paper gives motivations and presents the work of a recently completed Task Force that addressed the topic of evaluation of semantic enrichment. We especially report on the design and the results of a comparative evaluation experiment, where we have assessed the enrichments of seven tools (or configurations thereof) on a sample benchmark dataset from Europeana.

Hugo Manguinhas, Nuno Freire, Antoine Isaac, Juliane Stiller, Valentine Charles, Aitor Soroa, Rainer Simon, Vladimir Alexiev

### Data Integration for the Arts and Humanities: A Language Theoretical Concept

In the context of the arts and humanities, heterogeneity largely corresponds to the variety of disciplines, their research questions and communities. Resulting from the diversity of the application domain, the analysis of overall requirements and the subsequent derivation of appropriate unifying schemata is prevented by the complexity and size of the domain. The approach presented in this paper is based on the hypothesis that data integration problems in the arts and humanities can be solved on the theoretical foundation of formal languages. In applying a theoretically substantiated framework, integrative solutions on the formal basis of language specifications can be tailored to specific and individual research needs—abstracting from reoccurring technical difficulties and leading the focus of domain experts on semantic aspects.

### The Challenge of Creating Geo-Location Markup for Digital Books

The story lines of many books occupy real world locations. We have previously explored the challenges of creating automatic location markup that might be used in location-based audio services for digital books. This paper explores the challenges of manually creating location annotations for digital books. We annotated three very different books, and report here on the insights gained and lessons learned. We draw conclusions for the design of software that might support this annotation process in situ and ex situ.

Annika Hinze, David Bainbridge, Sally Jo Cunningham

David Zellhöfer

### Person-Centric Mining of Historical Newspaper Collections

We present a text mining environment that supports entity-centric mining of terascale historical newspaper collections. Information about entities and their relation to each other is often crucial for historical research. However, most text mining tools provide only very basic support for dealing with entities, typically at most including facilities for entity tagging. Historians, on the other hand, are typically interested in the relations between entities and the contexts in which these are mentioned. In this paper, we focus on person entities. We provide an overview of the tool and describe how person-centric mining can be integrated in a general-purpose text mining environment. We also discuss our approach for automatically extracting person networks from newspaper archives, which includes a novel method for person name disambiguation, which is particularly suited for the newspaper domain and obtains state-of-the-art disambiguation results.

Mariona Coll Ardanuy, Jürgen Knauth, Andrei Beliankou, Maarten van den Bos, Caroline Sporleder

### Usability in Digital Humanities - Evaluating User Interfaces, Infrastructural Components and the Use of Mobile Devices During Research Process

The usability of tools and services that form a digital research infrastructure is a key asset for their acceptance among researchers. When applied to infrastructures, the concept of usability needs to be extended to other aspects such as the interoperability between several infrastructure components. In this paper, we present the results of several usability studies. Our aim was not only to test the usability of single tools but also to assess the extent to which different tools and devices can be seamlessly integrated into a single digital research workflow. Our findings suggest that more resources need be spent on testing of digital tools and infrastructure components and that it is especially important to conduct user tests covering the whole knowledge process.

Natasa Bulatovic, Timo Gnadt, Matteo Romanello, Juliane Stiller, Klaus Thoden

Open Access

### CERN Analysis Preservation: A Novel Digital Library Service to Enable Reusable and Reproducible Research

The latest policy developments require immediate action for data preservation, as well as reproducible and Open Science. To address this, an unprecedented digital library service is presented to enable the High-Energy Physics community to preserve and share their research objects (such as data, code, documentation, notes) throughout their research process. While facing the challenges of a “big data” community, the internal service builds on existing internal databases to make the process as easy and intrinsic as possible for researchers. Given the “work in progress” nature of the objects preserved, versioning is supported. It is expected that the service will not only facilitate better preservation techniques in the community, but will foremost make collaborative research easier as detailed metadata and novel retrieval functionality provide better access to ongoing works. This new type of e-infrastructure, fully integrated into the research workflow, could help in fostering Open Science practices across disciplines.

Xiaoli Chen, Sünje Dallmeier-Tiessen, Anxhela Dani, Robin Dasler, Javier Delgado Fernández, Pamfilos Fokianos, Patricia Herterich, Tibor Šimko

### DataQ: A Data Flow Quality Monitoring System for Aggregative Data Infrastructures

Andrea Mannocci, Paolo Manghi

### Scientific Social Publications for Digital Libraries

Social web content is an important development in the scientific workflow. In this context, scientific blogs are an important medium: they play a significant role in the timely dissemination of scientific developments, and provide useful grounds for discussion and development via the readers feedback. Blogs from the domain of economics are no exception to this practice. A possible extension to Digital Libraries (DL) services, content- and service-wise, is to enable its users access to these blogs. This paper demonstrates an approach for seamlessly integrating scientific blogs in DLs and, with the developed proof of concept application, showcases the resulting benefits for the users and DLs.

Fidan Limani, Atif Latif, Klaus Tochtermann

### Formal Representation of Socio-Legal Roles and Functions for the Description of History

We propose a modeling approach for formal descriptions of historical material. In our previous work, we defined the formal structures of social entities such as roles, rights and obligations, activities, and processes which appear in the Roman Constitution, as an application of Basic Formal Ontology (BFO). In this paper, we extend that approach by incorporating aspects of the Information Artifact Ontology (IAO) and the emerging Document Acts Ontology (DAO). We use these to describe relationships among realizable entities (role and function), rights and obligations that are aligned to Socio-Legal Generically Dependent Continuants (SGDCs) of DAO, and activities as subtypes of directive information entity of IAO. Two examples are discussed: a passage from a digitized historical newspaper and a description of citizenship in ancient Rome.

Yoonmi Chu, Robert B. Allen

### Evaluating Co-authorship Networks in Author Name Disambiguation for Common Names

With the increasing size of digital libraries it has become a challenge to identify author names correctly. The situation becomes more critical when different persons share the same name (homonym problem) or when the names of authors are presented in several different ways (synonym problem). This paper focuses on homonym names in the computer science bibliography DBLP. The goal of this study is to evaluate a method which uses co-authorship networks and analyze the effect of common names on it. For this purpose we clustered the publications of authors with the same name and measured the effectiveness of the method against a gold standard of manually assigned DBLP records. The results show that despite the good performance of implemented method for most names, we should optimize for common names. Hence community detection was employed to optimize the method. Results prove that the applied method improves the performance for these names.

Fakhri Momeni, Philipp Mayr

### Ten Months of Digital Reading: An Exploratory Log Study

We address digital reading practices in Russia analyzing 10 months of logging data from a commercial ebook mobile app. We describe the data and focus on three aspects: reading schedule, reading speed, and book abandonment. The exploratory study proves a high potential of the data and proposed approach.

Pavel Braslavski, Vivien Petras, Valery Likhosherstov, Maria Gäde

### A Case Study of Summarizing and Normalizing the Properties of DBpedia Building Instances

The DBpedia ontology forms the structural backbone of DBpedia linked open dataset. Among its classes dbo:Building and dbo:HistoricBuilding entities, hold information for thousands of important buildings and monuments, thus making DBpedia an international digital repository of the architectural heritage. This knowledge for these architectural structures, in order to be fully exploited for academic research and other purposes, must be homogenized, as its richest source - Wikipedia infobox template system - is a heterogeneous and non-standardized environment. The work presented below summarizes the most widely used properties for buildings, categorizes and highlights structural and semantic heterogeneities allowing DBpedia’s users a full exploitation of the available information.

Michail Agathos, Eleftherios Kalogeros, Sarantos Kapidakis

### What Happens When the Untrained Search for Training Information

Unemployed and information illiterate people often have the greatest need for information because it could change their lives. While a lot of information on jobs and training is available online, it is unclear if the target users are indeed able to find such information. This paper presents the findings of a study of the expectations of low skilled people with low information literacy when searching for information about training courses. The results indicate that users have access to technology and information is indeed available online but the users who need this information most are not able to find it using conventional search engines.

Jorgina Paihama, Hussein Suleman

### InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives

We have integrated Web ARChive (WARC) files with the peer-to-peer content addressable InterPlanetary File System (IPFS) to allow the payload content of web archives to be easily propagated. We also provide an archival replay system extended from pywb to fetch the WARC content from IPFS and re-assemble the originally archived HTTP responses for replay. From a 1.0 GB sample Archive-It collection of WARCs containing 21,994 mementos, we show that extracting and indexing the HTTP response content of WARCs containing IPFS lookup hashes takes 66.6 min inclusive of dissemination into IPFS.

Mat Kelly, Sawood Alam, Michael L. Nelson, Michele C. Weigle

### Exploring Metadata Providers Reliability and Update Behavior

Sarantos Kapidakis

### TIB|AV-Portal: Integrating Automatically Generated Video Annotations into the Web of Data

The German National Library of Science and Technology (TIB) aims to promote the use and distribution of its collections. In this context TIB publishes metadata of scientific videos from the TIB|AV Portal as linked open data. Unlike other library metadata the TIB|AV-Portal deploys automated metadata extraction and named entity linking to provide time-based semantic metadata. By publishing this metadata, TIB is offering a new service involving the provision of quarterly updated data in RDF format which can be reused by third parties. In this paper the strategy and the challenges regarding the linked open data service are introduced.

Jörg Waitelonis, Margret Plank, Harald Sack

### Supporting Web Surfers in Finding Related Material in Digital Library Repositories

Web surfers often face the need for additional information beyond the page they are currently reading. While such related material is available in digital library repositories, finding it within these repositories can be a challenging task. In order to ease the burden for the user, we present an approach to construct queries automatically from a textual paragraph. Named entities from the paragraph and a query scheme, which includes the topic of the paragraph form the two pillars of this approach, which is applicable to any search system, that supports keyword queries. Evaluation results point towards users not being able to find optimal queries and needing support in doing so.

Jörg Schlötterer, Christin Seifert, Michael Granitzer

### Do Ambiguous Words Improve Probing for Federated Search?

The core approach to distributed knowledge bases is federated search. Two of the main challenges for federated search are the source representation and source selection. Different solutions to these problems were proposed in the literature. Within this work we present our novel approach for query-based sampling by relying on knowledge bases. We show the basic correctness of our approach and we came to the insight that the ambiguity of the probing terms has just a minor impact on the representation of the collection. Finally, we show that our method can be used to distinguish between niche and encyclopedic knowledge bases.

Günter Urak, Hermann Ziak, Roman Kern

### Automatic Recognition and Disambiguation of Library of Congress Subject Headings

In this article we investigate the possibilities to extract Library of Congress Subject Headings from texts. The large number of ambiguous terms turns out to be a problem. Disambiguation of subject headings seems to have potentials to improve the extraction results.

Rosa Tsegaye Aga, Christian Wartena, Michael Franke-Maier

### The Problem of Categorizing Conferences in Computer Science

Research in computer science (CS) is mainly published in conferences. It makes sense to study conferences to understand CS research. We present and discuss the problem of categorizing CS conferences as well as the challenges in doing so.

Suhendry Effendy, Roland H. C. Yap

### Germania Sacra Online – The Research Portal of Clerics and Religious Institutions Before 1810

The research project Germania Sacra provides a comprehensive prosopographical database, that makes structured and comparable data of the Church of the Holy Roman Empire available for further research. The database contains approximately 38,000 records of premodern persons, new data is continuously added. This digital index of persons is supplemented by the “Database of Monasteries, Convents and Collegiate Churches of the Old Empire”. The access through ecclesiastical institutions offers a broad variety of visualization possibilities for the prosopographical data. In order to make as much information as possible accessible for scholarly use the next steps that will be undertaken are cross-institutional collaboration and integration of scientific data resources of other research projects.

Bärbel Kröger, Christian Popp

### Open Digital Forms

The maintenance of digital libraries often passes through physical paper forms. Such forms are tedious to handle for both senders and receivers. Several commercial solutions exist for the digitization of forms. However, most of them are proprietary, expensive, centralized, or require software installation. With this demo, we propose a free, secure, and lightweight framework for digital forms. It is based on HTML documents with embedded JavaScript, it uses exclusively open standards, and it does not require a centralized architecture. Our forms can be digitally signed with the OpenPGP standard, and they contain machine-readable RDFa. Thus, they allow for the semantic analysis, sharing, re-use, or merger of documents across users or institutions.

Hiep Le, Thomas Rebele, Fabian Suchanek

### A Text Mining Framework for Accelerating the Semantic Curation of Literature

The Biodiversity Heritage Library is the world’s largest digital library of biodiversity literature. Currently containing almost 40 million pages, the library can be explored with a search interface employing keyword-matching, which unfortunately fails to address issues brought about by ambiguity. Helping alleviate these issues are tools that automatically attach semantic metadata to documents, e.g., biodiversity concept recognisers. However, gold standard, semantically annotated textual corpora are critical for the development of these advanced tools. In the biodiversity domain, such corpora are almost non-existent especially since the construction of semantically annotated resources is typically a time-consuming and laborious process. Aiming to accelerate the development of a corpus of biodiversity documents, we propose a text mining framework that hastens curation through an iterative feedback-loop process of (1) manual annotation, and (2) training and application of statistical concept recognition models. Even after only a few iterations, our curators were observed to have spent less time and effort on annotation.

Riza Batista-Navarro, Jennifer Hammock, William Ulate, Sophia Ananiadou

### Backmatter

Weitere Informationen

## BranchenIndex Online

Die B2B-Firmensuche für Industrie und Wirtschaft: Kostenfrei in Firmenprofilen nach Lieferanten, Herstellern, Dienstleistern und Händlern recherchieren.

## Whitepaper

- ANZEIGE -

### Best Practices für die Mitarbeiter-Partizipation in der Produktentwicklung

Unternehmen haben das Innovationspotenzial der eigenen Mitarbeiter auch außerhalb der F&E-Abteilung erkannt. Viele Initiativen zur Partizipation scheitern in der Praxis jedoch häufig. Lesen Sie hier  - basierend auf einer qualitativ-explorativen Expertenstudie - mehr über die wesentlichen Problemfelder der mitarbeiterzentrierten Produktentwicklung und profitieren Sie von konkreten Handlungsempfehlungen aus der Praxis.