Skip to main content

Über dieses Buch

This book constitutes the thoroughly refereed proceedings of the 15th Italian Research Conference on Digital Libraries, IRCDL 2019, held in Pisa, Italy, in January/February 2019.

The 22 full papers and 5 short papers presented were carefully selected from 42 submissions. The papers are organized in topical sections on information retrieval, digital libraries and archives, information integration, open science, and data mining.



Information Retrieval


On Synergies Between Information Retrieval and Digital Libraries

In this paper we present the results of a longitudinal analysis of ACM SIGIR papers from 2003 to 2017. ACM SIGIR is the main venue where Information Retrieval (IR) research and innovative results are presented yearly; it is a highly competitive venue and only the best and most relevant works are accepted for publication. The analysis of ACM SIGIR papers gives us a unique opportunity to understand where the field is going and what are the most trending topics in information access and search.
In particular, we conduct this analysis with a focus on Digital Library (DL) topics to understand what is the relation between these two fields that we know to be closely linked. We see that DL provide document collections and challenging tasks to be addressed by the IR community and in turn exploit the latest advancements in IR to improve the offered services.
We also point to the role of public investments in the DL field as one of the core drivers of DL research which in turn may also have a positive effect on information accessing and searching in general.
Maristella Agosti, Erika Fabris, Gianmaria Silvello

Making Large Collections of Handwritten Material Easily Accessible and Searchable

Libraries and cultural organisations contain a rich amount of digitised historical handwritten material in the form of scanned images. A vast majority of this material has not been transcribed yet, owing to technological challenges and lack of expertise. This renders the task of making these historical collections available for public access challenging, especially in performing a simple text search across the collection. Machine learning based methods for handwritten text recognition are gaining importance these days, which require huge amount of pre-transcribed texts for training the system. However, it is impractical to have access to several thousands of pre-transcribed documents due to adversities transcribers face. Therefore, this paper presents a training-free word spotting algorithm as an alternative for handwritten text transcription, where case studies on Alvin (Swedish repository) and Clavius on the Web are presented. The main focus of this work is on discussing prospects of making materials in the Alvin platform and Clavius on the Web easily searchable using a word spotting based handwritten text recognition system.
Anders Hast, Per Cullhed, Ekta Vats, Matteo Abrate

Transparency in Keyword Faceted Search: An Investigation on Google Shopping

The most popular e-commerce search engines allow the user to run a keyword search, to find relevant results and to narrow down the results by mean of filters. The engines can also keep track of data and activities of the users, to provide personalized content, thus filtering automatically out a part of the results. Issues occur when personalization is not transparent and interferes with the user choices. Indeed, it has been noticed that, in some cases, a different ordering of search results is shown to different users. This becomes particularly critical when search results are associated with prices. Changing the order of search results according to prices is known as price steering. This study investigates if and how price steering exists, considering queries on Google Shopping by users searching from different geographic locations, distinguishable by their values of Gross Domestic Product.
The results confirm that products belonging to specific categories (e.g., electronic devices and apparel) are shown to users according to different prices orderings, and the prices in the results list differ, on average, in a way that depends on users’ location. All results are validated through statistical tests.
Vittoria Cozza, Van Tien Hoang, Marinella Petrocchi, Rocco De Nicola

Predicting the Usability of the Dice CAPTCHA via Artificial Neural Network

This paper introduces a new study of the CAPTCHA usability which analyses the predictability of the solution time, also called response time, to solve the Dice CAPTCHA. This is accomplished by proposing a new artificial neural network model for predicting the response time from known personal and demographic features of the users who solve the CAPTCHA: (i) age, (ii) device on which the CAPTCHA is solved, and (iii) Web use in years. The experiment involves a population of 197 Internet users, who is required to solve two types of Dice CAPTCHA on laptop or tablet computer. The data collected from the experiment is subject to the artificial neural network model which is trained and tested to predict the response time. The proposed analysis provides new results of usability of the Dice CAPTCHA and important suggestions for designing new CAPTCHAs which could be closer to an “ideal” CAPTCHA.
Alessia Amelio, Radmila Janković, Dejan Tanikić, Ivo Rumenov Draganov

Digital Libraries and Archives


Water to the Thirsty Reflections on the Ethical Mission of Libraries and Open Access

The shift to digital information determines a parallel shift in access modes, and digital libraries are called to action by the ethical foundations of their mission. Open Access makes information potentially available not just to researchers, but to everyone, yet there are still barriers to be overcome in terms of technical infrastructures, points of access, digital and cultural divide.
The mission of libraries, as stated by IFLA Manifesto for Digital Libraries and IFLA/FAIFE Code of Ethics for Librarians and other Information Workers, converges with the mission and ethics of the BBB declarations on Open Access: it is about delivering information to everyone, from scholars to the “curious minds”, and librarians can be mediators in the wide diffusion, at all levels of society, of scientific, scholarly knowledge, to foster “active” and “scientific” citizenship.
Matilde Fontanin, Paola Castellucci

Computational Terminology in eHealth

In this paper, we present a methodology for the development of a new eHealth resource in the context of Computational Terminology. This resource, named TriMED, is a digital library of terminological records designed to satisfy the information needs of different categories of users within the healthcare field: patients, language professionals and physicians. TriMED offers a wide range of information for the purpose of simplification of medical language in terms of understandability and readability. Finally, we present two applications of our resource in order to conduct different types of studies in particular in Information Retrieval and Literature Analysis.
Federica Vezzani, Giorgio Maria Di Nunzio

Connecting Researchers to Data Repositories in the Earth, Space, and Environmental Sciences

The Repository Finder tool was developed to help researchers in the domain of Earth, space, and environmental sciences to identify appropriate repositories where they can deposit their research data and to promote practices that implement the FAIR Principles, encouraging progress toward sharing data that are findable, accessible, interoperable, and reusable. Requirements for the design of the tool were gathered through a series of workshops and working groups as a part of the Enabling FAIR Data initiative led by the American Geophysical Union that included the development of a decision tree that researchers may follow in selecting a data repository, interviews with domain repository managers, and usability testing. The tool is hosted on the web by DataCite and enables a researcher to query all data repositories by keyword or to view a list of domain repositories that accept data for deposit, support open access, and provide persistent identifiers. Metadata records from the registry of research data repositories and the returned results highlight repositories that have achieved trustworthy digital repository certification through a formal procedure such as the CoreTrust Seal.
Michael Witt, Shelley Stall, Ruth Duerr, Raymond Plante, Martin Fenner, Robin Dasler, Patricia Cruse, Sophie Hou, Robert Ulrich, Danie Kinkade

Learning to Cite: Transfer Learning for Digital Archives

We consider the problem of automatically creating citations for digital archives. We focus on the learning to cite framework that allows us to create citations without users or experts in the loop. In this work, we study the possibility of learning a citation model on one archive and then applying the model to another archive that has never been seen before by the system.
Dennis Dosso, Guido Setti, Gianmaria Silvello

Exploring Semantic Archival Collections: The Case of Piłsudski Institute of America

Over the last decades, a huge amount of available digital collections have been published on the Web, opening up new possibilities for solving old questions and posing new ones. However, finding pertinent information in archives often is not an easy task. Semantic Web technologies are rapidly changing the archival research by providing a way to formally describe archival documents.
In this paper, we present the activities employed in building the semantic layer of the Piłsudski Institute of America digital archive. In order to accommodate the description of archival documents as well as historical references contained in these, we used the arkivo ontology, which aims at providing a reference schema for publishing Linked Data. Finally, we present some query examples that meet the domain experts’ information needs.
Laura Pandolfo, Luca Pulina, Marek Zieliński

Digital Libraries for Open Science: Using a Socio-Technical Interaction Network Approach

This paper argues that using Socio-Technical Interaction Networks to build on extensively-used Digital Library infrastructures for supporting Open Science knowledge environments. Using a more social -technical approach could lead to an evolutionary reconceptualization of Digital Libraries. Digital Libraries being used as knowledge environments, built upon on the document repositories, will also emphasize the importance of user interaction and collaboration in carrying out those activities. That is to say, the primary goal of Digital Libraries is to help users convert information into knowledge; therefore, Digital Libraries examined in light of socio-technical interaction networks have the potential to shift Digital Libraries from individual, isolated collections to more interoperable, interconnected knowledge-creating repositories that support an evolving relationship between open science users and the Digital Library environment.
Jennifer E. Beamer

Information Integration


OpenAIRE’s DOIBoost - Boosting Crossref for Research

Research in information science and scholarly communication strongly relies on the availability of openly accessible datasets of scholarly entities metadata and, where possible, their relative payloads. Since such metadata information is scattered across diverse, freely accessible, online resources (e.g. Crossref, ORCID), researchers in this domain are doomed to struggle with (meta)data integration problems, in order to produce custom datasets of often undocumented and rather obscure provenance. This practice leads to waste of time, duplication of efforts, and typically infringes open science best practices of transparency and reproducibility of science. In this article, we describe how to generate DOIBoost, a metadata collection that enriches Crossref with inputs from Microsoft Academic Graph, ORCID, and Unpaywall for the purpose of supporting high-quality and robust research experiments, saving times to researchers and enabling their comparison. To this end, we describe the dataset value and its schema, analyse its actual content, and share the software Toolkit and experimental workflow required to reproduce it. The DOIBoost dataset and Software Toolkit are made openly available via Zenodo.​org. DOIBoost will become an input source to the OpenAIRE information graph.
Sandro La Bruzzo, Paolo Manghi, Andrea Mannocci

Enriching Digital Libraries with Crowdsensed Data

Twitter Monitor and the SoBigData Ecosystem
SoBigData is a Research Infrastructure (RI) aiming to provide an integrated ecosystem for ethic-sensitive scientific discoveries and advanced applications of social data mining. A key milestone of the project focuses on data, methods and results sharing, in order to ensure the reproducibility, review and re-use of scientific works. For this reason, the Digital Library paradigm is implemented within the RI, providing users with virtual environments where datasets, methods and results can be collected, maintained, managed and preserved, granting full documentation, access and the possibility to re-use.
In this paper, we describe the results of our effort for integrating the Twitter Monitor, a tool for gathering messages from the Twitter Online Social Network, into the SoBigData RI. The Twitter Monitor provides a simple user interface, enabling researchers and stakeholders, without programming skills, to seamlessly (i) select relevant messages out of the huge Twitter stream by means of language, keyword, user tracking and geographical filters, (ii) store data on user personal Workspace, (iii) and publish them in the SoBigData Resource Catalogue, which implements all the aforementioned Digital Library features.
Thanks to the seamless integration in the SoBigData RI, the Twitter Monitor allows researchers and stakeholders, belonging to different areas and having different backgrounds, to exploit the crowdsensing paradigm for enriching the SoBigData Digital Library. In this way, crowdsensing acquires the key features of openness, accessibility, interoperability and interdisciplinarity that characterize the Digital Libraries framework.
Stefano Cresci, Salvatore Minutoli, Leonardo Nizzoli, Serena Tardelli, Maurizio Tesconi

Populating Narratives Using Wikidata Events: An Initial Experiment

The study presented in this paper is part of our research aimed at improving the search functionalities of current Digital Libraries using formal narratives. Narratives are intended as sequences of events. We present the results of an initial experiment to detect and extract implicit events from the Wikidata knowledge base in order to construct a narrative in a semi-automatic way. Wikidata contains many historical entities, but comparably few events. The reason is that most events in Wikidata are represented in an implicit way, e.g. by listing a date of birth instead of having an event of type “birth”. For this reason, we decided to generate what we call the Wikidata Event Graph (WEG), i.e. the graph of implicit events found in Wikidata. We performed an initial experiment taking as case study the narrative of the life of Italian poet Dante Alighieri. Only one event of the life of Dante is explicitly represented in Wikidata as instance of the class Q1190554 Occurrence. Using the WEG, we were able to automatically detect 31 more events of Dante’s life that were present in Wikidata in an implicit way.
Daniele Metilli, Valentina Bartalesi, Carlo Meghini, Nicola Aloia

Metadata as Semantic Palimpsests: The Case of PHAIDRA@unipd

This paper illustrates the experience of the Library System of the University of Padova in reviewing the data model of Phaidra (Permanent Hosting, Archiving and Indexing of Digital Resources and Assets), the digital repository for the long-term management and preservation of digital objects in place since 2010, whose system was created and developed by the University of Vienna. In order to provide better informational representation and visualisation of data, both in terms of metadata quality and display, this re-examination consisted in a critical analysis of the foundational metadata profile of Phaidra, its mapping and conversion into the Dublin Core metadata schema (Dublin Core Metadata Element Set 1.1) and, at prototype level, into the Metadata Object Description Schema (MODS). This paper discusses the evidence of the identified solutions being guided by two core principles: on the one hand, the distinctive valorisation of the dual analogue-digital nature of the Phaidra cultural heritage object, on the other, the metadata reuse in the visual function for the graphic updating of the new web interface, which is being done in order to encourage the discovery, even serendipitously, of its content by the digital researcher. Finally, the presentation considers the development activities being carried out by the Phaidra working groups of the Universities of Padova and Vienna, focused on the semantic evolution of the concept of metadata to open data, by presenting here an unpublished example of the Simple Knowledge Organization System (SKOS) prototype and last, but not least, suggesting the definition of a new Phaidra data model.
Anna Bellotto, Cristiana Bettella

In Codice Ratio: Machine Transcription of Medieval Manuscripts

Our project, In Codice Ratio, is an interdisciplinary research initiative for analyzing content of historical documents conserved in the Vatican Secret Archives (VSA). As most of such documents are digitized as images, Machine Transcription is both an enabler to the application of Knowledge Discovery techniques, as well as a useful tool to the paleographer for speeding up the transcription process. Our approach involves a convolutional neural network to recognize characters, statistical language models to compose and rank word transcriptions, and crowdsourcing for scalable training data collection. We have conducted experiments on pages from the medieval manuscript collection known as the Vatican Registers. Our results show that almost all the considered words can be transcribed without significant spelling errors.
Serena Ammirati, Donatella Firmani, Marco Maiorino, Paolo Merialdo, Elena Nieddu

Open Science


Foundations of a Framework for Peer-Reviewing the Research Flow

Traditionally, peer-review focuses on the evaluation of scientific publications, literature products that describe the research process and its final results in natural language. The adoption of ICT technologies in support of science introduces new opportunities to support transparent evaluation, thanks to the possibility of sharing research products, even inputs, intermediate and negative results, repetition and reproduction of the research activities conducted in a digital laboratory. Such innovative shift also sets the condition for novel peer review methodologies, as well as scientific reward policies, where scientific results can be transparently and objectively assessed via machine-assisted processes. This paper presents the foundations of a framework for the representation of a peer-reviewable research flow for a given discipline of science. Such a framework may become the scaffolding enabling the development of tools for supporting ongoing peer review of research flows. Such tools could be “hooked”, in real time, to the underlying digital laboratory, where scientists are carrying out their research flow, and they would abstract over the complexity of the research activity and offer user-friendly dashboards.
Alessia Bardi, Vittore Casarosa, Paolo Manghi

A Practical Workflow for an Open Scientific Lifecycle Project: EcoNAOS

This paper represents a review of the practical application, work done and near-future perspectives of an open scientific lifecycle model. The EcoNAOS (Ecological North Adriatic Open Science Observatory System) project is an example of the application of Open Science principles to long term marine research. For long term marine research we intend here all the marine research projects based on Long Term Ecological Data. In the paper, the structure of the lifecycle, modeled over Open Science principles, will be presented. The project develops through some fundamental steps: database correction and harmonization, metadata collection, data exploitation by publication on a web infrastructure and planning of dissemination moments. The project also foresees the setting up of a data citation and versioning model (adapted to dynamic databases) and a final guidelines production, illustrating the whole process in detail. The advancement state of these steps will be reviewed. Results achieved and expected outcomes will be explained with a particular focus on the upcoming work.
Annalisa Minelli, Alessandro Sarretta, Alessandro Oggioni, Caterina Bergami, Alessandra Pugnetti

Data Deposit in a CKAN Repository: A Dublin Core-Based Simplified Workflow

Researchers are currently encouraged by their institutions and the funding agencies to deposit data resulting from projects. Activities related to research data management, namely organization, description, and deposit, are not obvious for researchers due to the lack of knowledge on metadata and the limited data publication experience. Institutions are looking for solutions to help researchers organize their data and make them ready for publication. We consider here the deposit process for a CKAN-powered data repository managed as part of the IT services of a large research institute. A simplified data deposit process is illustrated here by means of a set of examples where researchers describe their data and complete the publication in the repository. The process is organised around a Dublin Core-based dataset deposit form, filled by the researchers as preparation for data deposit. The contacts with researchers provided the opportunity to gather feedback about the Dublin Core metadata and the overall experience. Reflections on the ongoing process highlight a few difficulties in data description, but also show that researchers are motivated to get involved in data publication activities.
Yulia Karimova, João Aguiar Castro, Cristina Ribeiro

Information Literacy Needs Open Access or: Open Access is not Only for Researchers

The Open Access was initially (blandly) conceived in view not only of researchers but also of lay readers, then this perspective slowly faded out. The Information Literacy movement wants to teach citizens how to arrive at trustable information but the amount of paywalled knowledge is still big. So, their lines of development are somehow complementary: Information Literacy needs Open Access for the citizens to freely access high quality information while Open Access truly fulfils its scope when it is conceived and realized not only for the researchers (an aristocratic view which was the initial one) but for the whole society.
Maurizio Lana

The OpenUP Pilot on Research Data Sharing, Validation and Dissemination in Social Sciences

The paper presents the results of a pilot carried out within the European project OpenUp (Opening up new methods, indicators and tools for peer review, dissemination of research results and impact measurement). Aim of the pilot is to investigate the applicability of peer review and/or Open Peer Review (OPR) to datasets in disciplines related to Social sciences. Main emphasis is given to the characteristic and features of data sharing and validation in this heterogeneous scientific field, thus providing the basis for the selection of the community chosen for the pilot. Indications emerging from the analysis of the interviews carried out in the pilot can drive the adoption of data quality assessment, and hence peer review, as well as provide some principles that can incentivize other scientific communities to share their research data.
Daniela Luzi, Roberta Ruggieri, Lucio Pisacane

Crowdsourcing Peer Review: As We May Do

This paper describes Readersourcing 2.0, an ecosystem providing an implementation of the Readersourcing approach proposed by Mizzaro [10]. Readersourcing is proposed as an alternative to the standard peer review activity that aims to exploit the otherwise lost opinions of readers. Readersourcing 2.0 implements two different models based on the so-called codetermination algorithms. We describe the requirements, present the overall architecture, and show how the end-user can interact with the system. Readersourcing 2.0 will be used in the future to study also other topics, like the idea of shepherding the users to achieve a better quality of the reviews and the differences between a review activity carried out with a single-blind or a double-blind approach.
Michael Soprano, Stefano Mizzaro

Hands-On Data Publishing with Researchers: Five Experiments with Metadata in Multiple Domains

The current requirements for open data in the EU are increasing the awareness of researchers with respect to data management and data publication. Metadata is essential in research data management, namely on data discovery and reuse. Current practices tend to either leave metadata definition to researchers, or to assign their creation to curators. The former typically results in ad-hoc descriptors, while the latter follows standards but lacks specificity. In this exploratory study, we adopt a researcher-curator collaborative approach in five data publication cases, involving researchers in data description and discussing the use of both generic and domain-oriented metadata. The study shows that researchers working on familiar datasets can contribute effectively to the definition of metadata models, in addition to the actual metadata creation. The cases also provide preliminary evidence of cross-disciplinary descriptor use. Moreover, the interaction with curators highlights the advantages of data management, making researchers more open to participate in the corresponding tasks.
Joana Rodrigues, João Aguiar Castro, João Rocha da Silva, Cristina Ribeiro

Data Mining


Towards a Process Mining Approach to Grammar Induction for Digital Libraries

Syntax Checking and Style Analysis
Since most content in Digital Libraries and Archives is text, there is an interest in the application of Natural Language Processing (NLP) to extract valuable information from it in order to support various kinds of user activities. Most NLP techniques exploit linguistic resources that are language-specific, costly and error prone to produce manually, which motivates research for automatic ways to build them.
This paper extends the BLA-BLA tool for learning linguistic resources, adding a Grammar Induction feature based on the advanced process mining and management system WoMan. Experimental results are encouraging, envisaging interesting applications to Digital Libraries and motivating further research aimed at extracting an explicit grammar from the learned models.
Stefano Ferilli, Sergio Angelastro

Keyphrase Extraction via an Attentive Model

Keyphrase extraction is a task of crucial importance for digital libraries. When performing automatically a task of this, the context in which a specific word is located seems to hold a substantial role. To exploit this context, in this paper we propose an architecture based on an Attentive Model: a neural network designed to focus on the most relevant parts of data. A preliminary experimental evaluation on the widely used INSPEC dataset confirms the validity of the approach and shows our approach achieves higher performance than the state of the art.
Marco Passon, Massimo Comuzzo, Giuseppe Serra, Carlo Tasso

Semantically Aware Text Categorisation for Metadata Annotation

In this paper we illustrate a system aimed at solving a long-standing and challenging problem: acquiring a classifier to automatically annotate bibliographic records by starting from a huge set of unbalanced and unlabelled data. We illustrate the main features of the dataset, the learning algorithm adopted, and how it was used to discriminate philosophical documents from documents of other disciplines. One strength of our approach lies in the novel combination of a standard learning approach with a semantic one: the results of the acquired classifier are improved by accessing a semantic network containing conceptual information. We illustrate the experimentation by describing the construction rationale of training and test set, we report and discuss the obtained results and conclude by drawing future work.
Giulio Carducci, Marco Leontino, Daniele P. Radicioni, Guido Bonino, Enrico Pasini, Paolo Tripodi

Collecting and Controlling Distributed Research Information by Linking to External Authority Data - A Case Study

With respect to the world wide web, scientific information has become distributed and often redundantly held on different server locations. The vision of a current research information system (CRIS) as an environment for constant monitoring and tracking of a researcher’s output has become vivid, but still fighting with issues like legacy information and institutional repository structures to be established yet. We therefore suggest to gather those scattered research information through identifying its authors by means of authority data already associated with them. We introduce author pages as a proof-of-concept application collecting research information not only from a local source such as an institutional repository, but also from other external bibliographic sources.
Atif Latif, Timo Borst, Klaus Tochtermann

Interactive Text Analysis and Information Extraction

A lot of work that has been done in the text mining field concerns the extraction of useful information from the full-text of publications. Such information may be links to projects, acknowledgements to communities, citations to software entities or datasets and more. Each category of entities, according to its special characteristics, requires different approaches. Thus it is not possible to build a generic mining platform that could text mine various publications to extract such info. Most of the time, a field expert is needed to supervise the mining procedure, decide the mining rules with the developer, and finally validate the results. This is an iterative procedure that requires a lot of communication among the experts and the developers, and thus is very time-consuming. In this paper, we present an interactive mining platform. Its purpose is to allow the experts to define the mining procedure, set/update the rules, validate the results, while the actual text mining code is produced automatically. This significantly reduces the communication among the developers and the experts and moreover allows the experts to experiment themselves using a user-friendly graphical interface.
Tasos Giannakopoulos, Yannis Foufoulas, Harry Dimitropoulos, Natalia Manola


Weitere Informationen