This book constitutes the thoroughly refereed proceedings of the 8th Italian Research Conference on Digital Libraries, held in Bari, Italy, in February 2012. The 22 full papers, included together with 4 panel papers, were selected from extended versions of the presentations given at the conference, following an additional round of reviewing and revision after the event. The topics covered are as follows: legacy documents and cultural heritage; systems interoperability and data integration; formal and methodological foundations of digital libraries; semantic web and linked data for digital libraries; multilingual information access; digital library infrastructures; metadata creation and management; search engines for digital library systems; evaluation and log data; handling audio/visual and non-traditional objects; user interfaces and visualization; digital library quality; policies and copyright issues in digital libraries; scientific data curation, citation and scholarly publication, user behavior and modeling; and preservation and curation.




Experiences and Perspectives in Management for Digital Preservation of Cultural Heritage Resources (Panel)

This paper reports on the panel objectives, on the topics addressed during the panel, and on the following discussion.
A relevant conclusion that emerges from the panel is the need to discuss and define a shared Digital Agenda for Italy.
Maristella Agosti

Where Do Humanities Computing and Digital Libraries Meet?

It is in libraries that humanists have always found their basic and essential instrumentation. Libraries can be described as the humanist’s lab. Obviously this applies also to digital humanists, who deal with digital objects for research purposes, and to digital libraries that store collections in digital form. But digital objects produced for research purposes are not just inactive artefacts and ‘digital library objects are more than collections of bits,’ for ‘the content of even the most basic digital object has some structure’ and to enable access and transactions additional information or ‘metadata’ is required. [1] So ‘if, unlike print,’ digital editions ‘are also open-ended and collaborative work-sites rather than static closed electronic objects’ (p. 77), [2] it can be legitimately asked how a digital repository for objects of this kind can enable effective access to the interactive functionalities they provide. In a digital research context, the issue of how the architecture of a digital library could meet the needs of the working practices increasingly adopted by digital humanists seems therefore of primary importance.
Dino Buzzetti

The ArchiMEDE Project for an Electronically Digitized Archive of Historical Monographs

First of all I would like to thank the organizers for having invited me to this round table discussion, and in particular my colleagues Floriana Esposito, whom I studied with at university, and Maristella Agosti, whom I have had the pleasure of meeting for the first time on this occasion.
Onofrio Erriquez

Considerations on the Preservation of Base Digital Data of Cultural Resources

This paper does not aim to thoroughly list and discuss issues which are well-known in research circles working towards defining the correct strategies for the preservation of cultural heritage digital resources, nor is it an attempt to suggest possible solutions to the current problems, as this is a burdensome task which has been tackled by individuals much more professionally and theoretically qualified than myself.
Nicola Barbuti


Supporting Tabular Data Characterization in a Large Scale Data Infrastructure by Lexical Matching Techniques

Digital Libraries continue to evolve towards research environments supporting access and management of multiform Information Objects spread across multiple data sources and organizational domains. This evolution has introduced the need to deal with Information Objects having traits different from those characterizing Digital Libraries at their early stages and to revise the services supporting their management. Tabular data represent a class of Information Objects that require to be efficiently managed because of their core role in many eScience scenarios. This paper discusses the tabular data characterization problem, i.e., the problem of identifying the reference dataset of any column of the dataset. In particular, the paper presents an approach based on lexical matching techniques to support users during the data curation phase by providing them with a ranked list of reference datasets suitable for a dataset column.
Leonardo Candela, Gianpaolo Coro, Pasquale Pagano

Data Interoperability and Curation: The European Film Gateway Experience

Film archives, containing collections of cinema-related digital material, have been created in many European countries. Today, the EC Best Practice Network Project EFG (European Film Gateway) provides a single access point to 59 collections from 19 archives and across 14 European countries, for a total of 640,000 digital objects. This paper illustrates challenges and solutions in the realization of the EFG data infrastructure. These mainly concerned the curation and interoperability issues derived by the need of aggregating metadata from heterogeneous archives (different data models, hence metadata schemas, and exchange formats). EFG designed a common data model for movie information, onto which archives data models can be optimally mapped. It realizes a data infrastructure based on the D-NET software toolkit, capable of dealing with data collection, mapping, cleaning, indexing, and access provision through web portals or standard access protocols. To achieve its objectives EFG has extended D-NET with advanced tools for data curation.
Michele Artini, Alessia Bardi, Federico Biagini, Franca Debole, Sandro La Bruzzo, Paolo Manghi, Marko Mikulicic, Pasquale Savino, Franco Zoppi

Annotating Digital Libraries and Electronic Editions in a Collaborative and Semantic Perspective

The distinction between digital libraries and electronic editions is becoming more and more subtle. The practice of annotation represents a point of convergence of two only apparently separated worlds. The aim of this paper is to present a model of collaborative semantic annotation of texts (SemLib project), suggesting a system that find in Semantic Web and Linked Data the solution technologies for enabling structured semantic annotation, also in the field of electronic editions in Digital Humanities domain. The main purpose of SemLib is to develop an application so to make easy for developers the integration of annotation software in digital libraries, which are different both for technical implementations and managed contents, and provide to users, indifferently from their cultural backgrounds, a simple system which could be used as a front-end. We present, for this purpose, a final example of semantic annotation in a specific context: a digital edition of a literary text and the issues that an annotation task involves.
Michele Barbera, Federico Meschini, Christian Morbidoni, Francesca Tomasi

Empowering Archives through Annotations

The paper presents an integration and visualization service to enhance the use of annotations and to empower the role of the user and research community in the archival context. We show how this service allows us to address the interoperability between diversified digital archive and annotation systems. Furthermore, it propels the use of annotations to enhance the user experience and to exploit the archivists expertise both in the description and consultation phases.
Nicola Ferro, Gianmaria Silvello

Metadata Inference for Description Authoring in a Document Composition Environment

In this paper, we propose a simple model for metadata management in a document composition environment. Our model considers (1) composite documents in the form of trees, whose nodes are either atomic documents, or other composite documents, and (2) metadata or descriptions of documents in the form of sets of terms taken from a taxonomy. We present a formal definition of our model and several concepts of inferred descriptions. Inferred descriptions can be used for term suggestion that allows users to easily define and manage document descriptions by taking into account what we call soundness of descriptions.
Tsuyoshi Sugibuchi, Ly Anh Tuan, Nicolas Spyratos

A Multi-layer Digital Library for Mediaeval Legal Manuscripts

This paper presents the results of the MOSAICO project, an Italian Government research project (2008–12) funded by the Italian Ministry of Education and Research, and carried out by an academic consortium. The goal of the Mosaic project ( ) is to create a thematic and specialized digital library, relying on the Web 2.0 and the P5 TEI XML standard to manage heterogeneous descriptions of medieval codex images. The portal is designed for scholars of medieval legal history and emphasizes the intellectual path of the academic experts.
Monica Palmirani, Luca Cervone

Extracting Keyphrases from Web Pages

Social tagging systems allow people to classify Web resources by using a set of freely chosen terms commonly called tags. However, by shifting the classification task from a set of experts to a larger and untrained set of people, the results of the classification are not accurate. The lack of control and guidelines generates noisy tags (i.e. tags without a clear semantic) which lower the precision of the user generated classifications. In order to face this limitation several tools have been proposed in the literature for suggesting to the users tags which properly describe a given resource. On the other hand we propose to suggest n-grams (named keyphrases) by following the idea that sequences of two/three terms can better face potential ambiguities. More specifically, in this work, we identify a set of features which characterize n-grams adequate for describing meaningful aspects reported in the Web pages. By means of these features, we developed a mechanism which can support people when classifying Web pages by automatically suggesting meaningful keyphrases.
Felice Ferrara, Carlo Tasso

Learning to Recognize Critical Cells in Document Tables

Tables are among the most informative components of documents, because they are exploited to compactly and intuitively represent data, typically for understandability purposes. The needs are to identify and extract tables from documents, and, on the other hand, to be able to extract the data they contain. The latter task involves the understanding of a table structure. Due to the variability in style, size, and aims of tables, algorithmic approaches to this task can be insufficient, and the exploitation of machine learning systems may represent an effective solution. This paper proposes the exploitation of a first-order logic representation, that is able to capture the complex spatial relationships involved in a table structure, and of a learning system that can mix the power of this representation with the flexibility of statistical approaches. The obtained encouraging results suggest further investigation and refinement of the proposal.
Nicola Di Mauro, Stefano Ferilli, Floriana Esposito

Document Image Understanding through Iterative Transductive Learning

In Document Image Understanding, one of the fundamental tasks is that of recognizing semantically relevant components in the layout extracted from a document image. This process can be automatized by learning classifiers able to automatically label such components. However, the learning process assumes the availability of a huge set of documents whose layout components have been previously manually labeled. Indeed, this contrasts with the more common situation in which we have only few labeled documents and abundance of unlabeled ones. In addition, labeling layout documents introduces further complexity aspects due to multi-modal nature of the components (textual and spatial information may coexist). In this work, we investigate the application of a relational classifier that works in the transductive setting. The relational setting is justified by the multi-modal nature of the data we are dealing with, while transduction is justified by the possibility of exploiting the large amount of information conveyed in the unlabeled layout components. The classifier bootstraps the labeling process in an iterative way: reliable classifications are used in subsequent iterative steps as training examples. The proposed computational solution has been evaluated on document images of scientific literature.
Michelangelo Ceci, Corrado Loglisci, Lucrezia Macchia, Donato Malerba, Luciano Quercia

A Domain Based Approach to Information Retrieval in Digital Libraries

The current abundance of electronic documents requires automatic techniques that support the users in understanding their content and extracting useful information. To this aim, improving the retrieval performance must necessarily go beyond simple lexical interpretation of the user queries, and pass through an understanding of their semantic content and aims. It goes without saying that any digital library would take enormous advantage from the availability of effective Information Retrieval techniques to provide to their users. This paper proposes an approach to Information Retrieval based on a correspondence of the domain of discourse between the query and the documents in the repository. Such an association is based on standard general-purpose linguistic resources (WordNet and WordNet Domains) and on a novel similarity assessment technique. Although the work is at a preliminary stage, interesting initial results suggest to go on extending and improving the approach.
Fulvio Rotella, Stefano Ferilli, Fabio Leuzzi

Uncertain (Multi)Graphs for Personalization Services in Digital Libraries

Digital Libraries organized collections of multimedia objects in a computer processable form. They also comprise services and infrastructures to manage, store, retrieve and share objects. Among these services, personalization services represent an active and broad area of digital library research. A popular way to realize personalization is by using information filtering techniques aiming to remove redundant or unwanted information from data. In this paper we propose to use a probabilistic framework based on uncertain graphs in order to deal with information filtering problems. Users, items and their relationships are encoded in a probabilistic graph that can be used to infer the probability of existence of a link between entities involved in the graph. The goal of the paper is to extend uncertain graphs definition to multigraphs and to study whether uncertain graphs could be used as a valuable tool for information filtering problems. The performance of the proposed probabilistic framework is reported when applied to a real-world domain.
Claudio Taranto, Nicola Di Mauro, Floriana Esposito

Improving Online Access to Archival Data

Archives are memory institutions whose original mission was to preserve and provide access to a set of carefully selected, arranged and described documents to a small number of scholars interested in their contents. For those specialists, the usual way to find information in an archive is by way of “finding aids”, i.e. descriptions of the arc-hive contents that reflect the hierarchical structure by which data are physically arranged in an archive. With the increased availability of archival holdings accessible on the Web, archives are now widening the range of users, and the use of online finding aids has proved to be too complicated for the non-specialists. This is mostly due to the hierarchical nature of the description, usually represented on line with a standard called EAD (Encoded Archival Description). This paper is the synopsis of a Master Thesis, where a methodology has been developed to represent the information contained in finding aids with a different standard, namely EDM (Europeana Data Model), which is used by the Europeana digital library and is becoming the de-facto standard for metadata interoperability. EDM allows a much more intuitive representation of the archive content and the possibility to access data from many different access points.
Vittore Casarosa, Carlo Meghini, Stanislava Gardasevic

Quick and Easy Implementation of Approximate Similarity Search with Lucene

Similarity search technique has been proved to be an effective way for retrieving multimedia content. However, as the amount of available multimedia data increases, the cost of developing from scratch a robust and scalable system with content-based image retrieval facilities is quite prohibitive.
In this paper, we propose to exploit an approach that allows us to convert low level features into a textual form. In this way, we are able to easily set up a retrieval system on top of the Lucene search engine library that combines full-text search with approximate similarity search capabilities.
Giuseppe Amato, Paolo Bolettieri, Claudio Gennaro, Fausto Rabitti

Establishing a Digital Library in Wide-Ranging University’s Context

The Sapienza Digital Library Experience
The Sapienza Digital Library (SDL) is a research project undertaken by Sapienza Università di Roma, the largest Europe’s campus, and the Italian supercomputer center Cineca.
The SDL project aims to build an infrastructure supporting preservation, management and dissemination of the past, present and future digital resources, that contain the overall intellectual production of the Sapienza University. The solution adopted tries to find a tradeoff between the standardization of the digital processes and products (that allows a cost-effective centralized and shared management and curation), and the preservation of the peculiarities of scientific materials, belonging to disparate knowledge disciplines (that need to be digitally available for future initiatives, more specifically tailored to the designated communities).
Angela Di Iorio, Marco Schaerf, Matteo Bertazzo

Digital Curators’ Education: Professional Identity vs. Convergence of LAM (Libraries, Archives, Museums)

Digital curation education is a new subject where the convergence between libraries, archives, museums and computer science seems to build an interdisciplinary bridge, with common competences needed by present and future professionals. The study methodology is based on: the literature review, on the proceedings of the Puerto Rico Conference organised by IFLA on “Education for Digital Curation” and on the findings of a Delphi study which has been done for a Thesis of the International Master DILL. Issues and problematic areas for further study and discussions are evidenced.
Anna Maria Tammaro, Melody Madrid, Vittore Casarosa

A Contribution for the Dissemination of Cultural Heritage Content to a Wider Public

Digital resources are becoming an important tool for research in all the domains related to cultural heritage. Scholars have special requirements that need to be matched when developing digital library and digital archive systems that are to be used as tools to carry out scientific research. After having designed and developed a digital library application called IPSA as a system for researchers in illuminated manuscripts, we investigated how the digital library can be evaluated by non-domain users. Our goal was to highlight the overlaps and the differences in the user requirements between specialists, who use the digital archive to fulfill their research goal, and non-domain users, who interact with the digital library system because of a general interest about its content. The results have been used to re-engineer the digital library system and extend the functions of the digital library application in order to open up its use also to non specialists.
Maristella Agosti, Lucio Benfante, Nicola Orio

Engaging the User: Elaboration and Execution of Trials with a Database of Illuminated Images

Currently one of the most important challenges for curators and providers of digital cultural heritage is to increase and enhance the engagement of users and communities with digital humanities collections. The reflections and efforts made to open up the IPSA database to new user categories is an ongoing process able to offer useful suggestions and contributions to this field of investigation. The considerations taken into account to elaborate the IPSA database trials engaging non-domain users are presented and the design of the trials is described.
Chiara Ponchia

Modeling Archives by Means of OAI-ORE

Currently, archival practice is moving towards the definition of complex relationships between the resources of interest as well as the constitution of compound digital objects. To this end archives can take advantage of using the Open Archives Initiative - Object Reuse and Exchange (OAI-ORE) providing additional and flexible visualizations of archival resources.
In this paper we define a formal basis that provides a means for defining OAI-ORE instances which are consistent with the fundamental archival principles.
Nicola Ferro, Gianmaria Silvello

Reflecting on the Europeana Data Model

We describe some issues arising while using Europeana, and analyze some features of the Europeana Data Model (EDM), starting from the rationale of the project. Some aspects of the theoretical model, derived mostly from the mapping between the provided Cultural Heritage Object (CHO) and the EDM, prevent useful results in users’ queries. The concept of media type, the multi-layer description and the relation between roles and values are some issues about which we reflected. The aim of Europeana to make records available as Linked Open Data on the Web could require moreover a redefinition of the implementation techniques.
Silvio Peroni, Francesca Tomasi, Fabio Vitali

The Europeana Linked Open Data Pilot Server

The Linked Data is a set of principles and technologies providing a publishing paradigm for sharing and reusing RDF data on the Web. The Linked Data Cloud is expanding at a very high speed since 2007, when the Linked Data Project was launched. Europeana, the European Digital Library, subscribes to the view of a web of data, and the distribution of cultural heritage data is one of the main objectives established by the Europeana Strategic Plan. The paper illustrates how Europeana publishes Linked Data, with focus on the technological approach adopted.
Nicola Aloia, Cesare Concordia, Carlo Meghini

Managing Authenticity through the Digital Resource Lifecycle

On the basis of principles and methodologies developed by the major projects on digital preservation, the paper addresses the fundamental problem of authenticity management, and specifically of defining appropriate mechanisms and tools to transform the presumption of authenticity into the capacity of its verification. The approach we propose is to concentrate on the digital resource lifecycle, since, in order to make a proper assessment, one must be able to trace back all the transformations the digital resource has undergone since its creation, and that may have affected its authenticity. For these transformations one needs to collect and preserve the appropriate evidence that would allow, at a later time, to make the assessment. We have therefore developed a model of the digital resource lifecycle in order to identify the main events that impact on authenticity and to define precise operational guidelines to specify which evidence should be collected and how to organize it. A case study analysis is currently being performed to check the validity of the model and to see how it specializes on several specific environments. Preliminary results are already available and confirm that the model is sound and that the implementation of the guidelines can be worked out effectively and with a fairly reasonable amount of effort.
Maria Guercio, Silvio Salza

An Innovative Character Recognition for Ancient Book and Archival Materials: A Segmentation and Self-learning Based Approach

The paper illustrates the invention of a method and an apparatus able to recognize the text in a set of digital images referring to pages of ancient manuscripts or printed books. It includes the following macro steps: identifying and connecting in sequence regions containing words in a subset of the images; structuring a thesaurus of fonts used in those regions; performing the character recognition of one or more images belonging to the set, associating to this recognition a first value of efficiency. The prototype is patent pending (National Pat. Pend. n. BA2011A000038 – Intern. Pat. Pend. n. I116-PCT).
Nicola Barbuti, Tommaso Caldarola


