Uses, Users, and User Interaction

Users and Uses of Online Digital Libraries in France

This article presents a study of online digital library (DL) uses, based on three data sources (online questionnaire, Internet traffic data and interviews). We show that DL users differ from average Internet users as well as from classical library users, and that their practices involve particular contexts, among which personal researches and bibliophilism. These results lead us to reconsider the status of online documents, as well as the relationship between commercial and non-commercial Web sites. Digital libraries, far from being simple digital versions of library holdings, are now attracting a new type of public, bringing about new, unique and original ways for reading and understanding texts. They represent a new arena for reading and consultation of works alongside that of traditional libraries.

Houssem Assadi, Thomas Beauvisage, Catherine Lupovici, Thierry Cloarec

In Search for Patterns of User Interaction for Digital Libraries

The paper provides preliminary results from a major study of the academic and research libraries users in Slovakia. The study is part of a larger research project on interaction of man and information environment. The goal of the research is to identify patterns of interaction of individuals and groups with information resources, derive models and information styles. The methodological model for questionnaire survey of users is described. The first results confirm the need to support user strategies, collaboration, different stages of information seeking and knowledge states, closer links with learning and problem solving, easy and flexible access, human creative processes of analysis, synthesis, interpretation, and the need to develop new knowledge organization structures.

Jela Steinerová

Detecting Research Trends in Digital Library Readership

The research interests and preferences of the reader communities associated to any given digital library may change over the course of years. It is vital for digital library services and collection management to be informed of such changes, and to determine how they may point to future trends. We propose the Impact Discrepancy Ratio metric for the detection of research trends in a large digital library by comparing a reader-defined metric of journal impact to the Institute for Scientific Information Impact Factor (ISI IF) over the course of three years. An analysis for the Los Alamos National Laboratory (LANL) Research Library (RL) comparing reader impact to the ISI IF for 1998 and 2001 indicates journals relating to climatology have undergone a sharp increase in local impact. This evolution pinpoints specific shifts in the local strategies and reader interests of the LANL RL which were qualitatively validated by LANL RL management.

Johan Bollen, Rick Luce, Somasekhar Vemulapalli, Weining Xu

Evaluating the Changes in Knowledge and Attitudes of Digital Library Users

Medical digital libraries are essentially life-critical applications providing timely access for professionals and the public to current medical knowledge and practice. This paper presents a new methodology for evaluating the impact of the knowledge within a medical digital library on users by testing their knowledge improvements and attitude changes. Using pre and post-use questionnaires we tested the impact of a small medical information website acting as an interface to the National electronic Library for Communicable Disease. The changes in user attitudes and the correlation with knowledge improvements observed indicate the potential for this methodology to be applied as a general evaluation technique of digital libraries and the impact of online information on user learning.

Gemma Madle, Patty Kostkova, Jane Mani-Saada, Julius R. Weinberg

Metadata Applications

Towards a Role-Based Metadata Scheme for Educational Digital Libraries: A Case Study in Singapore

In this paper, we describe the development of an appropriate metadata scheme for GeogDL, a Web-based digital library application containing past-year examination resources for students taking a Singapore national examination in geography. The new metadata scheme was developed from established metadata schemes on education and e-learning. Initial evaluation showed that a role-based approach would be more viable, adapting to the different roles of teachers/educators and librarians contributing geography resources to GeogDL. The paper concludes with concrete implementation of the role-based metadata schema for GeogDL.

Dian Melati Md Ismail, Ming Yin, Yin-Leng Theng, Dion Hoe-Lian Goh, Ee-Peng Lim

Incorporating Educational Vocabulary in Learning Object Metadata Schemas

Educational metadata schemas are obligated to provide learning-related attributes in learning objects. The examination of current educational metadata standards found that few of them have places for incorporating educational vocabulary. Even within the educational category of metadata standards there is a lack of learning-related vocabulary for characterizing attributes that can help users identify the type of learning, objective, or context. The paper also discussed the problems with examples from a learning object taxonomy compiled by the authors.

Jian Qin, Carol Jean Godby

Findings from the Mellon Metadata Harvesting Initiative

Findings are reported from four projects initiated through funding by the Andrew W. Mellon Foundation in 2001 to explore applications of metadata harvesting using the OAI-PMH. Metadata inconsistencies among providers have been encountered and strategies for normalization have been studied. Additional findings concerning harvesting are format conflicts, harvesting problems, provider system development, and questions regarding the entire cycle of metadata production, dissemination, and use (termed metadata gardening, rather than harvesting).

Martin Halbert, Joanne Kaczmarek, Kat Hagedorn

Semantic Browsing

We have created software applications that allow users to both author and use Semantic Web metadata. To create and use a layer of semantic content on top of the existing Web, we have (1) implemented a user interface that expedites the task of attributing metadata to resources on the Web, and (2) augmented a Web browser to leverage this semantic metadata to provide relevant information and tasks to the user. This project provides a framework for annotating and reorganizing existing files, pages, and sites on the Web that is similar to Vannevar Bush’s original concepts of trail blazing and associative indexing.

Alexander Faaborg, Carl Lagoze

Metadata Editing by Schema

Metadata creation and editing is a reasonably well-understood task which involves creating forms, checking the input data and generating appropriate storage formats. XML has largely become the standard storage representation for metadata records and various automatic mechanisms are becoming popular for validation of these records, including XML Schema and Schematron. However, there is no standard methodology for creating data manipulation mechanisms. This work presents a set of guidelines and extensions to use the XML Schema standard for this purpose. The experiences and issues involved in building such a generalised structured data editor are discussed, to support the notion that metadata editing, and not just validation, should be description-driven.

Hussein Suleman

Annotation and Recommendation

Annotations: Enriching a Digital Library

This paper presents the results of a study on the semantics of the concept of annotation. It specifically deals with annotations in the context of digital libraries. In the light of those considerations, general characteristics and features of an annotation service are introduced. The OpenDLib digital library is adopted as a framework of reference for our ongoing research, so the paper presents the annotations extension to the OpenDLib digital library, where the extension regards both the adopted document model and the architecture. The final part of the paper discusses and evaluates if OpenDLib has the expressive power of representing the presented semantics of annotations.

Maristella Agosti, Nicola Ferro

Identifying Useful Passages in Documents Based on Annotation Patterns

Many readers annotate passages that are important to their work. If we understand the relationship between the types of marks on a passage and the passage’s ultimate utility in a task, then we can design e-book software to facilitate access to the most important annotated parts of the documents. To investigate this hypothesis and to guide software design, we have analyzed annotations collected during an earlier study of law students reading printed case law and writing Moot Court briefs. This study has allowed us to characterize the relation-ship between the students’ annotations and the citations they use in their final written briefs. We think of annotations that relate directly to the written brief as high-value annotations; these annotations have particular, detectable characteristics. Based on this study we have designed a mark parser that analyzes freeform digital ink to identify such high-value annotations.

Frank Shipman, Morgan Price, Catherine C. Marshall, Gene Golovchinsky

Others Also Use: A Robust Recommender System for Scientific Libraries

Scientific digital library systems are a very promising application area for value-added expert advice services. Such systems could significantly reduce the search and evaluation costs of information products for students and scientists. This holds for pure digital libraries as well as for traditional scientific libraries with online public access catalogs (OPAC). In this contribution we first outline different types of recommendation services for scientific libraries and their general integration strategies. Then we focus on a recommender system based on log file analysis that is fully operational within the legacy library system of the Universität Karlsruhe (TH) since June 2002. Its underlying mathematical model, the implementation within the OPAC, as well as the first user evaluation is presented.

Andreas Geyer-Schulz, Andreas Neumann, Anke Thede

Automatic Classification and Indexing

Cross-Lingual Text Categorization

This article deals with the problem of Cross-Lingual Text Categorization (CLTC), which arises when documents in different languages must be classified according to the same classification tree. We describe practical and cost-effective solutions for automatic Cross-Lingual Text Categorization, both in case a sufficient number of training examples is available for each new language and in the case that for some language no training examples are available.Experimental results of the bi-lingual classification of the ILO corpus (with documents in English and Spanish) are obtained using bi-lingual training, terminology translation and profile-based translation.

Nuria Bel, Cornelis H. A. Koster, Marta Villegas

Automatic Multi-label Subject Indexing in a Multilingual Environment

This paper presents an approach to automatically subject index full-text documents with multiple labels based on binary support vector machines (SVM). The aim was to test the applicability of SVMs with a real world dataset. We have also explored the feasibility of incorporating multilingual background knowledge, as represented in thesauri or ontologies, into our text document representation for indexing purposes. The test set for our evaluations has been compiled from an extensive document base maintained by the Food and Agriculture Organization (FAO) of the United Nations (UN). Empirical results show that SVMs are a good method for automatic multi- label classification of documents in multiple languages.

Boris Lauser, Andreas Hotho

Automatic Induction of Rules for Classification and Interpretation of Cultural Heritage Material

This work presents the application of incremental symbolic learning strategies for the automatic induction of classification and interpretation rules in the cultural heritage domain. Specifically, such experience was carried out in the environment of the EU project COLLATE, in whose architecture the incremental learning system INTHELEX is used as a learning component. Results are reported, proving that the system was able to learn highly reliable rules for such a complex task.

S. Ferilli, F. Esposito, T. M. A. Basile, N. Di Mauro

An Integrated Digital Library Server with OAI and Self-Organizing Capabilities

The Open Archives Initiative (OAI) is an experimental initiative for the interoperability of Digital Libraries (DLs) based on metadata harvesting. The goal of OAI is to develop and promote interoperability solutions to facilitate the efficient dissemination of content. At present, however, there are still several challenging issues such as metadata incorrectness, poor quality of metadata, and metadata inconsistency that have to be solved in order to create a variety of high-quality services. In this paper we propose an integrated DL system with OAI and self-organizing capabilities. The system provides two value-added services, cross-archive searching and interactive concept browsing services, for organizing, exploring, and searching a collection of harvested metadata to satisfy users’ information needs. We also propose a multi-layered Self-Organizing Map (SOM) algorithm for building a subject-specific concept hierarchy using two input vector sets constructed by indexing the harvested metadata collection. By using the concept hierarchy, we can also automatically classify the harvested metadata collection for the purpose of selective harvesting.

Hyunki Kim, Chee-Yoong Choo, Su-Shing Chen

Web Technologies

YAPI: Yet Another Path Index for XML Searching

As many metadata are encoded in XML, and many digital libraries need to manage XML documents, efficient techniques for searching in such formatted data are required. In order to efficiently process path expressions with wildcards on XML data, a new path index is proposed. Extensive evaluation confirms better performance with respect to other techniques proposed in the literature. An extension of the proposed technique to deal with the content of XML documents in addition to their structure is also discussed.

Giuseppe Amato, Franca Debole, Pavel Zezula, Fausto Rabitti

Structure-Aware Query for Digital Libraries: Use Cases and Challenges for the Humanities

Much recent research in database design focuses on persistence models for semistructured data similar to the SGML and XML that humanities digital libraries have long used to encode digital editions of texts. Structure-aware querying promises to simplify the design of such digital repositories by allowing them to store and query texts using a single, unified information model. Using content the Perseus Project has acquired over the past ten years as a test case, we describe the advantages and delimit the problems in managing structure-aware queries over multiple or ambiguous schemas, evaluate the place of markup in digital libraries where much content is automatically generated, and examine the uses for structure-aware query in a system that stores both semistructured content and graph-structured metadata.

Christopher York, Clifford Wulfman, Greg Crane

Combining DAML+OIL, XSLT and Probabilistic Logics for Uncertain Schema Mappings in MIND

When distributed, heterogeneous digital libraries have to be integrated, one of the crucial tasks is to map between different schemas. As schemas may have different granularities, and as schema attributes do not always match precisely, a general-purpose schema mapping approach requires support for uncertain mappings. In this paper we present one of the very few approaches for defining and using uncertain schema mappings. We combine different technologies like DAML+OIL, probabilistic Datalog (since DAML+OIL—as similar ontology languages—lacks rules) and XSLT for actually transforming queries and documents. This declarative approach is fully implemented in the project MIND (which develops methods for retrieval in networked multimedia digital libraries). However, as DAML+OIL lacks some important features, the proposed approach is only a stepping stone for an integrated solution.

Henrik Nottelmann, Norbert Fuhr

Digitometric Services for Open Archives Environments

We describe “digitometric” services and tools that add value to open-access eprint archives using the Open Archives Initiative (OAI) Protocol for Metadata Harvesting. Celestial is an OAI cache and gateway tool. Citebase Search enhances OAI-harvested metadata with linked references harvested from the full-text to provide a web service for citation navigation and research impact analysis. Digitometrics builds on data harvested using OAI to provide advanced visualisation and hypertext navigation for the research community. Together these services provide a modular, distributed architecture for building a “semantic web” for the research literature.

Tim Brody, Simon Kampa, Stevan Harnad, Les Carr, Steve Hitchcock

Topical Crawling, Subject Gateways

Search Engine-Crawler Symbiosis: Adapting to Community Interests

Web crawlers have been used for nearly a decade as a search engine component to create and update large collections of documents. Typically the crawler and the rest of the search engine are not closely integrated. If the purpose of a search engine is to have as large a collection as possible to serve the general Web community, a close integration may not be necessary. However, if the search engine caters to a specific community with shared focused interests, it can take advantage of such an integration. In this paper we investigate a tightly coupled system in which the crawler and the search engine engage in a symbiotic relationship. The crawler feeds the search engine and the search engine in turn helps the crawler to better its performance. We show that the symbiosis can help the system learn about a community’s interests and serve such a community with better focus.

Gautam Pant, Shannon Bradshaw, Filippo Menczer

Topical Crawling for Business Intelligence

The Web provides us with a vast resource for business intelligence. However, the large size of the Web and its dynamic nature make the task of foraging appropriate information challenging. General-purpose search engines and business portals may be used to gather some basic intelligence. Topical crawlers, driven by richer contexts, can then leverage on the basic intelligence to facilitate in-depth and up-to-date research. In this paper we investigate the use of topical crawlers in creating a small document collection that helps locate relevant business entities. The problem of locating business entities is encountered when an organization looks for competitors, partners or acquisitions. We formalize the problem, create a test bed, introduce metrics to measure the performance of crawlers, and compare the results of four different crawlers. Our results underscore the importance of identifying good hubs and exploiting link contexts based on tag trees for accelerating the crawl and improving the overall results.

Gautam Pant, Filippo Menczer

SozioNet: Networking Social Science Resources

SozioNet forms part of a forthcoming national social science information portal, which is currently being developed by the German Infoconnex initiative. Inspired by successful examples like MathNet or SOSIG, SozioNet provides access to freely available web resources with relevance to social science. It is based on a network of social science institutions and scientists, to agree on and establish common metadata standards. SozioNet implements a general infrastructure for the creation of semantically rich metadata, and for the harvesting and retrieval of relevant resources with a domain specific focus.

Wolfgang Meier, Natascha Schumann, Sue Heise, Rudi Schmiede

VASCODA: A German Scientific Portal for Cross-Searching Distributed Digital Resource Collections

The German information science community – with the support of the two main funding agencies in Germany – will develop a scientific portal, vascoda, for cross-searching distributed metadata collections. In platitudinous words, one of the services of vascoda is going to be a “Google”-like search for the academic community, an easy to use, yet sophisticated search-engine to supply information on high-quality resources from different media and technical environments. Reaching this objective requires considerable standardisation activity amongst the main players to harmonise the already existing services (e.g. regarding metadata, protocols, etc.). The co-operation amongst the participants including both of the funding agencies is creating a unique team-work situation in Germany thus strengthening the information science community.

Heike Neuroth, Tamara Pianos

Architectures and Systems

Scenario-Based Generation of Digital Library Services

We describe the development, implementation, and deployment of a new generic digital library generator yielding implementations of digital library services from models of DL “societies” and “scenarios”. The distinct aspects of our solution are: 1) approach based on a formal, theoretical framework; 2) use of state-of-the-art database and software engineering techniques such as domain-specific declarative languages, scenario synthesis, componentized and model driven architectures; 3) analysis centered on scenario-based design and DL societal relationships; 4) automatic transformations and mappings from scenarios to workflow designs and from these to Java implementations, 5) special attention paid to issues of simplicity of implementation, modularity, reusability, and extensibility. We demonstrate the feasibility of the approach through a number of examples.

Rohit Kelapure, Marcos André Gonçalves, Edward A. Fox

An Evaluation of Document Prefetching in a Distributed Digital Library

Latency is a fundamental problem for all distributed systems including digital libraries. To reduce user perceived delays both caching – keeping accessed objects for future use – and prefetching – transferring objects ahead of access time – can be used. In a previous paper we have reported that caching is not worthwhile for digital libraries due to low re-access frequencies.In this paper we evaluate our previous findings that prefetching can be used instead. To do this we have set up an experimental prefetching proxy which is able to retrieve documents from remote fulltext archives before the user demands them. Using a simple prediction to keep the overhead of unnecessarily transfered data limited, we find that it is possible to cut the user perceived average delay a factor of two.

Jochen Hollmann, Anders Ardö, Per Stenström

An Architecture for Online Information Integration on Concurrent Resource Access on a Z39.50 Environment

The lack of information integration, by the existing online systems, for resource sharing in a distributed environment, impacts directly to the development and the usage of dynamically defined Virtual Union Catalogues. In this work we propose a design approach for the construction of an online system, able to improve the information integration when a Dynamic Resource Collection is used, by taking into account the restrictions imposed by the network environment and the Z39.50 protocol. The main strength of this architecture is the presentation of de-duplicated results to the user, by the gradual application of the duplicate detection process in small received packets (sets of results), as the data packets flow from the participating servers. While it presents results to the user, it also processes a limited amount of data ahead of time, to be ready before the user requests them.

Michalis Sfakakis, Sarantos Kapidakis

Knowledge Organization: Concepts

The ADEPT Concept-Based Digital Learning Environment

We describe the design and application of a Digital Learning Environment (DLE) that is integrated with the collections and services of the Alexandria Digital Library (ADL). This DLE is in operational use in undergraduate teaching environments. Its design and development incorporate the assumption that deep understanding of both scientific phenomena and scientific methods is facilitated when learning materials are explicitly organized, accessed and presented at the level of granularity of appropriate sets of scientific concepts and their interrelationships. The DLE supports services for the creating, searching, and displaying: (1) knowledge bases (KBs) of strongly structured models of scientific concepts; (2) DL collections of information objects organized and accessible by the concepts of the KBs; and (3) collections of presentation materials, such as lectures and laboratory materials, that are organized as trajectories through the KB of concepts.

T. R. Smith, D. Ancona, O. Buchel, M. Freeston, W. Heller, R. Nottrott, T. Tierney, A. Ushakov

A User Evaluation of Hierarchical Phrase Browsing

Phrase browsing interfaces based on hierarchies of phrases extracted automatically from document collections offer a useful compromise between automatic full-text searching and manually-created subject indexes. The literature contains descriptions of such systems that many find compelling and persuasive. However, evaluation studies have either been anecdotal, or focused on objective measures of the quality of automatically-extracted index terms, or restricted to questions of computational efficiency and feasibility. This paper reports on an empirical, controlled user study that compares hierarchical phrase browsing with full-text searching over a range of information seeking tasks. Users found the results located via phrase browsing to be relevant and useful but preferred keyword searching for certain types of queries. Users’ experiences were marred by interface details, including inconsistencies between the phrase browser and the surrounding digital library interface.

Katrina D. Edgar, David M. Nichols, Gordon W. Paynter, Kirsten Thomson, Ian H. Witten

Visual Semantic Modeling of Digital Libraries

The current interest from non-experts who wish to build digital libraries (DLs) is strong worldwide. However, since DLs are complex systems, it usually takes considerable time and effort to create and tailor a DL to satisfy specific needs and requirements of target communities/societies. What is needed is a simplified modeling process and rapid generation of DLs. To enable this, DLs can be modeled with descriptive domain-specific languages. A visual tool would be helpful to non-experts so they may model a DL without knowing the theoretical foundations and the syntactic details of the descriptive language. In this paper, we present a domain-specific visual DL modeling tool, 5SGraph. It employs a metamodel that describes DLs using the 5S theory. The output from 5SGraph is a DL model that is an instance of the metamodel, expressed in the 5S description language. Furthermore, 5SGraph maintains semantic constraints specified by the 5S metamodel and enforces these constraints over the instance model to ensure semantic consistency and correctness. 5SGraph enables component reuse to reduce the time and effort of designers. 5SGraph also is designed to accommodate and integrate several other complementary tools reflecting the interdisciplinary nature of DLs. Thus, tools based on concept maps to fulfill those roles are introduced. The 5SGraph tool has been tested with real users and several modeling tasks in a usability experiment, and its usefulness and learnability have been demonstrated.

Qinwei Zhu, Marcos André Gonçalves, Rao Shen, Lillian Cassell, Edward A. Fox

Collection Building and Management

Connecting Interface Metaphors to Support Creation of Path-Based Collections

Walden’s Paths is a suite of tools that supports the creation and presentation of linear hypermedia paths—targeted collections that enable authors to reorganize and contextualize Web-based information for presentation to an audience. Its current tools focus primarily on authoring and presenting paths, but not on the discovery and vetting of the materials that are included in the path. CollageMachine, on the other hand, focuses strongly on the exploration of Web spaces at the granularity of their media elements through presentation as a streaming collage, modified temporally through learning from user behavior. In this paper we present an initial investigation of the differences in expectations, assumptions, and work practices caused by the differing metaphors of browser based and CollageMachine Web search result representations, and how they affect the process of creating paths.

Unmil P. Karadkar, Andruid Kerne, Richard Furuta, Luis Francisco-Revilla, Frank Shipman, Jin Wang

Managing Change in a Digital Library System with Many Interface Languages

Managing the organizational and software complexity of a comprehensive open source digital library system presents a significant challenge. The challenge becomes even more imposing when the interface is available in different languages, for enhancements to the software and changes to the interface must be faithfully reflected in each language version. This paper describes the solution adopted by Greenstone, a multilingual digital library system distributed by UNESCO in a trilingual European version (English, French, Spanish), complete with all documentation, and whose interface is available in many further languages. Greenstone incorporates a language translation facility which allows authorized people to update the interface in specified languages. A standard version control system is used to manage software change, and from this the system automatically determines which language fragments need updating and presents them to the human translator.

David Bainbridge, Katrina D. Edgar, John R. McPherson, Ian H. Witten

A Service for Supporting Virtual Views of Large Heterogeneous Digital Libraries

This paper presents an innovative type of digital library basic architectural service, the Collection Service, that supports the dynamic construction of customized virtual user views of the digital library. These views make transparent to the users the real DL content, services and their physical organization. By realizing the independency between the physical digital library and the digital library perceived by the user the Collection Service also creates the conditions for services optimization.The paper exemplifies this service by showing how it has been instantiated in the CYCLADES and SCHOLNET digital library systems.

Leonardo Candela, Donatella Castelli, Pasquale Pagano

Knowledge Organization: Authorities and Works

A Framework for Unified Authority Files: A Case Study of Corporate Body Names in the FAO Catalogue

We present a Unified Authority File for Names for use with the FAO Catalogue. This authority file will include all authorized forms of names, and can be used for highly precise resource discovery, as well as for record sharing. Other approaches of creating unified authority files are discussed. A major advantage of our proposal lies in the ease and sustainability of sharing records across authority files. The public would benefit from the Unified Authority File with its possibilities for cross-collection searching, and metadata creators would also have a greater possibility to utilize bibliographic records from other collections. A case study describes the treatment and use of corporate body names used in the catalogue of The Food and Agriculture Organization of the United Nations.

James Weinheimer, Kafkas Caprazli

geoXwalk – A Gazetteer Server and Service for UK Academia

This paper will summarise work undertaken on behalf of the UK academic community to evaluate and develop a gazetteer server and service which will underpin geographic searching within the UK distributed academic information network. It will outline the context and problem domain, report on issues investigated and the findings to date. Lastly, it poses some unresolved questions requiring further research and speculates on possible future directions.

James Reid

Utilizing Temporal Information in Topic Detection and Tracking

The harnessing of time-related information from text for the use of information retrieval requires a leap from the surface forms of the expressions to a formalized time-axis. Often the expressions are used to form chronological sequences of events. However, we want to be able to determine the temporal similarity, i.e., the overlap of temporal references of two documents and use this similarity in Topic Detection and Tracking, for example. We present a methodology for extraction of temporal expressions and a scheme of comparing the temporal evidence of the news documents. We also examine the behavior of the temporal expressions and run experiments on English News corpus.

Juha Makkonen, Helena Ahonen-Myka

Automatic Conversion from MARC to FRBR

Catalogs have for centuries been the main tool that enabled users to search for items in a library by author, title, or subject. A catalog can be interpreted as a set of bibliographic records, where each record acts as a surrogate for a publication. Every record describes a specific publication and contains the data that is used to create the indexes of search systems and the information that is presented to the user. Bibliographic records are often captured and exchanged by the use of the MARC format. Although there are numerous ”dialects” of the MARC format in use, they are usually crafted on the same basis and are interoperable with each other —to a certain extent. The data model of a MARC-based catalog, however, is ”[...] extremely non-normalized with excessive replication of data” [1]. For instance, a literary work that exists in numerous editions and translations is likely to yield a large result set because each edition or translation is represented by an individual record, that is unrelated to other records that describe the same work.

Christian Mönch, Trond Aalberg

Information Retrieval in Different Application Areas

Musescape: A Tool for Changing Music Collections into Libraries

Increases in hard disk capacity and audio compression technology have enabled the storage of large collections of music on personal computers and portable devices. As an example a portable device with 20 Gigabytes of storage can hold up to 4000 songs in compressed audio format. Currently the only way of structuring these collections is using a file system hierarchy which allows very limited forms of searching and retrieval. These limitations are even more pronounced in the case of portable devices where there is less screen real estate and user attention is limited compared to a personal computer.Musescape is a prototype tool for organizing and interacting with large music collections in audio format with specific emphasis on portable devices. It provides a variety of automatic and manual ways to organize and interact with large music collections using a consistent continuous audio feedback user interface for browsing, searching and annotating. Using this system a user can convert an unstructured or partially structured collection of music with limited retrieval capabilities into a music library with enhanced functionality.

George Tzanetakis

A Digital GeoLibrary: Integrating Keywords And Place Names

A digital library typically includes a set of keywords (or subject terms) for each document in its collection(s). For some applications, including natural resource management, geographic location (e.g., the place of a study or a project) is very important. The metadata for such documents needs to indicate the location(s) associated with a document – and users need to be able to search for documents by keyword as well as location. We have developed and implemented a digital library that supports – but does not require – georeferenceable documents (i.e., documents with reference to geography through the use of a textual place name). Because of their implicit spatial footprint, place names benefit from spatial reasoning and querying (e.g., to find all documents that describe work performed within a five-mile radius of a certain point) in addition to traditional keyword-based search. This paper presents the architecture for a digital library that combines spatial reasoning and selection with traditional (non-spatial) search. The contributions of this work are: (1) the use of a traditional geographic information system (GIS) for spatial processing rather than a specially tailored GIS system or a separate gazetteer and (2) the seamless integration of GIS with our thesaurus-based Metadata++ system, so users can easily take advantage of the strengths of both systems.

Mathew Weaver, Lois Delcambre, Leonard Shapiro, Jason Brewster, Afrem Gutema, Timothy Tolle

Document-Centered Collaboration for Scholars in the Humanities – The COLLATE System

In contrast to electronic document collections we find in contemporary digital libraries, systems applied in the cultural domain have to satisfy specific requirements with respect to data ingest, management, and access. Such systems should also be able to support the collaborative work of domain experts and furthermore offer mechanisms to exploit the value-added information resulting from a collaborative process like scientific discussions. In this paper, we present the solutions to these requirements developed and realized in the COLLATE system, where advanced methods for document classification, content management, and a new kind of context-based retrieval using scientific discourses are applied.

Ingo Frommholz, Holger Brocks, Ulrich Thiel, Erich Neuhold, Luigi Iannone, Giovanni Semeraro, Margherita Berardi, Michelangelo Ceci

Digital Preservation

DSpace as an Open Archival Information System: Current Status and Future Directions

As more and more output from research institutions is born digital, a means for capturing and preserving the results of this investment is required. To begin to understand and address the problems surrounding this task, Hewlett-Packard Laboratories collaborated with MIT Libraries over two years to develop DSpace, an open source institutional repository software system. This paper describes DSpace in the context of the Open Archival Information System (OAIS) reference model. Particular attention is given to the preservation aspects of DSpace, and the current status of the DSpace system with respect to addressing these aspects. The reasons for various design decisions and trade-offs that were necessary to develop the system in a timely manner are given, and directions for future development are explored. While DSpace is not yet a complete solution to the problem of preserving digital research output, it is a production-capable system, represents a significant step forward, and is an excellent platform for future research and development.

Robert Tansley, Mick Bass, MacKenzie Smith

Preserving the Fabric of Our Lives: A Survey of Web Preservation Initiatives

This paper argues that the growing importance of the World Wide Web means that Web sites are key candidates for digital preservation. After an brief outline of some of the main reasons why the preservation of Web sites can be problematic, a review of selected Web archiving initiatives shows that most current initiatives are based on combinations of three main approaches: automatic harvesting, selection and deposit. The paper ends with a discussion of issues relating to collection and access policies, software, costs and preservation.

Michael Day

Implementing Preservation Strategies for Complex Multimedia Objects

Addressing the preservation and long-term access issues for digital resources is one of the key challenges facing informational organisations such as libraries, archives, cultural institutions and government agencies today. A number of major initiatives and projects have been established to investigate or develop strategies for preserving the burgeoning amounts of digital content being produced. To date, the alternative preservation approaches have been based on emulation, migration and metadata – or some combination of these. Most of the work has focussed on digital objects of a singular media type: text, HTML, images, video or audio and to date few usable tools have been developed to support or implement such strategies or policies. In this paper we consider the preservation of composite, mixed-media, objects, a rapidly growing class of resources. Using three exemplars of new media artwork as case studies, we describe the optimum preservation strategies that we have determined for each exemplar and the software tools that we have developed to support and implement those strategies.

Jane Hunter, Sharmin Choudhury

Indexing and Searching of Special Document and Collection Information

Distributed IR for Digital Libraries

This paper examines technology developed to support large-scale distributed digital libraries. We describe the method used for harvesting collection information using standard information retrieval protocols and how this information is used in collection ranking and retrieval. The system that we have developed takes a probabilistic approach to distributed information retrieval using a Logistic regression algorithm for estimation of distributed collection relevance and fusion techniques to combine multiple sources of evidence. We discuss the harvesting method used and how it can be employed in building collection representatives using features of the Z39.50 protocol. The extracted collection representatives are ranked using a fusion of probabilistic retrieval methods. The effectiveness of our algorithm is compared to other distributed search methods using test collections developed for distributed search evaluation. We also describe how this system in currently being applied to operational systems in the U.K.

Ray R. Larson

Reference Directed Indexing: Redeeming Relevance for Subject Search in Citation Indexes

Citation indexes are valuable tools for research, in part because they provide a means with which to measure the relative impact of articles in a collection of scientific literature. Recent efforts demonstrate some value in retrieval systems for citation indexes based on measures of impact. However, such approaches use weak measures of relevance, ranking together a few useful documents with many that are frequently cited but irrelevant. We propose an indexing technique that joins measures of relevance and impact in a single retrieval metric. This approach, called Reference Directed Indexing (RDI) is based on a comparison of the terms authors use in reference to documents. Initial retrieval experiments with RDI indicate that it retrieves documents of a quality on par with current ranking metrics, but with significantly improved relevance.

Shannon Bradshaw

Space-Efficient Support for Temporal Text Indexing in a Document Archive Context

Support for temporal text-containment queries (query for all versions of documents that contained one or more particular words at a particular time t) is of interest in a number of contexts, including web archives, in a smaller scale temporal XML/web warehouses, and temporal document database systems in general. In the V2 temporal document database system we employed a combination of full-text indexes and variants of time indexes to perform efficient text-containment queries. That approach was optimized for moderately large temporal document databases. However, for “extremely large databases” the index space usage of the approach could be too large. In this paper, we present a more space-efficient solution to the problem: the interval-based temporal text index (ITTX). We also present appropriate algorithms for update and retrieval, and we discuss advantages and disadvantages of the V2 and ITTX approaches.

Kjetil Nørvåg

Clustering Top-Ranking Sentences for Information Access

In this paper we propose the clustering of top-ranking sentences (TRS) for effective information access. Top-ranking sentences are selected by a query-biased sentence extraction model. By clustering such sentences, we aim to generate and present to users a personalised information space. We outline our approach in detail and we describe how we plan to utilise user interaction with this space for effective information access. We present an initial evaluation of TRS clustering by comparing its effectiveness at providing access to useful information to that of document clustering.

Anastasios Tombros, Joemon M. Jose, Ian Ruthven

Springer Professional

About this book

Table of Contents

Frontmatter