Information Retrieval | springerprofessional.de

Springer Professional

nach oben

2004 | Buch | 2. Auflage

Kapitel lesen Erstes Kapitel lesen

Information Retrieval

Algorithms and Heuristics

verfasst von: David A. Grossman, Ophir Frieder

Verlag: Springer Netherlands

Buchreihe : The Information Retrieval Series

Enthalten in: Professional Book Archive

Einloggen, um Zugang zu erhalten

Über dieses Buch

Interested in how an efficient search engine works? Want to know what algorithms are used to rank resulting documents in response to user requests? The authors answer these and other key information retrieval design and implementation questions.

This book is not yet another high level text. Instead, algorithms are thoroughly described, making this book ideally suited for both computer science students and practitioners who work on search-related applications. As stated in the foreword, this book provides a current, broad, and detailed overview of the field and is the only one that does so. Examples are used throughout to illustrate the algorithms.

The authors explain how a query is ranked against a document collection using either a single or a combination of retrieval strategies, and how an assortment of utilities are integrated into the query processing scheme to improve these rankings. Methods for building and compressing text indexes, querying and retrieving documents in multiple languages, and using parallel or distributed processing to expedite the search are likewise described.

This edition is a major expansion of the one published in 1998. Besides updating the entire book with current techniques, it includes new sections on language models, cross-language information retrieval, peer-to-peer processing, XML search, mediators, and duplicate document detection.

Anzeige

Inhaltsverzeichnis

Frontmatter

Chapter 1. Introduction

Abstract

Since the near beginnings of civilization, human beings have focused on written communication. From cave drawings to scroll writings, from printing presses to electronic libraries, communicating was of primary concern to man’s existence. Today, with the emergence of digital libraries and electronic information exchange there is clear need for improved techniques to organize large quantities of information. Applied and theoretical research and development in the areas of information authorship, processing, storage, and retrieval is of interest to all sectors of the community. In this book, we survey recent research efforts that focus on the electronic searching and retrieving of documents.

David A. Grossman, Ophir Frieder

Chapter 2. Retrieval Strategies

Abstract

Retrieval strategies assign a measure of similarity between a query and a document. These strategies are based on the common notion that the more often terms are found in both the document and the query, the more “relevant” the document is deemed to be to the query. Some of these strategies employ counter measures to alleviate problems that occur due to the ambiguities inherent in language—the reality that the same concept can often be described with many different terms (e.g., new york and the big apple can refer to the same concept). Additionally, the same term can have numerous semantic definitions (terms like bark and duck have very different meanings in their noun and verb forms).

David A. Grossman, Ophir Frieder

Chapter 3. Retrieval Utilities

Abstract

Many different utilities improve the results of a retrieval strategy. Most utilities add or remove terms from the initial query in an attempt to refine the query. Others simply refine the focus of the query by using subdocuments or passages instead of whole documents. The key is that each of these utilities (although rarely presented as such) are plug-and-play utilities that operate with any arbitrary retrieval strategy.

David A. Grossman, Ophir Frieder

Chapter 4. Cross-Language Information Retrieval

Abstract

Cross-Language Information Retrieval (CUR) is quickly becoming a mature area in the information retrieval world. The goal is to allow a user to issue a query in language L and have that query retrieve documents in language L′ (see Figure 4.1). The idea is that the user wants to issue a single query against a document collection that contains documents in a myriad of languages. An implicit assumption is that the user understands results obtained in multiple languages. If this is not the case, it is necessary for the retrieval system to translate the selected foreign language documents into a language that the user can understand. Surveys of cross-language information retrieval techniques and multilingual processing include [Gard and Diekema, 1998, Haddouti, 1999].

David A. Grossman, Ophir Frieder

Chapter 5. Efficiency

Abstract

Thus far, we have discussed algorithms used to improve the effectiveness of query processing in terms of precision and recall. Retrieval strategies and utilities all focus on finding the relevant documents for a query. They are not concerned with how long it takes to find them.

David A. Grossman, Ophir Frieder

Chapter 6. Integrating Structured Data and Text

Abstract

Essential problems associated with searching and retrieving relevant documents were discussed in the preceding chapters. However, simply searching massive quantities of unstructured data is not sufficient.

David A. Grossman, Ophir Frieder

Chapter 7. Parallel Information Retrieval

Abstract

Parallel architectures are often described based on the number of instruction and data streams, namely single and multiple data and instruction streams. A complete taxonomy of different combinations of instruction streams and data was given in [Flynn, 1972]. To evaluate the performance delivered by these architectures on a given computation, speedup is defined as \(\frac{{{{T}_{s}}}}{{{{T}_{p}}}}\), where T_s is the time taken by the best sequential algorithm, and T_p is the time taken by the parallel algorithm under consideration. The higher the speedup, the better the performance The motivation for measuring speedup is that it indicates whether or not an algorithm scales. An algorithm that has near linear speedup on sixteen processors may not exhibit similar speedup on hundreds of processors. However, an algorithm that delivers very little or no speedup on only two processors will certainly not scale to large numbers of processors.

David A. Grossman, Ophir Frieder

Chapter 8. Distributed Information Retrieval

Abstract

Until now, we focused strictly on the use of a single machine to provide an information retrieval service. In Chapter 7, we discussed the use of a single machine with multiple processors to improve performance. Although efficient performance is critical for user acceptance of the system, today, document collections are often scattered across many different geographical areas. Thus, the ability to process the data where they are located is arguably even more important than the ability to efficiently process them. Possible constraints prohibiting the centralization of the data include data security, their sheer volume prohibiting their physical transfer, their rate of change, political and legal constraints, as well as other proprietary motivations. For a comprehensive discussion from a data engineering perspective on the engineering of data processing systems in a distributed environment, see [Shuey et al., 1997].

David A. Grossman, Ophir Frieder

Chapter 9. Summary and Future Directions

Abstract

We described a variety of search and retrieval approaches, most of which primarily focused on improving the accuracy of information retrieval engines. Unlike other search and retrieval domains, e.g., traditional relational databases, the accuracy of retrieval is not constant. That is, in the traditional relational database domain all techniques result in perfect accuracy. Hence, the main concern, in terms of performance evaluation, is the overall system throughput and the individual query performance.

David A. Grossman, Ophir Frieder

Backmatter

Titel: Information Retrieval
verfasst von: David A. Grossman
Ophir Frieder
Copyright-Jahr: 2004
Verlag: Springer Netherlands
Electronic ISBN: 978-1-4020-3005-5
Print ISBN: 978-1-4020-3004-8
DOI: https://doi.org/10.1007/978-1-4020-3005-5