Skip to main content

Über dieses Buch

Information retrieval (IR) is becoming an increasingly important area as scientific, business and government organisations take up the notion of "information superhighways" and make available their full text databases for searching. Containing a selection of 35 papers taken from the 17th Annual SIGIR Conference held in Dublin, Ireland in July 1994, the book addresses basic research and provides an evaluation of information retrieval techniques in applications. Topics covered include text categorisation, indexing, user modelling, IR theory and logic, natural language processing, statistical and probabilistic models of information retrieval systems, routing, passage retrieval, and implementation issues.



Text Categorisation


A Sequential Algorithm for Training Text Classifiers

The ability to cheaply train text classifiers is critical to their use in information retrieval, content analysis, natural language processing, and other tasks involving data which is partly or fully textual. An algorithm for sequential sampling during machine learning of statistical classifiers was developed and tested on a newswire text categorization task. This method, which we call uncertainty sampling, reduced by as much as 500-fold the amount of training data that would have to be manually classified to achieve a given level of effectiveness.

David D. Lewis, William A. Gale

Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval

Expert Network (ExpNet) is our new approach to automatic categorization and retrieval of natural language texts. We use a training set of texts with expert-assigned categories to construct a network which approximately reflects the conditional probabilities of categories given a text. The input nodes of the network are words in the training texts, the nodes on the intermediate level are the training texts, and the output nodes are categories. The links between nodes are computed based on statistics of the word distribution and the category distribution over the training set. ExpNet is used for relevance ranking of candidate categories of an arbitrary text in the case of text categorization, and for relevance ranking of documents via categories in the case of text retrieval. We have evaluated ExpNet in categorization and retrieval on a document collection of the MEDLINE database, and observed a performance in recall and precision comparable to the Linear Least Squares Fit (LLSF) mapping method, and significantly better than other methods tested. Computationally, ExpNet has an O(N 1og N) time complexity which is much more efficient than the cubic complexity of the LLSF method. The simplicity of the model, the high recall-precision rates, and the efficient computation together make ExpNet preferable as a practical solution for real-world applications.

Yiming Yang

Towards Language Independent Automated Learning of Text Categorization Models

We describe the results of extensive machine learning experiments on large collections of Reuters’ English and German newswires. The goal of these experiments was to automatically discover classification patterns that can be used for assignment of topics to the individual newswires. Our results with the English newswire collection show a very large gain in performance as compared to published benchmarks, while our initial results with the German newswires appear very promising. We present our methodology, which seems to be insensitive to the language of the document collections, and discuss issues related to the differences in results that we have obtained for the two collections.

Chidanand Apté, Fred Damerau, Sholom M. Weiss

Using IR Techniques for Text Classification in Document Analysis

This paper presents the INFOCLAS system applying statistical methods of information retrieval for the classification of German business letters into corresponding message types such as order, offer, enclosure, etc. INFOCLAS is a first step towards the understanding of documents proceeding to a classification-driven extraction of information. The system is composed of two main modules: the central indexer (extraction and weighting of indexing terms) and the classifier (classification of business letters into given types). The system employs several knowledge sources including a letter database, word frequency statistics for German, lists of message type specific words, morphological knowledge as well as the underlying document structure. As output, the system evaluates a set of weighted hypotheses about the type of the actual letter. Classification of documents allow the automatic distribution or archiving of letters and is also an excellent starting point for higher-level document analysis.1

Rainer Hoch



An Evaluation Method for Stemming Algorithms

The effectiveness of stemming algorithms has usually been measured in terms of their effect on retrieval performance with test collections. This however does not provide any insights which might help in stemmer optimisation. This paper describes a method in which stemming performance is assessed against predefined concept groups in samples of words. This enables various indices of stemming performance and weight to be computed. Results are reported for three stemming algorithms. The validity and usefulness of the approach, and the problems of conceptual grouping, are discussed, and directions for further research are identified.

Chris D. Paice

On the Measurement of Inter-Linker Consistency and Retrieval Effectiveness in Hypertext Databases

An important stage in the process of retrieval of objects from a hypertext database is the creation of a set of inter-nodal links that are intended to represent the relationships existing between objects; this operation is often undertaken manually, just as index terms are often manually assigned to documents in a conventional retrieval system. In this paper, a study is reported in which several different sets of links were inserted, each by a different person, between the paragraphs of each of a number of full-text documents. The degree of similarity between the members of each pair of link-sets (i.e., the degree of inter-linker consistency) was then evaluated. The results indicated that little similarity existed amongst the link-sets, a finding that is comparable with those of studies of inter-indexer consistency, which suggest that there is generally only a low level of agreement between the sets of index terms assigned to a document by different indexers. These latter studies have historically been considered significant on account of their common assumption that there exists a positive relationship between recorded levels of inter-indexer consistency and the levels of retrieval effectiveness that may be achieved by the systems studied. In order to test the validity of making a similar assumption in the context of link-assignment, the paper continues with a description of an investigation into the nature of the relationship existing between (i) the levels of inter-linker consistency obtaining among the group of hypertext databases used in our earlier experiments and (ii) the levels of effectiveness of a number of searches carried out in those databases. An account is given of the implementation of the searches and of the methods used in the calculation of numerical values expressing their effectiveness, and conclusions are drawn regarding the consistency-effectiveness relationship.

David Ellis, Jonathan Furner-Hines, Peter Willett

Query Expansion using Lexical-Semantic Relations

Applications such as office automation, news filtering, help facilities in complex systems, and the like require the ability to retrieve documents from full-text databases where vocabulary problems can be particularly severe. Experiments performed on small collections with single-domain thesauri suggest that expanding query vectors with words that are lexically related to the original query words can ameliorate some of the problems of mismatched vocabularies. This paper examines the utility of lexical query expansion in the large, diverse TREC collection. Concepts are represented by WordNet synonym sets and are expanded by following the typed links included in WordNet. Experimental results show this query expansion technique makes little difference in retrieval effectiveness if the original queries are relatively complete descriptions of the information being sought even when the concepts to be expanded are selected by hand. Less well developed queries can be significantly improved by expansion of hand-chosen concepts. However, an automatic procedure that can approximate the set of hand picked synonym sets has yet to be devised, and expanding by the synonym sets that are automatically generated can degrade retrieval performance.

Ellen M. Voorhees

User Modelling


Perceptual Speed, Learning and Information Retrieval Performance

Although the cognitive ability “perceptual speed” is known to influence search performance by end-users, previous research has not established the mechanism by which this influence occurred. Results from educational psychology suggest that learning that occurs during searching is likely to be influenced by perceptual speed. An experiment was designed to test how this cognitive ability would interact with a system feature designed to enhance learning of search vocabulary, specifically, presenting subject descriptors as the first element in the display of a reference. Results showed significant interactions between perceptual speed and the order of presentation of data elements in predicting both vocabulary learning and search performance. These results indicate that searchers with higher levels of perceptual speed will learn additional search vocabulary, and use that vocabulary to complete higher quality searches, when they use a system designed to optimize scanning of subject descriptors. This outcome supports the idea that cognitive abilities influence information system usability, and that usability is determined by interactions between characteristics of users and system features. The findings also suggest that system features that enhance the learning of search vocabulary, such as query expansion mechanisms, can have a significant positive effect on the quality of end-user searching.

Bryce Allen

Term Relevance Feedback and Query Expansion: Relation to Design

To improve information retrieval effectiveness, research in both the algorithmic and human approach to query expansion is required. This paper uses the human approach to examine the selection and effectiveness of search terms sources for query expansion. The results show that the most effective sources were the users written question statement, user terms derived during the interaction and terms selected from particular database fields. These findings indicate the need for the design and testing of automatic relevance feedback techniques that place greater emphasis on these sources.

Amanda Spink

Modelling Information Retrieval Agents with Belief Revision

This paper describes the development and computational testing of a model of the information intermediary based on an AI theory of belief revision. We describe the theoretical foundations of the work in a general account of the way an agent’s beliefs and intentions are formed and modified, and in an analysis of the functional tasks an intermediary has to carry out; we indicate the specific developments required to automate and integrate both aspects of intermediary behaviour, as determinants of interactive dialogue with the user; and report, with illustrations, on tests and findings. The research shows that such approaches can be implemented in an essentially principled manner, though there are many large problems still to be overcome, and our experiments are only the first, extremely simple, trials of the basic strategy for intermediary simulation.

Brian Logan, Steven Reece, Karen Sparck Jones

Polyrepresentation of Information Needs and Semantic Entities Elements of a Cognitive Theory for Information Retrieval Interaction

Elements of a Cognitive Theory for Information Retrieval Interaction

The paper outlines the principles underlying the theory of polyrepresentation applied to the user’s cognitive space and the information space of IR systems, set in a cognitive framework. By means of polyrepresentation it is suggested to represent the current user’s information need, problem state, and domain work task or interest in a structure of causality as well as to embody semantic full-text entities by means of the principle of ‘intentional redundancy’. hi IR systems this principle implies simultaneously to apply different methods of representation and a variety of IR techniques of different cognitive origin to each entity. The objective is to aproximate as close as possible text retrieval to retrieval of information in a cognitive sense.

Peter Ingwersen

Theory and Logic


Investigating Aboutness Axioms using Information Fields

This article proposes a framework, a so called information field, which allows information retrieval mechanisms to be compared inductively instead of experimentally. Such a comparison occurs as follows: Both retrieval mechanisms are first mapped to an associated information field. Within the field, the axioms that drive the retrieval process can be filtered out. Tn this way, the implicit assumptions governing an information retrieval mechanism can be brought to light. The retrieval mechanisms can then be compared according to which axioms they are governed by. Using this method it is shown that Boolean retrieval is more powerful than a strict form of coordinate retrieval. The salient point is not this result in itself, but how the result was achieved.

P. D. Bruza, T. W. C. Huibers

A Probabilistic Terminological Logic for Modelling Information Retrieval

Some researchers have recently argued that the task of Information Retrieval (IR) may successfully be described by means of mathematical logic; accordingly, the relevance of a given document to a given information need should be assessed by checking the validity of the logical formula d → n,where d is the representation of the document, n is the representation of the information need and “→” is the conditional connective of the logic in question. In a recent paper we have proposed Terminological Logics (TLs) as suitable logics for modelling IR within the paradigm described above. This proposal, however, while making a step towards adequately modelling IR in a logical way, does not account for the fact that the relevance of a document to an information need can only be assessed up to a limited degree of certainty. In this work, we try to overcome this limitation by introducing a model of IR based on a Probabilistic TL, i.e. a logic allowing the expression of real-valued terms representing probability values and possibly involving expressions of a TL. Two different types of probabilistic information, i.e. statistical information and information about degrees of belief, can be accounted for in this logic. The paper presents a formal syntax and a denotational (possible-worlds) semantics for this logic, and discusses, by means of a number of examples, its adequacy as a formal tool for describing IR.

Fabrizio Sebastiani

Natural Language Processing


Retrieving Terms and their Variants in a Lexicalized Unification-Based Framework

Term extraction is a major concern for information retrieval. Terms are not fixed forms and their variations prevent them from being identified by a match with their initial string or inflection. We show that a local syntactic approach to this problem can give good results for both the quality of identification and parsing time.A specific tool, FASTR, is developed which handles an identification of basic terms and a parser of their variations as well. Terms are described by logic rules automatically generated from terms and their categorial structure. Variations are represented by metarules. The parser efficiently processes large size corpora with big dictionaries and mixes lexical identification with local syntactic analysis. We evaluate the accuracy of results produced by these metarules and improve these results with filtering metarules.

Christian Jacquemin, Jean Royaute

Word Sense Disambiguation and Information Retrieval

It has often been thought that word sense ambiguity is a cause of poor performance in Information Retrieval (IR) systems. The belief is that if ambiguous words can be correctly disambiguated, IR performance will increase. However, recent research into the application of a word sense disambiguator to an IR system failed to show any performance increase. From these results it has become clear that more basic research is needed to investigate the relationship between sense ambiguity, disambiguation, and IR.Using a technique that introduces additional sense ambiguity into a collection, this paper presents research that goes beyond previous work in this field to reveal the influence that ambiguity and disambiguation have on a probabilistic IR system. We conclude that word sense ambiguity is only problematic to an B2 system when it is retrieving from very short queries. In addition we argue that if a word sense disambiguator is to be of any use to an IR system, the disambiguator must be able to resolve word senses to a high degree of accuracy.

Mark Sanderson

A Full-Text Retrieval System with a Dynamic Abstract Generation Function

We have developed a Japanese full-text retrieval system named BREVIDOC* that enables the user to specify an area within a text for abstraction and to control the volume of the abstract interactively. This system analyzes a document structure using linguistic knowledge only and thus is domain-independent. In its text structure analysis, the system determines relations among paragraphs and sentences, based on linguistic clues such as connectives, anaphoric expressions, and idiomatic expressions. The system analyzes and stores the text structure in advance so that it can generate an abstract in real time by selecting sentences according to relative importance of rhetorical relations among the sentences. The retrieval system works on an engineering workstation.

Seiji Miike, Etsuo Itoh, Kenji Ono, Kazuo Sumita

Statistical Models


A Document Retrieval Model Based on Term Frequency Ranks

This paper introduces a new full-text document retrieval model that is based on comparing occurrence frequency rank numbers of terms in queries and documents.More precisely, to compute the similarity between a query and a document, this new model first ranks the terms in the query and in the document on decreasing occurrence frequency. Next, for each term, it computes a local similarity between the query and the document, by calculating a weighted difference between the term’s rank number in the query and its rank number in the document. Finally, it collects all those local similarities and unifies them into one global similarity between the query and the document.In this paper we also demonstrate that the effectiveness of this new full-text document retrieval model is comparable with that of the standard vector-space retrieval model.

IJsbrand Jan Aalbersberg

Automatic Combination of Multiple Ranked Retrieval Systems

Retrieval performance can often be improved significantly by using a number of different retrieval algorithms and combining the results, in contrast to using just a single retrieval algorithm. This is because different retrieval algorithms, or retrieval experts, often emphasize different document and query features when determining relevance and therefore retrieve different sets of documents. However, it is unclear how the different experts are to be combined, in general, to yield a superior overall estimate. We propose a method by which the relevance estimates made by different experts can be automatically combined to result in superior retrieval performance. We apply the method to two expert combination tasks. The applications demonstrate that the method can identify high performance combinations of experts and also is a novel means for determining the combined effectiveness of experts.

Brian T. Bartell, Garrison W. Cottrell, Richard K. Belew

Properties of Extended Boolean Models in Information Retrieval

The conventional boolean retrieval system does not provide ranked retrieval output because it cannot compute similarity coefficients between queries and documents. Extended boolean models such as fuzzy set, Waller-Kraft, Paice, P-Norm and Infinite-One have been proposed in the past to support ranking facility for the boolean retrieval system. In this paper, we analyze the behavioural aspects of the previous extended boolean models and address important mathematical properties to affect retrieval effectiveness. We concentrate our description on evaluation formulas for AND and OR operations and query weights. Our analyses show that P-Norm is the most suitable for achieving high retrieval effectiveness.

Joon Ho Lee

Performance Evaluation


OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research

A series of information retrieval experiments was carried out with a computer installed in a medical practice setting for relatively inexperienced physician end-users. Using a commercial MEDLINE product based on the vector space model, these physicians searched just as effectively as more experienced searchers using Boolean searching. The results of this experiment were subsequently used to create a new large medical test collection, which was used in experiments with the SMART retrieval system to obtain baseline performance data as well as compare SMART with the other searchers.

William Hersh, Chris Buckley, T. J. Leone, David Hickam

Results of Applying Probabilistic IR to OCR Text

Character accuracy of optically recognized text is considered a basic measure for evaluating OCR devices. In the broader sense, another fundamental measure of an OCR’s goodness is whether its generated text is usable for retrieving information. In this study, we evaluate retrieval effectiveness from OCR text databases using a probabilistic IR system. We compare these retrieval results to their manually corrected equivalent. We show there is no statistical difference in precision and recall using graded accuracy levels from three OCR devices. However, characteristics of the OCR data have side effects that could cause unstable results with this IR model. In particular, we found individual queries can be greatly affected. Knowing the qualities of OCR text, we compensate for them by applying an automatic post-processing system that improves effectiveness.

Kazem Taghva, Julie Borsack, Allen Condit

Natural Language vs. Boolean Query Evaluation: A Comparison of Retrieval Performance

The results of experiments comparing the relative performance of natural language and Boolean query formulations are presented. The experiments show that on average a current generation natural language system provides better retrieval performance than expert searchers using a Boolean retrieval system when searching full-text legal materials. Methodological issues are reviewed and the effect of database size on query formulation strategy is discussed.

Howard Turtle

Probabilistic Models


Inferring Probability of Relevance Using the Method of Logistic Regression

This research evaluates a model for probabilistic text and document retrieval; the model utilizes the technique of logistic regression to obtain equations which rank documents by probability of relevance as a function of document and query properties. Since the model infers probability of relevance from statistical clues present in the texts of documents and queries, we call it logistic inference. By transforming the distribution of each statistical clue into its standardized distribution (one with mean μ = 0 and standard deviation σ = 1), the method allows one to apply logistic coefficients derived from a training collection to other document collections, with little loss of predictive power. The model is applied to three well-known information retrieval test collections, and the results are compared directly to the particular vector space model of retrieval which uses term-frequency/inverse-document-frequency (tfidf) weighting and the cosine similarity measure. In the comparison, the logistic inference method performs significantly better than (in two collections) or equally well as (in the third collection) the tfidf/cosine vector space model. The differences in performances of the two models were subjected to statistical tests to see if the differences are statistically significant or could have occurred by chance.

Fredric C. Gey

Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval

The 2-Poisson model for term frequencies is used to suggest ways of incorporating certain variables in probabilistic models for information retrieval. The variables concerned are within-document term frequency, document length, and within-query term frequency. Simple weighting functions are developed, and tested on the TREC test collection. Considerable performance improvements (over simple inverse collection frequency weighting) are demonstrated.

S. E. Robertson, S. Walker

The Formalism of Probability Theory in IR: A Foundation or an Encumbrance?

Probabilistic theories of retrieval bring to bear on the information search problem a high degree of theoretical coherence and deductive power. In principle, this power ought to be an invaluable asset. In practice, it has turned out to be a mixed blessing. The question considered here is whether the trappings of the probabilistic formalism strengthen or encumber IR research on balance.

Wm. S. Cooper



LyberWorld — A Visualization User Interface Supporting Fulltext Retrieval

LyberWorld is a prototype IR user interface. It implements visualizations of an abstract information space — fulltext. The paper derives a model for such visualizations and an exemplar user interface design is implemented for the probabilistic fulltext retrieval system INQUERY. Visualizations are used to communicate information search and browsing activities in a natural way by applying metaphors of spatial navigation in abstract information spaces. Visualization tools for exploring information spaces and judging relevance of information items are introduced and an example session demonstrates the prototype. The presence of a spatial model in the user’s mind and interaction with a system’s corresponding display methods is regarded as an essential contribution towards natural interaction and reduction of cognitive costs during e.g. query construction, orientation within the database content, relevance judgement and orientation within the retrieval context.

Matthias Hemmje, Clemens Kunkel, Alexander Willett

A System for Discovering Relationships by Feature Extraction from Text Databases

A method for accessing text-based information using domain-specific features rather than documents alone is presented. The basis of this approach is the ability to automatically extract features from large text databases, and identify statistically significant relationships or associations between those features. The techniques supporting this approach are discussed, and examples from an application using these techniques, named the Associations System, are illustrated using the Wall Street Journal database. In this particular application, the features extracted are company and person names. The series of tests run on the Associations System demonstrate that feature extraction can be quite accurate, and that the relationships generated are reliable. In addition to conventional measures of recall and precision, evaluation measures are currently being studied which will indicate the usefulness of the relationships identified, in various domain-specific contexts.

Jack G. Conrad, Mary Hunter Utt



Information Filtering Based on User Behavior Analysis and Best Match Text Retrieval

Information filtering systems have potential power that may provide an efficient means of navigating through large and diverse data space. However, current information filtering technology heavily depends on a user’s active participation for describing the user’s interest to information items, forcing the user to accept extra load to overcome the already loaded situation. Furthermore, because the user’s interests are often expressed in discrete format such as a set of keywords sometimes augmented with if-then rules, it is difficult to express ambiguous interests, which users often want to do. We propose a technique that uses user behavior monitoring to transparently capture the user’s interest in information, and a technique to use this interest to filter incoming information in a very efficient way. The proposed techniques are verified to perform very well by having conducted a field experiment and a series of simulation.

Masahiro Morita, Yoichi Shinoda

Improving Text Retrieval for the Routing Problem using Latent Semantic Indexing

Latent Semantic Indexing (LSI) is a novel approach to information retrieval that attempts to model the underlying structure of term associations by transforming the traditional representation of documents as vectors of weighted term frequencies to a new coordinate space where both documents and terms are represented as linear combinations of underlying semantic factors. In previous research, LSI has produced a small improvement in retrieval performance. In this paper, we apply LSI to the routing task, which operates under the assumption that a sample of relevant and non-relevant documents is available to use in constructing the query. Once again, LSI slightly improves performance. However, when LSI is used is conjuction with statistical classification, there is a dramatic improvement in performance.

David Hull

The Effect of Adding Relevance Information in a Relevance Feedback Environment

The effects of adding information from relevant documents are examined in the TREC routing environment. A modified Rocchio relevance feedback approach is used, with a varying number of relevant documents retrieved by an initial SMART search, and a varying number of terms from those relevant documents used to expand the initial query. Recall-precision evaluation reveals that as the amount of expansion of the query due to adding terms from relevant documents increases, so does the effectiveness. There appears to be a linear relationship between the log of the number of terms added and the recall-precision effectiveness. There also appears to be a linear relationship between the log of the number of known relevant documents and the recall-precision effectiveness.

Chris Buckley, Gerard Salton, James Allan

Passage Retrieval


Passage-Level Evidence in Document Retrieval

The increasing lengths of documents in full-text collections encourages renewed interest in the ranking and retrieval of document passages. Past research showed that evidence from passages can improve retrieval results, but it also raised questions about how passages are defined, how they can be ranked efficiently, and what is their proper role in long, structured documents.This paper reports on experiments with passages in INQUERY, a probabilistic information retrieval system. Experiments were conducted with passages based on paragraphs, and with passages based on text windows of various sizes. Experimental results are given for three homogeneous and two heterogeneous document collections, ranging in size from three megabytes to two gigabytes.

James P. Callan

Effective Retrieval of Structured Documents

Information systems usually retrieve whole documents as answers to queries. However, it may in some circumstances be more appropriate to retrieve parts of documents. We consider formulas for retrieving whole documents and parts of documents horn a large structured document collection. We consider what information is needed to retrieve effectively and show that knowledge of the structure of documents can lead to improved retrieval performance.

Ross Wilkinson

Document and Passage Retrieval Based on Hidden Markov Models

Introduced is a new approach to Information Retrieval developed on the basis of Hidden Markov Models (HMMs). HMMs are shown to provide a mathematically sound framework for retrieving documenta—documents with predefined boundaries and also entities of information that are of arbitrary lengths and formats (passage retrieval). Our retrieval model is shown to encompass promising capabilities: First, the position of occurrences of indexing features can be used for indexing. Positional information is essential, for instance, when considering phrases, negation, and the proximity of features. Second, from training collections we can derive automatically optimal weights for arbitrary features. Third, a query dependent structure can be determined for every document by segmenting the documents into passages that are either relevant or irrelevant to the query. The theoretical analysis of our retrieval model is complemented by the results of preliminary experiments.

Elke Mittendorf, Peter Schäuble



Synthetic Workload Performance Analysis of Incremental Updates

Declining disk and CPU costs have kindled a renewed interest in efficient document indexing techniques. In this paper, the problem of incremental updates of inverted lists is addressed using a dual-structure index data structure that dynamically separates long and short inverted lists and optimizes the retrieval, update, and storage of each type of list. The behavior of this index is studied with the use of a synthetically-generated document collection and a simulation model of the algorithm. The index structure is shown to support rapid insertion of documents, fast queries, and to scale well to large document collections and many disks.

Kurt Shoens, Anthony Tomasic, Hector Garcia-Molina

Document Filtering for Fast Ranking

Ranking techniques are effective for finding answers in document collections but the cost of evaluation of ranked queries can be unacceptably high. We propose an evaluation technique that reduces both main memory usage and query evaluation time. based on early recognition of which documents are likely to be highly ranked. Our experiments show that, for our test data, the proposed technique evaluates queries in 20% of the time and 2% of the memory taken by the standard inverted file implementation, without degradation in retrieval effectiveness.

Michael Persin

Adapting a Full-text Information Retrieval System to the Computer Troubleshooting Domain

There has been much research in full-text information retrieval on automated and semi-automated methods of query expansion to improve the effectiveness of user queries. In this paper we consider the challenges of tuning an IR system to the domain of computer troubleshooting, where user queries tend to be very short and natural language query terms are intermixed with terminology from a variety of technical sublanguages. A number of heuristic techniques for domain knowledge acquisition are described in which the complementary contributions of query log data and corpus analysis are exploited. We discuss the implications of sublanguage domain tuning for run-time query expansion tools and document indexing, arguing that the conventional devices for more purely “natural language” domains may be inadequate.

Peter G. Anick

Panel Sessions


Panel: Integration of Information Retrieval and Database Systems

Bruce W. Croft, C. J. van Rijsbergen

Panel: Evaluating Interactive Retrieval Systems

Most current information retrieval systems are highly interactive. Users ask queries, get immediate feedback, refine their queries, and so on. Methods for evaluating these dynamic systems have not kept pace with the rapid advances in system design. It is no longer enough to use the standard precision-recall measures to evaluate and to improve interactive retrieval systems. There is often no single final query to evaluate, with useful information being gathered from many different queries along the way. In addition, interfaces play a critical role in building effective retrieval systems. The best retrieval algorithm can be rendered functionally useless if the interface to it is unusable. Conversely, of course, the spiffiest new interface is not worth much without a good retrieval engine behind it. It would be easy if one could study interfaces and retrieval engines separately and take the best of both worlds. Unfortunately, there are important interactions that cannot be evaluated by studying components in isolation — e.g., how do you incorporate ranking or relevance feedback for a Boolean retrieval engine, or how do you highlight matching terms if complex syntactic and semantic processing of queries is used? The design of effective interactive retrieval environments will require careful attention to the larger human — interface — retrieval — engine system.

Nicholas Belkin, Christine L. Borgman, Susan Dumais, Micheline Hancock-Beaulieu


Weitere Informationen