nach oben

2003 | Buch

Kapitel lesen Erstes Kapitel lesen

Advances in Information Retrieval

25th European Conference on IR Research, ECIR 2003, Pisa, Italy, April 14–16, 2003. Proceedings

herausgegeben von: Fabrizio Sebastiani

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Professional Book Archive

Einloggen, um Zugang zu erhalten

Über dieses Buch

The European Conference on Information Retrieval Research, now in its 25th “Silver Jubilee” edition, was initiallyestablished bythe Information Retrieval Specialist Group of the British Computer Society(BCS-IRSG) under the name “Annual Colloquium on Information Retrieval Research,” and was always held in the United Kingdom until 1997. Since 1998 the location of the colloquium has alternated between the United Kingdom and the rest of Europe, in order to re?ect the growing European orientation of the event. For the same reason, in 2001 the event was renamed “European Annual Colloquium on Information Retrieval Research.” Since 2002, the proceedings of the Colloquium have been published bySpringer-Verlag in their Lecture Notes in Computer Science series. In 2003 BCS-IRSG decided to rename the event “European Conference on Information Retrieval Research,” in order to re?ect what the event had slowly turned into, i.e., a full-blown conference with a European program committee, strong peer reviewing, and a (mostly) European audience. However, ECIR still retains the strong student focus that has characterized the Colloquia since their inception: student fees are kept particularlylow, a s- dent travel grant program is available in order to encourage students to attend the conference (and encourage student authors to present their papers pers- ally), and a Best Student Paper Award is assigned (conversely, ECIR has no best paper award).

Inhaltsverzeichnis

Frontmatter

Invited Papers

Document Retrieval: Shallow Data, Deep Theories; Historical Reflections, Potential Directions

This paper reviews the development of statistically-based retrieval. Since the 1950s statistical techniques have clearly demonstrated their practical worth and statistical theories their staying power, for document or text retrieval. In the last decade the TREC programme, and the Web, have offered new retrieval challenges to which these methods have successfully risen. They are now one element in the much wider and very productive spread of statistical methods to all areas of information and language processing, in which innovative approaches to modelling their data and tasks are being applied.

Karen Spärck Jones

Annotation and Retrieval of Structured Video Documents

Live Logging and Posterity Logging are the two basic applications for video databases. The former aims at providing effective annotation of video in quasi-real time and supports extraction of meaningful clips from the live stream; the latter provides annotation for later reuse of video material and is the prerequisite for retrieval by content from video digital libraries. Both require that information is adequately structured with interchange format and that annotation is performed, at a great extent, automatically. Video information structure must encompass both low-intermediate level video organization and event relationships that define specific highlights and situations. Analysis of the visual data of the video stream permits to extract hints, identify events and detect highlights. All of this must be supported by a-priori knowledge of the subject and effective reasoning engines capable to capture the inherent semantics of the visual events.

Marco Bertini, Alberto Del Bimbo, Walter Nunziati

Papers

Improving the Evaluation of Web Search Systems

Linkage analysis as an aid to web search has been assumed to be of significant benefit and we know that it is being implemented by many major Search Engines. Why then have few TREC participants been able to scientifically prove the benefits of linkage analysis over the past three years? In this paper we put forward reasons why disappointing results have been found and we identify the linkage density requirements of a dataset to faithfully support experiments into linkage analysis. We also report a series of linkage-based retrieval experiments on a more densely linked dataset culled from the TREC web documents.

Cathal Gurrin, Alan F. Smeaton

When Are Links Useful? Experiments in Text Classification

Link analysis methods have become popular for information access tasks, especially information retrieval, where the link information in a document collection is used to complement the traditionally used content information. However, there has been little firm evidence to confirm the utility of link information. We show that link information can be useful when the document collection has a sufficiently high link density and links are of sufficiently high quality. We report experiments on text classification of the Cora and WebKB data sets using Probabilistic Latent Semantic Analysis and Probabilistic Hypertext Induced Topic Selection. Comparison with manually assigned classes shows that link information enhances classification in data with sufficiently high link density, but is detrimental to performance at low link densities or if the quality of the links is degraded. We introduce a new frequency-based method for selecting the most useful citations from a document collection for use in the model.

Michelle Fisher, Richard Everson

Hierarchical Classification of HTML Documents with WebClassII

This paper describes a new method for the classification of a HTML document into a hierarchy of categories. The hierarchy of categories is involved in all phases of automated document classification, namely feature extraction, learning, and classification of a new document. The innovative aspects of this work are the feature selection process, the automated threshold determination for classification scores, and an experimental study on real-word Web documents that can be associated to any node in the hierarchy. Moreover, a new measure for the evaluation of system performances has been introduced in order to compare three different techniques (flat, hierarchical with proper training sets, hierarchical with hierarchical training sets). The method has been implemented in the context of a client-server application, named WebClassII. Results show that for hierarchical techniques it is better to use hierarchical training sets.

Michelangelo Ceci, Donato Malerba

Hierarchical Indexing and Flexible Element Retrieval for Structured Document

As more and more structured documents, such as the SGML or XML documents, become available on the Web, there is a growing demand to develop effective structured document retrieval which exploits both content and hierarchical structure of documents and return document elements with appropriate granularity. Previous work on partial retrieval of structured document has limited applications due to the requirement of structured queries and restriction that the document structure cannot be traversed according to queries. In this paper, we put forward a method for flexible element retrieval which can retrieve relevant document elements with arbitrary granularity against natural language queries. The proposed techniques constitute a novel hierarchical index propagation and pruning mechanism and an algorithm of ranking document elements based on the hierarchical index. The experimental results show that our method significantly outperforms other existing methods. Our method also shows robustness to the long-standing problems of text length normalization and threshold setting in structured document retrieval.

Hang Cui, Ji-Rong Wen, Tat-Seng Chua

Construction of a Test Collection for the Focussed Retrieval of Structured Documents

In this paper, we examine the methodological issues involved in constructing test collections of structured documents and obtaining best entry points for the evaluation of the focussed retrieval of document components. We describe a pilot test of the proposed test collection construction methodology performed on a document collection of Shakespeare plays. In our analysis, we examine the effect of query complexity and type on overall query difficulty, the use of multiple relevance judges for each query, the problem of obtaining exhaustive relevance assessments from participants, and the method of eliciting relevance assessments and best entry points. Our findings indicate that the methodology is indeed feasible in this small-scale context, and merits further investigation.

Gabriella Kazai, Mounia Lalmas, Jane Reid

User Behaviour in the Context of Structured Documents

This paper describes a small-scale experimental study examining user behaviour in the context of structured documents. Two variants of the same interface to support information seeking were implemented, one highlighting relevant objects and one highlighting best entry points (BEPs). BEPs are intended to support users’ natural information seeking behaviour by providing optimal starting points for browsing to relevant objects. Analysis of the results from the comparative study of these two interfaces shows that the BEP interface was strongly preferred to the relevant object interface, and that appropriate usage of BEPs can lead to improved task performance. However, the study also highlighted shortcomings related to the inconsistent nature of BEPs and to BEP interface design.

Karen Finesilver, Jane Reid

Attaining Fast and Successful Searches in E-commerce Environments

Current online stores suffer from a cardinal problem. There are too many products to offer, and customers find themselves lost due to the vast selection. As opposed to traditional stores, there is little or no guidance that helps the customers as they search. In this paper, we propose a new approach for searching in online stores. This approach is based on algorithms commonly used in recommendation systems, but which are rarely used for searches in online stores. We employ this approach for both keyword and browse searches, and present an implementation of this approach. We compared several search guide algorithms experimentally, and the experiments’ results show that the suggested algorithms are applicable to the domain of online stores.

Raz Lin, Sarit Kraus, Jeffrey Tew

Learning User Similarity and Rating Style for Collaborative Recommendation

Information filtering is an area getting more important as we have long been flooded with too much information. Product brokering in e-commerce is a typical example and systems which can recommend products to their users in a personalized manner have been studied rigoriously in recent years. Collaborative filtering is one of the commonly used approaches where careful choices of the user similarity measure and the rating style representation are required, and yet there is no guarantee for their optimality. In this paper, we propose the use of machine learning techniques to learn the user similarity as well as the rating style. A criterion function measuring the prediction errors is used and several problem formulations are proposed together with their learning algorithms. We have evaluated our proposed methods using the EachMovie dataset and succeeded in obtaining significant improvement in recommendation accuracy when compared with the standard correlation method.

Lily F. Tian, Kwok-Wai Cheung

Spoken Information Extraction from Italian Broadcast News

Current research on information extraction from spoken documents is mainly focused on the recognition of named entities, such as names of organizations, locations and persons, within transcripts automatically generated by a speech recognizer. In this work we present research carried out at ITC-irst on named entity recognition in Italian broadcast news. In particular, an original statistical named entity tagger is described which can be trained with relatively little language resources: a seed list of named entities and a large untagged text corpus. Moreover, the paper discusses and presents named entity recognition experiments with case sensitive automatic transcripts, generated by the ITC-irst speech recognizer, and by training the named entity model with seed lists of different size.

Vanessa Sandrini, Marcello Federico

Taming Wild Phrases

In this paper the suitability of different document representations for automatic document classification is compared, investigating a whole range of representations between bag-of-words and bag-of-phrases. We look at some of their statistical properties, and determine for each representation the optimal choice of classification parameters and the effect of Term Selection.Phrases are represented by an abstraction called Head/Modifier pairs. Rather than just throwing phrases and keywords together, we shall start with pure HM pairs and gradually add more keywords to the document representation. We use the classification on keywords as the baseline, which we compare with the contribution of the pure HM pairs to classification accuracy, and the incremental contributions from heads and modifiers. Finally, we measure the accuracy achieved with all words and all HM pairs combined, which turns out to be only marginally above the baseline.We conclude that even the most careful term selection cannot overcome the differences in Document Frequency between phrases and words, and propose the use of term clustering to make phrases more cooperative.

Cornelis H. A. Koster, Mark Seutter

Stemming and Decompounding for German Text Retrieval

The stemming problem, i.e. finding a common stem for different forms of a term, has been extensively studied for English, but considerably less is known for other languages. Previously, it has been claimed that stemming is essential for highly declensional languages. We report on our experiments on stemming for German, where an additional issue is the handling of compounds, which are formed by concatenating several words. Rarely do studies on stemming for any language cover more than one or two different approaches. This paper makes a major contribution that transcends its focus on German by investigating a complete spectrum of approaches, ranging from language-independent to elaborate linguistic methods. The main findings are that stemming is beneficial even when using a simple approach, and that carefully designed decompounding, the splitting of compound words, remarkably boosts performance. All findings are based on a thorough analysis using a large reliable test collection.

Martin Braschler, Bärbel Ripplinger

Question Answering System for Incomplete and Noisy Data

Methods and Measures for Its Evaluation

We present a question answering system that can handle noisy and incomplete natural language data, and methods and measures for the evaluation of question answering systems. Our question answering system is based on the vector space model and linguistic analysis of the natural language data. In the evaluation procedure, we test eight different preprocessing schemes for the data, and come to the conclusion that lemmatization combined with breaking compound words into their constituents gives significantly better results than the baseline. The evaluation process is based on stratified random sampling and bootstrapping. To measure the correctness of an answer, we use partial credits as well as full credits.

Lili Aunimo, Oskari Heinonen, Reeta Kuuskoski, Juha Makkonen, Renaud Petit, Otso Virtanen

Term Proximity Scoring for Keyword-Based Retrieval Systems

This paper suggests the use of proximity measurement in combination with the Okapi probabilistic model. First, using the Okapi system, our investigation was carried out in a distributed retrieval framework to calculate the same relevance score as that achieved by a single centralized index. Second, by applying a term-proximity scoring heuristic to the top documents returned by a keyword-based system, our aim is to enhance retrieval performance. Our experiments were conducted using the TREC8, TREC9 and TREC10 test collections, and show that the suggested approach is stable and generally tends to improve retrieval effectiveness especially at the top documents retrieved.

Yves Rasolofo, Jacques Savoy

Propositional Logic Representations for Documents and Queries: A Large-Scale Evaluation

Expressive power is a potential source of benefits for Information Retrieval. Indeed, a number of works have been traditionally devoting their efforts to defining models able to manage structured documents. Similarly, many researchers have looked at query formulation and proposed different methods to generate structured queries. Nevertheless few attempts have addressed the combination of both expressive documents and expressive queries and its effects on retrieval performance. This is mostly due to the lack of a coherent and expressive framework in which both documents and queries can be handled in an homogeneous and efficient way. In this work we aim at filling this gap. We test the impact of logical representations for documents and queries under a large-scale evaluation. The experiments show clearly that, under the same conditions, the use of logical representations for both documents and queries leads to significant improvements in retrieval performance. Moreover, the overall performance results make evident that logic-based approaches can be competitive in the field of Information Retrieval.

David E. Losada, Alvaro Barreiro

From Uncertain Inference to Probability of Relevance for Advanced IR Applications

Uncertain inference is a probabilistic generalisation of the logical view on databases, ranking documents according to their probabilities that they logically imply the query. For tasks other than ad-hoc retrieval, estimates of the actual probability of relevance are required. In this paper, we investigate mapping functions between these two types of probability. For this purpose, we consider linear and logistic functions. The former have been proposed before, whereas we give a new theoretic justification for the latter. In a series of upper-bound experiments, we compare the goodness of fit of the two models. A second series of experiments investigates the effect on the resulting retrieval quality in the fusion step of distributed retrieval. These experiments show that good estimates of the actual probability of relevance can be achieved, and the logistic model outperforms the linear one. However, retrieval quality for distributed retrieval (only merging, without resource selection) is only slightly improved by using the logistic function.

Henrik Nottelmann, Norbert Fuhr

Topic Detection and Tracking with Spatio-Temporal Evidence

Topic Detection and Tracking is an event-based information organization task where online news streams are monitored in order to spot new unreported events and link documents with previously detected events. The detection has proven to perform rather poorly with traditional information retrieval approaches. We present an approach that formalizes temporal expressions and augments spatial terms with ontological information and uses this data in the detection. In addition, instead using a single term vector as a document representation, we split the terms into four semantic classes and process and weigh the classes separately. The approach is motivated by experiments.

Juha Makkonen, Helena Ahonen-Myka, Marko Salmenkivi

Clustering and Visualization in a Multi-lingual Multi-document Summarization System

To measure the similarity of words, sentences, and documents is one of the major issues in multi-lingual multi-document summarization. This paper presents five strategies to compute the multilingual sentence similarity. The experimental results show that sentence alignment without considering the word position or order in a sentence obtains the best performance. Besides, two strategies are proposed for multilingual document clustering. The two-phase strategy (translation after clustering) is better than one-phase strategy (translation before clustering). Translation deferred to sentence clustering, which reduces the propagation of translation errors, is most promising. Moreover, three strategies are proposed to tackle the sentence clustering. Complete link within a cluster has the best performance, however, the subsumption-based clustering has the advantage of lower computation complexity and similar performance. Finally, two visualization models (i.e., focusing and browsing), which consider the users’ language preference, are proposed.

Hsin-Hsi Chen, June-Jei Kuo, Tsei-Chun Su

A Hybrid Relevance-Feedback Approach to Text Retrieval

Relevance feedback (RF) has been an effective query modification approach to improving the performance of information retrieval (IR) by interactively asking a user whether a set of documents are relevant or not to a given query concept. The conventional RF algorithms either converge slowly or cost a user’s additional efforts in reading irrelevant documents. This paper surveys several RF algorithms and introduces a novel hybrid RF approach using a support vector machine (HRFSVM), which actively selects the uncertain documents as well as the most relevant ones on which to ask users for feedback. It can efficiently rank documents in a natural way for user browsing. We conduct experiments on Reuters-21578 dataset and track the precision as a function of feedback iterations. Experimental results have shown that HRFSVM significantly outperforms two other RF algorithms.

Zhao Xu, Xiaowei Xu, Kai Yu, Volker Tresp

Experiments with Document Archive Size Detection

The size of a document archive is a very important parameter for resource selection in distributed information retrieval systems. In this paper, we present a method for automatically detecting the size (i.e. number of documents) of a document archive, in case the archive itself does not provided such information. In addition, a method for detecting the incremental change of the archive size is also presented, which can be useful for deciding if a resource description has become obsolete and needs to be regenerated. An experimental evaluation of these methods shows that they provide quite accurate information.

Shengli Wu, Forbes Gibb, Fabio Crestani

Using Kullback-Leibler Distance for Text Categorization

A system that performs text categorization aims to assign appropriate categories from a predefined classification scheme to incoming documents. These assignments might be used for varied purposes such as filtering, or retrieval. This paper introduces a new effective model for text categorization with great corpus (more or less 1 million documents). Text categorization is performed using the Kullback-Leibler distance between the probability distribution of the document to classify and the probability distribution of each category. Using the same representation of categories, experiments show a significant improvement when the above mentioned method is used. KLD method achieve substantial improvements over the tfidf performing method.

Brigitte Bigi

Discretizing Continuous Attributes in AdaBoost for Text Categorization

We focus on two recently proposed algorithms in the family of “boosting”-based learners for automated text classification, ADABOOST. MH and ADABOOST.MHKR. While the former is a realization of the well-known ADABOOST algorithm specifically aimed at multi-label text categorization, the latter is a generalization of the former based on the idea of learning a committee of classifier sub-committees. Both algorithms have been among the best performers in text categorization experiments so far.A problem in the use of both algorithms is that they require documents to be represented by binary vectors, indicating presence or absence of the terms in the document. As a consequence, these algorithms cannot take full advantage of the “weighted” representations (consisting of vectors of continuous attributes) that are customary in information retrieval tasks, and that provide a much more significant rendition of the document’s content than binary representations.In this paper we address the problem of exploiting the potential of weighted representations in the context of ADABOOST-like algorithms by discretizing the continuous attributes through the application of entropy-based discretization methods. We present experimental results on the Reuters-21578 text categorization collection, showing that for both algorithms the version with discretized continuous attributes outperforms the version with traditional binary representations.

Pio Nardiello, Fabrizio Sebastiani, Alessandro Sperduti

Combining Naive Bayes and n-Gram Language Models for Text Classification

We augment the naive Bayes model with an n-gram language model to address two shortcomings of naive Bayes text classifiers. The chain augmented naive Bayes classifiers we propose have two advantages over standard naive Bayes classifiers. First, a chain augmented naive Bayes model relaxes some of the independence assumptions of naive Bayes— allowing a local Markov chain dependence in the observed variables—while still permitting efficient inference and learning. Second, smoothing techniques from statistical language modeling can be used to recover better estimates than the Laplace smoothing techniques usually used in naive Bayes classification. Our experimental results on three real world data sets show that we achieve substantial improvements over standard naive Bayes classification, while also achieving state of the art performance that competes with the best known methods in these cases.

Fuchun Peng, Dale Schuurmans

WebDocBall: A Graphical Visualization Tool for Web Search Results

In the Web search process people often think that the hardest work is done by the search engines or by the directories which are entrusted with finding the Web pages. While this is partially true, a not less important part of the work is done by the user, who has to decide which page is relevant from the huge set of retrieved pages. In this paper we present a graphical visualisation tool aimed at helping users to determine the relevance of a Web page with respect to its structure. Such tool can help the user in the often tedious task of deciding which page is relevant enough to deserve a visit.

Jesús Vegas, Pablo de la Fuente, Fabio Crestani

Relevance feedback for content-based image retrieval: what can three mouse clicks achieve?

We introduce a novel relevance feedback method for content-based image retrieval and demonstrate its effectiveness using a subset of the Corel Gallery photograph collection and five low-level colour descriptors. Relevance information is translated into updated, analytically computed descriptor weights and a new query representation, and thus the system combines movement in both query and weight space. To assess the effectiveness of relevance feedback, we first determine the weight set that is optimal on average for a range of possible queries. The resulting multiple-descriptor retrieval model yields significant performance gains over all the single-descriptor models and provides the benchmark against which we measure the additional improvement through relevance feed-back. We model a number of scenarios of user-system interaction that differ with respect to the precise type and the extent of relevance feedback. In all scenarios, relevance feedback leads to a significant improvement of retrieval performance suggesting that feedback-induced performance gain is a robust phenomenon. Based on a comparison of the different scenarios, we identify optimal interaction models that yield high performance gains at a low operational cost for the user. To support the proposed relevant feedback technique we developed a novel presentation paradigm that allows relevance to be treated as a continuous variable.

Daniel C. Heesch, Stefan Rüger

Query-Based Document Skimming: A User-Centred Evaluation of Relevance Profiling

We present a user-centred, task-oriented, comparative evaluation of two query-based document skimming tools. ProfileSkim bases within-document retrieval on computing a relevance profile for a document and query; FindSkim provides similar functionality to the web browser Find-command. A novel simulated work task was devised, where experiment participants are asked to identify (index) relevant pages of an electronic book, given subjects from the existing book index. This subject index provides the ground truth, against which the indexing results can be compared. Our major hypothesis was confirmed, namely ProfileSkim proved significantly more efficient than FindSkim, as measured by time for task. Moreover, indexing task effectiveness, measured by typical IR measures, demonstrated that ProfileSkim was better than FindSkim in identifying relevant pages, although not significantly so. The experiments confirm the potential of relevance profiling to improve query-based document skimming, which should prove highly beneficial for users trying to identify relevant information within long documents.

David J. Harper, Ivan Koychev, Yixing Sun

Representative Sampling for Text Classification Using Support Vector Machines

In order to reduce human efforts, there has been increasing interest in applying active learning for training text classifiers. This paper describes a straightforward active learning heuristic, representative sampling, which explores the clustering structure of ‘uncertain’ documents and identifies the representative samples to query the user opinions, for the purpose of speeding up the convergence of Support Vector Machine (SVM) classifiers. Compared with other active learning algorithms, the proposed representative sampling explicitly addresses the problem of selecting more than one unlabeled documents. In an empirical study we compared representative sampling both with random sampling and with SVM active learning. The results demonstrated that representative sampling offers excellent learning performance with fewer labeled documents and thus can reduce human efforts in text classification tasks.

Zhao Xu, Kai Yu, Volker Tresp, Xiaowei Xu, Jizhi Wang

Chinese Text Categorization Based on the Binary Weighting Model with Non-binary Smoothing

In Text Categorization (TC) based on the vector space model, feature weighting is vital for the categorization effectiveness. Various non-binary weighting schemes are widely used for this purpose. By emphasizing the category discrimination capability of features, the paper firstly puts forward a new weighting scheme TF*IDF*IG. Upon the fact that refined statistics may have more chance to meet sparse data problem, we re-evaluate the role of the Binary Weighting Model (BWM) in TC for further consideration. As a consequence, a novel approach named the Binary Weighting Model with Non-Binary Smoothing (BWM-NBS) is then proposed so as to overcome the drawback of BWM. A TC system for Chinese texts using words as features is implemented. Experiments on a large-scale Chinese document collection with 71,674 texts show that the F1 metric of categorization performance of BWM-NBS gets to 94.9% in the best case, which is 26.4% higher than that of TF*IDF, 19.1% higher than that of TF*IDF*IG, and 5.8% higher than that of BWM under the same condition. Moreover, BWM-NBS exhibits the strong stability in categorization performance.

Xue Dejun, Sun Maosong

A Study on Optimal Parameter Tuning for Rocchio Text Classifier

Current trend in operational text categorization is the designing of fast classification tools. Several studies on improving accuracy of fast but less accurate classifiers have been recently carried out. In particular, enhanced versions of the Rocchio text classifier, characterized by high performance, have been proposed. However, even in these extended formulations the problem of tuning its parameters is still neglected. In this paper, a study on parameters of the Rocchio text classifier has been carried out to achieve its maximal accuracy. The result is a model for the automatic selection of parameters. Its main feature is to bind the searching space so that optimal parameters can be selected quickly. The space has been bound by giving a feature selection interpretation of the Rocchio parameters. The benefit of the approach has been assessed via extensive cross evaluation over three corpora in two languages. Comparative analysis shows that the performances achieved are relatively close to the best TC models (e.g. Support Vector Machines).

Alessandro Moschitti

Optimization of Restricted Searches in Web Directories Using Hybrid Data Structures

The need of efficient tools in order to manage, retrieve and filter the information in the WWW is clear. Web directories are taxonomies for the classification of Web documents. These kind of information retrieval systems present a specific type of search where the document collection is restricted to one area of the category graph. This paper introduces a specific data architecture for Web directories that improves the performance of restricted searches. That architecture is based on a hybrid data structure composed of an inverted file with multiple embedded signature files. Two variants are presented: hybrid architecture with total information and with partial information. This architecture has been analyzed by means of developing both variants to be compared with a basic model. The performance of the restricted queries was clearly improved, especially the hybrid model with partial information, which yielded a positive response under any load of the search system.1

Fidel Cacheda, Victor Carneiro, Carmen Guerrero, Angel Viña

Similarity Join in Metric Spaces

Similarity join in distance spaces constrained by the metric postulates is the necessary complement of more famous similarity range and the nearest neighbors search primitives. However, the quadratic computational complexity of similarity joins prevents from applications on large data collections. We first study the underlying principles of such joins and suggest three categories of implementation strategies based on filtering, partitioning, or similarity range searching. Then we study an application of the D-index to implement the most promising alternative of range searching. Though also this approach is not able to eliminate the intrinsic quadratic complexity of similarity joins, significant performance improvements are confirmed by experiments.

Vlastislav Dohnal, Claudio Gennaro, Pasquale Savino, Pavel Zezula

An Efficient Compression Code for Text Databases

We present a new compression format for natural language texts, allowing both exact and approximate search without decompression. This new code —called End-Tagged Dense Code— has some advantages with respect to other compression techniques with similar features such as the Tagged Huffman Code of [Moura et al., ACM TOIS 2000]. Our compression method obtains (i) better compression ratios, (ii) a simpler vocabulary representation, and (iii) a simpler and faster encoding. At the same time, it retains the most interesting features of the method based on the Tagged Huffman Code, i.e., exact search for words and phrases directly on the compressed text using any known sequential pattern matching algorithm, efficient word-based approximate and extended searches without any decoding, and efficient decompression of arbitrary portions of the text. As a side effect, our analytical results give new upper and lower bounds for the redundancy of d-ary Huffman codes.

Nieves R. Brisaboa, Eva L. Iglesias, Gonzalo Navarro, José R. Paramá

Posters

Compressing Semistructured Text Databases

We describe a compression model for semistructured documents, called Structural Contexts Model, which takes advantage of the context information usually implicit in the structure of the text. The idea is to use a separate semiadaptive model to compress the text that lies inside each different structure type (e.g., different XML tag). The intuition behind the idea is that the distribution of all the texts that belong to a given structure type should be similar, and different from that of other structure types. We test our idea using a word-based Huffman coding, which is the standard for compressing large natural language textual databases, and show that our compression method obtains significant improvements in compression ratios. We also analyze the possibility that storing separate models may not pay of if the distribution of different structure types is not different enough, and present a heuristic to merge models with the aim of minimizing the total size of the compressed database. This technique gives an additional improvement over the plain technique. The comparison against existing prototypes shows that our method is a competitive choice for compressed text databases.

Joaquín Adiego, Gonzalo Navarro, Pablo de la Fuente

Vertical Searching in Juridical Digital Libraries

In the world of modern digital libraries, the searching for juridical information of interest is a current and relevant problem. We approach this problem from the perspective that a new searching mechanism, specialized in the juridical area, will work better than standard solutions. We propose a specialized (or vertical) searching mechanism that combines information from a juridical thesaurus with information generated by a standard searching mechanism (the classic vector space model), using the framework of a Bayesian belief network. Our vertical searching mechanism is evaluated using a reference collection of 552,573 documents. The results show improvements in retrieval performance, suggesting that the study and development of vertical searching mechanisms is a promising research direction.

Maria de Lourdes da Silveira, Berthier Ribeiro-Neto, Rodrigo de Freitas Vale, Rodrigo Tôrres Assumpção

Corpus-Based Thesaurus Construction for Image Retrieval in Specialist Domains

This paper explores the use of texts that are related to an image collection, also known as collateral texts, for building thesauri in specialist domains to aid in image retrieval. Corpus linguistic and information extraction methods are used for identifying key terms and semantic relationships in specialist texts that may be used for query expansion purposes. The specialist domain context imposes certain constraints on the language used in the texts, which makes the texts computationally more tractable.

Khurshid Ahmad, Mariam Tariq, Bogdan Vrusias, Chris Handy

Generating Extracts with Genetic Algorithms

This paper describes an application of genetic algorithms for text summarisation. We have built a sentence extraction algorithm that overcomes some of the drawbacks of traditional sentence extractors, and takes into consideration different features of the summaries. The fitness function can be easily modified in order to incorporate features such as user modelling and adaptation. The system has been evaluated with standard procedures, and the obtained results are very good.

Enrique Alfonseca, Pilar Rodríguez

The ITC-irst News on Demand Platform

The rapid growth of the Information Society is increasing the demand for technologies enabling access of multimedia data by content. An interesting example is given by the broadcasting companies and the content providers, which require today effective technologies to support the management and access of their huge audiovisual archives, both for internal use and public services. The Multilingual Natural Speech Technology lab at ITC-irst has been engaged for several years in developing core technologies suited for audio indexing and retrieval. In this paper, a news-on-demand experimental platform is presented which integrates several technologies in the area of spoken language processing and information retrieval. In particular, an overview of the whole system architecture will be given, together with a short description of the achieved state-of-the-art in each single technology.

Nicola Bertoldi, Fabio Brugnara, Mauro Cettolo, Marcello Federico, Diego Giuliani, Erwin Leeuwis, Vanessa Sandrini

Evaluating Peer-to-Peer Networking for Information Retrieval within the Context of Meta-searching

Peer-to-peer (P2P) computing has shown an unexpected growth and development during the recent years. P2P networking is being applied from B2B enterprise solutions to more simple, every-day file-sharing applications like Gnutella clients. In this paper we are investigating the use of the Gnutella P2P protocol for Information Retrieval by means of building and evaluating a general-purpose Web meta-search engine. A similar project, Infrasearch, existed in the past only to demonstrate the usefulness of P2P computing beyond just file-sharing. In this work, a Java-based Gnutella enabled meta-search engine has been developed and used in order to evaluate the suitability and usefulness of P2P networking in the aforementioned context. Our conclusions concerning time efficiency in different networking topologies are presented. Finally, limitations of this approach as well as future research areas and issues on the subject are discussed.

Iraklis A. Klampanos, James J. Barnes, Joemon M. Jose

Parallel Computing for Term Selection in Routing/Filtering

It has been postulated that a method of selecting terms in either routing or filtering using relevance feedback would be to evaluate every possible combination of terms in a training set and determine which combination yields the best retrieval results. Whilst this is not a realistic proposition because of the enormous size of the search space, some heuristics have been developed on the Okapi system to tackle the problem which are computationally intensive. This paper describes parallel computing techniques that have been applied to these heuristics to reduce the time it takes to select to select terms.

Andy MacFarlane, Stephen E. Robertson, Julie A. McCann

A Weighting Scheme for Star-Graphs

A star-graph is a conceptual graph that contains a single relation, with some concepts linked to it. They are elementary pieces of information describing combinations of concepts. We use star-graphs as descriptors — or index terms — for image content representation. This allows for relational indexing and expression of complex user needs, in comparison to classical text retrieval, where simple keywords are generally used as document descriptors. In classical text retrieval, the keywords are weighted to give emphasis to good document descriptors and discriminators where the most popular weighting schemes are based on variations of tf.idf. In this paper, we present an extension of tf.idf, introducing a new weighting scheme suited for star-graphs. This weighting scheme is based on a local analysis of star-graphs indexing a document and a global analysis of star-graphs across the whole collection. We show and discuss some preliminary results evaluating the performance of this weighting scheme applied to image retrieval.

Jean Martinet, Iadh Ounis, Yves Chiaramella, Philippe Mulhem

Phrase-Based Hierarchical Clustering of Web Search Results

The paper addresses the problem of clustering text documents coming from the Web. We apply clustering to support users in interactive browsing through hierarchically organized search results as opposed to the standard ranked-list presentation. We propose a clustering method that is tailored to on-line processing of Web documents and takes into account the time aspect, the particular requirements of clustering texts, and readability of the produced hierarchy. Finally, we present the user interface of an actual system in which the method is applied to the results of a popular search engine.

Irmina Masłowska

Aggregated Feature Retrieval for MPEG-7

In this paper we present an initial study on the use of both high and low level MPEG-7 descriptions for video retrieval. A brief survey of current XML indexing techniques shows that an IR-based retrieval method provides a better foundation for retrieval as it satisfies important retrieval criteria such as content ranking and approximate matching. An aggregation technique for XML document retrieval is adapted to an MPEG-7 indexing structure by assigning semantic meanings to various audio/visual features and this is presented here.

Jiamin Ye, Alan F. Smeaton

Document Retrieval in the Context of Question Answering

Current question answering systems rely on document retrieval as a means of providing documents which are likely to contain an answer to a user’s question. A question answering system heavily depends on the effectiveness of a retrieval system: If a retrieval system fails to find any relevant documents for a question, further processing steps to extract an answer will inevitably fail, too. In this paper, we compare the effectiveness of some common retrieval techniques with respect to their usefulness for question answering.

Christof Monz

A Study of the Usefulness of Institutions’ Acronyms as Web Queries

Many people in Hungary use the Web to obtain information from public institutions and organizations. Because these users typically do not know the URL of the desired institution’s home page, they use a Web search engine to get there. Institutions’ names are usually difficult to recall exactly, thus they are not being used as queries in search engines. Instead, the acronyms of institutions are being used: they are easy to remember and are extensively used in media and by people in everyday life. The paper is concerned with studying the usefulness of the acronyms of Hungarian institutions and organisations present on the Web. The study shows that the majority of acronyms lack this ability. Causes are presented, and possible remedies are suggested. Because the method used in the paper is language independent, it can be used to carry out a similar study in another country too.

Sándor Dominich, Júlia Góth, Adrienn Skrop

Building a Hierarchy of Events and Topics for Newspaper Digital Libraries

In this paper we propose an incremental hierarchical clustering algorithm for on-line event detection. This algorithm is applied to a set of newspaper articles in order to discover the structure of topics and events that they describe. In the first level, articles with a high temporal-semantic similarity are clustered together into events. In the next levels of the hierarchy, these events are successively clustered so that composite events and topics can be discovered. The results obtained for the F1-measure and the Detection Cost demonstrate the validity of our algorithm for on-line event detection tasks.

Aurora Pons-Porrata, Rafael Berlanga-Llavori, José Ruiz-Shulcloper

A Machine Learning Approach for the Curation of Biomedical Literature

In the field of the biomedical sciences there exists a vast repository of information located within large quantities of research papers. Very often, researchers need to spend considerable amounts of time reading through entire papers before being able to determine whether or not they should be curated (archived). In this paper, we present an automated text classification system for the classification of biomedical papers. This classification is based on whether there is experimental evidence for the expression of molecular gene products for specified genes within a given paper. The system performs preprocessing and data cleaning, followed by feature extraction from the raw text. It subsequently classifies the paper using the extracted features with a Naïve Bayes Classifier. Our approach has made it possible to classify (and curate) biomedical papers automatically, thus potentially saving considerable time and resources. The system proved to be highly accurate, and won honourable mention in the KDD Cup 2002 task 1.

Min Shi, David S. Edwin, Rakesh Menon, Lixiang Shen, Jonathan Y. K. Lim, Han Tong Loh, S. Sathiya Keerthi, Chong Jin Ong

Automatic Construction of Theme Melody Index from Music Database for Fast Content-Based Retrievals

In traditional content-based music information retrieval systems, users may face with longer response time, since the traditional systems mostly do syntactic processing to match query melody and whole melodies of the underlying music database. Hence, there has been a growing need for theme melody index that can support to quick retrieve the relevant music to user’s query melody. In this paper, we suggested an automatic mechanism for constructing the theme melody index from large music database and also showed how the theme melody index can be used for content-based music retrievals by implementing a prototype system.

Chang-Hwan Shin, Kyong-I Ku, KiChang Kim, Yoo-Sung Kim

A Personalized Information Search Process Based on Dialoguing Agents and User Profiling

The amount of information available on the web, as well as the number of e-businesses and web shoppers, is growing exponentially. Customers have to spend a lot of time to browse the net in order to find relevant information. One way to overcome this problem is to use dialoguing agents that exploit user profiles to generate personal recommendations. This paper presents a system, designed according to this approach, that adopts a query refinement mechanism to improve the search process of an Internet commerce web site.

Giovanni Semeraro, Marco Degemmis, Pasquale Lops, Ulrich Thiel, Marcello L’Abbate

Backmatter

Titel: Advances in Information Retrieval
herausgegeben von: Fabrizio Sebastiani
Verlag: Springer Berlin Heidelberg
Electronic ISBN: 978-3-540-36618-8
Print ISBN: 978-3-540-01274-0
DOI: https://doi.org/10.1007/3-540-36618-0

Springer Professional