Skip to main content

2007 | Buch

String Processing and Information Retrieval

14th International Symposium, SPIRE 2007 Santiago, Chile, October 29-31, 2007 Proceedings

herausgegeben von: Nivio Ziviani, Ricardo Baeza-Yates

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Inhaltsverzeichnis

Frontmatter
A Chaining Algorithm for Mapping cDNA Sequences to Multiple Genomic Sequences
Abstract
Given a set of matches between a cDNA sequence and multiple genomic sequences, we present a subquadratic chaining algorithm for computing an optimal chain of colinear matches, while allowing overlaps between the matches. Our algorithm improves upon the quadratic graph based solution, and extends the previous algorithms which are limited to matches between a cDNA sequence and a single genomic sequence. The matches of the resulting chain serve as anchors for computing a multiple alignment between the cDNA and the given sequences.
Mohamed Abouelhoda
Edge-Guided Natural Language Text Compression
Abstract
We describe a novel compression technique for natural language text collections which takes advantage of the information provided by edges when a graph is used to model the text. This technique is called edge-guided compression. We propose an algorithm that allows the text to be transformed in agreement with the edge-guided technique in conjunction with the spaceless words transformation. The result of these transformations is a PPM-friendly byte-stream that has to be codified with a PPM family encoder. The comparison with state-of-art compressors shows that our proposal is a competitive choice for medium and large natural language text collections.
Joaquín Adiego, Miguel A. Martínez-Prieto, Pablo de la Fuente
Local Transpositions in Alignment of Polyphonic Musical Sequences
Abstract
Music retrieval systems based on melodic similarity consider sequences of notes. Adaptations of edit-based algorithms, mainly applied in bioinformatic applications, to the musical domain lead to promising results. However, new problems are raised when considering polyphonic music. Existing representations of notes do not allow retrieval systems to be transposition invariant. In this article, we propose a new dynamic programming algorithm that permits to take into account multiple local transpositions. Experiments with MIREX collections have been performed to evaluate the improvements induced by this algorithm. The results clearly show the contribution of this algorithm: it is confirmed as the most accurate solution for a music retrieval system based on alignment algorithm to be transposition invariant.
Julien Allali, Pascal Ferraro, Pierre Hanna, Costas Iliopoulos
Efficient Computations of ℓ1 and ℓ ∞  Rearrangement Distances
Abstract
Recently, a new pattern matching paradigm was proposed, pattern matching with address errors. In this paradigm approximate string matching problems are studied, where the content is unaltered and only the locations of the different entries may change. Specifically, a broad class of problems in this new paradigm was defined – the class of rearrangement errors. In this type of errors the pattern is transformed through a sequence of rearrangement operations, each with an associated cost. The natural ℓ1 and ℓ2 rearrangement systems were considered. A variant of the ℓ1-rearrangement distance problem seems more difficult – where the pattern is a general string that may have repeating symbols. The best algorithm presented for the general case is O(nm). In this paper, we show that even for general strings the problem can be approximated in linear time! This paper also considers another natural rearrangement system – the ℓ ∞  rearrangement distance. For this new rearrangement system we provide efficient exact solutions for different variants of the problem, as well as a faster approximation.
Amihood Amir, Yonatan Aumann, Piotr Indyk, Avivit Levy, Ely Porat
Generalized LCS
Abstract
The Longest Common Subsequence (LCS) is a well studied problem, having a wide range of implementations. Its motivation is in comparing strings. It has long been of interest to devise a similar measure for comparing higher dimensional objects, and more complex structures. In this paper we give, what is to our knowledge, the first inherently multi-dimensional definition of LCS. We discuss the Longest Common Substructure of two matrices and the Longest Common Subtree problem for multiple trees including a constrained version. Both problems cannot be solved by a natural extension of the original LCS solution. We investigate the tractability of the above problems. For the first we prove \(\cal{NP}\)-Completeness. For the latter \(\cal{NP}\)-hardness holds for two general unordered trees and for k trees in the constrained LCS.
Amihood Amir, Tzvika Hartman, Oren Kapah, B. Riva Shalom, Dekel Tsur
Exploiting Genre in Focused Crawling
Abstract
In this paper, we propose a novel approach to focused crawling that exploits genre and content-related information present in Web pages to guide the crawling process. The effectiveness, efficiency and scalability of this approach are demonstrated by a set of experiments involving the crawling of pages related to syllabi (genre) of computer science courses (content). The results of these experiments show that focused crawlers constructed according to our approach achieve levels of F1 superior to 92% (an average gain of 178% over traditional focused crawlers), requiring the analysis of no more than 60% of the visited pages in order to find 90% of the relevant pages (an average gain of 82% over traditional focused crawlers).
Guilherme T. de Assis, Alberto H. F. Laender, Marcos André Gonçalves, Altigran S. da Silva
Admission Policies for Caches of Search Engine Results
Abstract
This paper studies the impact of the tail of the query distribution on caches of Web search engines, and proposes a technique for achieving higher hit ratios compared to traditional heuristics such as LRU. The main problem we solve is the one of identifying infrequent queries, which cause a reduction on hit ratio because caching them often does not lead to hits. To mitigate this problem, we introduce a cache management policy that employs an admission policy to prevent infrequent queries from taking space of more frequent queries in the cache. The admission policy uses either stateless features, which depend only on the query, or stateful features based on usage information. The proposed management policy is more general than existing policies for caching of search engine results, and it is fully dynamic. The evaluation results on two different query logs show that our policy achieves higher hit ratios when compared to previously proposed cache management policies.
Ricardo Baeza-Yate, Flavio Junqueira, Vassilis Plachouras, Hans Friedrich Witschel
A Pocket Guide to Web History
Abstract
Web archives like the Internet Archive preserve the evolutionary history of large portions of the Web. Access to them, however, is still via rather limited interfaces – a search functionality is often missing or ignores the time axis. Time-travel search alleviates this shortcoming by enriching keyword queries with a time-context of interest. In order to be effective, time-travel queries require historical PageRank scores. In this paper, we address this requirement and propose rank synopses as a novel structure to compactly represent and reconstruct historical PageRank scores. Rank synopses can reconstruct the PageRank score of a web page as of any point during its lifetime, even in the absence of a snapshot of the Web as of that time. We further devise a normalization scheme for PageRank scores to make them comparable across different graphs. Through a comprehensive evaluation over different datasets, we demonstrate the accuracy and space-economy of the proposed methods.
Klaus Berberich, Srikanta Bedathur, Gerhard Weikum
Jump-Matching with Errors
Abstract
Two equal-length integer-value strings jump-match if each of their corresponding (locationwise) elements differ by the same value d. In Jump matching one seeks all text substrings which jump-match the pattern. Strings approximate jump-match if all elements differ by the same value asides from at most k, where k is predefined. In approximate jump-matching one seeks the text substrings which approximate jump-match with the pattern.
We present innovative, efficient deterministic and randomized algorithms to solve the approximate jump-matching problem.
Ayelet Butman, Noa Lewenstein, Benny Porat, Ely Porat
Estimating Number of Citations Using Author Reputation
Abstract
We study the problem of predicting the popularity of items in a dynamic environment in which authors post continuously new items and provide feedback on existing items. This problem can be applied to predict popularity of blog posts, rank photographs in a photo-sharing system, or predict the citations of a scientific article using author information and monitoring the items of interest for a short period of time after their creation. As a case study, we show how to estimate the number of citations for an academic paper using information about past articles written by the same author(s) of the paper. If we use only the citation information over a short period of time, we obtain a predicted value that has a correlation of r = 0.57 with the actual value. This is our baseline prediction. Our best-performing system can improve that prediction by adding features extracted from the past publishing history of its authors, increasing the correlation between the actual and the predicted values to r = 0.81.
Carlos Castillo, Debora Donato, Aristides Gionis
A Fast and Compact Web Graph Representation
Abstract
Compressed graphs representation has become an attractive research topic because of its applications in the manipulation of huge Web graphs in main memory. By far the best current result is the technique by Boldi and Vigna, which takes advantage of several particular properties of Web graphs. In this paper we show that the same properties can be exploited with a different and elegant technique, built on Re-Pair compression, which achieves about the same space but much faster navigation of the graph. Moreover, the technique has the potential of adapting well to secondary memory. In addition, we introduce an approximate Re-Pair version that works efficiently with limited main memory.
Francisco Claude, Gonzalo Navarro
A Filtering Algorithm for k-Mismatch with Don’t Cares
Abstract
We present a filtering based algorithm for the k-mismatch pattern matching problem with don’t cares. Given a text t of length n and a pattern p of length m with don’t care symbols in either p or t and a bound k, our algorithm finds all the places that the pattern matches the text with at most k mismatches. The algorithm is deterministic and runs in O(nm 1/3 k 1/3log2/3 m) time. The location of the mismatches at each alignment is also given at no extra cost.
Raphaël Clifford, Ely Porat
Compact Set Representation for Information Retrieval
Abstract
Conjunctive Boolean queries are a fundamental operation in web search engines. These queries can be reduced to the problem of intersecting ordered sets of integers, where each set represents the documents containing one of the query terms. But there is tension between the desire to store the lists effectively, in a compressed form, and the desire to carry out intersection operations efficiently, using non-sequential processing modes. In this paper we evaluate intersection algorithms on compressed sets, comparing them to the best non-sequential array-based intersection algorithms. By adding a simple, low-cost, auxiliary index, we show that compressed storage need not hinder efficient and high-speed intersection operations.
J. Shane Culpepper, Alistair Moffat
Approximate Swap and Mismatch Edit Distance
Abstract
There is no known algorithm that solves the general case of the Approximate Edit Distance problem, where the edit operations are: insertion, deletion, mismatch, and swap, in time o(nm), where n is the length of the text and m is the length of the pattern.
In the effort to study this problem, the edit operations were analyzed independently. Karloff  [10] showed an algorithm that approximates the edit distance problem with only the mismatch operation in time \(O(\frac{1}{\epsilon^2}n\log^3 m)\). Amir et. al.  [3] showed that if the only edit operations allowed are swap and mismatch, then the exact edit distance problem can be solved in time \(O(n\sqrt{m} \log m)\).
In this paper, we discuss the problem of approximate edit distance with swap and mismatch. We show a randomized \(O(\frac{1}{\epsilon^3}n \log n \log^3 m)\) time algorithm for the problem. The algorithm guarantees an approximation factor of (1 + ε) with probability of at least \(1-\frac{1}{n}\).
Yair Dombb, Ohad Lipsky, Benny Porat, Ely Porat, Asaf Tsur
Approximating Constrained LCS
Abstract
The problem of finding the longest common subsequence (LCS) of two given strings A and B is a well-studied problem. The Constrained longest common subsequence (C-LCS) for three strings A, B and C is the longest common subsequence of A and B that contains C as a subsequence. The fastest algorithm solving the C-LCS problem has a time complexity of O(mnk) where m, n and k are the lengths of A, B and C respectively. We propose to consider the approximate version of the LCS and the Constrained LCS. For LCS we propose a simple linear time approximation algorithm that yields an approximation ratio of \(1 \over |\Sigma|\). For C-LCS we obtain the first two approximation algorithms. Our first algorithm has an approximation factor of \(\frac{1}{\sqrt{\min(m,n)}}\) with an O(mn) running time, while the second algorithm yields a \(\frac{1}{\sqrt{\min(m,n)|\Sigma|}}\) approximation factor within a running time of O(m + n).
Zvi Gotthilf, Moshe Lewenstein
Tuning Approximate Boyer-Moore for Gene Sequences
Abstract
Recently a new variation of approximate Boyer-Moore string matching was presented for the k-mismatch problem. This variation was developed for gene sequences. We further tuned this algorithm gaining speedups in both preprocessing and search times. Our preprocessing has lower time complexity than the previous algorithm and our experiments show that our algorithm is over 30% faster than the previous one. We also present two variations of the algorithm for the k-difference problem.
Petri Kalsi, Leena Salmela, Jorma Tarhio
Optimal Self-adjusting Trees for Dynamic String Data in Secondary Storage
Abstract
We present a self-adjusting layout scheme for suffix trees in secondary storage that provides optimal number of disk accesses for a sequence of string or substring queries. This has been an open problem since Sleator and Tarjan presented their splaying technique to create self-adjusting binary search trees in 1985. In addition to resolving this open problem, our scheme provides two additional advantages: 1) The partitions are slowly readjusted, requiring fewer disk accesses than splaying methods, and 2) the initial state of the layout is balanced, making it useful even when the sequence of queries is not highly skewed. Our method is also applicable to PATRICIA trees, and potentially to other data structures.
Pang Ko, Srinivas Aluru
Indexing a Dictionary for Subset Matching Queries
Abstract
We consider a subset matching variant of the Dictionary Query problem. Consider a dictionary D of n strings, where each string location contains a set of characters drawn from some alphabet Σ. Our goal is to preprocess D so when given a query pattern p, where each location in p contains a single character from Σ, we answer if p matches to D. p is said to match to D if there is some s ∈ D where |p| = |s| and p[i] ∈ s[i] for every 1 ≤ i ≤ |p|.
To achieve a query time of O(|p|), we construct a compressed trie of all possible patterns that appear in D. Assuming that for every s ∈ D there are at most k locations where |s[i]| > 1, we present two constructions of the trie that yield a preprocessing time of \(O(nm +|\Sigma|^k n \lg (\min\{n,m\}))\), where n is the number of strings in D and m is the maximum length of a string in D. The first construction is based on divide and conquer and the second construction uses ideas introduced in [2] for text fingerprinting. Furthermore, we show how to obtain \(O(nm+|\Sigma|^k n +|\Sigma|^{k/2} n\lg (\min\{n,m\}))\) preprocessing time and \(O(|p|\lg\lg|\Sigma|+\min\{|p|,\lg(|\Sigma|^kn)\}\lg \lg(|\Sigma|^kn))\) query time by cutting the dictionary strings and constructing two compressed tries.
Our problem is motivated by haplotype inference from a library of genotypes [14,7]. There, D is a known library of genotypes (|Σ| = 2), and p is a haplotype. Indexing all possible haplotypes that can be inferred from D as well as gathering statistical information about them can be used to accelerate various haplotype inference algorithms. In particular, algorithms based on the “pure parsimony criteria” [13,16], greedy heuristics such as “Clarks rule” [6,18], EM based algorithms [1,11,12,20,26,30], and algorithms for inferring haplotypes from a set of Trios [4,27].
Gad M. Landau, Dekel Tsur, Oren Weimann
Extending Weighting Models with a Term Quality Measure
Abstract
Weighting models use lexical statistics, such as term frequencies, to derive term weights, which are used to estimate the relevance of a document to a query. Apart from the removal of stopwords, there is no other consideration of the quality of words that are being ‘weighted’. It is often assumed that term frequency is a good indicator for a decision to be made as to how relevant a document is to a query. Our intuition is that raw term frequency could be enhanced to better discriminate between terms. To do so, we propose using non-lexical features to predict the ‘quality’ of words, before they are weighted for retrieval. Specifically, we show how parts of speech (e.g. nouns, verbs) can help estimate how informative a word generally is, regardless of its relevance to a query/document. Experimental results with two standard TREC collections show that integrating the proposed term quality to two established weighting models enhances retrieval performance, over a baseline that uses the original weighting models, at all times.
Christina Lioma, Iadh Ounis
Highly Frequent Terms and Sentence Retrieval
Abstract
In this paper we propose a novel sentence retrieval method based on extracting highly frequent terms from top retrieved documents. We compare it against state of the art sentence retrieval techniques, including those based on pseudo-relevant feedback, showing that the approach is robust and competitive. Our results reinforce the idea that top retrieved data is a valuable source to enhance retrieval systems. This is especially true for short queries because there are usually few query-sentence matching terms. Moreover, the approach is particularly promising for weak queries. We demonstrate that this novel method is able to improve significantly the precision at top ranks when handling poorly specified information needs.
David E. Losada, Ronald T. Fernández
Implicit Compression Boosting with Applications to Self-indexing
Abstract
Compression boosting (Ferragina & Manzini, SODA 2004) is a new technique to enhance zeroth order entropy compressors’ performance to k-th order entropy. It works by constructing the Burrows-Wheeler transform of the input text, finding optimal partitioning of the transform, and then compressing each piece using an arbitrary zeroth order compressor. The optimal partitioning has the property that the achieved compression is boosted to k-th order entropy, for any k. The technique has an application to text indexing: Essentially, building a wavelet tree (Grossi et al., SODA 2003) for each piece in the partitioning yields a k-th order compressed full-text self-index providing efficient substring searches on the indexed text (Ferragina et al., SPIRE 2004). In this paper, we show that using explicit compression boosting with wavelet trees is not necessary; our new analysis reveals that the size of the wavelet tree built for the complete Burrows-Wheeler transformed text is, in essence, the sum of those built for the pieces in the optimal partitioning. Hence, the technique provides a way to do compression boosting implicitly, with a trivial linear time algorithm, but fixed to a specific zeroth order compressor (Raman et al., SODA 2002). In addition to having these consequences on compression and static full-text self-indexes, the analysis shows that a recent dynamic zeroth order compressed self-index (Mäkinen & Navarro, CPM 2006) occupies in fact space proportional to k-th order entropy.
Veli Mäkinen, Gonzalo Navarro
A Web-Page Usage Prediction Scheme Using Weighted Suffix Trees
Abstract
In this paper we consider the problem of web page usage prediction in a web site by modeling users’ navigation history with weighted suffix trees. This user’s navigation prediction can be exploited either in an on-line recommendation system in a website or in a web-page cache system. The method proposed has the advantage that it demands a constant amount of computational effort per user action and consumes a relatively small amount of extra memory space. These features make the method ideal for an on-line working environment. Finally, we have performed an evaluation of the proposed scheme with experiments on various website logfiles and we have found that its prediction quality is fairly good, in many cases outperforming existing solutions.
Christos Makris, Yannis Panagis, Evangelos Theodoridis, Athanasios Tsakalidis
Enhancing Educational-Material Retrieval Using Authored-Lesson Metadata
Abstract
Many authors believe that in order to achieve coherence and flexibility at the same time in multimedia-based learning units, it is highly recommendable to structure the different components as a graph. In a lesson graph, educational resources are encapsulated into learning objects (LO) along with their respective metadata and are interconnected through different kind of rhetorical and semantical relationships. The LOs of these graphs are stored within repositories, where their metadata are used to ease their retrieval. In this paper we propose to integrate the processes of searching LOs and editing the lesson graph. This new framework extends traditional keyword and metadata search to take advantage of the information stored implicitly in the lesson graph structure, making LOs retrieval more effective and the expression of queries more intuitive. The retrieval of the learning material consists of two processes: (1) The user first defines the topological location of a required piece of educational material within the lesson graph, this is, its relationships with other pieces. (2) Then, the user issues a traditional keyword query, which is processed by an IR system modified to take the graph structure into account. Experiments show the advantages of this approach.
Olivier Motelet, Benjamin Piwowarski, Georges Dupret, Jose A. Pino, Nelson Baloian
Approximate String Matching with Lempel-Ziv Compressed Indexes
Abstract
A compressed full-text self-index for a text T is a data structure requiring reduced space and able of searching for patterns P in T. Furthermore, the structure can reproduce any substring of T, thus it actually replaces T. Despite the explosion of interest on self-indexes in recent years, there has not been much progress on search functionalities beyond the basic exact search. In this paper we focus on indexed approximate string matching (ASM), which is of great interest, say, in computational biology applications. We present an ASM algorithm that works on top of a Lempel-Ziv self-index. We consider the so-called hybrid indexes, which are the best in practice for this problem. We show that a Lempel-Ziv index can be seen as an extension of the classical q-samples index. We give new insights on this type of index, which can be of independent interest, and then apply them to the Lempel-Ziv index. We show experimentally that our algorithm has a competitive performance and provides a useful space-time tradeoff compared to classical indexes.
Luís M. S. Russo, Gonzalo Navarro, Arlindo L. Oliveira
Algorithms for Weighted Matching
Abstract
We consider the matching of weighted patterns against an unweighted text. We adapt the shift-add algorithm for this problem. We also present an algorithm that enumerates all strings that produce a score higher than a given score threshold when aligned against a weighted pattern and then searches for all these strings using a standard exact multipattern algorithm. We show that both of these approaches are faster than previous algorithms on patterns of moderate length and high significance levels while the good performance of the shift-add algorithm continues with lower significance levels.
Leena Salmela, Jorma Tarhio
Efficient Text Proximity Search
Abstract
In addition to purely occurrence-based relevance models, term proximity has been frequently used to enhance retrieval quality of keyword-oriented retrieval systems. While there have been approaches on effective scoring functions that incorporate proximity, there has not been much work on algorithms or access methods for their efficient evaluation. This paper presents an efficient evaluation framework including a proximity scoring function integrated within a top-k query engine for text retrieval. We propose precomputed and materialized index structures that boost performance. The increased retrieval effectiveness and efficiency of our framework are demonstrated through extensive experiments on a very large text benchmark collection. In combination with static index pruning for the proximity lists, our algorithm achieves an improvement of two orders of magnitude compared to a term-based top-k evaluation, with a significantly improved result quality.
Ralf Schenkel, Andreas Broschart, Seungwon Hwang, Martin Theobald, Gerhard Weikum
Prefix-Shuffled Geometric Suffix Tree
Abstract
Protein structure analysis is one of the most important research issues in the post-genomic era, and faster and more accurate index data structures for such 3-D structures are highly desired for research on proteins. The geometric suffix tree is a very sophisticated index structure that enables fast and accurate search on protein 3-D structures. By using it, we can search from 3-D structure databases for all the substructures whose RMSDs (root mean square deviations) to a given query 3-D structure are not larger than a given bound. In this paper, we propose a new data structure based on the geometric suffix tree whose query performance is much better than the original geometric suffix tree. We call the modified data structure the prefix-shuffled geometric suffix tree (or PSGST for short). According to our experiments, the PSGST outperforms the geometric suffix tree in most cases. The PSGST shows its best performance when the database does not have many substructures similar to the query. The query is sometimes 100 times faster than the original geometric suffix trees in such cases.
Tetsuo Shibuya
Backmatter
Metadaten
Titel
String Processing and Information Retrieval
herausgegeben von
Nivio Ziviani
Ricardo Baeza-Yates
Copyright-Jahr
2007
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-75530-2
Print ISBN
978-3-540-75529-6
DOI
https://doi.org/10.1007/978-3-540-75530-2

Premium Partner