Skip to main content

2014 | Buch

String Processing and Information Retrieval

21st International Symposium, SPIRE 2014, Ouro Preto, Brazil, October 20-22, 2014. Proceedings

herausgegeben von: Edleno Moura, Maxime Crochemore

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes the proceedings of the 21st International Symposium on String Processing and Information Retrieval, SPIRE 2014, held in Ouro Preto, Brazil, in October 2014. The 20 full and 6 short papers included in this volume were carefully reviewed and selected from 45 submissions. The papers focus not only on fundamental algorithms in string processing and information retrieval, but address also application areas such as computational biology, Web mining and recommender systems. They are organized in topical sections on compression, indexing, genome and related topics, sequences and strings, search, as well as on mining and recommending.

Inhaltsverzeichnis

Frontmatter

Compression

Strategic Pattern Search in Factor-Compressed Text
Abstract
We consider the problem of pattern-search in compressed text in a context in which: (a) the text is stored as a sequence of factors against a static phrase-book; (b) decoding of factors is from right-to-left; and (c) extraction of each symbol in each factor requires Θ(logσ) time, where σ is the size of the original alphabet. To determine possible alignments given information about decoded characters we introduce two Boyer-Moore-like searching mechanisms, including one that makes use of a suffix array constructed over the pattern. The new mechanisms decode fewer than half the symbols that are required by a sequential left-to-right search such as the Knuth-Morris-Pratt approach, a saving that translates directly into improved execution time. Experiments with a two-level suffix array index structure for 4 GB of English text demonstrate the usefulness of the new techniques.
Simon Gog, Alistair Moffat, Matthias Petri
Relative Lempel-Ziv with Constant-Time Random Access
Abstract
Relative Lempel-Ziv (RLZ) is a variant of LZ77 that can compress well collections of similar genomes while still allowing fast random access to them. In theory, at the cost of using sublinear extra space, accessing an arbitrary character takes constant time. We show that even in practice this works quite well: e.g., we can compress 36 S. cerevisiae genomes from a total of 464 MB to 11 MB and still support random access to them in under 50 nanoseconds per character, even when the accessed substrings are short. Our theoretical contribution is an optimized representation of RLZ’s pointers.
Héctor Ferrada, Travis Gagie, Simon Gog, Simon J. Puglisi
Efficient Compressed Indexing for Approximate Top-k String Retrieval
Abstract
Given a collection of strings (called documents), the top-k document retrieval problem is that of, given a string pattern p, finding the k documents where p appears most often. This is a basic task in most information retrieval scenarios. The best current implementations require 20–30 bits per character (bpc) and k to 4k microseconds per query, or 12–24 bpc and 1–10 milliseconds per query. We introduce a Lempel-Ziv compressed data structure that occupies 5–10 bpc to answer queries in around k microseconds. The drawback is that the answer is approximate, but we show that its quality improves asymptotically with the size of the collection, reaching over 85% of the accumulated term frequency of the real answer already for patterns of length 4–6 on rather small collections, and improving for larger ones.
Héctor Ferrada, Gonzalo Navarro
Grammar Compressed Sequences with Rank/Select Support
Abstract
Sequence representations supporting not only direct access to their symbols, but also rank/select operations, are a fundamental building block in many compressed data structures. In several recent applications, the need to represent highly repetitive sequences arises, where statistical compression is ineffective. We introduce grammar-based representations for repetitive sequences, which use up to 10% of the space needed by representations based on statistical compression, and support direct access and rank/select operations within tens of microseconds.
Gonzalo Navarro, Alberto Ordóñez

Indexing

Algorithms for Jumbled Indexing, Jumbled Border and Jumbled Square on Run-Length Encoded Strings
Abstract
Jumbled Indexing, the problem of indexing a text for histogram queries, has been of much interest lately. In this paper we consider jumbled indexing for run-length encoded texts. We refute a former conjecture and show an algorithm for general sized alphabets. We also consider Jumbled Borders, the extension of borders to jumbled strings. Borders are the basis for various algorithms. Finally, we consider Jumbled Squares, strings which are of the form \(x\bar{x}\), where \(\bar{x}\) is a jumbling of x. We show efficient algorithms for these problems.
Amihood Amir, Alberto Apostolico, Tirza Hirst, Gad M. Landau, Noa Lewenstein, Liat Rozenberg
Relative FM-Indexes
Abstract
Intuitively, if two strings S 1 and S 2 are sufficiently similar and we already have an FM-index for S 1 then, by storing a little extra information, we should be able to reuse parts of that index in an FM-index for S 2. We formalize this intuition and show that it can lead to significant space savings in practice, as well as to some interesting theoretical problems.
Djamal Belazzougui, Travis Gagie, Simon Gog, Giovanni Manzini, Jouni Sirén
Efficient Indexing and Representation of Web Access Logs
Abstract
We present a space-efficient data structure, based on the Burrows-Wheeler Transform, especially designed to handle web sequence logs, which are needed by web usage mining processes. Our index is able to process a set of operations efficiently, while at the same time maintains the original information in compressed form. Results show that web access logs can be represented using 0.85 to 1.03 times their original (plain) size, while executing most of the operations within a few tens of microseconds.
Francisco Claude, Roberto Konow, Gonzalo Navarro
A Compressed Suffix-Array Strategy for Temporal-Graph Indexing
Abstract
Temporal graphs represent vertexes and binary relations that change over time. In this paper we consider a temporal graph as a set of 4-tuples ( v s , v e , t s , t e ) indicating that an edge from a vertex v s to a vertex v e is active during the time interval [t s , t e ). Representing those tuples involves the challenge of not only saving space but also of efficient query processing. Queries of interest for these graphs are both direct and reverse neighbors constrained by a time instant or a time interval. We show how to adapt a Compressed Suffix Array (CSA) to represent temporal graphs. The proposed structure, called Temporal Graph CSA (TGCSA), was experimentally compared with a compact data structure based on compressed inverted lists, which can be considered as a fair baseline in the state of the art. Our experimental results are promising. TGCSA obtains a good space-time trade-off, owns wider expressive capabilities than other alternatives, obtains reasonable space usage, and it is efficient even when performing the most complex temporal queries.
Nieves R. Brisaboa, Diego Caro, Antonio Fariña, M. Andrea Rodríguez
Succinct Indexes for Reporting Discriminating and Generic Words
Abstract
We consider the problem of indexing a collection \(\cal{D}\) of D strings (documents) of total n characters from an alphabet set of size σ, such that whenever a pattern P (of p characters) and an integer τ ∈ [1, D] comes as a query, we can efficiently report all (i) maximal generic words and (ii) minimal discriminating words as defined below:
  • maximal generic word is a maximal extension of P occurring in at least τ documents..
  • minimal discriminating word is a minimal extension of P occurring in at most τ documents.
These problems were introduced by Kucherov et al. [8], and they proposed linear space indexes occupying O(nlogn) bits with query times O(p + output) and O(p + loglogn + output) for Problem (i) and Problem (ii) respectively. In this paper, we describe succinct indexes of nlogσ + o(nlogσ) + O(n) bits space with near optimal query times i.e., O(p + loglogn + output) for both these problems.
Sudip Biswas, Manish Patil, Rahul Shah, Sharma V. Thankachan
Fast Construction of Wavelet Trees
Abstract
In this paper we describe a fast algorithm that creates a wavelet tree for a sequence of symbols. We show that a wavelet tree can be constructed in \(O(n\lceil{\frac{\log \sigma}{\sqrt{\log n}}}\rceil)\) time where n is the number of symbols and σ is the alphabet size.
J. Ian Munro, Yakov Nekrich, Jeffrey S. Vitter
Order Preserving Prefix Tables
Abstract
In the Order Preserving Pattern Matching (OPPM) problem, we have a text T and a pattern P on an integer alphabet as input. And the goal is to locate a fragment which is order-isomorphic with the pattern. Two sequences over integer alphabet are order-isomorphic if the relative order between any two elements at the same positions in both sequences is the same. In this paper we present an efficient algorithm to construct an interesting and useful data structure, namely, prefix table, from the order preserving point of view.
Md. Mahbubul Hasan, A. S. M. Sohidull Islam, Mohammad Saifur Rahman, M. Sohel Rahman

Genome and Related Topics

Alphabet-Independent Algorithms for Finding Context-Sensitive Repeats in Linear Time
Abstract
The identification of repetitive sequences (repeats) is an essential component of genome sequence analysis, and there are dozens of algorithms that search for exact or approximate repeats. The notions of maximal and supermaximal (exact) repeats have received special attention, and it is possible to simultaneously compute them on index data structures like the suffix tree or the enhanced suffix array. Very recently, this research has been extended in two directions. Gallé and Tealdi devised an alphabet-independent linear-time algorithm that finds all context-diverse repeats (which subsume maximal and supermaximal repeats as special cases), while Taillefer and Miller gave a quadratic-time algorithm that simultaneously computes and classifies maximal, near-supermaximal, and supermaximal repeats. In this paper, we provide new alphabet-independent linear-time algorithms for both tasks.
Enno Ohlebusch, Timo Beller
A 3-Approximation Algorithm for the Multiple Spliced Alignment Problem and Its Application to the Gene Prediction Task
Abstract
The Spliced Alignment Problem is a well-known problem in Bioinformatics with application to the gene prediction task. This problem consists in finding an ordered subset of non-overlapping substrings of a subject sequence g that best fits a target sequence t. In this work we present an approximation algorithm for a variant of the Spliced Alignment Problem, called Multiple Spliced Alignment Problem, that involves more than one target sequence. Under a metric, this algorithm is proved to be a 3-approximation for the problem and its good practical results compare to those obtained by four heuristics already developed for the Multiple Spliced Alignment Problem.
Regina Beretta Mazaro, Leandro Ishi Soares de Lima, Said Sadique Adi
Improved Filters for the Approximate Suffix-Prefix Overlap Problem
Abstract
Computing suffix-prefix overlaps for a large collection of strings is a fundamental building block for the analysis of genomic next-generation sequencing data. The approximate suffix-prefix overlap problem is to find all pairs of strings from a given set such that a prefix of one string is similar to a suffix of the other. Välimäki et al. (Information and Computation, 2012) gave a solution to this problem based on suffix filters. In this work, we propose two improvements to the method of Välimäki et al. that reduce the running time of the computation.
Gregory Kucherov, Dekel Tsur

Sequences and Strings

Sequence Decision Diagrams
Abstract
Compact encoding of finite sets of strings is a classic problem. The manipulation of large sets requires compact data structures that allow for efficient set operations. We define sequence decision diagrams (SeqDDs), which can encode arbitrary finite sets of strings over an alphabet. SeqDDs can be seen as a variant of classic decision diagrams such as BDDs and MDDs where, instead of a fixed number of levels, we simply require that the number of paths and the lengths of these paths be finite. However, the main difference between the two is the target application: while MDDs are suited to store and manipulate large sets of constant-length tuples, SeqDDs can store arbitrary finite languages and, as such, should be studied in relation to finite automata. We do so, examining in particular the size of equivalent representations.
Hind Alhakami, Gianfranco Ciardo, Marek Chrobak
Shortest Unique Queries on Strings
Abstract
Let D be a long input string of n characters (from an alphabet of size up to 2 w , where w is the number of bits in a machine word). Given a substring q of D, a shortest unique query returns a shortest unique substring of D that contains q. We present an optimal structure that consumes O(n) space, can be built in O(n) time, and answers a query in O(1) time. We also extend our techniques to solve several variants of the problem optimally.
Xiaocheng Hu, Jian Pei, Yufei Tao
Online Multiple Palindrome Pattern Matching
Abstract
A palindrome is a string that reads the same forward and backward. We say that two strings of the same length are pal-equivalent if for each possible center they have the same length of the maximal palindrome. Given a text T of length n and a set of patterns P 1,…,P k , we study the online multiple palindrome pattern matching problem that finds all pairs of an index i and a pattern P j such that T[i−|P j | + 1:i] and P j are pal-equivalent. We solve the problem in O(m k M) preprocessing time and O(m k n) query time using O(m k M) space, where M is the sum of all pattern lengths and m k is the longest pattern length.
Hwee Kim, Yo-Sub Han
Indexed Matching Statistics and Shortest Unique Substrings
Abstract
The unidirectional and bidirectional matching statistics between two strings s and t on alphabet Σ, and the shortest unique substrings of a single string t, are the cornerstone of a number of large-scale genome analysis applications, and they encode nontrivial structural properties of s and t. In this paper we compute for the first time the matching statistics between s and t in O((|s| + |t|)log|Σ|) time and in O(|s|log|Σ|) bits of space, circumventing the need for computing the depths of suffix tree nodes that characterized previous approaches. Symmetrically, we compute for the first time the shortest unique substrings of a string t in O(|t|log|Σ|) time and in O(|t|log|Σ|) bits of space. A key component of our methods is an encoding of both the unidirectional and the bidirectional statistics that takes 2|t| + o(|t|) bits of space and that allows constant-time access to every position.
Djamal Belazzougui, Fabio Cunial

Search

I/O-Efficient Dictionary Search with One Edit Error
Abstract
This paper studies the 1-error dictionary search problem in external memory. The input is a set D of strings whose characters are drawn from a constant-size alphabet. Given a string q, a query reports the ids of all strings in D that are within 1 edit distance from q. We give a structure occupying O(n/B) blocks that answers a query in \(O(1 + \frac{m}{wB} + \frac{k}{B})\) I/Os, where n is the total length of all strings in D, m is the length of q, k is the number of ids reported, w is the size of a machine word, and B is the number of words in a block.
Chin-Wan Chung, Yufei Tao, Wei Wang
Online Pattern Matching for String Edit Distance with Moves
Abstract
Edit distance with moves (EDM) is a string-to-string distance measure that includes substring moves in addition to ordinal editing operations to turn one string to the other. Although optimizing EDM is intractable, it has many applications especially in error detections. Edit sensitive parsing (ESP) is an efficient parsing algorithm that guarantees an upper bound of parsing discrepancies between different appearances of the same substrings in a string. ESP can be used for computing an approximate EDM as the L 1 distance between characteristic vectors built by node labels in parsing trees. However, ESP is not applicable to a streaming text data where a whole text is unknown in advance. We present an online ESP (OESP) that enables an online pattern matching for EDM. OESP builds a parse tree for a streaming text and computes the L 1 distance between characteristic vectors in an online manner. For the space-efficient computation of EDM, OESP directly encodes the parse tree into a succinct representation by leveraging the idea behind recent results of a dynamic succinct tree. We experimentally test OESP on the ability to compute EDM in an online manner on benchmark datasets, and we show OESP’s efficiency.
Yoshimasa Takabatake, Yasuo Tabei, Hiroshi Sakamoto
K 2-Treaps: Range Top-k Queries in Compact Space
Abstract
Efficient processing of top-k queries on multidimensional grids is a common requirement in information retrieval and data mining, for example in OLAP cubes. We introduce a data structure, the K 2-treap, that represents grids in compact form and supports efficient prioritized range queries. We compare the K 2-treap with state-of-the-art solutions on synthetic and real-world datasets, showing that it uses 30% of the space of competing solutions while solving queries up to 10 times faster.
Nieves R. Brisaboa, Guillermo de Bernardo, Roberto Konow, Gonzalo Navarro
Performance Improvements for Search Systems Using an Integrated Cache of Lists+Intersections
Abstract
Modern information retrieval systems use several levels of caching to speedup computation by exploiting frequent, recent or costly data used in the past. In this study we propose and evaluate a static cache that works simultaneously as list and intersection cache, offering a more efficient way of handling cache space. In addition, we propose effective strategies to select the term pairs that should populate the cache. Simulation using two datasets and a real query log reveal that the proposed approach improves overall performance in terms of total processing time, achieving savings of up to 40% in the best case.
Gabriel Tolosa, Luca Becchetti, Esteban Feuerstein, Alberto Marchetti-Spaccamela

Mining and Recommending

Information-Theoretic Term Selection for New Item Recommendation
Abstract
Recommender systems aim at predicting the preference of a user towards a given item (e.g., a movie, a song). For systems that must cope with continuously evolving item catalogs, there will be a considerable rate of new items for which no past preference is known that could otherwise inform preference-based recommendations. In contrast, pure content-based recommendations may suffer from noisy item descriptions. To overcome these problems, we propose an information-theoretic approach that exploits a taxonomy of categories associated with the cataloged items in order to select informative terms for an improved recommendation. Our experiments using two publicly available datasets attest the effectiveness of the proposed approach, which significantly outperforms state-of-the-art content-based recommenders from the literature.
Thales F. Costa, Anisio Lacerda, Rodrygo L. T. Santos, Nivio Ziviani
On the String Consensus Problem and the Manhattan Sequence Consensus Problem
Abstract
In the Manhattan Sequence Consensus problem (MSC problem) we are given k integer sequences, each of length ℓ, and we are to find an integer sequence x of length ℓ (called a consensus sequence), such that the maximum Manhattan distance of x from each of the input sequences is minimized. For binary sequences Manhattan distance coincides with Hamming distance, hence in this case the string consensus problem (also called string center problem or closest string problem) is a special case of MSC. Our main result is a practically efficient \(\mathcal{O}(\ell)\)-time algorithm solving MSC for k ≤ 5 sequences. Practicality of our algorithms has been verified experimentally. It improves upon the quadratic algorithm by Amir et al. (SPIRE 2012) for string consensus problem for k = 5 binary strings. Similarly as in Amir’s algorithm we use a column-based framework. We replace the implied general integer linear programming by its easy special cases, due to combinatorial properties of the MSC for k ≤ 5. We also show that for a general parameter k any instance can be reduced in linear time to a kernel of size k!, so the problem is fixed-parameter tractable. Nevertheless, for k ≥ 4 this is still too much for any naive solution to be feasible in practice.
Tomasz Kociumaka, Jakub W. Pachocki, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń
Context-Aware Deal Size Prediction
Abstract
Daily deals sites, such as Groupon and LivingSocial, attract millions of customers in the hunt for products and services at substantially reduced prices (i.e., deals). An important aspect for the profitability of these sites is the correct prediction of how many coupons will be sold for each deal in their catalog—a task commonly referred to as deal size prediction. Existing solutions for the deal size prediction problem focus on one deal at a time, neglecting the existence of similar deals in the catalog. In this paper, we propose to improve deal size prediction by taking into account the context in which a given deal is offered. In particular, we propose a topic modeling approach to identify markets with similar deals and an expectation-maximization approach to model intra-market competition while minimizing the prediction error. A systematic set of experiments shows that our approach offers gains in precision ranging from 8.18% to 17.67% when compared against existing solutions.
Anisio Lacerda, Adriano Veloso, Rodrygo L. T. Santos, Nivio Ziviani
Simple and Efficient String Algorithms for Query Suggestion Metrics Computation
Abstract
In order to make query suggestion mechanisms more efficient, it is important to have metrics that will estimate query suggestions quality well. Recently, Kharitonov et al. [7] proposed a family of metrics that showed much better alignment with user satisfaction than previously known metrics. However, they did not address the problem of computing the proposed metrics. In this paper we show that the problem can be reduced to one of the two string problems which we call Top-k and Sorted-Top-k. Given an integer k and two sets of pairwise distinct strings (queries) with weights, Q and Q test , the Top-k problem is to find, for each query q ∈ Q test , its shortest prefix q[1..i] such that q belongs to the list of k heaviest queries in Q starting with q[1..i]. The Sorted-Top-k problem is to retrieve, for each q ∈ Q test and 1 ≤ i ≤ |q|, a position of q in the sorted list of the k heaviest queries in Q starting with q[1..i]. We show several linear-time solutions to these problems and compare them experimentally.
Alexander Loptev, Anna Selugina, Tatiana Starikovskaya
Backmatter
Metadaten
Titel
String Processing and Information Retrieval
herausgegeben von
Edleno Moura
Maxime Crochemore
Copyright-Jahr
2014
Verlag
Springer International Publishing
Electronic ISBN
978-3-319-11918-2
Print ISBN
978-3-319-11917-5
DOI
https://doi.org/10.1007/978-3-319-11918-2

Neuer Inhalt