Skip to main content

2010 | Buch

Combinatorial Pattern Matching

21st Annual Symposium, CPM 2010, New York, NY, USA, June 21-23, 2010. Proceedings

herausgegeben von: Amihood Amir, Laxmi Parida

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Inhaltsverzeichnis

Frontmatter
Algorithms for Forest Pattern Matching

Ordered labelled trees are trees where the left-to-right order among siblings is significant. An ordered labelled forest is a sequence of ordered labelled trees. Given an ordered labelled forest

F

(“the target forest”) and an ordered labelled forest

G

(“the pattern forest”), the

forest pattern matching problem

is to find a sub-forest

F

′ of

F

such that

F

′ and

G

are the most similar over all possible

F

′. In this paper, we present efficient algorithms for the forest pattern matching problem for two types of sub-forests: closed subforests and closed substructures. As RNA molecules’ secondary structures could be represented as ordered labelled forests, our algorithms can be used to locate the structural or functional regions in RNA secondary structures.

Kaizhong Zhang, Yunkun Zhu
Affine Image Matching Is Uniform ${\text{\rm TC}^0}$ -Complete

Affine image matching is a computational problem to determine for two given images

A

and

B

how much an affine transformated

A

can resemble

B

. The research in combinatorial pattern matching led to a polynomial time algorithm which solves this problem by a sophisticated search in the set

${\mathcal D}(A)$

of all affine transformations of

A

. This paper shows that polynomial time is not the lowest complexity class containing this problem by providing its

${\text{\rm TC}^0}$

-completeness. This result means not only that there are extremely efficient parallel solutions but also reveals further insight into the structural properties of image matching. The completeness in

${\text{\rm TC}^0}$

relates affine image matching to a number of most basic problems in computer science, like integer multiplication and division.

Christian Hundt
Old and New in Stringology

Twenty five years ago in a paper titled “Open Problems in Stringology” I listed thirteen open problems in a field I called Stringology. The first part of the talk will revisit the list. Some problems were solved, others were partially solved and some resisted any progress.

The second part of the talk will review some recent results in Stringology, namely algorithms in the streaming model. In this model, the algorithms cannot store the entire input string(s) and can use only very limited space. Surprisingly, efficient algorithms were discovered for a number of string problems.

The talk will conclude with new open problems that are raised by these new results.

Zvi Galil
Small-Space 2D Compressed Dictionary Matching

The dictionary matching problem seeks all locations in a text that match any of the patterns in a dictionary. In the

compressed dictionary matching

problem, the input is in compressed form. In this paper we introduce the 2-dimensional compressed dictionary matching problem in Lempel-Ziv compressed images, and present an efficient solution for patterns whose rows are all

periodic

. Given

k

patterns, each of (uncompressed) size

m

×

m

, and a text of (uncompressed) size

n

×

n

, all in 2D-LZ compressed form, our algorithm finds all occurrences of the patterns in the text. The algorithm is

strongly inplace

, i.e., the extra space it uses is proportional to the optimal compression of the dictionary, which is

O

(

km

). The preprocessing time of the algorithm is

O

(

km

2

), linear in the uncompressed dictionary size, and the time for performing the search is linear in the uncompressed text size, independent of the dictionary size. Our algorithm is general in the sense that it can be used for any 2D compression scheme which can be sequentially decompressed in small space.

Shoshana Neuburger, Dina Sokol
Bidirectional Search in a String with Wavelet Trees

Searching for genes encoding microRNAs (miRNAs) is an important task in genome analysis. Because the secondary structure of miRNA (but not the sequence) is highly conserved, the genes encoding it can be determined by finding regions in a genomic DNA sequence that match the structure. It is known that algorithms using a bidirectional search on the DNA sequence for this task outperform algorithms based on unidirectional search. The data structures supporting a bidirectional search (affix trees and affix arrays), however, are rather complex and suffer from their large space consumption. Here, we present a new data structure called

bidirectional wavelet index

that supports bidirectional search with much less space. With this data structure, it is possible to search for RNA secondary structural patterns in large genomes, for example the human genome.

Thomas Schnattinger, Enno Ohlebusch, Simon Gog
A Minimal Periods Algorithm with Applications

Kosaraju in “Computation of squares in a string” briefly described a linear-time algorithm for computing the minimal squares starting at each position in a word. Using the same construction of suffix trees, we generalize his result and describe in detail how to compute the minimal

α

power, with a period of length longer than

s

, starting at each position in a word

w

for arbitrary exponent

α

> 1 and integer

s

 ≥ 0. The algorithm runs in

O

(

α

|

w

|)-time for

s

 = 0 and in

O

(|

w

|

2

)-time otherwise. We provide a complete proof of the correctness and computational complexity of the algorithm. The algorithm can be used to detect certain types of pseudo-patterns in words, which was our original goal in studying this generalization.

Zhi Xu
The Property Suffix Tree with Dynamic Properties

Recently there has been much interest in the Property Indexing Problem ([1],[7],[8]), where one is interested to preprocess a text

T

of size

n

over alphabet Σ (which we assume is of constant size), and a set of intervals

π

over the text positions, such that give a query pattern

P

of size

m

we can report all of the occurrences of

P

in

T

which are completely contained within some interval from

π

. This type of matching is extremely helpful in scenarios in molecular biology where it has long been a practice to consider special areas in the genome by their structure.

The work done so far has focused on the static version of this problem where the intervals are given a-priori and never changed. This paper is the first to focus on several dynamic settings of

π

including an incremental version where new intervals are inserted into

π

, decremental version where intervals are deleted from

π

, fully dynamic version where intervals may be inserted or deleted to or from

π

, or batched insertions where a set of intervals is inserted into

π

. In particular, the batched version provides us with a new (optimal) algorithm for the static case.

Tsvi Kopelowitz
Approximate All-Pairs Suffix/Prefix Overlaps

Finding approximate overlaps is the first phase of many sequence assembly methods. Given a set of

r

strings of total length

n

and an error-rate

ε

, the goal is to find, for all-pairs of strings, their suffix/prefix matches (overlaps) that are within edit distance

k

 = ⌈

ε

ℓ⌉, where ℓ is the length of the overlap. We propose new solutions for this problem based on

backward backtracking

(Lam et al. 2008) and

suffix filters

(Kärkkäinen and Na, 2008). Techniques use

nH

k

 + 

o

(

n

log

σ

) + 

r

log

r

bits of space, where

H

k

is the

k

-th order entropy and

σ

the alphabet size. In practice, methods are easy to parallelize and scale up to millions of DNA reads.

Niko Välimäki, Susana Ladra, Veli Mäkinen
Succinct Dictionary Matching with No Slowdown

The problem of dictionary matching is a classical problem in string matching: given a set

S

of

d

strings of total length

n

characters over an (not necessarily constant) alphabet of size

σ

, build a data structure so that we can match in a any text

T

all occurrences of strings belonging to

S

. The classical solution for this problem is the Aho-Corasick automaton which finds all

occ

occurrences in a text

T

in time

O

(|

T

| + 

occ

) using a representation that occupies

O

(

m

log

m

) bits of space where

m

 ≤ 

n

 + 1 is the number of states in the automaton. In this paper we show that the Aho-Corasick automaton can be represented in just

m

(log

σ

 + 

O

(1)) + 

O

(

d

log(

n

/

d

)) bits of space while still maintaining the ability to answer to queries in

O

(|

T

| + 

occ

) time. To the best of our knowledge, the currently fastest succinct data structure for the dictionary matching problem uses

O

(

n

log

σ

) bits of space while answering queries in

O

(|

T

|loglog

n

 + 

occ

) time. In the paper we also show how the space occupancy can be reduced to

m

(

H

0

 + 

O

(1)) + 

O

(

d

log(

n

/

d

)) where

H

0

is the empirical entropy of the characters appearing in the trie representation of the set

S

, provided that

σ

 < 

m

ε

for any constant 0 < 

ε

< 1. The query time remains unchanged.

Djamal Belazzougui
Pseudo-realtime Pattern Matching: Closing the Gap

We consider the

k

-difference and

k

-mismatch problems in the pseudo-realtime model where the text arrives online and the time complexity measure is per arriving character and unamortised. The well-known

k

-difference/

k

-mismatch problems are those of finding all alignments of a pattern of length

m

with a text of length

n

where the edit/Hamming distance is at most

k

. Offline, the literature gives efficient solutions in

O

(

nk

) and

$O(n \sqrt{k \log k})$

time, respectively. More recently, a pseudo-realtime solution was given for the former in

O

(

k

log

m

) time and the latter in

$O(\sqrt{k \log k}\log m)$

time per arriving text character. Our work improves these complexities to

O

(

k

) time for the

k

-difference problem and

$O(\sqrt{k}\log k + \log m)$

for the

k

-mismatch problem. In the process of developing the main results, we also give a simple solution with optimal time complexity for performing longest common extension queries in the same pseudo-realtime setting which may be of independent interest.

Raphaël Clifford, Benjamin Sach
Breakpoint Distance and PQ-Trees

The PQ-tree is a fundamental data structure that can encode large sets of permutations. It has recently been used in comparative genomics to model ancestral genomes with some uncertainty: given a phylogeny for some species, extant genomes are represented by permutations on the leaves of the tree, and each internal node in the phylogenetic tree represents an extinct ancestral genome, represented by a PQ-tree. An open problem related to this approach is then to quantify the evolution between genomes represented by PQ-trees. In this paper we present results for two problems of PQ-tree comparison motivated by this application. First, we show that the problem of comparing two PQ-trees by computing the minimum breakpoint distance among all pairs of permutations generated respectively by the two considered PQ-trees is NP-complete for unsigned permutations. Next, we consider a generalization of the classical Breakpoint Median problem, where an ancestral genome is represented by a PQ-tree and

p

permutations are given, with

p

 ≥ 1, and we want to compute a permutation generated by the PQ-tree that minimizes the sum of the breakpoint distances to the

p

permutations. We show that this problem is Fixed-Parameter Tractable with respect to the breakpoint distance value. This last result applies both on signed and unsigned permutations, and to uni-chromosomal and multi-chromosomal permutations.

Haitao Jiang, Cedric Chauve, Binhai Zhu
On the Parameterized Complexity of Some Optimization Problems Related to Multiple-Interval Graphs

We show that for any constant

t

 ≥ 2,

k

-Independent Set

and

k

-Dominating Set

in

t

-track interval graphs are W[1]-hard. This settles an open question recently raised by Fellows, Hermelin, Rosamond, and Vialette. We also give an FPT algorithm for

k

-Clique

in

t

-interval graphs, parameterized by both

k

and

t

, with running time max {

t

O

(

k

)

, 2

O

(

k

log

k

)

} ·poly(

n

), where

n

is the number of vertices in the graph. This slightly improves the previous FPT algorithm by Fellows, Hermelin, Rosamond, and Vialette. Finally, we use the W[1]-hardness of

k

-Independent Set

in

t

-track interval graphs to obtain the first parameterized intractability result for a recent bioinformatics problem called

Maximal Strip Recovery

(MSR). We show that MSR-

d

is W[1]-hard for any constant

d

 ≥ 4 when the parameter is either the total length of the strips, or the total number of adjacencies in the strips, or the number of strips in the optimal solution.

Minghui Jiang
Succinct Representations of Separable Graphs

We consider the problem of highly space-efficient representation of separable graphs while supporting queries in constant time in the RAM with logarithmic word size. In particular, we show constant-time support for adjacency, degree and neighborhood queries. For any monotone class of separable graphs, the storage requirement of the representation is optimal to within lower order terms.

Separable graphs are those that admit a

O

(

n

c

)-separator theorem where

c

 < 1. Many graphs that arise in practice are indeed separable. For instance, graphs with a bounded genus are separable. In particular, planar graphs (genus 0) are separable and our scheme gives the first succinct representation of planar graphs with a storage requirement that matches the information-theory minimum to within lower order terms with constant time support for the queries.

We, furthers, show that we can also modify the scheme to succinctly represent the combinatorial planar embedding of planar graphs (and hence encode planar maps).

Guy E. Blelloch, Arash Farzan
Implicit Hitting Set Problems and Multi-genome Alignment

Let U be a finite set and S a family of subsets of U. Define a hitting set as a subset of U that intersects every element of S. The optimal hitting set problem is: given a positive weight for each element of U, find a hitting set of minimum total weight. This problem is equivalent to the classic weighted set cover problem.We consider the optimal hitting set problem in the case where the set system S is not explicitly given, but there is an oracle that will supply members of S satisfying certain conditions; for example, we might ask the oracle for a minimum-cardinality set in S that is disjoint from a given set Q. The problems of finding a minimum feedback arc set or minimum feedback vertex set in a digraph are examples of implicit hitting set problems. Our interest is in the number of oracle queries required to find an optimal hitting set. After presenting some generic algorithms for this problem we focus on our computational experience with an implicit hitting set problem related to multi-genome alignment in genomics. This is joint work with Erick Moreno Centeno.

Richard M. Karp
Bounds on the Minimum Mosaic of Population Sequences under Recombination

We study the minimum mosaic problem, an optimization problem originated in population genomics. We develop a new lower bound, called the

C

bound. The

C

bound is provably higher and significantly more accurate in practice than an existing bound. We show how to compute the exact

C

bound using integer linear programming. We also show that a weaker version of the

C

bound is also more accurate than the existing bound, and can be computed in polynomial time. Simulation shows that the new bounds often match the exact optimum at least for the range of data we tested. Moreover, we give an analytical upper bound for the minimum mosaic problem.

Yufeng Wu
The Highest Expected Reward Decoding for HMMs with Application to Recombination Detection

Hidden Markov models are traditionally decoded by the Viterbi algorithm which finds the highest probability state path in the model. In recent years, several limitations of the Viterbi decoding have been demonstrated, and new algorithms have been developed to address them (Kall et al., 2005; Brejova et al., 2007; Gross et al., 2007; Brown and Truszkowski, 2010). In this paper, we propose a new efficient highest expected reward decoding algorithm (HERD) that allows for uncertainty in boundaries of individual sequence features. We demonstrate usefulness of our approach on jumping HMMs for recombination detection in viral genomes.

Michal Nánási, Tomáš Vinař, Broňa Brejová
Phylogeny- and Parsimony-Based Haplotype Inference with Constraints

Haplotyping, also known as haplotype phase prediction, is the problem of predicting likely haplotypes based on genotype data. One fast computational haplotyping method is based on an evolutionary model where a perfect phylogenetic tree is sought that explains the observed data. In their

cpm

2009 paper, Fellows et al. studied an extension of this approach that incorporates prior knowledge in the form of a set of candidate haplotypes from which the right haplotypes must be chosen. While this approach may help to increase the accuracy of haplotyping methods, it was conjectured that the resulting formal problem

constrained perfect phylogeny haplotyping

might be NP-complete. In the present paper we present a polynomial-time algorithm for it. Our algorithmic ideas also yield new fixed-parameter algorithms for related haplotyping problems based on the maximum parsimony assumption.

Michael Elberfeld, Till Tantau
Faster Computation of the Robinson-Foulds Distance between Phylogenetic Networks

The Robinson-Foulds distance, which is the most widely used metric for comparing phylogenetic trees, has recently been generalized to phylogenetic networks. Given two networks

N

1

,

N

2

with

n

leaves,

m

nodes, and

e

edges, the Robinson-Foulds distance measures the number of clusters of descendant leaves that are not shared by

N

1

and

N

2

. The fastest known algorithm for computing the Robinson-Foulds distance between those networks runs in

O

(

m

(

m

 + 

e

)) time. In this paper, we improve the time complexity to

O

(

n

(

m

 + 

e

)/log

n

) for general networks and

O

(

n

m

/log

n

) for general networks with bounded degree, and to optimal

O

(

m

 + 

e

) time for planar phylogenetic networks and bounded-level phylogenetic networks. We also introduce the natural concept of the minimum spread of a phylogenetic network and show how the running time of our new algorithm depends on this parameter. As an example, we prove that the minimum spread of a level-

k

phylogenetic network is at most

k

 + 1, which implies that for two level-

k

phylogenetic networks, our algorithm runs in

O

((

k

 + 1)(

m

 + 

e

)) time.

Tetsuo Asano, Jesper Jansson, Kunihiko Sadakane, Ryuhei Uehara, Gabriel Valiente
Mod/Resc Parsimony Inference

We address in this paper a new computational biology problem that aims at understanding a mechanism that could potentially be used to genetically manipulate natural insect populations infected by inherited, intra-cellular parasitic bacteria. In this problem, that we denote by

Mod/Resc Parsimony Inference

, we are given a boolean matrix and the goal is to find two other boolean matrices with a minimum number of columns such that an appropriately defined operation on these matrices gives back the input. We show that this is formally equivalent to the

Bipartite Biclique Edge Cover

problem and derive some complexity results for our problem using this equivalence. We provide a new, fixed-parameter tractability approach for solving both that slightly improves upon a previously published algorithm for the

Bipartite Biclique Edge Cover

. Finally, we present experimental results where we applied some of our techniques to a real-life data set.

Igor Nor, Danny Hermelin, Sylvain Charlat, Jan Engelstadter, Max Reuter, Olivier Duron, Marie-France Sagot
Extended Islands of Tractability for Parsimony Haplotyping

Parsimony haplotyping is the problem of finding a smallest size set of haplotypes that can explain a given set of genotypes. The problem is NP-hard, and many heuristic and approximation algorithms as well as polynomial-time solvable special cases have been discovered. We propose improved fixed-parameter tractability results with respect to the parameter “size of the target haplotype set”

k

by presenting an

O

*

(

k

4

k

)-time algorithm. This also applies to the practically important constrained case, where we can only use haplotypes from a given set. Furthermore, we show that the problem becomes polynomial-time solvable if the given set of genotypes is complete, i.e., contains all possible genotypes that can be explained by the set of haplotypes.

Rudolf Fleischer, Jiong Guo, Rolf Niedermeier, Johannes Uhlmann, Yihui Wang, Mathias Weller, Xi Wu
Sampled Longest Common Prefix Array

When augmented with the longest common prefix (LCP) array and some other structures, the suffix array can solve many string processing problems in optimal time and space. A compressed representation of the LCP array is also one of the main building blocks in many compressed suffix tree proposals. In this paper, we describe a new compressed LCP representation: the

sampled LCP array

. We show that when used with a compressed suffix array (CSA), the sampled LCP array often offers better time/space trade-offs than the existing alternatives. We also show how to construct the compressed representations of the LCP array directly from a CSA.

Jouni Sirén
Verifying a Parameterized Border Array in O(n 1.5) Time

The parameterized pattern matching problem is to check if there exists a renaming bijection on the alphabet with which a given pattern can be transformed into a substring of a given text. A

parameterized border array

(

p-border array

) is a parameterized version of a standard border array, and we can efficiently solve the parameterized pattern matching problem using p-border arrays. In this paper we present an

O

(

n

1.5

)-time

O

(

n

)-space algorithm to verify if a given integer array of length

n

is a valid p-border array for an unbounded alphabet. The best previously known solution takes time proportional to the

n

-th Bell number

$\frac{1}{e} \sum_{k=0}^{\infty} \frac{k^{n}}{k!}$

, and hence our algorithm is quite efficient.

Tomohiro I., Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda
Cover Array String Reconstruction

A proper factor

u

of a string

y

is a cover of

y

if every letter of

y

is within some occurrence of

u

in

y

. The concept generalises the notion of periods of a string. An integer array

${\mathit{C}}$

is the minimal-cover (resp. maximal-cover) array of

y

if

${\mathit{C}}[i]$

is the minimal (resp. maximal) length of covers of

$y[0{\ldotp\ldotp}i]$

, or zero if no cover exists.

In this paper, we present a constructive algorithm checking the validity of an array as a minimal-cover or maximal-cover array of some string. When the array is valid, the algorithm produces a string over an unbounded alphabet whose cover array is the input array. All algorithms run in linear time due to an interesting combinatorial property of cover arrays: the sum of important values in a cover array is bounded by twice the length of the string.

Maxime Crochemore, Costas S. Iliopoulos, Solon P. Pissis, German Tischler
Compression, Indexing, and Retrieval for Massive String Data

The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as traditional data structures. In this invited presentation, we discuss some breakthroughs in compressed data structures over the course of the last decade that have significantly reduced the space requirements for fast text and document indexing. One interesting consequence is that, for the first time, we can construct data structures for text indexing that are competitive in time and space with the well-known technique of inverted indexes, but that provide more general search capabilities. Several challenges remain, and we focus in this presentation on two in particular: building I/O-efficient search structures when the input data are so massive that external memory must be used, and incorporating notions of relevance in the reporting of query answers.

Wing-Kai Hon, Rahul Shah, Jeffrey Scott Vitter
Building the Minimal Automaton of A * X in Linear Time, When X Is of Bounded Cardinality

We present an algorithm for constructing the minimal automaton recognizing

A

*

X

, where the pattern

X

is a set of

m

(that is a fixed integer) non-empty words over a finite alphabet

A

whose sum of lengths is

n

. This algorithm, inspired by Brzozowski’s minimization algorithm, uses sparse lists to achieve a linear time complexity with respect to

n

.

Omar AitMous, Frédérique Bassino, Cyril Nicaud
A Compact Representation of Nondeterministic (Suffix) Automata for the Bit-Parallel Approach

We present a novel technique, suitable for bit-parallelism, for representing both the nondeterministic automaton and the nondeterministic suffix automaton of a given string in a more compact way. Our approach is based on a particular factorization of strings which on the average allows to pack in a machine word of

w

bits automata state configurations for strings of length greater than

w

. We adapted the

Shift-And

and

BNDM

algorithms using our encoding and compared them with the original algorithms. Experimental results show that the new variants are generally faster for long patterns.

Domenico Cantone, Simone Faro, Emanuele Giaquinta
Algorithms for Three Versions of the Shortest Common Superstring Problem

The input to the Shortest Common Superstring (SCS) problem is a set

S

of

k

words of total length

n

. In the classical version the output is an explicit word

SCS

(

S

) in which each

s

 ∈ 

S

occurs at least once. In our paper we consider two versions with multiple occurrences, in which the input includes additional numbers (multiplicities), given in binary. Our output is the word

SCS

(

S

) given implicitly in a compact form, since its real size could be exponential. We also consider a case when all input words are of length two, where our main algorithmic tool is a compact representation of Eulerian cycles in multigraphs. Due to exponential multiplicities of edges such cycles can be exponential and the compact representation is needed. Other tools used in our paper are a polynomial case of integer linear programming and a min-plus product of matrices.

Maxime Crochemore, Marek Cygan, Costas Iliopoulos, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń
Finding Optimal Alignment and Consensus of Circular Strings

We consider the problem of finding the optimal alignment and consensus (string) of circular strings. Circular strings are different from linear strings in that the first (leftmost) symbol of a circular string is wrapped around next to the last (rightmost) symbol. In nature, for example, bacterial and mitochondrial DNAs typically form circular strings. The consensus string problem is finding a representative string (consensus) of a given set of strings, and it has been studied on linear strings extensively. However, only a few efforts have been made for the consensus problem for circular strings, even though circular strings are biologically important. In this paper, we introduce the consensus problem for circular strings and present novel algorithms to find the optimal alignment and consensus of circular strings under the Hamming distance metric. They are

O

(

n

2

log

n

)-time algorithms for three circular strings and an

O

(

n

3

log

n

)-time algorithm for four circular strings. Our algorithms are

O

(

n

/ log

n

) times faster than the naïve algorithm directly using the solutions for the linear consensus problems, which takes

O

(

n

3

) time for three circular strings and

O

(

n

4

) time for four circular strings. We achieved this speedup by adopting a convolution and a system of linear equations into our algorithms to reflect the characteristics of circular strings that we found.

Taehyung Lee, Joong Chae Na, Heejin Park, Kunsoo Park, Jeong Seop Sim
Optimizing Restriction Site Placement for Synthetic Genomes

Restriction enzymes are the workhorses of molecular biology. We introduce a new problem that arises in the course of our project to design virus variants to serve as potential vaccines: we wish to modify virus-length genomes to introduce large numbers of unique restriction enzyme recognition sites while preserving wild-type function by substitution of synonymous codons. We show that the resulting problem is NP-Complete, give an exponential-time algorithm, and propose effective heuristics, which we show give excellent results for five sample viral genomes. Our resulting modified genomes have several times more unique restriction sites and reduce the maximum gap between adjacent sites by three to nine-fold.

Pablo Montes, Heraldo Memelli, Charles Ward, Joondong Kim, Joseph S. B. Mitchell, Steven Skiena
Extension and Faster Implementation of the GRP Transform for Lossless Compression

The GRP transform, or the

generalized radix permutation

transform was proposed as a parametric generalization of the BWT of the block-sorting data compression algorithm. This paper develops its extension that can be applied with any combination of parameters. By using the technique developed for linear time/space implementation of the sort transform, we propose an efficient implementation for the inverse transformation of the GRP transform. It works for arbitrary parameter values, and can convert the transformed string to the original string in time linear in the string length.

Hidetoshi Yokoo
Parallel and Distributed Compressed Indexes

We study parallel and distributed compressed indexes. Compressed indexes are a new and functional way to index text strings. They exploit the compressibility of the text, so that their size is a function of the compressed text size. Moreover, they support a considerable amount of functions, more than many classical indexes. We make use of this extended functionality to obtain, in a shared-memory parallel machine, near-optimal speedups for solving several stringology problems. We also show how to distribute compressed indexes across several machines.

Luís M. S. Russo, Gonzalo Navarro, Arlindo L. Oliveira
Backmatter
Metadaten
Titel
Combinatorial Pattern Matching
herausgegeben von
Amihood Amir
Laxmi Parida
Copyright-Jahr
2010
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-13509-5
Print ISBN
978-3-642-13508-8
DOI
https://doi.org/10.1007/978-3-642-13509-5

Premium Partner