Skip to main content

2007 | Buch

Combinatorial Pattern Matching

18th Annual Symposium, CPM 2007, London, Canada, July 9-11, 2007. Proceedings

herausgegeben von: Bin Ma, Kaizhong Zhang

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

The papers contained in this volume were presented at the 18th Annual S- posium on Combinatorial Pattern Matching (CPM 2007) held at the University of Western Ontario, in London, Ontario, Canada from July 9 to 11, 2007. All the papers presented at the conference are original research contri- tions on computational pattern matching and analysis, data compression and compressed text processing, su?x arrays and trees, and computational biology. They were selected from 64 submissions. Each submission was reviewed by at least three reviewers. The committee decided to accept 32 papers. The p- gramme also included three invited talks by Tao Jiang from the University of California, Riverside, USA, S. Muthukrishnan from Rutgers University, USA, and Frances Yao from City University of Hong Kong, Hong Kong. Combinatorial Pattern Matching addresses issues of searching and matching stringsandmorecomplicatedpatternssuchastrees,regularexpressions,graphs, point sets, and arrays.The goal is to derive non-trivial combinatorial properties of such structures and to exploit these properties in order to either achieve superior performance for the corresponding computational problems or pinpoint conditions under which searches cannot be performed e?ciently.

Inhaltsverzeichnis

Frontmatter

Invited Talks (Abstracts)

A Combinatorial Approach to Genome-Wide Ortholog Assignment: Beyond Sequence Similarity Search

The assignment of orthologous genes between a pair of genomes is a fundamental and challenging problem in comparative genomics. Existing methods that assign orthologs based on the similarity between DNA or protein sequences may make erroneous assignments when sequence similarity does not clearly delineate the evolutionary relationship among genes of the same families. In this paper, we present a new approach to ortholog assignment that takes into account both sequence similarity and evolutionary events at genome level, where orthologous genes are assumed to correspond to each other in the most parsimonious evolving scenario under genome rearrangement and gene duplication. It is then formulated as a problem of computing the signed reversal distance with duplicates between two genomes of interest, for which an efficient heuristic algorithm was constructed based on solutions to two new optimization problems, minimum common partition and maximum cycle decomposition. Following this approach, we have implemented a high-throughput system for assigning orthologs on a genome scale, called MSOAR, and tested it on both simulated data and real genome sequence data. Our predicted orthologs between the human and mouse genomes are strongly supported by ortholog and protein function information in authoritative databases, and predictions made by other key ortholog assignment methods such as Ensembl, Homologene, INPARANOID, and HGNC. The simulation results demonstrate that MSOAR in general performs better than the iterated exemplar algorithm of D. Sankoff’s in terms of identifying true exemplar genes.

This is joint work with X. Chen (Nanyang Tech. Univ., Singapore), Z. Fu (UCR), J. Zheng (NCBI), V. Vacic (UCR), P. Nan (SCBIT), Y. Zhong (SCBIT), and S. Lonardi (UCR).

Tao Jiang
Stringology: Some Classic and Some Modern Problems

We examine some of the classic problems related to suffix trees from 70’s and show some recent results on sorting suffixes with small space and suffix selection. Further, we introduce modern versions of suffix sorting and their application to XML processing. The study of combinatorial aspects of strings continues to flourish, and we present several open problems with modern applications.

S. Muthukrishnan
Algorithmic Problems in Scheduling Jobs on Variable-Speed Processors

Power and heat have become two of the major concerns for the computer industry, which struggles to cope with the energy and cooling costs for servers, as well as the short battery life of portable devices. Dynamic Voltage Scaling (DVS) has emerged as a useful technique: e.g. Intel’s newest Foxton technology enables a chip to run at 64 different speed levels. Equipped with DVS technology, the operating system can then save CPU’s energy consumption by scheduling tasks wisely. A schedule that finishes the given tasks within their timing constraints while using minimum total energy (among all feasible schedules) is called an optimal DVS schedule. A theoretical model for DVS scheduling was proposed in a paper by Yao, Demers and Shenker in 1995, along with a well-formed characterization of the optimum and an algorithm for computing it. This algorithm has remained as the most efficient known despite many investigations of this model. In this talk, we will first give an overview of the DVS scheduling problem, and then present the latest improved results for computing the optimal schedule in both the finite and the continuous (infinite speed levels) models. Related results on efficient on-line scheduling heuristics will also be discussed.

Frances F. Yao

Session 1: Alogirthmic Techniques I

Speeding Up HMM Decoding and Training by Exploiting Sequence Repetitions

We present a method to speed up the dynamic program algorithms used for solving the HMM decoding and training problems for discrete time-independent HMMs. We discuss the application of our method to Viterbi’s decoding and training algorithms [21], as well as to the forward-backward and Baum-Welch [4] algorithms. Our approach is based on identifying repeated substrings in the observed input sequence. We describe three algorithms based alternatively on byte pair encoding (BPE) [19], run length encoding (RLE) and Lempel-Ziv (LZ78) parsing [12]. Compared to Viterbi’s algorithm, we achieve a speedup of

Ω

(

r

) using BPE, a speedup of

$\Omega(\frac{r}{\log r})$

using RLE, and a speedup of

$\Omega(\frac{\log n}{k})$

using LZ78, where

k

is the number of hidden states,

n

is the length of the observed sequence and

r

is its compression ratio (under each compression scheme). Our experimental results demonstrate that our new algorithms are indeed faster in practice. Furthermore, unlike Viterbi’s algorithm, our algorithms are highly parallelizable.

Shay Mozes, Oren Weimann, Michal Ziv-Ukelson
On Demand String Sorting over Unbounded Alphabets

On-demand string sorting is the problem of preprocessing a set of

n

strings to allow subsequent queries of finding the

k

 < 

n

lexicographically smallest strings (and afterwards the next

k

etc.) This on-demand variant strongly resembles the search engine queries which give you the best

k

-ranked pages recurringly.

We present a data structure that supports this in

O

(

n

) preprocessing time, and answers queries in

O

(log

n

) time. There is also a cost of

O

(

N

) time amortized over all operations, where

N

is the total length of the strings.

Our data structure is a heap of strings, which supports heapify and delete-mins. As it turns out, implementing a full heap with all operations is not that simple. For the sake of completeness we propose a heap with full operations based on balanced indexing trees that supports the heap operations in optimal times.

Carmel Kent, Moshe Lewenstein, Dafna Sheinwald

Session 2: Approximate Pattern Matching

Finding Witnesses by Peeling

In the

k

-matches problem, we are given a pattern and a text, and for each text location the goal is to list all, but not more than

k

, matches between the pattern and the text. This problem is one of several string matching problems that ask to not only to find where the pattern matches the text, under different “match” definitions, but also to provide

witnesses

to the match. Other such problems include:

k

-aligned ones [4],

k

-witnesses, and

k

-mismatches [18]. In addition, the solution to several other string matching problems relies on the efficient solution of the witness finding problems.

In this paper we provide a general efficient method for solving such witness finding problems. We do so by casting the problem as a generalization of group testing, which we then solve by a process which we call

peeling

. Using this general framework we obtain improved results for all of the above problems. We also show that our method also solves a couple of problems outside the pattern matching domain.

Yonatan Aumann, Moshe Lewenstein, Noa Lewenstein, Dekel Tsur
Cache-Oblivious Index for Approximate String Matching

This paper revisits the problem of indexing a text for approximate string matching. Specifically, given a text

T

of length

n

and a positive integer

k

, we want to construct an index of

T

such that for any input pattern

P

, we can find all its

k

-error matches in

T

efficiently. This problem is well-studied in the internal-memory setting. Here, we extend some of these recent results to external-memory solutions, which are also cache-oblivious. Our first index occupies

O

((

n

log

k

n

)/

B

) disk pages and finds all

k

-error matches with

$O((|P|+occ)/B + \log^k n \log \log_{\scriptscriptstyle B} n)$

I/Os, where

B

denotes the number of words in a disk page. To the best of our knowledge, this index is the first external-memory data structure that does not require

$\Omega(|P| + occ + \mbox{poly}(\log n))$

I/Os. The second index reduces the space to

O

((

n

log

n

)/

B

) disk pages, and the I/O complexity is

O

((|

P

| + 

occ

)/

B

 + log

k

(

k

 + 1)

n

loglog

n

).

Wing-Kai Hon, Tak-Wah Lam, Rahul Shah, Siu-Lung Tam, Jeffrey Scott Vitter
Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

We study the approximate string matching and regular expression matching problem for the case when the text to be searched is compressed with the Ziv-Lempel adaptive dictionary compression schemes. We present a time-space trade-off that leads to algorithms improving the previously known complexities for both problems. In particular, we significantly improve the space bounds. In practical applications the space is likely to be a bottleneck and therefore this is of crucial importance.

Philip Bille, Rolf Fagerberg, Inge Li Gørtz
Self-normalised Distance with Don’t Cares

We present

O

(

n

log

m

) algorithms for a new class of problems termed

self-normalised distance with don’t cares

. The input is a pattern

p

of length

m

and text

t

of length

n

 > 

m

. The elements of these strings are either integers or wild card symbols. In the shift version, the problem is to compute

$\min_{\alpha}\sum_{j=0}^{m-1}(\alpha + p_j - t_{i+j})^2$

for all

i

, where wild cards do not contribute to the sum. In the shift-scale version, the objective is to compute

$\min_{\alpha,\beta}\sum_{j=0}^{m-1}(\alpha+ \beta p_j - t_{i+j})^2$

for all

i

, similarly. We show that the algorithms have the additional benefit of providing simple

O

(

n

log

m

) solutions for the problems of exact matching with don’t cares, exact shift matching with don’t cares and exact shift-scale matching with don’t cares.

Peter Clifford, Raphaël Clifford

Session 3: Data Compression I

Move-to-Front, Distance Coding, and Inversion Frequencies Revisited

Move-to-Front, Distance Coding and Inversion Frequencies are three somewhat related techniques used to process the output of the Burrows-Wheeler Transform. In this paper we analyze these techniques from the point of view of how effective they are in the task of compressing low-entropy strings, that is, strings which have many regularities and are therefore highly compressible. This is a non-trivial task since many compressors have non-constant overheads that become non-negligible when the input string is highly compressible.

Because of the properties of the Burrows-Wheeler transform, being

locally optimal

ensures an algorithm compresses low-entropy strings effectively. Informally, local optimality implies that an algorithm is able to effectively compress

an arbitrary partition

of the input string. We show that in their original formulation neither Move-to-Front, nor Distance Coding, nor Inversion Frequencies is locally optimal. Then, we describe simple variants of the above algorithms which are locally optimal. To achieve local optimality with Move-to-Front it suffices to combine it with Run Length Encoding. To achieve local optimality with Distance Coding and Inversion Frequencies we use a novel “escape and re-enter” strategy. Since we build on previous results, our analyses are simple and shed new light on the inner workings of the three techniques considered in this paper.

Travis Gagie, Giovanni Manzini
A Lempel-Ziv Text Index on Secondary Storage

Full-text

searching consists in locating the occurrences of a given pattern

P

[1..

m

] in a text

T

[1..

u

], both sequences over an alphabet of size

σ

. In this paper we define a new index for full-text searching on

secondary storage

, based on the Lempel-Ziv compression algorithm and requiring 8

uH

k

 + 

o

(

u

log

σ

) bits of space, where

H

k

denotes the

k

-th order empirical entropy of

T

, for any

k

 = 

o

(log

σ

u

). Our experimental results show that our index is significantly smaller than any other practical secondary-memory data structure: 1.4–2.3 times the text size

including the text

, which means 39%–65% the size of traditional indexes like

String B-trees

[Ferragina and Grossi,

JACM

1999]. In exchange, our index requires more disk access to locate the pattern occurrences. Our index is able to report up to 600 occurrences per disk access, for a disk page of 32 kilobytes. If we only need to

count

pattern occurrences, the space can be reduced to about 1.04–1.68 times the text size, requiring about 20–60 disk accesses, depending on the pattern length.

Diego Arroyuelo, Gonzalo Navarro
Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts
(Extended Abstract)

Given an

n

-length text over a

σ

-size alphabet, we propose a dynamic rank-select structure that supports

$O((1+\frac{\log\sigma}{\log\log n})\log n)$

time operations in

n

log

σ

 + 

o

(

n

log

σ

) bits space. If

σ

 < log

n

, then the operation time is

O

(log

n

). In addition, we consider both static and dynamic rank-select structures on the run-length encoding (RLE) of a text. For an

n

′-length RLE of an

n

-length text, we present a static structure that gives

O

(1) time

select

and

O

(loglog

σ

) time

rank

using

n

′log

σ

 + 

O

(

n

) bits and a dynamic structure that provides

$O((1+\frac{\log\sigma}{\log\log n})\log n)$

time operations in

n

′log

σ

 + 

o

(

n

′log

σ

) + 

O

(

n

) bits.

Sunho Lee, Kunsoo Park
Most Burrows-Wheeler Based Compressors Are Not Optimal

We present a technique for proving lower bounds on the compression ratio of algorithms which are based on the

Burrows-Wheeler Transform

(BWT). We study three well known BWT-based compressors: the original algorithm suggested by Burrows and Wheeler; BWT with distance coding; and BWT with run-length encoding. For each compressor, we show a Markov source such that for asymptotically-large text generated by the source, the compression ratio divided by the entropy of the source is a constant greater than 1. This constant is 2 − 

ε

, 1.26, and 1.29, for each of the three compressors respectively. Our technique is robust, and can be used to prove similar claims for most BWT-based compressors (with a few notable exceptions). This stands in contrast to statistical compressors and Lempel-Ziv-style dictionary compressors, which are long known to be optimal, in the sense that for any Markov source, the compression ratio divided by the entropy of the source asymptotically tends to 1.

We experimentally corroborate our theoretical bounds. Furthermore, we compare BWT-based compressors to other compressors and show that for “realistic” Markov sources they indeed perform bad and often worse than other compressors. This is in contrast with the well known fact that on English text, BWT-based compressors are superior to many other types of compressors.

Haim Kaplan, Elad Verbin

Session 4: Computational Biology I

Non-breaking Similarity of Genomes with Gene Repetitions

In this paper we define a new similarity measure, the

non-breaking similarity

, which is the complement of the famous breakpoint distance between genomes (in general, between any two sequences drawn from the same alphabet). When the two input genomes

${\cal G}$

and

${\cal H}$

, drawn from the same set of

n

gene families, contain gene repetitions, we consider the corresponding Exemplar Non-breaking Similarity problem (ENbS) in which we need to delete repeated genes in

${\cal G}$

and

${\cal H}$

such that the resulting genomes

G

and

H

have the maximum non-breaking similarity. We have the following results.

For the Exemplar Non-breaking Similarity problem, we prove that the Independent Set problem can be linearly reduced to this problem. Hence, ENbS does not admit any factor-

n

1 − 

ε

polynomial-time approximation unless P=NP. (Also, ENbS is W[1]-complete.)

We show that for several practically interesting cases of the Exemplar Non-breaking Similarity problem, there are polynomial time algorithms.

Zhixiang Chen, Bin Fu, Jinhui Xu, Boting Yang, Zhiyu Zhao, Binhai Zhu
A New and Faster Method of Sorting by Transpositions

Some of the classical comparisons of DNA molecules consists in computing rearrangement distances between them, i.e.: a minimal number of rearrangements needed to change a molecule into another. One such rearrangement is that of

transposition

. At this time, it is not known if a polynomial time algorithm exists to compute the exact transposition distance between two permutations. In this article, we present a new and faster method of sorting by transpositions. While there does exist better algorithms with regards to distance approximation, our approach relies on a simpler structure which makes for a significantly faster computation time, while keeping an acceptable close approximation.

Maxime Benoît-Gagné, Sylvie Hamel
Finding Compact Structural Motifs

Protein structure motif detection is one of the fundamental problems in Structural Bioinformatics. Compared with sequence motifs, structural motifs are more sensitive in detecting the evolutionary relationships among proteins. A variety of algorithms have been proposed to attack this problem. However, they are either heuristic without theoretical performance guarantee, or inefficient for employing an exhaustive search strategy. Here, we study a reasonably restricted version of this problem: the compact structural motif problem. In this paper, we prove that this restricted version is still NP-hard, and we present a polynomial-time approximation scheme to solve it. To the best of our knowledge, this is the first approximation algorithm with a guarantee ratio for the protein structural motif problem.

Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, Jinbo Xu

Session 5: Computational Biology II

Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants

Detecting historical recombination is an important computational problem which has received great attention recently. Due to recombination, input sequences form a mosaic, where each input sequence is composed of segments from founder sequences. In this paper, we present improved algorithms for the problem of finding the minimum mosaic (a mosaic containing the

fewest

ancestral segments) of a set of recombinant sequences. This problem was first formulated in [15], where an exponential-time algorithm was described. It is also known that a restricted version of this problem (assuming recombination occurs only at predefined block boundaries) can be solved in polynomial time [15, 11]. We give a polynomial-time algorithm for a special case of the minimum (blockless) mosaic problem, and a practical algorithm for the general case. Experiments with our method show that it is practical in a range of data much larger than could be handled by the algorithm described in [15].

Yufeng Wu, Dan Gusfield
Computing Exact p-Value for Structured Motif

Extracting motifs from a set of DNA sequences is important in computational biology. Occurrence probability is a common used statistics to evaluate the statistical significance of a motif. A main problem is how to calculate the occurrence probability of the motif on the random model of DNA sequence efficiently and accurately. In this paper, we are interested in a particular motif model which is useful in transcription process. This motif, which is called structured motif, is composed two motif words on single nucleotide alphabet and with fixed spacers between them. We present an efficient algorithm to calculate the exact occurrence probability of a structured motif on a given sequence. It is the first non-trivial algorithm to calculate the exact p-value for such kind of motifs.

Jing Zhang, Xi Chen, Ming Li

Session 6: Algorithmic Techniques II

Improved Sketching of Hamming Distance with Error Correcting

We address the problem of sketching the hamming distance of data streams. We develop

Fixable Sketches

which compare data streams or files and restore the differences between them. Our contribution: For two streams with hamming distance bounded by

k

we show a sketch of size

O

(

k

log

n

) with

O

(log

n

) processing time per new element in the stream and how to restore all locations where the two streams differ in time linear in the sketch size. Probability of error is less than 1/

n

.

Ely Porat, Ohad Lipsky
Deterministic Length Reduction: Fast Convolution in Sparse Data and Applications

In this paper a deterministic algorithm for the length reduction problem is presented. This algorithm enables a new tool for performing fast convolution in sparse data. The proposed algorithm performs the convolution in

$O(n_1 \log^3 n_1)$

, where

n

1

is the number of non-zero values in

V

1

. This algorithm assumes that

V

1

is given in advance, and the

V

2

is given in running time.

Amihood Amir, Oren Kapah, Ely Porat
Guided Forest Edit Distance: Better Structure Comparisons by Using Domain-knowledge

We introduce the guided forest edit distance problem, which measures the similarity of two forests under the guidance of a third forest. We give an efficient algorithm for the problem. Our problem is a natural generalization of many important structure comparison problems such as the forest edit distance problem, constrained sequence alignment problem and the longest constrained common subsequence problem. Our algorithm matches the performance of the best known algorithms for these problems.

Zeshan Peng, Hing-fung Ting
Space-Efficient Algorithms for Document Retrieval

We study the

Document Listing

problem, where a collection

D

of documents

d

1

,...,

d

k

of total length ∑ 

i

d

i

 = 

n

is to be preprocessed, so that one can later efficiently list all the

$\textrm{ndoc}$

documents containing a given query pattern

P

of length

m

as a substring. Muthukrishnan (SODA 2002) gave an optimal solution to the problem; with

O

(

n

) time preprocessing, one can answer the queries in

$O(m+\textrm{ndoc})$

time. In this paper, we improve the space-requirement of the Muthukrishnan’s solution from

O

(

n

log

n

) bits to |

CSA

| + 2

n

 + 

n

log

k

(1 + 

o

(1)) bits, where |

CSA

| ≤ 

n

log|

Σ

|(1 + 

o

(1)) is the size of any suitable

compressed suffix array

(

CSA

), and

Σ

is the underlying alphabet of documents. The time requirement depends on the

CSA

used, but we can obtain e.g. the optimal

$O(m+\textrm{ndoc})$

time when

. For general |

Σ

|,

k

the time requirement becomes

$O(m \log |\Sigma|+\textrm{ndoc} \log k)$

. Sadakane (ISAAC 2002) has developed a similar space-efficient variant of the Muthukrishnan’s solution; we obtain a better time requirement in most cases, but a slightly worse space requirement.

Niko Välimäki, Veli Mäkinen

Session 7: Data Compression II

Compressed Text Indexes with Fast Locate

Compressed text (self-)indexes have matured up to a point where they can replace a text by a data structure that requires less space and, in addition to giving access to arbitrary text passages, support indexed text searches. At this point those indexes are competitive with traditional text indexes (which are very large) for

counting

the number of occurrences of a pattern in the text. Yet, they are still hundreds to thousands of times slower when it comes to

locating

those occurrences in the text. In this paper we introduce a new compression scheme for suffix arrays which permits locating the occurrences extremely fast, while still being much smaller than classical indexes. In addition, our index permits a very efficient secondary memory implementation, where compression permits reducing the amount of I/O needed to answer queries.

Rodrigo González, Gonzalo Navarro
Processing Compressed Texts: A Tractability Border

What kind of operations can we perform effectively (without full unpacking) with compressed texts? In this paper we consider three fundamental problems: (1) check the equality of two compressed texts, (2) check whether one compressed text is a substring of another compressed text, and (3) compute the number of different symbols (Hamming distance) between two compressed texts of the same length.

We present an algorithm that solves the first problem in

O

(

n

3

) time and the second problem in

O

(

n

2

m

) time. Here

n

is the size of compressed representation (we consider representations by straight-line programs) of the text and

m

is the size of compressed representation of the pattern. Next, we prove that the third problem is actually #

P

-complete. Thus, we indicate a pair of similar problems (equivalence checking, Hamming distance computation) that have radically different complexity on compressed texts. Our algorithmic technique used for problems (1) and (2) helps for computing minimal periods and covers of compressed texts.

Yury Lifshits

Session 8: Computational Biology III

Common Structured Patterns in Linear Graphs: Approximation and Combinatorics

A linear graph is a graph whose vertices are linearly ordered. This linear ordering allows pairs of disjoint edges to be either preceding (<), nesting (

$\sqsubset$

) or crossing (

$\between$

). Given a family of linear graphs, and a non-empty subset

$\mathcal{R} \subseteq \{<,\sqsubset,\between\}$

, we are interested in the PBMCSP problem: Find a maximum size edge-disjoint graph, with edge-pairs all comparable by one of the relations in

$\mathcal{R}$

, that occurs as a subgraph in each of the linear graphs of the family. In this paper, we generalize the framework of Davydov and Batzoglou by considering patterns comparable by all possible subsets

$\mathcal{R} \subseteq \{<,\sqsubset,\between\}$

. This is motivated by the fact that many biological applications require considering crossing structures, and by the fact that different combinations of the relations above give rise to different generalizations of natural combinatorial problems. Our results can be summarized as follows: We give tight hardness results for the PBMCSP problem for

$\{<,\between\}$

-structured patterns and

$\{\sqsubset,\between\}$

-structured patterns. Furthermore, we prove that the problem is approximable within ratios: (

i

)

for

$\{<,\between\}$

-structured patterns, (

ii

)

k

1/2

for

$\{\sqsubset, \between\}$

-structured patterns, and (

iii

)

$\mathcal{O}(\sqrt{k \lg k})$

for

$\{<,\sqsubset,\between\}$

-structured patterns, where

k

is the size of the optimal solution and

is the

k

-th harmonic number.

Guillaume Fertin, Danny Hermelin, Romeo Rizzi, Stéphane Vialette
Identification of Distinguishing Motifs

Motivation:

Motif identification for sequences has many important applications in biological studies, e.g., diagnostic probe design, locating binding sites and regulatory signals, and potential drug target identification. There are two versions.

1

Single Group:

Given a group of

n

sequences, find a length-

l

motif that appears in each of the given sequences and those occurrences of the motif are

similar

.

1

Two Groups:

Given two groups of sequences

B

and

G

, find a length-

l

(distinguishing) motif that appears in every sequence in

B

and does not appear in anywhere of the sequences in

G

.

Here the occurrences of the motif in the given sequences have errors. Currently, most of existing programs can only handle the case of single group. Moreover, it is very difficult to use edit distance (allowing indels and replacements) for motif detection.

Results:

(1) We propose a randomized algorithm for the one group problem that can handle indels in the occurrences of the motif. (2) We give an algorithm for the two groups problem. (3) Extensive simulations have been done to evaluate the algorithms.

WangSen Feng, Zhanyong Wang, Lusheng Wang
Algorithms for Computing the Longest Parameterized Common Subsequence

In this paper, we revisit the classic and well-studied

longest common subsequence

(LCS) problem and study some new variants, first introduced and studied by Rahman and Iliopoulos [Algorithms for Computing Variants of the Longest Common Subsequence Problem, ISAAC 2006]. Here we define a generalization of these variants, the

longest parameterized common subsequence

(LPCS) problem, and show how to solve it in

O

(

n

2

) and

time. Furthermore, we show how to compute two variants of LCS, RELAG and RIFIG in

time.

Costas S. Iliopoulos, Marcin Kubica, M. Sohel Rahman, Tomasz Waleń
Fixed-Parameter Tractability of the Maximum Agreement Supertree Problem
Extended Abstract

Given a ground set

L

of labels and a collection of trees whose leaves are bijectively labelled by some elements of

L

, the Maximum Agreement Supertree problem (SMAST) is the following: find a tree

T

on a largest label set

L

′ ⊆ 

L

that homeomorphically contains every input tree restricted to

L

′. The problem finds applications in several fields, e.g. phylogenetics. In this paper we focus on the parameterized complexity of this NP-hard problem. We consider different combinations of parameters for SMAST as well as particular cases, providing both FPT algorithms and intractability results.

Sylvain Guillemot, Vincent Berry

Session 9: Pattern Analysis

Two-Dimensional Range Minimum Queries

We consider the two-dimensional Range Minimum Query problem: for a static (

m

×

n

)-matrix of size

N

 = 

mn

which may be preprocessed, answer on-line queries of the form “where is the position of a minimum element in an axis-parallel rectangle?”. Unlike the one-dimensional version of this problem which can be solved in provably optimal time and space, the higher-dimensional case has received much less attention. The only result we are aware of is due to Gabow, Bentley and Tarjan [1], who solve the problem in

O

(

N

log

N

) preprocessing time and space and

O

(log

N

) query time. We present a class of algorithms which can solve the 2-dimensional RMQ-problem with

O

(

kN

) additional space,

preprocessing time and

O

(1) query time for any

k

 > 1, where

denotes the iterated application of

k

 + 1 logarithms. The solution converges towards an algorithm with

preprocessing time and space and

O

(1) query time. All these algorithms are significant improvements over the previous results: query time is optimal, preprocessing time is quasi-linear in the input size, and space is linear. While this paper is of theoretical nature, we believe that our algorithms will turn out to have applications in different fields of computer science, e.g., in computational biology.

Amihood Amir, Johannes Fischer, Moshe Lewenstein
Tiling Periodicity

We contribute to combinatorics and algorithmics of words by introducing new types of periodicities in words. A

tiling period

of a word

w

is partial word

u

such that

w

can be decomposed into several disjoint parallel copies of

u

, e.g.

a

 ⋄ 

b

is a tiling period of

aabb

. We investigate properties of tiling periodicities and design an algorithm working in

O

(

n

log(

n

)loglog(

n

)) time which finds a tiling period of minimal size, the number of such periods and their compact representation. The combinatorics of tiling periods differs significantly from that for classical full periods, for example unlike the classical case the same word can have many different primitive tiling periods. We consider also a related new type of periods called in the paper

multi-periods

. As a side product of the paper we solve an open problem posted by T. Harju.

Juhani Karhumäki, Yury Lifshits, Wojciech Rytter
Fast and Practical Algorithms for Computing All the Runs in a String

A

repetition

in a string

x

is a substring

${ \bf{w}} = {\it \bf{u}}^e$

of

x

, maximum

e

 ≥ 2, where

u

is not itself a repetition in

w

. A

run

in

x

is a substring

${\it \bf{w}} = {\it \bf{u}}^e{\it \bf{u^{*}}}$

of “maximal periodicity”, where

${\it \bf{u}}^e$

is a repetition and

u

*

a maximum-length possibly empty proper prefix of

u

. A run may encode as many as

$|{\it \bf{u}}|$

repetitions. The maximum number of repetitions in any string

${\it \bf{x}} = {\it \bf{x}}[1..n]$

is well known to be

Θ

(

n

log

n

). In 2000 Kolpakov & Kucherov showed that the maximum number of runs in

x

is

O

(

n

); they also described a

Θ

(

n

)-time algorithm, based on Farach’s

Θ

(

n

)-time suffix tree construction algorithm (STCA),

Θ

(

n

)-time Lempel-Ziv factorization, and Main’s

Θ

(

n

)-time leftmost runs algorithm, to compute all the runs in

x

. Recently Abouelhoda

et al.

proposed a

Θ

(

n

)-time Lempel-Ziv factorization algorithm based on an “enhanced” suffix array — a suffix array together with other supporting data structures. In this paper we introduce a collection of fast space-efficient algorithms for computing all the runs in a string that appear in many circumstances to be superior to those previously proposed.

Gang Chen, Simon J. Puglisi, W. F. Smyth
Longest Common Separable Pattern Among Permutations

In this paper, we study the problem of finding the longest common separable pattern among several permutations. We first give a polynomial-time algorithm when the number of input permutations is fixed and next show that the problem is

NP

–hardfor an arbitrary number of input permutations even if these permutations are separable.

On the other hand, we show that the

NP

–hardproblem of finding the longest common pattern between two permutations cannot be approximated better than within a ratio of

(where

is the size of an optimal solution) when taking common patterns belonging to pattern-avoiding permutation classes.

Mathilde Bouvel, Dominique Rossin, Stéphane Vialette

Session 10: Suffix Arrays and Trees

Suffix Arrays on Words

Surprisingly enough, it is not yet known how to build

directly

a suffix array that indexes just the

k

positions at word-boundaries of a text

T

[1,

n

], taking

O

(

n

) time and

O

(

k

) space in addition to

T

. We propose a class-note solution to this problem that achieves such optimal time and space bounds. Word-based versions of indexes achieving the same time/space bounds were already known for suffix trees [1,2] and (compact) DAWGs [3,4]. Our solution inherits the simplicity and efficiency of suffix arrays, with respect to such other word-indexes, and thus it foresees applications in

word-based approaches

to data compression [5] and computational linguistics [6]. To support this, we have run a large set of experiments showing that word-based suffix arrays may be constructed twice as fast as their full-text counterparts, and with a working space as low as 20%. The space reduction of the final word-based suffix array impacts also in their query time (i.e. less random access binary-search steps!), being faster by a factor of up to 3.

Paolo Ferragina, Johannes Fischer
Efficient Computation of Substring Equivalence Classes with Suffix Arrays

This paper considers enumeration of substring equivalence classes introduced by Blumer et al. [1]. They used the equivalence classes to define an index structure called compact directed acyclic word graphs (CDAWGs). In text analysis, considering these equivalence classes is useful since they group together redundant substrings with essentially identical occurrences. In this paper, we present how to enumerate those equivalence classes using suffix arrays. Our algorithm uses rank and lcp arrays for traversing the corresponding suffix trees, but does not need any other additional data structure. The algorithm runs in linear time in the length of the input string. We show experimental results comparing the running times and space consumptions of our algorithm, suffix tree and CDAWG based approaches.

Kazuyuki Narisawa, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda
A Simple Construction of Two-Dimensional Suffix Trees in Linear Time
(Extended Abstract)

The two-dimensional suffix tree of a matrix

A

is a compacted trie that represents all square submatrices of

A

. There exists a linear-time construction of two-dimensional suffix trees using the

odd-even

scheme. This algorithm has the drawback that the merging step is quite complicated. In this paper, we propose a new and simple algorithm to construct two-dimensional suffix trees in linear time by applying the

skew

scheme to square matrices. To do this, we present a simple algorithm to merge two Isuffix trees in linear time.

Dong Kyue Kim, Joong Chae Na, Jeong Seop Sim, Kunsoo Park
Backmatter
Metadaten
Titel
Combinatorial Pattern Matching
herausgegeben von
Bin Ma
Kaizhong Zhang
Copyright-Jahr
2007
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-73437-6
Print ISBN
978-3-540-73436-9
DOI
https://doi.org/10.1007/978-3-540-73437-6