Skip to main content

2017 | Buch

String Processing and Information Retrieval

24th International Symposium, SPIRE 2017, Palermo, Italy, September 26–29, 2017, Proceedings

insite
SUCHEN

Über dieses Buch

This book constitutes the proceedings of the 24th International Symposium on String Processing and Information Retrieval, SPIRE 2017, held in Palermo, Italy, in September 2017.
The 26 papers presented in this volume were carefully reviewed and selected from 71 submissions. They focus on fundamental studies on string processing and information retrieval, as well as on computational biology.

Inhaltsverzeichnis

Frontmatter
Greedy Shortest Common Superstring Approximation in Compact Space
Abstract
Given a set of strings, the shortest common superstring problem is to find the shortest possible string that contains all the input strings. The problem is NP-hard, but a lot of work has gone into designing approximation algorithms for solving the problem. We present the first time and space efficient implementation of the classic greedy heuristic which merges strings in decreasing order of overlap length. Our implementation works in \(O(n \log \sigma )\) time and bits of space, where n is the total length of the input strings in characters, and \(\sigma \) is the size of the alphabet. After index construction, a practical implementation of our algorithm uses roughly \(5 n \log \sigma \) bits of space and reasonable time for a real dataset that consists of DNA fragments.
Jarno Alanko, Tuukka Norri
Longest Common Factor After One Edit Operation
Abstract
It is well known that the longest common factor (LCF) of two strings over an integer alphabet can be computed in time linear in the total length of the two strings. Our aim here is to present an algorithm that preprocesses two strings S and T in order to answer the following type of queries: Given a position i on S and a letter \(\alpha \), return an LCF of T and \(S'\), where \(S'\) is the string resulting from S after substituting S[i] with \(\alpha \). In what follows, we present an algorithm that, given two strings of length at most n, constructs in \(\mathcal {O}(n \log ^4 n)\) expected time a data structure of \(\mathcal {O}(n\log ^3 n)\) space that answers such queries in \(\mathcal {O}(\log ^3 n)\) time per query. After some trivial modifications, our approach can also support the case of single letter insertions or deletions in S.
Amihood Amir, Panagiotis Charalampopoulos, Costas S. Iliopoulos, Solon P. Pissis, Jakub Radoszewski
Distinct Squares in Circular Words
Abstract
A circular word, or a necklace, is an equivalence class under conjugation of a word. A fundamental question concerning regularities in standard words is bounding the number of distinct squares in a word of length n. The famous conjecture attributed to Fraenkel and Simpson is that there are at most n such distinct squares, yet the best known upper bound is 1.84n by Deza et al. [Discr. Appl. Math. 180, 52–69 (2015)]. We consider a natural generalization of this question to circular words: how many distinct squares can there be in all cyclic rotations of a word of length n? We prove an upper bound of 3.14n. This is complemented with an infinite family of words implying a lower bound of 1.25n.
Mika Amit, Paweł Gawrychowski
LZ78 Compression in Low Main Memory Space
Abstract
We present the first algorithms that perform the LZ78 compression of a text of length n over alphabet \([1..\sigma ]\), whose output is z integers, using only \(O(z\lg \sigma )\) bits of main memory. The algorithms read the input text from disk in a single pass, and write the compressed output to disk. The text can also be decompressed within the same main memory usage, which is unprecedented too. The algorithms are based on hashing and, under some simplifying assumptions, run in O(n) expected time. We experimentally verify that our algorithms use 2–9 times less time and/or space than previously implemented LZ78 compressors.
Diego Arroyuelo, Rodrigo Cánovas, Gonzalo Navarro, Rajeev Raman
On Two LZ78-style Grammars: Compression Bounds and Compressed-Space Computation
Abstract
We investigate two closely related LZ78-based compression schemes: LZMW (an old scheme by Miller and Wegman) and LZD (a recent variant by Goto et al.). Both LZD and LZMW naturally produce a grammar for a string of length n; we show that the size of this grammar can be larger than the size of the smallest grammar by a factor \(\varOmega (n^{\frac{1}{3}})\) but is always within a factor \(O((\frac{n}{\log n})^{\frac{2}{3}})\). In addition, we show that the standard algorithms using \(\varTheta (z)\) working space to construct the LZD and LZMW parsings, where z is the size of the parsing, work in \(\varOmega (n^{\frac{5}{4}})\) time in the worst case. We then describe a new Las Vegas LZD/LZMW parsing algorithm that uses \(O (z \log n)\) space and \(O(n + z \log ^2 n)\) time w.h.p.
Golnaz Badkobeh, Travis Gagie, Shunsuke Inenaga, Tomasz Kociumaka, Dmitry Kosolobov, Simon J. Puglisi
On Suffix Tree Breadth
Abstract
The suffix tree—the compacted trie of all the suffixes of a string—is the most important and widely-used data structure in string processing. We consider a natural combinatorial question about suffix trees: for a string S of length n, how many nodes \(\nu _S(d)\) can there be at (string) depth d in its suffix tree? We prove \(\nu (n,d)=\max _{S\in \varSigma ^n} \nu _S(d)\) is \(O((n/d)\log n)\), and show that this bound is almost tight, describing strings for which \(\nu _S(d)\) is \(\varOmega ((n/d)\log (n/d))\).
Golnaz Badkobeh, Juha Kärkkäinen, Simon J. Puglisi, Bella Zhukova
Pattern Matching on Elastic-Degenerate Text with Errors
Abstract
An elastic-degenerate string is a sequence of n sets of strings of total length N. It has been introduced to represent a multiple alignment of several closely-related sequences (e.g. pan-genome) compactly. In this representation, substrings of these sequences that match exactly are collapsed, while in positions where the sequences differ, all possible variants observed at that location are listed. The natural problem that arises is finding all matches of a deterministic pattern of length m in an elastic-degenerate text. There exists an \(\mathcal {O}(nm^2 + N)\)-time algorithm to solve this problem on-line after a pre-processing stage with time and space \(\mathcal {O}(m)\). In this paper, we study the same problem under the edit distance model and present an \(\mathcal {O}(k^2mG+kN)\)-time and \(\mathcal {O}(m)\)-space algorithm, where G is the total number of strings in the elastic-degenerate text and k is the maximum edit distance allowed. We also present a simple \(\mathcal {O}(kmG+kN)\)-time and \(\mathcal {O}(m)\)-space algorithm for Hamming distance.
Giulia Bernardini, Nadia Pisanti, Solon P. Pissis, Giovanna Rosone
Succinct Partial Sums and Fenwick Trees
Abstract
We consider the well-studied partial sums problem in succint space where one is to maintain an array of n k-bit integers subject to updates such that partial sums queries can be efficiently answered. We present two succint versions of the Fenwick Tree – which is known for its simplicity and practicality. Our results hold in the encoding model where one is allowed to reuse the space from the input data. Our main result is the first that only requires \(nk + o(n)\) bits of space while still supporting sum/update in \(\mathcal {O}(\log _bn)\) / \(\mathcal {O}(b \log _bn)\) time where \(2 \le b \le \log ^{\mathcal {O}(1)}n\). The second result shows how optimal time for sum/update can be achieved while only slightly increasing the space usage to \(nk + o(nk)\) bits. Beyond Fenwick Trees, the results are primarily based on bit-packing and sampling – making them very practical – and they also allow for simple optimal parallelization.
Philip Bille, Anders Roy Christiansen, Nicola Prezza, Frederik Rye Skjoldjensen
Tight Bounds for Top Tree Compression
Abstract
We consider compressing labeled, ordered and rooted trees using DAG compression and top tree compression. We show that there exists a family of trees such that the size of the DAG compression is always a logarithmic factor smaller than the size of the top tree compression (even for an alphabet of size 1). The result settles an open problem from Bille et al. (Inform. and Comput., 2015).
Philip Bille, Finn Fernstrøm, Inge Li Gørtz
Efficient Compression and Indexing of Trajectories
Abstract
We present a new compressed representation of free trajectories of moving objects. It combines a partial-sums-based structure that retrieves in constant time the position of the object at any instant, with a hierarchical minimum-bounding-boxes representation that allows determining if the object is seen in a certain rectangular area during a time period. Combined with spatial snapshots at regular intervals, the representation is shown to outperform classical ones by orders of magnitude in space, and also to outperform previous compressed representations in time performance, when using the same amount of space.
Nieves R. Brisaboa, Travis Gagie, Adrián Gómez-Brandón, Gonzalo Navarro, José R. Paramá
Fast Construction of Compressed Web Graphs
Abstract
Several compressed graph representations were proposed in the last 15 years. Today, all these representations are highly relevant in practice since they enable to keep large-scale web and social graphs in the main memory of a single machine and consequently facilitate fast random access to nodes and edges.
While much effort was spent on finding space-efficient and fast representations, one issue was only partially addressed: developing resource-efficient construction algorithms. In this paper, we engineer the construction of regular and hybrid \(k^2\)-trees. We show that algorithms based on the Z-order sorting reduce the memory footprint significantly and at the same time are faster than previous approaches. We also engineer a parallel version, which fully utilizes all CPUs and caches. We show the practicality of the latter version by constructing partitioned hybrid k-trees for Web graphs in the scale of a billion nodes and up to 100 billion edges.
Jan Broß, Simon Gog, Matthias Hauck, Marcus Paradies
Constructing a Consensus Phylogeny from a Leaf-Removal Distance (Extended Abstract)
Abstract
Understanding the evolution of a set of genes or species is a fundamental problem in evolutionary biology. The problem we study here takes as input a set of trees describing possibly discordant evolutionary scenarios for a given set of genes or species, and aims at finding a single tree that minimizes the leaf-removal distance to the input trees. This problem is a specific instance of the general consensus/supertree problem, widely used to combine or summarize discordant evolutionary trees. The problem we introduce is specifically tailored to address the case of discrepancies between the input trees due to the misplacement of individual taxa. Most supertree or consensus tree problems are computationally intractable, and we show that the problem we introduce is also NP-hard. We provide tractability results in form of a 2-approximation algorithm and a parameterized algorithm with respect to the number of removed leaves. We also introduce a variant that minimizes the maximum number d of leaves that are removed from any input tree, and provide a parameterized algorithm for this problem with parameter d.
Cedric Chauve, Mark Jones, Manuel Lafond, Céline Scornavacca, Mathias Weller
Listing Maximal Independent Sets with Minimal Space and Bounded Delay
Abstract
An independent set is a set of nodes in a graph such that no two of them are adjacent. It is maximal if there is no node outside the independent set that may join it. Listing maximal independent sets in graphs can be applied, for example, to sample nodes belonging to different communities or clusters in network analysis and document clustering. The problem has a rich history as it is related to maximal cliques, dominance sets, vertex covers and 3-colorings in graphs. We are interested in reducing the delay, which is the worst-case time between any two consecutively output solutions, and the memory footprint, which is the additional working space behind the read-only input graph.
Alessio Conte, Roberto Grossi, Andrea Marino, Takeaki Uno, Luca Versari
Fast Label Extraction in the CDAWG
Abstract
The compact directed acyclic word graph (CDAWG) of a string T of length n takes space proportional just to the number e of right extensions of the maximal repeats of T, and it is thus an appealing index for highly repetitive datasets, like collections of genomes from similar species, in which e grows significantly more slowly than n. We reduce from \(O(m\log {\log {n}})\) to O(m) the time needed to count the number of occurrences of a pattern of length m, using an existing data structure that takes an amount of space proportional to the size of the CDAWG. This implies a reduction from \(O(m\log {\log {n}}+\mathtt {occ})\) to \(O(m+\mathtt {occ})\) in the time needed to locate all the \(\mathtt {occ}\) occurrences of the pattern. We also reduce from \(O(k\log {\log {n}})\) to O(k) the time needed to read the k characters of the label of an edge of the suffix tree of T, and we reduce from \(O(m\log {\log {n}})\) to O(m) the time needed to compute the matching statistics between a query of length m and T, using an existing representation of the suffix tree based on the CDAWG. All such improvements derive from extracting the label of a vertex or of an arc of the CDAWG using a straight-line program induced by the reversed CDAWG.
Djamal Belazzougui, Fabio Cunial
Lightweight BWT and LCP Merging via the Gap Algorithm
Abstract
Recently, Holt and McMillan [Bioinformatics 2014, ACM-BCB 2014] have proposed a simple and elegant algorithm to merge the Burrows-Wheeler transforms of a collection of strings. In this paper we show that their algorithm can be improved so that, in addition to the BWTs, it also merges the Longest Common Prefix (LCP) arrays. Because of its small memory footprint this new algorithm can be used for the final merge of BWT and LCP arrays computed by a faster but memory intensive construction algorithm.
Lavinia Egidi, Giovanni Manzini
Practical Evaluation of Lempel-Ziv-78 and Lempel-Ziv-Welch Tries
Abstract
We present the first thorough practical study of the Lempel-Ziv-78 and the Lempel-Ziv-Welch computation based on trie data structures. With a careful selection of trie representations we can beat well-tuned popular trie data structures like Judy, m-Bonsai or Cedar.
Johannes Fischer, Dominik Köppl
Regular Abelian Periods and Longest Common Abelian Factors on Run-Length Encoded Strings
Abstract
Two strings are considered Abelian equivalent if one is a permutation of the other. We deal with two problems from Abelian stringology: computing regular Abelian periods of a given string and computing the longest common Abelian factor (LCAF) of two given strings. For the former problem our solution works in \(O(n\log m)\) time, where m is the length of the run-length encoded string, which improves the O(nm)-time result from [5]. For LCAF we propose two solutions, one working in \(O(n+m^4)\) time and O(n) space, the other requiring \(O(n^{3/2}\sigma \sqrt{m\log n})\) time and \(O(n\sigma )\) space (for \(m = O(n / \log n)\)).
Szymon Grabowski
Mining Bit-Parallel LCS-length Algorithms
Abstract
Some of the most efficient algorithms for computing the length of a longest common subsequence (LLCS) between two strings are based on so-called “bit-parallelism”. They achieve \(O(\lceil m/w \rceil n)\) time, where m and n are the string lengths and w is the computer word size. The first such algorithm was presented by Allison and Dix [3] and performs 6 bit-vector operations per step. The number of operations per step has later been improved to 5 by Crochemore et al. [5] and to 4 by Hyyrö [6]. In this short paper we explore whether further improvement is possible. We find that under fairly reasonable assumptions, the LLCS problem requires at least 4 bit-vector operations per step. As a byproduct we also present five new 4-operation bit-parallel LLCS algorithms.
Heikki Hyyrö
Practical Implementation of Space-Efficient Dynamic Keyword Dictionaries
Abstract
A keyword dictionary is an associative array with string keys. Although it is a classical data structure, recent applications require the management of massive string data using the keyword dictionary in main memory. Therefore, its space-efficient implementation is very important. If limited to static applications, there are a number of very compact dictionary implementations; however, existing dynamic implementations consume much larger space than static ones. In this paper, we propose a new practical implementation of space-efficient dynamic keyword dictionaries. Our implementation uses path decomposition, which is proposed for constructing cache-friendly trie structures, for dynamic construction in compact space with a different approach. Using experiments on real-world datasets, we show that our implementation can construct keyword dictionaries in spaces up to 2.8x smaller than the most compact existing dynamic implementation.
Shunsuke Kanda, Kazuhiro Morita, Masao Fuketa
Faster Practical Block Compression for Rank/Select Dictionaries
Abstract
This paper presents faster practical encoding and decoding procedures for block compression that underlies both static and dynamic compressed rank/select bitmaps based on the RRR (Raman, Raman, and Rao) scheme. Our procedures use a novel combination of universal tables for chunkwise processing. Experimental results showed that our procedures were faster than existing ones on 64-bit blocks.
Yusaku Kaneta
Optimal Skeleton Huffman Trees
Abstract
A skeleton Huffman tree is a Huffman tree from which all complete subtrees of depth \(h \ge 1\) have been pruned. Skeleton Huffman trees are used to save storage and enhance processing time in several applications such as decoding, compressed pattern matching and Wavelet trees for random access. However, the straightforward way of basing the construction of a skeleton tree on a canonical Huffman tree does not necessarily yield the least number of nodes. The notion of optimal skeleton trees is introduced, and an algorithm for achieving such trees is investigated. The resulting more compact trees can be used to further enhance the time and space complexities of the corresponding algorithms.
Shmuel T. Klein, Tamar C. Serebro, Dana Shapira
Detecting One-Variable Patterns
Abstract
Given a pattern \(p = s_1x_1s_2x_2\cdots s_{r-1}x_{r-1}s_r\) such that \(x_1,x_2,\ldots ,x_{r-1}\in \{x,\overset{{}_{\leftarrow }}{x}\}\), where x is a variable and \(\overset{{}_{\leftarrow }}{x}\) its reversal, and \(s_1,s_2,\ldots ,s_r\) are strings that contain no variables, we describe an algorithm that constructs in O(rn) time a compact representation of all P instances of p in an input string of length n over a polynomially bounded integer alphabet, so that one can report those instances in O(P) time.
Dmitry Kosolobov, Florin Manea, Dirk Nowotka
Order Preserving Pattern Matching on Trees and DAGs
Abstract
The order preserving pattern matching (OPPM) problem is, given a pattern string p and a text string t, find all substrings of t which have the same relative orders as p. In this paper, we consider two variants of the OPPM problem where a set of text strings is given as a tree or a DAG. We show that the OPPM problem for a single pattern p of length m and a text tree T of size N can be solved in \(O(m+N)\) time with O(m) working space if the characters of p are drawn from an integer alphabet of polynomial size. The time complexity becomes \(O(m \log m + N)\) if the pattern p is over a general ordered alphabet. We then show that the OPPM problem for a single pattern and a text DAG is NP-complete.
Temma Nakamura, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda
A Self-index on Block Trees
Abstract
The Block Tree is a recently proposed data structure that reaches compression close to Lempel-Ziv while supporting efficient direct access to text substrings. In this paper we show how a self-index can be built on top of a Block Tree so that it provides efficient pattern searches while using space proportional to that of the original data structure. More precisely, if a Lempel-Ziv parse cuts a text of length n into z non-overlapping phrases, then our index uses \(O(z\lg (n/z))\) words and finds the occ occurrences of a pattern of length m in time \(O(m^2\lg n+occ\lg ^\epsilon n)\) for any constant \(\epsilon >0\).
Gonzalo Navarro
Counting Palindromes in Substrings
Abstract
We propose a data structure and an online algorithm to report the number of distinct palindromes in any substring of an input string. Assume that the string S of length n arrives symbol-by-symbol and every symbol is followed by zero or more queries of the form “report the number of distinct palindromes in S[i..j]”. We use \(O(n\log n)\) total time to process the string plus \(O(\log n)\) time per query. The required space is \(O(n\log n)\) in general and O(n) in a natural particular case. As a simple application, we describe an algorithm reporting all palindromic rich substrings of an input string in \(O(n\log n)\) time and O(n) space.
Mikhail Rubinchik, Arseny M. Shur
Linear-Size CDAWG: New Repetition-Aware Indexing and Grammar Compression
Abstract
In this paper, we propose a novel approach to combine compact directed acyclic word graphs (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs (L-CDAWGs), which can be represented with \(O(\tilde{e}_T \log n)\) bits of space allowing for \(O(\log n)\)-time random and O(1)-time sequential accesses to edge labels, and \(O(m \log \sigma + occ)\)-time pattern matching. Here, \(\tilde{e}_T\) is the number of all extensions of maximal repeats in T, n and m are respectively the lengths of the text T and a given pattern, \(\sigma \) is the alphabet size, and \( occ \) is the number of occurrences of the pattern in T. The repetitiveness measure \(\tilde{e}_T\) is known to be much smaller than the text length n for highly repetitive text. For constant alphabets, our L-CDAWGs achieve \(O(m + occ )\) pattern matching time with \(O(e_T^r \log n)\) bits of space, which improves the pattern matching time of Belazzougui et al.’s run-length BWT-CDAWGs by a factor of \(\log \log n\), with the same space complexity. Here, \(e_T^r\) is the number of right extensions of maximal repeats in T. As a byproduct, our result gives a way of constructing a straight-line program (SLP) of size \(O(\tilde{e}_T)\) for a given text T in \(O(n + \tilde{e}_T \log \sigma )\) time.
Takuya Takagi, Keisuke Goto, Yuta Fujishige, Shunsuke Inenaga, Hiroki Arimura
Backmatter
Metadaten
Titel
String Processing and Information Retrieval
herausgegeben von
Gabriele Fici
Marinella Sciortino
Dr. Rossano Venturini
Copyright-Jahr
2017
Electronic ISBN
978-3-319-67428-5
Print ISBN
978-3-319-67427-8
DOI
https://doi.org/10.1007/978-3-319-67428-5

Neuer Inhalt