Skip to main content

2018 | Buch

String Processing and Information Retrieval

25th International Symposium, SPIRE 2018, Lima, Peru, October 9-11, 2018, Proceedings

herausgegeben von: Travis Gagie, Alistair Moffat, Gonzalo Navarro, Ernesto Cuadros-Vargas

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 25th International Symposium on String Processing and Information Retrieval, SPIRE 2018, held in Lima, Peru, in October 2018.
The 22 full papers and 6 short papers presented were carefully reviewed and selected from 51 submissions. They focus on fundamental studies on string processing and information retrieval, as well as on computational biology.

Inhaltsverzeichnis

Frontmatter
Recoloring the Colored de Bruijn Graph
Abstract
The colored de Bruijn graph, an extension of the de Bruijn graph, is routinely applied for variant calling, genotyping, genome assembly, and various other applications [11]. In this data structure, the edges are labeled with one or more colors from a set \(\{c_1, \dots , c_{\alpha } \}\), and are stored as a \(m \times \alpha \) matrix, where m is the number of edges. Recently, there has been a significant amount of work in developing compacted representations of this color matrix but all existing methods have focused on compressing the color matrix [3, 10, 12, 14]. In this paper, we explore the problem of recoloring the graph in order to reduce the number of colors, and thus, decrease the size of the color matrix. We show that finding the minimum number of colors needed for recoloring is not only NP-hard but also, difficult to approximate within a reasonable factor. These hardness results motivate the need for a recoloring heuristic that we present in this paper. Our results show that this heuristic is able to reduce the number of colors between one and two orders of magnitude. More specifically, when the number of colors is large (>5,000,000) the number of colors is reduced by a factor of 136 by our heuristic. An implementation of this heuristic is publicly available at https://​github.​com/​baharpan/​cosmo/​tree/​Recoloring.
Bahar Alipanahi, Alan Kuhnle, Christina Boucher
Efficient Computation of Sequence Mappability
Abstract
Sequence mappability is an important task in genome re-sequencing. In the (km)-mappability problem, for a given sequence T of length n, our goal is to compute a table whose ith entry is the number of indices \(j \ne i\) such that length-m substrings of T starting at positions i and j have at most k mismatches. Previous works on this problem focused on heuristic approaches to compute a rough approximation of the result or on the case of \(k=1\). We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that works in \(\mathcal {O}(n \min \{m^k,\log ^{k+1} n\})\) time and \(\mathcal {O}(n)\) space for \(k=\mathcal {O}(1)\). It requires a careful adaptation of the technique of Cole et al. [STOC 2004] to avoid multiple counting of pairs of substrings. We also show \(\mathcal {O}(n^2)\)-time algorithms to compute all results for a fixed m and all \(k=0,\ldots ,m\) or a fixed k and all \(m=k,\ldots ,n-1\). Finally we show that the (km)-mappability problem cannot be solved in strongly subquadratic time for \(k,m = \varTheta (\log n)\) unless the Strong Exponential Time Hypothesis fails.
Mai Alzamel, Panagiotis Charalampopoulos, Costas S. Iliopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Juliusz Straszyński
Longest Common Prefixes with k-Errors and Applications
Abstract
Although real-world text datasets, such as DNA sequences, are far from being uniformly random, string searching average-case algorithms perform significantly better than worst-case ones in most applications of interest. In this paper, we study the problem of computing the longest prefix of each suffix of a given string of length n that occurs elsewhere in the string with k-errors. This problem has already been studied under the Hamming distance model. Our first result is an improvement upon the state-of-the-art average-case time complexity for non-constant k and using only linear space under the Hamming distance model. Notably, we show that our technique can be extended to the edit distance model with the same time and space complexities. Specifically, our algorithms run in \(\mathcal {O}(n\frac{(c\log n)^k}{k!})\) time on average, where \(c>1\) is a constant, using \(\mathcal {O}(n)\) space. Finally, we show that our technique is applicable to several algorithmic problems found in computational biology and elsewhere. The importance of our technique lies on the fact that it is the first one achieving this bound for non-constant k and using \(\mathcal {O}(n)\) space.
Lorraine A. K. Ayad, Carl Barton, Panagiotis Charalampopoulos, Costas S. Iliopoulos, Solon P. Pissis
Longest Property-Preserved Common Factor
Abstract
In this paper we introduce a new family of string processing problems. We are given two or more strings and we are asked to compute a factor common to all strings that preserves a specific property and has maximal length. Here we consider two fundamental string properties: square-free factors and periodic factors under two different settings, one per property. In the first setting, we are given a string x and we are asked to construct a data structure over x answering the following type of on-line queries: given string y, find a longest square-free factor common to x and y. In the second setting, we are given k strings and an integer \(1 < k'\le k\) and we are asked to find a longest periodic factor common to at least \(k'\) strings. We present linear-time solutions for both settings. We anticipate that our paradigm can be extended to other string properties.
Lorraine A. K. Ayad, Giulia Bernardini, Roberto Grossi, Costas S. Iliopoulos, Nadia Pisanti, Solon P. Pissis, Giovanna Rosone
Adaptive Computation of the Discrete Fréchet Distance
Abstract
The discrete Fréchet distance is a measure of similarity between point sequences which permits to abstract differences of resolution between the two curves, approximating the original Fréchet distance between curves. Such distance between sequences of respective length n and m can be computed in time within O(nm) and space within \({O(n+m)}\) using classical dynamic programming techniques, a complexity likely to be optimal in the worst case over sequences of similar length unless the Strong Exponential Time Hypothesis is proved incorrect. We propose a parameterized analysis of the computational complexity of the discrete Fréchet distance in function of the area of the dynamic program matrix relevant to the computation, measured by its certificate width \(\omega \). We prove that the discrete Fréchet distance can be computed in time within \(O((n+m)\omega )\) and space within \(O(n+m)\).
Jérémy Barbay
Indexed Dynamic Programming to Boost Edit Distance and LCSS Computation
Abstract
There are efficient dynamic programming solutions to the computation of the Edit Distance from \(S\in [1..\sigma ]^n\) to \(T\in [1..\sigma ]^m\), for many natural subsets of edit operations, typically in time within O(nm) in the worst-case over strings of respective lengths n and m (which is likely to be optimal), and in time within \(O(n+m)\) in some special cases (e.g., disjoint alphabets). We describe how indexing the strings (in linear time), and using such an index to refine the recurrence formulas underlying the dynamic programs, yield faster algorithms in a variety of models, on a continuum of classes of instances of intermediate difficulty between the worst and the best case, thus refining the analysis beyond the worst case analysis. As a side result, we describe similar properties for the computation of the Longest Common Sub Sequence \(\mathtt {LCSS}(S,T)\) between S and T, since it is a particular case of Edit Distance, and we discuss the application of similar algorithmic and analysis techniques for other dynamic programming solutions. More formally, we propose a parameterized analysis of the computational complexity of the Edit Distance for various sets of operators and of the Longest Common Sub Sequence in function of the area of the dynamic program matrix relevant to the computation.
Jérémy Barbay, Andrés Olivares
Compressed Communication Complexity of Longest Common Prefixes
Abstract
We consider the communication complexity of fundamental longest common prefix \(({{\mathrm{\textsc {Lcp}}}})\) problems. In the simplest version, two parties, Alice and Bob, each hold a string, A and B, and we want to determine the length of their longest common prefix \(\ell ={{\mathrm{\textsc {Lcp}}}}(A,B)\) using as few rounds and bits of communication as possible. We show that if the longest common prefix of A and B is compressible, then we can significantly reduce the number of rounds compared to the optimal uncompressed protocol, while achieving the same (or fewer) bits of communication. Namely, if the longest common prefix has an LZ77 parse of z phrases, only \(O(\lg z)\) rounds and \(O(\lg \ell )\) total communication is necessary. We extend the result to the natural case when Bob holds a set of strings \(B_1, \ldots , B_k\), and the goal is to find the length of the maximal longest prefix shared by A and any of \(B_1, \ldots , B_k\). Here, we give a protocol with \(O(\log z)\) rounds and \(O(\lg z \lg k + \lg \ell )\) total communication. We present our result in the public-coin model of computation but by a standard technique our results generalize to the private-coin model. Furthermore, if we view the input strings as integers the problems are the greater-than problem and the predecessor problem.
Philip Bille, Mikko Berggreen Ettienne, Roberto Grossi, Inge Li Gørtz, Eva Rotenberg
New Structures to Solve Aggregated Queries for Trips over Public Transportation Networks
Abstract
Representing the trajectories of mobile objects is a hot topic from the widespread use of smartphones and other GPS devices. However, few works have focused on representing trips over public transportation networks (buses, subway, and trains) where user’s trips can be seen as a sequence of stages performed within a vehicle shared with many other users. In this context, representing vehicle journeys reduces the redundancy because all the passengers inside a vehicle share the same arrival time for each stop. In addition, each vehicle journey follows exactly the sequence of stops corresponding to its line, which makes it unnecessary to represent that sequence for each journey.
To solve data management for transportation systems, we designed a conceptual model that gave us a better insight into this data domain and allowed us the definition of relevant terms and the detection of redundancy sources among those data. Then, we designed two compact representations focused on users’ trips (\({\mathsf {TTCTR}}\)) and on vehicle trips (\({\mathsf {AcumM}}\)), respectively. Each approach owns some strengths and is able to answer some queries efficiently.
We include experimental results over synthetic trips generated from accurate schedules obtained from a real network description (from the bus transportation system of Madrid) to show the space/time trade-off of both approaches. We considered a wide range of different queries about the use of the transportation network such as counting-based/aggregate queries regarding the load of any line of the network at different times.
Nieves R. Brisaboa, Antonio Fariña, Daniil Galaktionov, Tirso V. Rodeiro, M. Andrea Rodríguez
3DGraCT: A Grammar-Based Compressed Representation of 3D Trajectories
Abstract
Much research has been published about trajectory management on the ground or at the sea, but compression or indexing of flight trajectories have usually been less explored. However, air traffic management is a challenge because airspace is becoming more and more congested, and large flight data collections must be preserved and exploited for varied purposes. This paper proposes 3DGraCT, a new method for representing these flight trajectories. It extends the GraCT compact data structure to cope with a third dimension (altitude), while retaining its space/time complexities. 3DGraCT improves space requirements of traditional spatio-temporal data structures by two orders of magnitude, being competitive for the considered types of queries, even leading the comparison for a particular one.
Nieves R. Brisaboa, Adrián Gómez-Brandón, Miguel A. Martínez-Prieto, José Ramón Paramá
Towards a Compact Representation of Temporal Rasters
Abstract
Big research efforts have been devoted to efficiently manage spatio-temporal data. However, most works focused on vectorial data, and much less, on raster data. This work presents a new representation for raster data that evolve along time named Temporal \(\mathsf {k^2raster} \). It faces the two main issues that arise when dealing with spatio-temporal data: the space consumption and the query response times. It extends a compact data structure for raster data in order to manage time and thus, it is possible to query it directly in compressed form, instead of the classical approach that requires a complete decompression before any manipulation. In addition, in the same compressed space, the new data structure includes two indexes: a spatial index and an index on the values of the cells, thus becoming a self-index for raster data.
Ana Cerdeira-Pena, Guillermo de Bernardo, Antonio Fariña, José Ramón Paramá, Fernando Silva-Coira
On Extended Special Factors of a Word
Abstract
An extended special factor of a word x is a factor of x whose longest infix can be extended by at least two distinct letters to the left or to the right and still occur in x. It is called extended bispecial if it can be extended in both directions and still occur in x. Let \(\rho (n)\) be the maximum number of extended bispecial factors over all words of length n. Almirantis et al. have shown that \(2n - 6 \le \rho (n) \le 3n-4\) [WABI 2017]. In this article, we show that there is no constant \(c<3\) such that \(\rho (n) \le cn\). We then exploit the connection between extended special factors and minimal absent words to construct a data structure for computing minimal absent words of a specific length in optimal time for integer alphabets generalising a result by Fujishige et al. [MFCS 2016]. As an application of our data structure, we show how to compare two words over an integer alphabet in optimal time improving on another result by Charalampopoulos et al. [Inf. Comput. 2018].
Panagiotis Charalampopoulos, Maxime Crochemore, Solon P. Pissis
Truncated DAWGs and Their Application to Minimal Absent Word Problem
Abstract
The directed acyclic word graph (DAWG) of a string y is the smallest (partial) DFA which recognizes all suffixes of y and has O(n) nodes and edges. Na et al. [11] proposed k-truncated suffix tree which is a compressed trie that represents substrings of a string whose length up to k. In this paper, we present a new data structure called k-truncated DAWGs, which can be obtained by pruning the DAWGs. We show that the size complexity of the k-truncated DAWG of a string y of length n is \(O(\min \{n,kz\})\) which is equal to the truncated suffix tree’s one, where z is the size of LZ77 factorization of y. We also present an \(O(n\log \sigma )\) time and \(O(\min \{ n,kz\})\) space algorithm for constructing the k-truncated DAWG of y, where \(\sigma \) is the alphabet size. As an application of the truncated DAWGs, we show that the set \( MAW _k(y)\) of all minimal absent words of y whose length is smaller than or equal to k can be computed by using k-truncated DAWG of y in \(O(\min \{ n, kz\} + | MAW _k(y)|)\) time and \(O(\min \{ n,kz\})\) working space.
Yuta Fujishige, Takuya Takagi, Diptarama Hendrian
The Colored Longest Common Prefix Array Computed via Sequential Scans
Abstract
Due to the increased availability of large datasets of biological sequences, tools for sequence comparison are now relying on efficient alignment-free approaches to a greater extent. Most alignment-free approaches require the computation of statistics when comparing sequences, even if such computations may not scale well in in internal memory when very large collections of long sequences are considered. In this paper, we present a new conceptual data structure, the colored longest common prefix array (cLCP), that allows to efficiently tackle several problems with an alignment-free approach. In fact, we show that such a data structure can be computed via sequential scans in semi-external memory. By using cLCP, we propose an efficient lightweight strategy to solve the multi-string Average Common Substring (ACS) problem, that consists in the pairwise comparison of a single string against a collection of m strings simultaneously, in order to obtain m ACS induced distances. Experimental results confirm the high practical efficiency of our approach.
Fabio Garofalo, Giovanna Rosone, Marinella Sciortino, Davide Verzotto
Early Commenting Features for Emotional Reactions Prediction
Abstract
Nowadays, one of the main sources for people to access and read news are social media platforms. Different types of news trigger different emotional reactions to users who may feel happy or sad after reading a news article. In this paper, we focus on the problem of predicting emotional reactions that are triggered on users after they read a news post. In particular, we try to predict the number of emotional reactions that users express regarding a news post that is published on social media. In this paper, we propose features extracted from users’ comments published about a news post shortly after its publication to predict users’ the triggered emotional reactions. We explore two different sets of features extracted from users’ comments. The first group represents the activity of users in publishing comments whereas the second refers to the comments’ content. In addition, we combine the features extracted from the comments with textual features extracted from the news post. Our results show that features extracted from users’ comments are very important for the emotional reactions prediction of news posts and that combining textual and commenting features can effectively address the problem of emotional reactions prediction.
Anastasia Giachanou, Paolo Rosso, Ida Mele, Fabio Crestani
Block Palindromes: A New Generalization of Palindromes
Abstract
We study a new generalization of palindromes and gapped palindromes called block palindromes. A block palindrome is a string that becomes a palindrome when identical substrings are replaced with a distinct character. We investigate several properties of block palindromes and in particular, study substrings of a string which are block palindromes. In so doing, we introduce the notion of a maximal block palindrome, which leads to a compact representation of all block palindromes that occur in a string. We also propose an algorithm which enumerates all maximal block palindromes that appear in a given string \(T\) in \(O(|T| + \Vert MBP (T)\Vert )\) time, where \(\Vert MBP (T)\Vert \) is the output size, which is optimal unless all the maximal block palindromes can be represented in a more compact way.
Keisuke Goto, I Tomohiro, Hideo Bannai, Shunsuke Inenaga
Maximal Motif Discovery in a Sliding Window
Abstract
Motifs are relatively short sequences that are biologically significant, and their discovery in molecular sequences is a well-researched subject. A don’t care is a special letter that matches every letter in the alphabet. Formally, a motif is a sequence of letters of the alphabet and don’t care letters. A motif \(\tilde{m}_{d,k}\) that occurs at least k times in a sequence is maximal if it cannot be extended (to the left or right) nor can it be specialised (that is, its \(d' \le d\) don’t cares cannot be replaced with letters from the alphabet) without reducing its number of occurrences. Here we present a new dynamic data structure, and the first on-line algorithm, to discover all maximal motifs in a sliding window of length \(\ell \) on a sequence x of length n in \(\mathcal {O}(nd\ell + d\lceil \frac{\ell }{w}\rceil \cdot \sum _{i = \ell }^{n-1} |{\textsc {diff}}_{i-1}^{i}|)\) time, where w is the size of the machine word and \({\textsc {diff}}_{i-1}^{i}\) is the symmetric difference of the sets of occurrences of maximal motifs at \(x[i-\ell \mathinner {.\,.}i-1]\) and at \(x[i-\ell +1 \mathinner {.\,.}i]\).
Costas S. Iliopoulos, Manal Mohamed, Solon P. Pissis, Fatima Vayani
Compressed Range Minimum Queries
Abstract
Given a string S of n integers in \([0,\sigma )\), a range minimum query \({\textsf {RMQ}}(i, j)\) asks for the index of the smallest integer in \(S[i \dots j]\). It is well known that the problem can be solved with a succinct data structure of size \(2n + o(n)\) and constant query-time. In this paper we show how to preprocess S into a compressed representation that allows fast range minimum queries. This allows for sublinear size data structures with logarithmic query time. The most natural approach is to use string compression and construct a data structure for answering range minimum queries directly on the compressed string. We investigate this approach using grammar compression. We then consider an alternative approach. Even if S is not compressible, its Cartesian tree necessarily is. Therefore, instead of compressing S using string compression, we compress the Cartesian tree of S using tree compression. We show that this approach can be exponentially better than the former, and is never worse by more than an \(O(\sigma )\) factor (i.e. for constant alphabets it is never asymptotically worse).
Seungbum Jo, Shay Mozes, Oren Weimann
Fast Wavelet Tree Construction in Practice
Abstract
The wavelet tree is a compact data structure that supports various types of operations on a sequence of n integers in \([0,\sigma )\). Although Munro et al. (SPIRE 2014 and Theoretical Computer Science 2016) and Babenko et al. (SODA 2015) showed that wavelet trees can be constructed in \(O(n\lceil \lg \sigma /\sqrt{\lg n}\rceil )\) time, there has been no empirical study on their construction methods possibly due to its heavy use of precomputed tables, seemingly limiting its practicality. In this paper, we propose practical variants of their methods. Instead of using huge precomputed tables, we introduce new techniques based on broadword programming and special CPU instructions available for modern processors. Experiments on real-world texts demonstrated that our proposed methods were up to 2.2 and 4.5 times as fast as the naive ones for the wavelet tree and the wavelet matrix (a variant of wavelet trees), respectively, and up to 1.9 times as fast as a state of the art for the wavelet matrix.
Yusaku Kaneta
Faster Recovery of Approximate Periods over Edit Distance
Abstract
The approximate period recovery problem asks to compute all approximate word-periods of a given word S of length n: all primitive words P (\(|P|=p\)) which have a periodic extension at edit distance smaller than \(\tau _p\) from S, where \(\tau _p = \lfloor \frac{n}{(3.75+\epsilon )\cdot p} \rfloor \) for some \(\epsilon >0\). Here, the set of periodic extensions of P consists of all finite prefixes of \(P^\infty \).
We improve the time complexity of the fastest known algorithm for this problem of Amir et al. [Theor. Comput. Sci., 2018] from \({\mathcal {O}}(n^{4/3})\) to \({\mathcal {O}}(n \log n)\). Our tool is a fast algorithm for Approximate Pattern Matching in Periodic Text. We consider only verification for the period recovery problem when the candidate approximate word-period P is explicitly given up to cyclic rotation; the algorithm of Amir et al. reduces the general problem in \({\mathcal {O}}(n)\) time to a logarithmic number of such more specific instances.
Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, Wiktor Zuba
Searching for a Modified Pattern in a Changing Text
Abstract
Much attention has been devoted recently to the dynamic model of pattern matching. In this model the input is updated or changed locally. One is interested in obtaining the appropriate search result in time that is shorter than the time necessary to search without any previous knowledge. In particular, searching for a pattern P in an indexed text is done in optimal O(|P|) time. There has been work done in searching for a fixed pattern in a dynamic text, as well as finding all maximum common substrings of two strings in a dynamic setting.
There are real-world applications where the text is unchanged and the pattern is slightly modified at every query. However, in the current state-of-the-art, a new search is required if the pattern is modified. In this paper we present an algorithm that reduces this search time to be sublinear, for a variety of types of pattern modification - in addition to the insertion, deletion, and replacement of symbols, we allow copy-paste and delete substring operations.
We also make a step toward a fully dynamic pattern matching model by also supporting text changes, albeit much more modest than the pattern changes. We support dynamic pattern matching where symbols may also be either added or deleted to either the beginning or the end of the text. We show that we can support such a model in time \(O(\log n)\) for every pattern modification or text change. We can then report all occ occurrences of P in the text in O(occ) time.
Amihood Amir, Eitan Kondratovsky
Recovering, Counting and Enumerating Strings from Forward and Backward Suffix Arrays
Abstract
The suffix array \( SA _{w}\) of a string w of length n is a permutation of [1..n] such that \( SA _{w}[i] = j\) iff w[jn] is the lexicographically i-th suffix of w. In this paper, we consider variants of the reverse-engineering problem on suffix arrays with two given permutations P and Q of [1..n], such that P refers to the forward suffix array of some string w and Q refers to the backward suffix array of the reversed string \(w^R\). Our results are the following: (1) An algorithm which computes a solution string over an alphabet of the smallest size, in O(n) time. (2) The exact number of solution strings over an alphabet of size \(\sigma \). (3) An efficient algorithm which computes all solution strings in the lexicographical order, in time near optimal up to \(\log n\) factor.
Yuki Kuhara, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda
Optimal In-Place Suffix Sorting
Abstract
The suffix array is a fundamental data structure for many applications that involve string searching and data compression. Designing time/space-efficient suffix array construction algorithms has attracted significant attentions and considerable advances have been made for the past 20 years. We obtain the first in-place linear time suffix array construction algorithms that are optimal both in time and space for (read-only) integer alphabets. Our algorithm settles the open problem posed by Franceschini and Muthukrishnan in ICALP 2007. The open problem asked to design in-place algorithms in \(o(n\log n)\) time and ultimately, in O(n) time for (read-only) integer alphabets with \(|\varSigma | \le n\). Our result is in fact slightly stronger since we allow \(|\varSigma |=O(n)\). Besides, we provide an optimal in-place \(O(n\log n)\) time suffix sorting algorithm for read-only general alphabets (i.e., only comparisons are allowed), recovering the result obtained by Franceschini and Muthukrishnan which was an open problem posed by Manzini and Ferragina in ESA 2002.
Zhize Li, Jian Li, Hongwei Huo
Computing Burrows-Wheeler Similarity Distributions for String Collections
Abstract
In this article we present practical and theoretical improvements to the computation of the Burrows-Wheeler similarity distribution for all pairs of strings in a collection. Our algorithms take advantage of the Burrows-Wheeler transform (BWT) computed for the concatenation of all strings, instead of the pairwise construction of BWTs performed by the straightforward approach, and use compressed data structures that allow reductions of running time while still keeping a small memory footprint, as shown by a set of experiments with real datasets.
Felipe A. Louza, Guilherme P. Telles, Simon Gog, Liang Zhao
Better Heuristic Algorithms for the Repetition Free LCS and Other Variants
Abstract
In Discrete Applied Mathematics 2010, Adi et al. introduce and study a variant of the well known Longest Common Subsequence problem, named Repetition Free Longest Common Subsequence (RFLCS). In RFLCS the input consists of two strings A and B over an alphabet \(\varSigma \) and the goal is to find the longest common subsequence containing only distinct characters from \(\varSigma \). Adi et al. prove that the problem is \(\mathcal {APX}\)-hard and show three approximation algorithms. Castelli et al. (Operations Research Letters 2013) propose a heuristic genetic algorithm and Blum and Blesa introduce metaheuristic algorithms (International Conference on Artificial Evolution 2013 and Evolutionary Computation in Combinatorial Optimization 2016).
In this paper we design and test several new heuristic algorithms for RFLCS. The first algorithm, uses dynamic programming and in our testing setup outperforms the algorithms of Adi et al. The second heuristic algorithm improves upon the first and becomes comparable to the state-of-the-art algorithms of Blum and Blesa. The third algorithm transforms the RFLCS instance into an instance of the Maximum Independent Set (MIS) problem with the same value of the optimum solution. Then, we apply known algorithms for the MIS problem. We also augment one of the approximation algorithms of Adi et al. and we prove that we achieve an approximation of factor \(2\sqrt{\min \{|A|,|B|\}}\).
Finally, we introduce a new variant of the LCS problem, named Multiset Restricted Common Subsequence (MRCS), that is a generalization of RFLCS. We present an exact polynomial time algorithm for MRCS for constant size alphabet. Additionally, we show that MRCS admits a \(2\sqrt{\min \{|A|,|B|\}}\) approximation.
Radu Stefan Mincu, Alexandru Popa
Linear-Time Online Algorithm Inferring the Shortest Path from a Walk
Abstract
We consider the problem of inferring an edge-labeled graph from the sequence of edge labels seen in a walk of that graph. It has been known that this problem is solvable in \(\mathrm {O}(n \log n)\) time when the targets are path or cycle graphs. This paper presents an online algorithm for the problem of this restricted case that runs in \(\mathrm {O}(n)\) time, based on Manacher’s algorithm for computing all the maximal palindromes in a string.
Shintaro Narisada, Diptarama Hendrian, Ryo Yoshinaka, Ayumi Shinohara
Trickier XBWT Tricks
Abstract
A trie [11] is one of the best data structures for implementing and searching a dictionary. However, to build the trie structure for larger collections of strings takes up a lot of memory. Since the eXtended Burrows-Wheeler Transform (XBWT) [8, 9] is able to compactly represent a labeled tree, it can naturally be used to succinctly represent a trie. The XBWT also supports navigational operations on the trie, but it does not support failure links. For example, the Aho-Corasick algorithm [1] for simultaneously searching for several patterns in a text achieves its good worst-case time complexity only with the aid of failure links. Manzini [18] showed that a balanced parentheses sequence P can be used to support failure links in constant time with only \(2n+o(n)\) bits of space, where n is the number of internal nodes in the trie. Besides practical algorithms that construct the XBWT, he also provided two different algorithms that construct P. In this paper, we suggest an alternative way for constructing P that outperforms the previous algorithms.
Enno Ohlebusch, Stefan Stauß, Uwe Baier
Fast and Effective Neural Networks for Translating Natural Language into Denotations
Abstract
In this paper we study the semantic parsing problem of mapping natural language utterances into machine interpretable meaning representations. We consider a text-to-denotation application scenario in which a user interacts with a non-human assistant by entering a question, which is then translated into a logical structured query and the result of running this query is finally returned as response to the user. We propose encoder-decoder models that are trained end-to-end using the input questions and the corresponding logical structured queries. In order to ensure fast response times, our models do not condition the target string generation on previously generated tokens. We evaluate our models on real data obtained from a conversational banking chat service, and we show that conditionally-independent translation models offer similar accuracy numbers when compared with sophisticate translation models and present one order of magnitude faster response times.
Tiago Pimentel, Juliano Viana, Adriano Veloso, Nivio Ziviani
Faster and Smaller Two-Level Index for Network-Based Trajectories
Abstract
Two-level indexes have been widely used to handle trajectories of moving objects that are constrained to a network. The top-level of these indexes handles the spatial dimension, whereas the bottom level handles the temporal dimension. The latter turns out to be an instance of the interval-intersection problem, but it has been tackled by non-specialized spatial indexes. In this work, we propose the use of a compact data structure on the bottom level of these indexes. Our experimental evaluation shows that our approach is both faster and smaller than existing solutions.
Rodrigo Rivera, M. Andrea Rodríguez, Diego Seco
Backmatter
Metadaten
Titel
String Processing and Information Retrieval
herausgegeben von
Travis Gagie
Alistair Moffat
Gonzalo Navarro
Ernesto Cuadros-Vargas
Copyright-Jahr
2018
Electronic ISBN
978-3-030-00479-8
Print ISBN
978-3-030-00478-1
DOI
https://doi.org/10.1007/978-3-030-00479-8

Neuer Inhalt