Skip to main content
Top

2025 | Book

String Processing and Information Retrieval

31st International Symposium, SPIRE 2024, Puerto Vallarta, Mexico, September 23–25, 2024, Proceedings

insite
SEARCH

About this book

This book constitutes the refereed proceedings of the 31st International Symposium on String Processing and Information Retrieval, SPIRE 2024, held in Puerto Vallarta, Mexico, during September 23–25, 2024.

The 22 full papers and 4 short papers presented in this volume were carefully reviewed and selected from 41 submissions. The papers reflect the continuation of the long and well-established tradition of encouraging high-quality research at the broad nexus of string processing, information retrieval, and computational biology.

Table of Contents

Frontmatter
Linear Time Reconstruction of Parameterized Strings from Parameterized Suffix and LCP Arrays for Constant-Sized Alphabets
Abstract
A parameterized string (p-string) is a string that can contain two kinds of characters, static symbols and parameter characters. Parameterized pattern matching is a form of pattern matching that allows parameters to be renamed by applying a one-to-one function. The parameterized suffix array is a data structure that is useful in efficient parameterized pattern matching when accompanied by the parameterized longest common prefix (LCP) array. Reconstructing input from a given instance of a data structure is the task of determining whether the instance is valid or not, and if valid, producing a plausible set of data that it can represent. In this paper we consider parameterized suffix and LCP arrays and reconstruct a corresponding p-string that they can represent. In previous work, an algorithm can determine in \(O(n^2)\) time whether a p-string can be constructed to correspond to the input parameterized suffix and LCP arrays of size n. In this work, we develop an algorithm that accomplishes this in O(n) time for constant-sized alphabets, and \(O(n \log n)\) time for general alphabets. Furthermore, when reconstruction is possible, we demonstrate that a p-string can be reconstructed over the minimal alphabet in \(O(n^2)\) time.
Amihood Amir, Eitan Kondratovsky, Shoshana Marcus, Dina Sokol
Bijective BWT Based Compression Schemes
Abstract
We investigate properties of the bijective Burrows-Wheeler transform (BBWT). We show that for any string w, a bidirectional macro scheme of size \(O(r_B )\) can be induced from the BBWT of w, where \(r_B \) is the number of maximal character runs in the BBWT. We also show that \(r_B = O(z\log ^2 n)\), where n is the length of w and z is the number of Lempel-Ziv 77 factors of w. Then, we show a separation between BBWT and BWT by a family of strings with \(r_B = \varOmega (\log n)\) but having only \(r=2\) maximal character runs in the standard Burrows–Wheeler transform (BWT). However, we observe that the smallest \(r_B \) among all cyclic rotations of w is always at most \(r \). While an \(o(n^2)\) algorithm for computing an optimal rotation giving the smallest \(r_B\) is still open, we show how to compute the Lyndon factorizations – a component for computing BBWT – of all cyclic rotations in O(n) time. Furthermore, we conjecture that we can transform two strings having the same Parikh vector to each other by BBWT and rotation operations, and prove this conjecture for the case of binary alphabets and permutations.
Golnaz Badkobeh, Hideo Bannai, Dominik Köppl
Indexing Finite-State Automata Using Forward-Stable Partitions
Abstract
An index on a finite-state automaton is a data structure able to locate specific patterns on the automaton’s paths and consequently on the regular language accepted by the automaton itself. Cotumaccio and Prezza [SODA ’21], introduced a data structure able to solve pattern matching queries on automata, generalizing the famous FM-index for strings of Ferragina and Manzini [FOCS ’00]. The efficiency of their index depends on the width of a particular partial order of the automaton’s states, the smaller the width of the partial order, the faster is the index. However, computing the partial order of minimal width is NP-hard. This problem was mitigated by Cotumaccio [DCC ’22], who relaxed the conditions on the partial order, allowing it to be a partial preorder. This relaxation yields the existence of a unique partial preorder of minimal width that can be computed in polynomial time. In the paper at hand, we present a new class of partial preorders and show that they have the following useful properties: (i) they can be computed in polynomial time, (ii) their width is never larger than the width of Cotumaccio’s preorders, and (iii) there exist infinite classes of automata on which the width of Cotumaccio’s preorder is linearly larger than the width of our preorder.
Ruben Becker, Sung-Hwan Kim, Nicola Prezza, Carlo Tosoni
Burst Edit Distance
Abstract
In this paper we define two types of burst edit errors that occur in Text Editing scenarios when the communication speed is unstable within a wireless keyboard usage: (1) A Burst of Errors (BE) involves a sequence of erroneous identical symbols and allows a single edit operation applied to a sequence of identical symbols; (2) A Burst of Operations (BO) involves a sequence of erroneous symbols that are not necessarily identical and allows a single edit operation applied to a sequence of symbols. In both burst types, every burst operation has a penalty, which is a cost function F(k), where k is the burst length. The burst edit distance of two strings S and T is: (1) The minimum cost of a sequence of BE operations that transforms S into T in the bursts of errors variant (EDBE); (2) The minimum cost of a sequence of BO operations that transforms S into T in the bursts of operations variant (EDBO). We describe solutions to both problems for general natural penalty functions families. A conditional lower bound for the EDBE problem is also given. The \(\mathcal {K}\)-bounded versions of the problems are considered as well.
Itai Boneh, Shay Golan, Avivit Levy, Ely Porat, B. Riva Shalom
Generalization of Repetitiveness Measures for Two-Dimensional Strings
Abstract
Detecting and measuring repetitiveness of strings is a problem that has been extensively studied in data compression and text indexing. When the data are structured in a non-linear way, as in two-dimensional strings, inherent redundancy offers a rich source for compression, yet systematic studies on repetitiveness measures are still lacking. In this paper, we extend to two dimensions the measures \(\delta \) and \(\gamma \), defined in terms of the submatrices of the input, as well as the measures g, \(g_{rl}\), and b, which are based on copy-paste mechanisms. We study their properties and mutual relationships, and we show that the two classes of measures become incomparable when two-dimensional inputs are considered. We also compare our measures with the 2D Block Tree data structure [Brisaboa et al., Computer J., 2024], and provide some insights for the design of effective 2D compressors.
Lorenzo Carfagna, Giovanni Manzini, Giuseppe Romana, Marinella Sciortino, Cristian Urbina
On Computing the Smallest Suffixient Set
Abstract
Let \(T \in \varSigma ^n\) be a text over alphabet \(\varSigma \). A suffixient set \(\mathcal {S} \subseteq [n]\) for T is a set of positions such that, for every one-character right-extension T[ij] of every right-maximal substring \(T[i,j-1]\) of T, there exists \(x\in S\) such that T[ij] is a suffix of T[1, x]. It was recently shown that, given a suffixient set of cardinality q and an oracle offering fast random access on T (for example, a straight-line program), there is a data structure of O(q) words (on top of the oracle) that can quickly find all Maximal Exact Matches (MEMs) of any query pattern P in T with high probability. The paper introducing suffixient sets left open the problem of computing the smallest such set; in this paper, we solve this problem by describing a simple quadratic-time algorithm, a \(O(n + \bar{r}|\varSigma |)\)-time algorithm running in compressed working space (\(\bar{r}\) is the number of runs in the Burrows-Wheeler transform of T reversed), and an optimal O(n)-time algorithm computing the smallest suffixient set. We present an implementation of our compressed-space algorithm and show experimentally that it uses a small memory footprint on repetitive text collections.
Davide Cenzato, Francisco Olivares, Nicola Prezza
Revisiting the Folklore Algorithm for Random Access to Grammar-Compressed Strings
Abstract
Grammar-based compression is a widely-accepted model of string compression that allows for efficient and direct manipulations on the compressed data. Most, if not all, such manipulations rely on the primitive random access queries, a task of quickly returning the character at a specified position of the original uncompressed string without explicit decompression. While there are advanced data structures for random access to grammar-compressed strings that guarantee theoretical query time and space bounds, little has been done for the practical perspective of this important problem. In this paper, we revisit a well-known folklore random access algorithm for grammars in the Chomsky normal form, modify it to work directly on general grammars, and show that this modified version is fast and memory efficient in practice.
Alan M. Cleary, Joseph Winjum, Jordan Dood, Shunsuke Inenaga
Logarithmic-Time Internal Pattern Matching Queries in Compressed and Dynamic Texts
Abstract
Internal Pattern Matching (IPM) queries on a text T, given two fragments X and Y of T such that \(|Y|<2|X|\), ask to compute all exact occurrences of X within Y. IPM queries have been introduced by Kociumaka, Radoszewski, Rytter, and Waleń [SODA’15], who showed that they can be answered in \(\mathcal {O}(1)\) time using a data structure of size \(\mathcal {O}(n)\) and used this result to answer various queries about fragments of T. In this work, we study IPM queries on compressed and dynamic strings. Our result is an \(\mathcal {O}(\log n)\)-time query algorithm applicable to any balanced recompression-based run-length straight-line program (RLSLP). In particular, one can use it on top of the RLSLP of Kociumaka, Navarro, and Prezza [IEEE TIT’23], whose size \(\mathcal {O}(\delta \log \frac{n\log \sigma }{\delta \log n})\) is optimal (among all text representations) as a function of the text length n, the alphabet size \(\sigma \), and the substring complexity \(\delta \). Our procedure does not rely on any preprocessing of the underlying RLSLP, which makes it readily applicable on top of the dynamic strings data structure of Gawrychowski, Karczmarz, Kociumaka, Łącki and Sankowski [SODA’18], which supports fully persistent updates in logarithmic time with high probability.
Anouk Duyster, Tomasz Kociumaka
Bounded-Ratio Gapped String Indexing
Abstract
In the gapped string indexing problem, one is given a text \(T[1\mathinner {.\,.}n]\) to preprocess. At query time, a gapped pattern \(P = P_1[\alpha \mathinner {.\,.}\beta ] P_2\) and an integer range \([\alpha \mathinner {.\,.}\beta ]\) are provided, where \(P_1\) and \(P_2\) are strings of total length m. The goal of the query is to report all pairs of occurrences of \(P_1\) and \(P_2\) with a gap falling within \([\alpha \mathinner {.\,.}\beta ]\). An existing (conditional) lower bound reveals that any index with query time \(\widetilde{\mathcal {O}}(m + occ)\) must occupy almost quadratic space, where occ is the output size. However, there are interesting special cases where more efficient solutions are possible. For example, queries with a bounded gap, i.e., \(\beta \le G\) (fixed at construction) can be answered optimally using an \(\widetilde{\mathcal {O}}(nG)\) space structure. In this paper, we bring out an interesting version of the problem where rather than having a fixed upper bound on \(\beta \), we fix \(\gamma \) and allow any \(\beta \le \gamma \cdot m\) (i.e., allow longer gaps for longer patterns; gap-to-pattern ratio is bounded). We show that such queries can be answered optimally using an \(\widetilde{\mathcal {O}}(n\gamma )\) space structure.
Arnab Ganguly, Daniel Gibney, Paul MacNichol, Sharma V. Thankachan

Open Access

Simultaneously Building and Reconciling a Synteny Tree
Abstract
We present FullSynesth, a tree reconciliation algorithm predicting the evolution of a set of homologous genomic regions or syntenies, inside a species tree. The considered evolutionary model involves segmental events (i.e. acting on multiple genes) including duplications (D), losses (L), synteny fissions and transfers possibly going through unsampled or extinct species. Formally, given a set of syntenies in a set of genomes and a set \(\mathcal {G}\) of consistent gene trees for the gene families composing the syntenies, the problem is to infer a most parsimonious evolutionary history explaining the observed gene trees and syntenies given a species tree. The problem is NP-hard for the DL distance. FullSynesth is based on Synesth explicating the evolution of a set of syntenies given a single synteny tree, which can be obtained from \(\mathcal {G}\) by selecting an “optimal” supertree. Rather than trying each supertree in turn, FullSynesth is based on a two-in-one approach simultaneously building and reconciling a synteny supertree. The running time of this algorithm is exponential in the number of gene trees rather than in the size of gene trees. We show on simulated datasets that FullSynesth significantly improves the running time of Synesth applied to each possible supertree. An implementation of the algorithm is available at: http://​www.​iro.​umontreal.​ca/​~mabrouk/​.
Mathieu Gascon, Mattéo Delabre, Nadia El-Mabrouk
Quantum Algorithms for Longest Common Substring with a Gap
Abstract
Recent breakthroughs have provided a sublinear time quantum algorithm for the Longest Common Substring Problem running in \(\widetilde{\mathcal {O}}(n^{2/3}/d^{1/6})\) time for two strings of length at most n, where d is the length of the solution. At the same time, no subquadratic time quantum algorithm for the Longest Common Subsequence Problem is known, implying increasing difficulty as gaps are allowed within the solution. In this work, we consider the problem of finding two ordered matching substrings such that their total length is maximized. We present a strongly sublinear-time quantum algorithm.
Daniel Gibney, Md Helal Hossen
Online Computation of String Net Frequency
Abstract
The net frequency (NF) of a string, of length m, in a text, of length n, is the number of occurrences of the string in the text with unique left and right extensions. Recently, Guo et al. [CPM 2024] showed that NF is combinatorially interesting and how two key questions can be computed efficiently in the offline setting. First, single-nf: reporting the NF of a query string in an input text. Second, all-nf: reporting an occurrence and the NF of each string of positive NF in an input text. For many applications, however, facilitating these computations in an online manner is highly desirable. We are the first to solve the above two problems in the online setting, and we do so in optimal time, assuming, as is common, a constant-size alphabet: single-nf in https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-72200-4_12/MediaObjects/630698_1_En_12_Figa_HTML.gif time and all-nf in https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-72200-4_12/MediaObjects/630698_1_En_12_Figb_HTML.gif time. Our results are achieved by first designing new and simpler offline algorithms using suffix trees, proving additional properties of NF, and exploiting Ukkonen’s online suffix tree construction algorithm and results on implicit node maintenance in an implicit suffix tree by Breslauer and Italiano.
Peaker Guo, Seeun William Umboh, Anthony Wirth, Justin Zobel
On the Number of Non-equivalent Parameterized Squares in a String
Abstract
A string s is called a parameterized square when \(s = xy\) for strings x, y and x and y are parameterized equivalent. Kociumaka et al. showed the number of parameterized squares, which are non-equivalent in parameterized equivalence, in a string of length n that contains \(\sigma \) distinct characters is at most \(2 \sigma ! n\) [TCS 2016]. In this paper, we show that the maximum number of non-equivalent parameterized squares is less than \(\sigma n\), which significantly improves the best-known upper bound by Kociumaka et al.
Rikuya Hamai, Kazushi Taketsugu, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai
Another Virtue of Wavelet Forests
Abstract
A wavelet forest for a text T[1..n] over an alphabet \(\sigma \) takes \(n H_0 (T) + o (n \log \sigma )\) bits of space and supports access and rank on T in \(O (\log \sigma )\) time. Kärkkäinen and Puglisi (2011) implicitly introduced wavelet forests and showed that when T is the Burrows-Wheeler Transform (BWT) of a string S, then a wavelet forest for T occupies space bounded in terms of higher-order empirical entropies of S even when the forest is implemented with uncompressed bitvectors. In this paper we show experimentally that wavelet forests also have better access locality than wavelet trees and are thus interesting even when higher-order compression is not effective on S, or when T is not a BWT at all.
Aaron Hong, Christina Boucher, Travis Gagie, Yansong Li, Norbert Zeh
All-Pairs Suffix-Prefix on Dynamic Set of Strings
Abstract
The all-pairs suffix-prefix (APSP) problem is a classical problem in string processing which has important applications in bioinformatics. Given a set \(\mathcal {S} = \{S_1, \ldots , S_k\}\) of k strings, the APSP problem asks one to compute the longest suffix of \(S_i\) that is a prefix of \(S_j\) for all \(k^2\) ordered pairs \(\langle S_i, S_j \rangle \) of strings in \(\mathcal {S}\). In this paper, we consider the dynamic version of the APSP problem that allows for insertions of new strings to the set of strings. Our objective is, each time a new string \(S_i\) arrives to the current set \(\mathcal {S}_{i-1} = \{S_1, \ldots , S_{i-1}\}\) of \(i-1\) strings, to compute (1) the longest suffix of \(S_i\) that is a prefix of \(S_j\) and (2) the longest prefix of \(S_i\) that is a suffix of \(S_j\) for all \(1 \le j \le i\). We propose an O(n)-space data structure which computes (1) and (2) in \(O(|S_i| \log \sigma + i)\) time for each new given string \(S_i\), where n is the total length of the strings.
Masaru Kikuchi, Shunsuke Inenaga
Adaptive Dynamic Bitvectors
Abstract
While operations rank and select on static bitvectors can be supported in constant time, lower bounds show that supporting updates raises the cost per operation to \(\varTheta (\log n/ \log \log n)\). This is a shame in scenarios where updates are possible but uncommon. We develop a representation of bitvectors that, if there are q queries per update, supports all the operations in \(O(\log (n/q))\) amortized time. Our experimental results support the theoretical findings, displaying speedups of orders of magnitude compared to standard dynamic implementations.
Gonzalo Navarro
Compressed Graph Representations for Evaluating Regular Path Queries
Abstract
Regular Path Queries (RPQs) are at the core of graph database query languages like SPARQL. They consist, essentially, of regular expressions that must match the sequence of edge labels of paths in the database graph. A way to answer them is to traverse the graph and the automaton of the RPQ in synchronization, reporting the graph nodes where the automaton reaches final states. We implement this approach on top of a compact graph representation that is particularly well suited for this task. The result is an index using considerably less space and/or query time than all existing approaches.
Gonzalo Navarro, Josefa Robert
Greedy Conjecture for the Shortest Common Superstring Problem and Its Strengthenings
Abstract
In the Shortest Common Superstring problem, one needs to find the shortest superstring for a set of strings. This problem is APX-hard, and many approximation algorithms were proposed, with the current best approximation factor of 2.466. Whereas these algorithms are technically involved, for more than thirty years the Greedy Conjecture remains unsolved, that states that the Greedy Algorithm “take two strings with the maximum overlap; merge them; repeat” is a 2-approximation. This conjecture is still open, and one way to approach it is to consider its stronger version, which may make the proof easier due to the stronger premise or provide insights from its refutation. In this paper, we propose two directions to strengthen the conjecture. First, we introduce the Locally Greedy Algorithm (LGA), that selects a pair of strings not with the largest overlap but with the locally largest overlap, that is, the largest among all pairs of strings with the same first or second string. Second, we change the quality metric: instead of length, we evaluate the solution by the number of occurrences of an arbitrary symbol.
Despite the double strengthening, we prove that LGA is a uniform 4-approximation, that is, it always constructs a superstring with no more than four times as many occurrences of an arbitrary symbol as any other superstring. At the same time, we discover the limitations of the greedy heuristic: we show that LGA is at least 3-approximation, and the Greedy Algorithm is at least uniform 2.5-approximation. These result show that if the Greedy Conjecture is true, it is not because the Greedy Algorithm is locally greedy or is uniformly 2-approximation.
Maksim S. Nikolaev
Faster Computation of Chinese Frequent Strings and Their Net Frequencies
Abstract
A Chinese frequent string (CFS) is a repeated string in a Chinese text that has at least one net occurrence (i.e., an occurrence with the property that both its left-extension and its right-extension by one character are unique in the text). The net frequency of a CFS is the number of its net occurrences in the text. In this short paper, we improve recent work of Guo et al. [4] in such a way that the computation of Chinese frequent strings and their net frequencies becomes much faster.
Enno Ohlebusch, Thomas Büchler, Jannik Olbrich
Faster Algorithms for Ranking/Unranking Bordered and Unbordered Words
Abstract
We show how the arithmetic structure of the set of borders (periods) of a word can be used to substantially reduce complexity of an interesting problem in combinatorics on words. A word w is a bordered word if it has a non-empty proper border (a prefix which is a suffix); equivalently, it has a period smaller than |w|. Words which are not bordered are called unbordered. The problem of ranking/unranking such words of a given length n over an alphabet of size k was considered  by Gabric (Inf. Process. Lett., 2024). We improve his results as follows: complexity of ranking is reduced by a factor \(nk/\log n\) and complexity of unranking by \(n^2k/ \log n\) (for large alphabets these improvement factors are \(n^2/\log n\) and \(n^3/ \log n\), respectively). We use the unit-cost RAM model (the same model was used by Gabric).
Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń
Computing String Covers in Sublinear Time
Abstract
In the word RAM model, a string T of length n over an integer alphabet of size \(\sigma \) can be represented in \(\mathcal {O}(n /\log _\sigma n)\) space. We show that a representation of all covers of T can be computed in the optimal \(\mathcal {O}(n/\log _\sigma n)\) time; in particular, the shortest cover can be computed within this time. We also design an \(\mathcal {O}(n(\log \sigma + \log \log n)/\log n)\)-sized data structure that computes in \(\mathcal {O}(1)\) time any element of the so-called (shortest) cover array of T, that is, the length of the shortest cover of any given prefix of T. As a by-product, we describe the structure of the cover array of Fibonacci strings. On the negative side, we show that the shortest cover of a length-n string cannot be computed using \(o(n/\log n)\) operations in the PILLAR model of Charalampopoulos, Kociumaka, and Wellnitz (FOCS 2020).
Jakub Radoszewski, Wiktor Zuba
LZ78 Substring Compression with CDAWGs
Abstract
The Lempel–Ziv 78 (LZ78) factorization is a well-studied technique for data compression. It and its derivates are used in compression formats such as compress or gif. While most research focuses on the factorization of plain data, not much research has been conducted on indexing the data for fast LZ78 factorization. Here, we study the LZ78 factorization in the substring compression model, where we are allowed to index the data and have to return the factorization of a substring specified at query time. In that model, we propose an algorithm that works in CDAWG-compressed space, computing the factorization with a logarithmic slowdown compared to the optimal time complexity.
Hiroki Shibata, Dominik Köppl
2d Side-Sharing Tandems with Mismatches
Abstract
One form of 2d periodicity is encapsulated by the definitions of 2d side-sharing tandems and runs. A 2d side-sharing tandem consists of two adjacent non-overlapping occurrences of the same rectangular block, and a side-sharing run is a maximally extended chain of side-sharing tandems. Furthering our understanding of 2d periodicity has long been an important goal, with motivation in the fields of image matching and multi-dimensional compression schemes. Much research has been accomplished on exact 2d periodicity, however, there have been few results on approximate 2d periodicity. In this paper we introduce several versions of approximate side-sharing tandems with k mismatches along with efficient algorithms for locating them in a rectangular array.
Shoshana Marcus, Dina Sokol, Sarah Zelikovitz
Faster and Simpler Online/Sliding Rightmost Lempel-Ziv Factorizations
Abstract
We tackle the problems of computing the rightmost variant of the Lempel-Ziv factorizations in the online/sliding model. Previous best bounds for this problem are \(O(n \log n)\) time with O(n) space, due to Amir et al. [IPL 2002] for the online model, and due to Larsson [CPM 2014] for the sliding model. In this paper, we present faster \(O(n \log n / \log \log n)\)-time solutions to both of the online/sliding models. Our algorithms are built on a simple data structure named BP-linked trees, and on a slightly improved version of the range minimum/maximum query (RmQ/RMQ) data structure on a dynamic list of integers. We also present other applications of our algorithms.
Wataru Sumiyoshi, Takuya Mieno, Shunsuke Inenaga
Space-Efficient SLP Encoding for O(log N)-Time Random Access
Abstract
A Straight-Line Program (SLP) \(\mathcal {G}\) for a string \(\mathcal {T}\) is a context-free grammar (CFG) that derives \(\mathcal {T}\) only, which can be considered as a compressed representation of \(\mathcal {T}\).In this paper, we show how to encode \(\mathcal {G}\) in \(n \lceil \lg N \rceil + (n + n') \lceil \lg (n+\sigma ) \rceil + 4n - 2n' + o(n)\) bits to support random access queries of extracting \(\mathcal {T}[p..q]\) in worst-case \(O(\log N + q - p)\) time, where N is the length of \(\mathcal {T}\), \(\sigma \) is the alphabet size, n is the number of variables in \(\mathcal {G}\) and \(n' \le n\) is the number of symmetric centroid paths in the DAG representation for \(\mathcal {G}\).
Akito Takasaka, Tomohiro I
Simple Linear-Time Repetition Factorization
Abstract
A factorization \(f_1, \ldots , f_m\) of a string w of length n is called a repetition factorization of w if \(f_i\) is a repetition, i.e., \(f_i\) is a form of \(x^kx'\) where x is a non-empty string, \(x'\) is a (possibly-empty) proper prefix of x, and \(k \ge 2\). Dumitran et al. [SPIRE 2015] presented an O(n)-time and space algorithm for computing an arbitrary repetition factorization of a given string of length n. Their algorithm heavily relies on the Union-Find data structure on trees proposed by Gabow and Tarjan [JCSS 1985] that works in linear time on the word RAM model, and an interval stabbing data structure of Schmidt [ISAAC 2009]. In this paper, we explore more combinatorial insights into the problem, and present a simple algorithm to compute an arbitrary repetition factorization of a given string of length n in O(n) time, without relying on data structures for Union-Find and interval stabbing. Our algorithm follows the approach by Inoue et al. [ToCS 2022] that computes the smallest/largest repetition factorization in \(O(n \log n)\) time.
Yuki Yonemoto, Shunsuke Inenaga
Backmatter
Metadata
Title
String Processing and Information Retrieval
Editors
Zsuzsanna Lipták
Edleno Moura
Karina Figueroa
Ricardo Baeza-Yates
Copyright Year
2025
Electronic ISBN
978-3-031-72200-4
Print ISBN
978-3-031-72199-1
DOI
https://doi.org/10.1007/978-3-031-72200-4

Premium Partner