2015 | OriginalPaper | Buchkapitel
Longest Common Prefix with Mismatches
verfasst von : Giovanni Manzini
Erschienen in: String Processing and Information Retrieval
Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.
Wählen Sie Textabschnitte aus um mit Künstlicher Intelligenz passenden Patente zu finden. powered by
Markieren Sie Textabschnitte, um KI-gestützt weitere passende Inhalte zu finden. powered by
The Longest Common Prefix (LCP) array is a data structure commonly used in combination with the Suffix Array. However, in some settings we are interested in the LCP values
per se
since they provide useful information on the repetitiveness of the underlying sequence.
Since sequences can contain alterations, which can be either malicious (plagiarism attempts) or pseudo-random (as in sequencing experiments), when using LCP values to measure repetitiveness it makes sense to allow for a small number of errors. In this paper we formalize this notion by considering the longest common prefix in the presence of mismatches. In particular, we propose an algorithm that computes, for each text suffix, the length of its longest prefix that occurs elsewhere in the text with at most one mismatch. For a sequence of length
n
our algorithm uses
$$\Theta (n\log n)$$
bits and runs in
$$\mathcal {O}(n \text{L}_{ave}\log n/\log \log n)$$
time where
$$\text{L}_{ave}$$
is the average LCP of the input sequence. Although
$$\text{L}_{ave}$$
is
$$\Theta (n)$$
in the worst case, recent analyses of real world data show that it usually grows logarithmically with the input size. We then describe and analyse a second algorithm that uses a greedy strategy to reduce the amount of computation and that can be turned into an even faster algorithm if allow an additive one-sided error.
Finally, we consider the related problem of computing the 1-mappability of a sequence. In this problem we are asked to compute, for each length-
m
substring of the input sequence, the number of other substrings which are at Hamming distance one. For this problem we propose an algorithm that takes
$$\mathcal {O}(m n \log n/\log \log n)$$
time using
$$\Theta (n \log n)$$
bits of space.