2015 | OriginalPaper | Buchkapitel
Space-Efficient Detection of Unusual Words
verfasst von : Djamal Belazzougui, Fabio Cunial
Erschienen in: String Processing and Information Retrieval
Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.
Wählen Sie Textabschnitte aus um mit Künstlicher Intelligenz passenden Patente zu finden. powered by
Markieren Sie Textabschnitte, um KI-gestützt weitere passende Inhalte zu finden. powered by
Detecting all the strings that occur in a text more frequently or less frequently than expected according to an IID or a Markov model is a basic problem in string mining, yet current algorithms are based on data structures that are either space-inefficient or incur large slowdowns, and current implementations cannot scale to genomes or metagenomes in practice. In this paper we engineer an algorithm based on the suffix tree of a string to use just a small data structure built on the Burrows-Wheeler transform, and a stack of
$$O(\sigma ^2\log ^2 n)$$
bits, where
n
is the length of the string and
$$\sigma $$
is the size of the alphabet. The size of the stack is
o
(
n
) except for very large values of
$$\sigma $$
. We further improve the algorithm by removing its time dependency on
$$\sigma $$
, by reporting only a subset of the maximal repeats and of the minimal rare words of the string, and by detecting and scoring candidate under-represented strings that
do not occur
in the string. Our algorithms are practical and work directly on the BWT, thus they can be immediately applied to a number of existing datasets that are available in this form, returning this string mining problem to a manageable scale.