2007 | OriginalPaper | Buchkapitel
A Lempel-Ziv Text Index on Secondary Storage
verfasst von : Diego Arroyuelo, Gonzalo Navarro
Erschienen in: Combinatorial Pattern Matching
Verlag: Springer Berlin Heidelberg
Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.
Wählen Sie Textabschnitte aus um mit Künstlicher Intelligenz passenden Patente zu finden. powered by
Markieren Sie Textabschnitte, um KI-gestützt weitere passende Inhalte zu finden. powered by
Full-text
searching consists in locating the occurrences of a given pattern
P
[1..
m
] in a text
T
[1..
u
], both sequences over an alphabet of size
σ
. In this paper we define a new index for full-text searching on
secondary storage
, based on the Lempel-Ziv compression algorithm and requiring 8
uH
k
+
o
(
u
log
σ
) bits of space, where
H
k
denotes the
k
-th order empirical entropy of
T
, for any
k
=
o
(log
σ
u
). Our experimental results show that our index is significantly smaller than any other practical secondary-memory data structure: 1.4–2.3 times the text size
including the text
, which means 39%–65% the size of traditional indexes like
String B-trees
[Ferragina and Grossi,
JACM
1999]. In exchange, our index requires more disk access to locate the pattern occurrences. Our index is able to report up to 600 occurrences per disk access, for a disk page of 32 kilobytes. If we only need to
count
pattern occurrences, the space can be reduced to about 1.04–1.68 times the text size, requiring about 20–60 disk accesses, depending on the pattern length.