2007 | OriginalPaper | Chapter
Space-Efficient Algorithms for Document Retrieval
Authors : Niko Välimäki, Veli Mäkinen
Published in: Combinatorial Pattern Matching
Publisher: Springer Berlin Heidelberg
Activate our intelligent search to find suitable subject content or patents.
Select sections of text to find matching patents with Artificial Intelligence. powered by
Select sections of text to find additional relevant content using AI-assisted search. powered by
We study the
Document Listing
problem, where a collection
D
of documents
d
1
,...,
d
k
of total length ∑
i
d
i
=
n
is to be preprocessed, so that one can later efficiently list all the
$\textrm{ndoc}$
documents containing a given query pattern
P
of length
m
as a substring. Muthukrishnan (SODA 2002) gave an optimal solution to the problem; with
O
(
n
) time preprocessing, one can answer the queries in
$O(m+\textrm{ndoc})$
time. In this paper, we improve the space-requirement of the Muthukrishnan’s solution from
O
(
n
log
n
) bits to |
CSA
| + 2
n
+
n
log
k
(1 +
o
(1)) bits, where |
CSA
| ≤
n
log|
Σ
|(1 +
o
(1)) is the size of any suitable
compressed suffix array
(
CSA
), and
Σ
is the underlying alphabet of documents. The time requirement depends on the
CSA
used, but we can obtain e.g. the optimal
$O(m+\textrm{ndoc})$
time when
. For general |
Σ
|,
k
the time requirement becomes
$O(m \log |\Sigma|+\textrm{ndoc} \log k)$
. Sadakane (ISAAC 2002) has developed a similar space-efficient variant of the Muthukrishnan’s solution; we obtain a better time requirement in most cases, but a slightly worse space requirement.