Elsevier

Theoretical Computer Science

Volume 542, 3 July 2014, Pages 83-97
Theoretical Computer Science

New space/time tradeoffs for top-k document retrieval on sequences

https://doi.org/10.1016/j.tcs.2014.05.005Get rights and content
Under an Elsevier user license
open archive

Abstract

We address the problem of indexing a collection D={T1,T2,,TD} of D string documents of total length n, so that we can efficiently answer top-k queries: retrieve k documents most relevant to a pattern P of length p given at query time. There exist linear-space data structures, that is, using O(n) words, that answer such queries in optimal O(p+k) time for an ample set of notions of relevance. However, using linear space is not sufficiently good for large text collections. In this paper we explore how far the space/time tradeoff for this problem can be pushed. We obtain three results: (1) When relevance is measured as term frequency (number of times P appears in a document Ti), an index occupying |CSA|+o(n) bits answers the query in time O(tsearch(p)+klg2klgεn), where CSA is a compressed suffix array indexing D, tsearch(p) is its time to find the suffix array interval of P, and ε>0 is any constant. (2) With the same measure of relevance, an index occupying |CSA|+nlgD+o(nlgσ+nlgD) bits answers the query in time O(tsearch(p)+klgk), where lgk is the iterated logarithm of k. (3) When the relevance depends only on the documents, an index occupying |CSA|+O(nlglgn) bits answers the query in O(tsearch(p)+ktSA) time, where tSA is the time the CSA needs to retrieve a suffix array cell. On our way, we obtain some other results of independent interest.

Keywords

Document retrieval
Top-k queries
String databases
Compressed data structures

Cited by (0)

Early parts of this work appeared in SPIRE 2013 and ISAAC 2013. Work supported by Fondecyt Grant 1-110066, Chile.