New space/time tradeoffs for top-k document retrieval on sequences

doi:10.1016/j.tcs.2014.05.005

Abstract

We address the problem of indexing a collection $D = {T_{1}, T_{2}, \dots, T_{D}}$ of D string documents of total length n, so that we can efficiently answer top-k queries: retrieve k documents most relevant to a pattern P of length p given at query time. There exist linear-space data structures, that is, using $O (n)$ words, that answer such queries in optimal $O (p + k)$ time for an ample set of notions of relevance. However, using linear space is not sufficiently good for large text collections. In this paper we explore how far the space/time tradeoff for this problem can be pushed. We obtain three results: (1) When relevance is measured as term frequency (number of times P appears in a document $T_{i}$ ), an index occupying $| CSA | + o (n)$ bits answers the query in time $O (t_{search} (p) + k \lg^{2} k \lg^{ε} n)$ , where $CSA$ is a compressed suffix array indexing $D$ , $t_{search} (p)$ is its time to find the suffix array interval of P, and $ε > 0$ is any constant. (2) With the same measure of relevance, an index occupying $| CSA | + n \lg D + o (n \lg σ + n \lg D)$ bits answers the query in time $O (t_{search} (p) + k \lg^{⁎} k)$ , where $\lg^{⁎} k$ is the iterated logarithm of k. (3) When the relevance depends only on the documents, an index occupying $| CSA | + O (n \lg \lg n)$ bits answers the query in $O (t_{search} (p) + k t_{SA})$ time, where $t_{SA}$ is the time the $CSA$ needs to retrieve a suffix array cell. On our way, we obtain some other results of independent interest.

Theoretical Computer Science

Abstract

Keywords

Cited by (0)

Theoretical Computer Science

New space/time tradeoffs for top-k document retrieval on sequences☆

Abstract

Keywords