Skip to main content
Log in

Space-Efficient Substring Occurrence Estimation

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

In this paper we study the problem of estimating the number of occurrences of substrings in textual data: A text \(T\) on some alphabet \(\varSigma =[\sigma ]\) of length \(n\) is preprocessed and an index \({\mathcal {I}}\) is built. The index is used in lieu of the text to answer queries of the form \(\mathsf{Count}\approx (P)\), returning an approximated number of the occurrences of an arbitrary pattern \(P\) as a substring of \(T\). The problem has its main application in selectivity estimation related to the LIKE predicate in textual databases. Our focus is on obtaining an algorithmic solution with guaranteed error rates and small footprint. To achieve that, we first enrich previous work in the area of compressed text-indexing providing an optimal data structure that, for a given additive error \(\ell \ge 1\), requires \(\varTheta \left( \frac{n}{\ell }\log \sigma \right) \) bits. We also approach the issue of guaranteeing exact answers for sufficiently frequent patterns, providing a data structure whose size scales with the amount of such patterns. Our theoretical findings are supported by experiments showing the practical impact of our data structures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. Recall that the condition on \(m\) is not enough to obtain a small \(\mathsf{PST}_\ell \) due to the edge labels.

  2. In the following we will adopt the common assumption that \(\sigma = O(n)\).

  3. We recall that a predecessor query Pred \((x,A)\) returns the predecessor of \(x\) in a set \(A\), namely, \(\max \{y \mid y \le x\; \wedge \; y \in A\}\). A successor query is similar but finds the minimum \(y\) such that \(y \ge x\).

  4. Notice that \(C(u)\) is the number of leaves in the subtree of \(u\) in the original suffix tree.

  5. We use the notation \(0^x\) to denote the binary value 0 repeated \(x\) times.

  6. Notice that \(w\) and \(v\) coincide whenever \(w\) has an inverse suffix link for \({c}\).

  7. We notice that this lemma is also observed by Russo et al. [24] (Lemma 5.1).

  8. A similar method of traversing a suffix tree by means of inverse suffix links encoded in a string has been proposed by Arroyuelo et al. [1].

  9. There is a caveat: in case the first node of the subtree of \({c}\) has an edge label with length greater than 1, then the \(+1\) factor must be eliminated, since that same node becomes a destination.

  10. Notice that we can use the same pre-filter described at the end of Sect. 4 whenever the term \(O(\sigma \log n)\) bits is dominant.

  11. This implementation can be downloaded at the address http://pizzachili.dcc.uchile.cl/indexes/FM-indexV2.

References

  1. Arroyuelo, D., Navarro, G., Sadakane, K.: Stronger Lempel-Ziv based compressed text indexing. Algorithmica 62(1), 54–101 (2012)

    Article  MATH  MathSciNet  Google Scholar 

  2. Barbay, J., Gagie, T., Navarro, G., Nekrich, Y.: Alphabet partitioning for compressed rank/select and applications. In: Proceedings of the 21st International Symposium on Algorithms and Computation (ISAAC), pp. 315–326 (2010)

  3. Belazzougui, D., Boldi, P., Pagh, R., Vigna, S.: Fast prefix search in little space, with applications. In: Proceedings of the 18th Annual European Symposium on Algorithms (ESA), pp. 427–438 (2010)

  4. Belazzougui, D., Navarro, G.: New lower and upper bounds for representing sequences. In: Proceedings of the 20th Annual European Symposium on Algorithms (ESA), pp. 181–192 (2012)

  5. Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)

  6. Chaudhuri, S., Ganti, V., Gravano, L.: Selectivity estimation for string predicates: overcoming the underestimation problem. In: Proceedings of the 20th International Conference on Data Engineering (ICDE), p. 227 (2004)

  7. Elias, P.: Efficient storage and retrieval by content and address of static files. J. ACM 21(2), 246–260 (1974)

    Article  MATH  MathSciNet  Google Scholar 

  8. Fano, R.M.: On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, Project MAC, MIT, Cambridge, MA (1971)

  9. Ferragina, P., González, R., Navarro, G., Venturini, R.: Compressed text indexes: from theory to practice. ACM J. Exp. Algorithmics 13, 12–31 (2008)

  10. Ferragina, P., Grossi, R.: The string B-tree: a new data structure for string search in external memory and its applications. J. ACM 46, 236–280 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  11. Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)

    Article  MathSciNet  Google Scholar 

  12. Ferragina, P., Venturini, R.: The compressed permuterm index. ACM Trans. Algorithms 7(1), 10 (2010)

    Article  MathSciNet  Google Scholar 

  13. Ferragina, P., Venturini, R.: Compressed cache-oblivious String B-tree. In: Proceedings of 21th Annual European Symposium on Algorithms (ESA), pp. 469–480 (2013)

  14. Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. ACM Trans. Algorithms 8(1), 4 (2012)

    Article  MathSciNet  Google Scholar 

  15. Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  16. Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)

    Book  MATH  Google Scholar 

  17. Hagerup, T., Tholey, T.: Efficient minimal perfect hashing in nearly minimal space. In: Proceedings of the 18th Annual Symposium on Theoretical Aspects of Computer Science (STACS), pp. 317–326 (2001)

  18. Jagadish, H.V., Ng, R.T., Srivastava, D.: Substring selectivity estimation. In: Proceedings of the 18th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of database systems (PODS), pp. 249–260 (1999)

  19. Krishnan, P., Vitter, J.S., Iyer, B.R.: Estimating alphanumeric selectivity in the presence of wildcards. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 282–293 (1996)

  20. Manzini, G.: An analysis of the Burrows–Wheeler transform. J. ACM 48(3), 407–430 (2001)

    Article  MathSciNet  Google Scholar 

  21. Morrison, D.R.: PATRICIA: practical algorithm to retrieve coded in alphanumeric. J. ACM 15(4), 514–534 (1968)

    Article  Google Scholar 

  22. Navarro, G., Mäkinen, V.: Compressed full text indexes. ACM Comput. Surv. 39(1), 2 (2007)

  23. Orlandi, A., Venturini, R.: Space-efficient substring occurrence estimation. In: Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), pp. 95–106 (2011)

  24. Russo, L.M.S., Navarro, G., Oliveira, A.: Fully-compressed suffix trees. In: Laber, E.S., Bornstein, C., Nogueira, L.T., Faria, L. (eds.) LATIN 2008: Theoretical Informatics, LNCS, vol. 4957, pp. 362–373. Springer, Berlin (2008)

  25. Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann Publishers, Los Altos, CA (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rossano Venturini.

Additional information

The work is an extended version of the paper appeared in the Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS) 2011 [23]. This work has been partially supported by MIUR of Italy under projects PRIN ARS Technomedia 2012, the Midas EU Project, Grant Agreement No. 318786 and the InGeoCloudS EU Project, Grant Agreement No. 297300. The work was conducted while the first author was a Ph.D. student at the Department of Computer Science, University of Pisa.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Orlandi, A., Venturini, R. Space-Efficient Substring Occurrence Estimation. Algorithmica 74, 65–90 (2016). https://doi.org/10.1007/s00453-014-9936-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-014-9936-y

Keywords

Navigation