Space-Efficient Substring Occurrence Estimation

Orlandi, Alessio; Venturini, Rossano

doi:10.1007/s00453-014-9936-y

Space-Efficient Substring Occurrence Estimation

Published: 10 September 2014

Volume 74, pages 65–90, (2016)
Cite this article

Algorithmica Aims and scope Submit manuscript

Alessio Orlandi¹ &
Rossano Venturini²

250 Accesses
2 Citations
Explore all metrics

Abstract

In this paper we study the problem of estimating the number of occurrences of substrings in textual data: A text \(T\) on some alphabet \(\varSigma =[\sigma ]\) of length \(n\) is preprocessed and an index \({\mathcal {I}}\) is built. The index is used in lieu of the text to answer queries of the form \(\mathsf{Count}\approx (P)\), returning an approximated number of the occurrences of an arbitrary pattern \(P\) as a substring of \(T\). The problem has its main application in selectivity estimation related to the LIKE predicate in textual databases. Our focus is on obtaining an algorithmic solution with guaranteed error rates and small footprint. To achieve that, we first enrich previous work in the area of compressed text-indexing providing an optimal data structure that, for a given additive error \(\ell \ge 1\), requires \(\varTheta \left( \frac{n}{\ell }\log \sigma \right) \) bits. We also approach the issue of guaranteeing exact answers for sufficiently frequent patterns, providing a data structure whose size scales with the amount of such patterns. Our theoretical findings are supported by experiments showing the practical impact of our data structures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Position-Restricted Substring Searching over Small Alphabets

Frequency-Constrained Substring Complexity

Accurate Cardinality Estimation of Co-occurring Words Using Suffix Trees

Notes

Recall that the condition on \(m\) is not enough to obtain a small \(\mathsf{PST}_\ell \) due to the edge labels.
In the following we will adopt the common assumption that \(\sigma = O(n)\).
We recall that a predecessor query Pred \((x,A)\) returns the predecessor of \(x\) in a set \(A\), namely, \(\max \{y \mid y \le x\; \wedge \; y \in A\}\). A successor query is similar but finds the minimum \(y\) such that \(y \ge x\).
Notice that \(C(u)\) is the number of leaves in the subtree of \(u\) in the original suffix tree.
We use the notation \(0^x\) to denote the binary value 0 repeated \(x\) times.
Notice that \(w\) and \(v\) coincide whenever \(w\) has an inverse suffix link for \({c}\).
We notice that this lemma is also observed by Russo et al. [24] (Lemma 5.1).
A similar method of traversing a suffix tree by means of inverse suffix links encoded in a string has been proposed by Arroyuelo et al. [1].
There is a caveat: in case the first node of the subtree of \({c}\) has an edge label with length greater than 1, then the \(+1\) factor must be eliminated, since that same node becomes a destination.
Notice that we can use the same pre-filter described at the end of Sect. 4 whenever the term \(O(\sigma \log n)\) bits is dominant.
This implementation can be downloaded at the address http://pizzachili.dcc.uchile.cl/indexes/FM-indexV2.

References

Arroyuelo, D., Navarro, G., Sadakane, K.: Stronger Lempel-Ziv based compressed text indexing. Algorithmica 62(1), 54–101 (2012)
Article MATH MathSciNet Google Scholar
Barbay, J., Gagie, T., Navarro, G., Nekrich, Y.: Alphabet partitioning for compressed rank/select and applications. In: Proceedings of the 21st International Symposium on Algorithms and Computation (ISAAC), pp. 315–326 (2010)
Belazzougui, D., Boldi, P., Pagh, R., Vigna, S.: Fast prefix search in little space, with applications. In: Proceedings of the 18th Annual European Symposium on Algorithms (ESA), pp. 427–438 (2010)
Belazzougui, D., Navarro, G.: New lower and upper bounds for representing sequences. In: Proceedings of the 20th Annual European Symposium on Algorithms (ESA), pp. 181–192 (2012)
Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)
Chaudhuri, S., Ganti, V., Gravano, L.: Selectivity estimation for string predicates: overcoming the underestimation problem. In: Proceedings of the 20th International Conference on Data Engineering (ICDE), p. 227 (2004)
Elias, P.: Efficient storage and retrieval by content and address of static files. J. ACM 21(2), 246–260 (1974)
Article MATH MathSciNet Google Scholar
Fano, R.M.: On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, Project MAC, MIT, Cambridge, MA (1971)
Ferragina, P., González, R., Navarro, G., Venturini, R.: Compressed text indexes: from theory to practice. ACM J. Exp. Algorithmics 13, 12–31 (2008)
Ferragina, P., Grossi, R.: The string B-tree: a new data structure for string search in external memory and its applications. J. ACM 46, 236–280 (1999)
Article MATH MathSciNet Google Scholar
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)
Article MathSciNet Google Scholar
Ferragina, P., Venturini, R.: The compressed permuterm index. ACM Trans. Algorithms 7(1), 10 (2010)
Article MathSciNet Google Scholar
Ferragina, P., Venturini, R.: Compressed cache-oblivious String B-tree. In: Proceedings of 21th Annual European Symposium on Algorithms (ESA), pp. 469–480 (2013)
Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. ACM Trans. Algorithms 8(1), 4 (2012)
Article MathSciNet Google Scholar
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005)
Article MATH MathSciNet Google Scholar
Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)
Book MATH Google Scholar
Hagerup, T., Tholey, T.: Efficient minimal perfect hashing in nearly minimal space. In: Proceedings of the 18th Annual Symposium on Theoretical Aspects of Computer Science (STACS), pp. 317–326 (2001)
Jagadish, H.V., Ng, R.T., Srivastava, D.: Substring selectivity estimation. In: Proceedings of the 18th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of database systems (PODS), pp. 249–260 (1999)
Krishnan, P., Vitter, J.S., Iyer, B.R.: Estimating alphanumeric selectivity in the presence of wildcards. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 282–293 (1996)
Manzini, G.: An analysis of the Burrows–Wheeler transform. J. ACM 48(3), 407–430 (2001)
Article MathSciNet Google Scholar
Morrison, D.R.: PATRICIA: practical algorithm to retrieve coded in alphanumeric. J. ACM 15(4), 514–534 (1968)
Article Google Scholar
Navarro, G., Mäkinen, V.: Compressed full text indexes. ACM Comput. Surv. 39(1), 2 (2007)
Orlandi, A., Venturini, R.: Space-efficient substring occurrence estimation. In: Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), pp. 95–106 (2011)
Russo, L.M.S., Navarro, G., Oliveira, A.: Fully-compressed suffix trees. In: Laber, E.S., Bornstein, C., Nogueira, L.T., Faria, L. (eds.) LATIN 2008: Theoretical Informatics, LNCS, vol. 4957, pp. 362–373. Springer, Berlin (2008)
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann Publishers, Los Altos, CA (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Independent researcher, Zurich, Switzerland
Alessio Orlandi
Department of Computer Science, University of Pisa, Pisa, Italy
Rossano Venturini

Authors

Alessio Orlandi
View author publications
You can also search for this author in PubMed Google Scholar
Rossano Venturini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rossano Venturini.

Additional information

The work is an extended version of the paper appeared in the Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS) 2011 [23]. This work has been partially supported by MIUR of Italy under projects PRIN ARS Technomedia 2012, the Midas EU Project, Grant Agreement No. 318786 and the InGeoCloudS EU Project, Grant Agreement No. 297300. The work was conducted while the first author was a Ph.D. student at the Department of Computer Science, University of Pisa.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Orlandi, A., Venturini, R. Space-Efficient Substring Occurrence Estimation. Algorithmica 74, 65–90 (2016). https://doi.org/10.1007/s00453-014-9936-y

Download citation

Received: 21 November 2013
Accepted: 26 August 2014
Published: 10 September 2014
Issue Date: January 2016
DOI: https://doi.org/10.1007/s00453-014-9936-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Space-Efficient Substring Occurrence Estimation

Abstract

Access this article

Similar content being viewed by others

Position-Restricted Substring Searching over Small Alphabets

Frequency-Constrained Substring Complexity

Accurate Cardinality Estimation of Co-occurring Words Using Suffix Trees

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Space-Efficient Substring Occurrence Estimation

Abstract

Access this article

Similar content being viewed by others

Position-Restricted Substring Searching over Small Alphabets

Frequency-Constrained Substring Complexity

Accurate Cardinality Estimation of Co-occurring Words Using Suffix Trees

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation