Skip to main content
Log in

Efficient Fully-Compressed Sequence Representations

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

We present a data structure that stores a sequence s[1..n] over alphabet [1..σ] in \(n\mathcal{H}_{0}(s) + o(n)(\mathcal {H}_{0}(s){+}1)\) bits, where \(\mathcal{H}_{0}(s)\) is the zero-order entropy of s. This structure supports the queries access, rank and select, which are fundamental building blocks for many other compressed data structures, in worst-case time \({\mathcal{O} ( {\lg\lg\sigma} )}\) and average time \({\mathcal{O} ( {\lg\mathcal{H}_{0}(s)} )}\). The worst-case complexity matches the best previous results, yet these had been achieved with data structures using \(n\mathcal{H}_{0}(s)+o(n\lg \sigma)\) bits. On highly compressible sequences the o(nlgσ) bits of the redundancy may be significant compared to the \(n\mathcal{H}_{0}(s)\) bits that encode the data. Our representation, instead, compresses the redundancy as well. Moreover, our average-case complexity is unprecedented.

Our technique is based on partitioning the alphabet into characters of similar frequency. The subsequence corresponding to each group can then be encoded using fast uncompressed representations without harming the overall compression ratios, even in the redundancy.

The result also improves upon the best current compressed representations of several other data structures. For example, we achieve (i) compressed redundancy, retaining the best time complexities, for the smallest existing full-text self-indexes; (ii) compressed permutations π with times for π() and π −1() improved to loglogarithmic; and (iii) the first compressed representation of dynamic collections of disjoint sets. We also point out various applications to inverted indexes, suffix arrays, binary relations, and data compressors.

Our structure is practical on large alphabets. Our experiments show that, as predicted by theory, it dominates the space/time tradeoff map of all the sequence representations, both in synthetic and application scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Algorithm 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. Because of these good results on polylog-sized alphabets, we focus on larger alphabets in this article, and therefore do not distinguish between redundancies of the form o(n)lgσ and no(lgσ), writing o(nlgσ) for all. See also footnote 6 of Barbay et al. [4].

  2. The representation actually compresses to the k-th order entropy of a different sequence, not s (A. Golynski, personal communication).

  3. By effective we mean that every character appears in s, and thus σn. In Sect. 3.3 we handle the case of larger alphabets.

  4. This is achieved by using block sizes of length \(\frac{\lg n}{2}\) and not \(\frac{\lg|s_{\ell}|}{2}\), at the price of storing universal tables of size \({\mathcal{O} ( {\sqrt{n} \textrm {polylog}(n)} )}=o(n)\) bits. Therefore all of our o(⋅) expressions involving n and other variables will be asymptotic in n.

  5. Given σ pairs of numbers a i ,b i >0, it holds that \(\sum a_{i} \lg\frac{a_{i}}{b_{i}} \ge (\sum a_{i} )\lg\frac{\sum a_{i}}{\sum b_{i}}\). Use a i =|s| i /n and b i =−a i lga i to obtain the result.

  6. One can retain lglgn in the denominator by using block sizes depending on n and not on |s i|, as explained in the footnote at the beginning of Sect. 3.2.

  7. The code for generating these sequences is available at https://github.com/fclaude/txtinvlists.

  8. We use G. Navarro’s Huffman implementation; the code is available in Libcds.

  9. We use the code by J. Carpinelli, A. Moffat, R. Neal, W. Salamonsen, L. Stuiver, A. Turpin and I. Witten, available at http://ww2.cs.mu.oz.au/~alistair/arith_coder/arith_coder-3.tar.gz. We modified the decompressor to read the whole stream before timing decompression.

  10. Note that this works well as long as each node points to at least one node. We solve this problem by keeping an additional bitmap marking the nodes whose list is not empty.

  11. http://law.dsi.unimi.it.

  12. Note that WTRRR is almost 50 % larger than WTRG. This is because the former is a more complex structure and requires a larger (constant) number of pointers to be represented. Multiplying by the σ nodes of the Huffman-shaped wavelet tree makes a significant difference when the alphabet is so large.

  13. Djamal Belazzougui, personal communication.

References

  1. Arroyuelo, D., González, S., Oyarzún, M.: Compressed self-indices supporting conjunctive queries on document collections. In: Proc. 17th International Symposium on String Processing and Information Retrieval (SPIRE), pp. 43–54 (2010)

    Chapter  Google Scholar 

  2. Barbay, J., Claude, F., Navarro, G.: Compact rich-functional binary relation representations. In: Proc. 9th Latin American Symposium on Theoretical Informatics (LATIN). LNCS, vol. 6034, pp. 170–183 (2010)

    Google Scholar 

  3. Barbay, J., Golynski, A., Munro, J.I., Rao, S.S.: Adaptive searching in succinctly encoded binary relations and tree-structured documents. Theor. Comput. Sci. 387(3), 284–297 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  4. Barbay, J., He, M., Munro, J.I., Rao, S.S.: Succinct indexes for strings, binary relations and multilabeled trees. ACM Trans. Algorithms 7(4), 52 (2011)

    Article  MathSciNet  Google Scholar 

  5. Barbay, J., López-Ortiz, A., Lu, T., Salinger, A.: An experimental investigation of set intersection algorithms for text searching. ACM J. Exp. Algorithmics 14(3), 7 (2009)

    Google Scholar 

  6. Barbay, J., Navarro, G.: Compressed representations of permutations, and applications. In: Proc. 26th Symposium on Theoretical Aspects of Computer Science (STACS), pp. 111–122 (2009)

    Google Scholar 

  7. Barbay, J., Navarro, G.: On compressing permutations and adaptive sorting. CoRR (2011). 1108.4408v1

  8. Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. In: Proc. 19th Annual European Symposium on Algorithms (ESA). LNCS, vol. 6942, pp. 748–759 (2011)

    Google Scholar 

  9. Belazzougui, D., Navarro, G.: New lower and upper bounds for representing sequences. In: Proc. 20th Annual European Symposium on Algorithms (ESA). LNCS, vol. 7501, pp. 181–192 (2012)

    Google Scholar 

  10. Boldi, P., Vigna, S.: The WebGraph framework I: compression techniques. In: Proc. 13th World Wide Web Conference (WWW), pp. 595–602 (2004)

    Chapter  Google Scholar 

  11. Brisaboa, N., Luaces, M., Navarro, G., Seco, D.: A new point access method based on wavelet trees. In: Proc. 3rd International Workshop on Semantic and Conceptual Issues in GIS (SeCoGIS). LNCS, vol. 5833, pp. 297–306 (2009)

    Google Scholar 

  12. Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)

  13. Clark, D.: Compact Pat Trees. Ph.D. Thesis, University of Waterloo, Canada (1996)

  14. Clarke, C., Cormack, G., Tudhope, E.: Relevance ranking for one to three term queries. In: Proc. 5th International Conference on Computer-Assisted Information Retrieval (RIAO), pp. 388–401 (1997)

    Google Scholar 

  15. Claude, F., Navarro, G.: Practical rank/select queries over arbitrary sequences. In: Proc. 15th International Symposium on String Processing and Information Retrieval (SPIRE), pp. 176–187 (2008)

    Google Scholar 

  16. Claude, F., Navarro, G.: Extended compact web graph representations. In: Elomaa, T., Mannila, H., Orponen, P. (eds.) Algorithms and Applications (Ukkonen Festschrift). LNCS, vol. 6060, pp. 77–91. Springer, Berlin (2010)

    Chapter  Google Scholar 

  17. Claude, F., Navarro, G.: Fast and compact web graph representations. ACM Trans. Web 4(4), 16 (2010)

    Article  Google Scholar 

  18. Demaine, E., López-Ortiz, A., Munro, J.I.: Adaptive set intersections, unions, and differences. In: Proc. 11th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 743–752 (2000)

    Google Scholar 

  19. Fariña, A., Brisaboa, N., Navarro, G., Claude, F., Places, A., Rodríguez, E.: Word-based self-indexes for natural language text. ACM Trans. Inf. Syst. 30(1), 1 (2012)

    Article  Google Scholar 

  20. Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Compressing and indexing labeled trees, with applications. J. ACM 57(1), 4 (2009)

    Article  MathSciNet  Google Scholar 

  21. Ferragina, P., Manzini, G.: Indexing compressed texts. J. ACM 52(4), 552–581 (2005)

    Article  MathSciNet  Google Scholar 

  22. Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms 3(2), 20 (2007)

    Article  MathSciNet  Google Scholar 

  23. Ferragina, P., Venturini, R.: A simple storage scheme for strings achieving entropy bounds. Theor. Comput. Sci. 372(1), 115–121 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  24. Gagie, T., Nekrich, Y.: Worst-case optimal adaptive prefix coding. In: Proc. 11th International Symposium on Algorithms and Data Structures (WADS). LNCS, vol. 5664, pp. 315–326 (2009)

    Chapter  Google Scholar 

  25. Golynski, A.: Optimal lower bounds for rank and select indexes. Theor. Comput. Sci. 387(3), 348–359 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  26. Golynski, A.: Cell probe lower bounds for succinct data structures. In: Proc. 20th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 625–634 (2009)

    Chapter  Google Scholar 

  27. Golynski, A., Grossi, R., Gupta, A., Raman, R., Srinivasa Rao, S.: On the size of succinct indices. In: Proc. 15th Annual European Symposium on Algorithms (ESA). LNCS, vol. 4698, pp. 371–382 (2007)

    Google Scholar 

  28. Golynski, A., Munro, J.I., Rao, S.S.: Rank/select operations on large alphabets: a tool for text indexing. In: Proc. 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 368–373 (2006)

    Google Scholar 

  29. Golynski, A., Raman, R., Rao, S.: On the redundancy of succinct data structures. In: Proc. 11th Scandinavian Workshop on Algorithm Theory (SWAT). LNCS, vol. 5124, pp. 148–159 (2008)

    Google Scholar 

  30. González, R., Grabowski, Sz., Mäkinen, V., Navarro, G.: Practical implementation of rank and select queries. In: Proc. 4th Workshop on Efficient and Experimental Algorithms (WEA), pp. 27–38 (2005). Posters

    Google Scholar 

  31. González, R., Navarro, G.: Statistical encoding of succinct data structures. In: Proc. 17th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 294–305 (2006)

    Chapter  Google Scholar 

  32. Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proc. 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 841–850 (2003)

    Google Scholar 

  33. Grossi, R., Orlandi, A., Raman, R.: Optimal trade-offs for succinct string indexes. In: Proc. 37th International Colloquim on Automata, Languages and Programming (ICALP), pp. 678–689 (2010)

    Chapter  Google Scholar 

  34. Grossi, R., Orlandi, A., Raman, R., Srinivasa Rao, S.: More haste, less waste: lowering the redundancy in fully indexable dictionaries. In: Proc. 26th Symposium on Theoretical Aspects of Computer Science (STACS), pp. 517–528 (2009)

    Google Scholar 

  35. Grossi, R., Vitter, J.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2006)

    Article  MathSciNet  Google Scholar 

  36. Haskel, B., Puri, A., Netravali, A.: Digital Video: an Introduction to MPEG-2. Chapman & Hall, London (1997)

    Google Scholar 

  37. Hernández, C., Navarro, G.: Compression of web and social graphs supporting neighbor and community queries. In: Proc. 5th ACM Workshop on Social Network Mining and Analysis (SNA-KDD). ACM, New York (2011)

    Google Scholar 

  38. Hreinsson, J.B., Krøyer, M., Pagh, R.: Storing a compressed function with constant time access. In: Proc. 17th European Symposium on Algorithms (ESA), pp. 730–741 (2009)

    Google Scholar 

  39. Huffman, D.: A method for the construction of minimum-redundancy codes. Proc. IRE 40(9), 1090–1101 (1952)

    Article  Google Scholar 

  40. Levcopoulos, C., Petersson, O.: Sorting shuffled monotone sequences. Inf. Comput. 112(1), 37–50 (1994)

    Article  MATH  MathSciNet  Google Scholar 

  41. Mäkinen, V., Navarro, G.: Rank and select revisited and extended. Theor. Comput. Sci. 387(3), 332–347 (2007)

    Article  MATH  Google Scholar 

  42. Mäkinen, V., Navarro, G.: Dynamic entropy-compressed sequences and full-text indexes. ACM Trans. Algorithms 4(3), 32 (2008)

    Article  MathSciNet  Google Scholar 

  43. Manzini, G.: An analysis of the Burrows-Wheeler transform. J. ACM 48(3), 407–430 (2001)

    Article  MathSciNet  Google Scholar 

  44. Mehlhorn, K.: Sorting presorted files. In: Proc. 4th GI-Conference on Theoretical Computer Science. LNCS, vol. 67, pp. 199–212 (1979)

    Chapter  Google Scholar 

  45. Moffat, A., Turpin, A.: On the implementation of minimum-redundancy prefix codes. IEEE Trans. Commun. 45(10), 1200–1207 (1997)

    Article  Google Scholar 

  46. Munro, I.: Tables. In: Proc. 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS). LNCS, vol. 1180, pp. 37–42 (1996)

    Chapter  Google Scholar 

  47. Munro, I., Raman, R., Raman, V., Rao, S.S.: Succinct representations of permutations and functions. Theor. Comput. Sci. 438, 74–88 (2012)

    Article  MATH  MathSciNet  Google Scholar 

  48. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39(1), 2 (2007)

    Article  Google Scholar 

  49. Navarro, G., Nekrich, Y.: Optimal dynamic sequence representations. In: Proc. 24th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (2013, to appear)

  50. Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: Proc. 10th Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 60–70 (2007)

    Google Scholar 

  51. Pearlman, W., Islam, A., Nagaraj, N., Said, A.: Efficient, low-complexity image coding with a set-partitioning embedded block coder. IEEE Trans. Circuits Syst. Video Technol. 14(11), 1219–1235 (2004)

    Article  Google Scholar 

  52. Pennebaker, W., Mitchell, J.: JPEG: Still Image Data Compression Standard. Van Nostrand-Reinhold, New York (1992)

    Google Scholar 

  53. Pătraşcu, M.: Succincter. In: Proc. 49th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 305–313 (2008)

    Google Scholar 

  54. Pătraşcu, M.: A lower bound for succinct rank queries. CoRR (2009). arXiv:0907.1103v1 [cs.DS]

  55. Raman, R., Raman, V., Rao, S.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms 3(4), 43 (2007)

    Article  MathSciNet  Google Scholar 

  56. Russo, L., Navarro, G., Oliveira, A., Morales, P.: Approximate string matching with compressed indexes. Algorithms 2(3), 1105–1136 (2009)

    Article  MathSciNet  Google Scholar 

  57. Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. J. Algorithms 48(2), 294–313 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  58. Sadakane, K., Grossi, R.: Squeezing succinct data structures into entropy bounds. In: Proc. 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1230–1239 (2006)

    Google Scholar 

  59. Said, A.: Efficient alphabet partitioning algorithms for low-complexity entropy coding. In: Proc. 15th Data Compression Conference (DCC), pp. 193–202 (2005)

    Google Scholar 

  60. Tarjan, R.E., van Leeuwen, J.: Worst-case analysis of set union algorithms. J. ACM 31(2), 245–281 (1984)

    Article  MATH  Google Scholar 

  61. Witten, I., Moffat, A., Bell, T.: Managing Gigabytes, 2nd edn. Morgan Kaufmann, San Mateo (1999)

    Google Scholar 

Download references

Acknowledgements

We thank Djamal Belazzougui for helpful comments on a draft of this paper, and Meg Gagie for righting our grammar.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gonzalo Navarro.

Additional information

An early version of this article appeared in Proc. 21st Annual International Symposium on Algorithms and Computation (ISAAC), part II, pp. 215–326, 2010. J. Barbay was funded in part by Fondecyt grants 1-110066 and 1120054, Chile. F. Claude was funded in part by Google PhD Fellowship Program. G. Navarro was funded in part by Fondecyt grant 1-110066, Chile. Work partially done during Y. Nekrich stay at the University of Chile.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Barbay, J., Claude, F., Gagie, T. et al. Efficient Fully-Compressed Sequence Representations. Algorithmica 69, 232–268 (2014). https://doi.org/10.1007/s00453-012-9726-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-012-9726-3

Keywords

Navigation