Efficient Fully-Compressed Sequence Representations

Barbay, Jérémy; Claude, Francisco; Gagie, Travis; Navarro, Gonzalo; Nekrich, Yakov

doi:10.1007/s00453-012-9726-3

Efficient Fully-Compressed Sequence Representations

Published: 15 December 2012

Volume 69, pages 232–268, (2014)
Cite this article

Algorithmica Aims and scope Submit manuscript

Jérémy Barbay¹,
Francisco Claude²,
Travis Gagie³,
Gonzalo Navarro¹ &
…
Yakov Nekrich⁴

606 Accesses
39 Citations
Explore all metrics

Abstract

We present a data structure that stores a sequence s[1..n] over alphabet [1..σ] in \(n\mathcal{H}_{0}(s) + o(n)(\mathcal {H}_{0}(s){+}1)\) bits, where \(\mathcal{H}_{0}(s)\) is the zero-order entropy of s. This structure supports the queries access, rank and select, which are fundamental building blocks for many other compressed data structures, in worst-case time \({\mathcal{O} ( {\lg\lg\sigma} )}\) and average time \({\mathcal{O} ( {\lg\mathcal{H}_{0}(s)} )}\). The worst-case complexity matches the best previous results, yet these had been achieved with data structures using \(n\mathcal{H}_{0}(s)+o(n\lg \sigma)\) bits. On highly compressible sequences the o(nlgσ) bits of the redundancy may be significant compared to the \(n\mathcal{H}_{0}(s)\) bits that encode the data. Our representation, instead, compresses the redundancy as well. Moreover, our average-case complexity is unprecedented.

Our technique is based on partitioning the alphabet into characters of similar frequency. The subsequence corresponding to each group can then be encoded using fast uncompressed representations without harming the overall compression ratios, even in the redundancy.

The result also improves upon the best current compressed representations of several other data structures. For example, we achieve (i) compressed redundancy, retaining the best time complexities, for the smallest existing full-text self-indexes; (ii) compressed permutations π with times for π() and π ⁻¹() improved to loglogarithmic; and (iii) the first compressed representation of dynamic collections of disjoint sets. We also point out various applications to inverted indexes, suffix arrays, binary relations, and data compressors.

Our structure is practical on large alphabets. Our experiments show that, as predicted by theory, it dominates the space/time tradeoff map of all the sequence representations, both in synthetic and application scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Longest Common Substring with Approximately k Mismatches

Article Open access 16 February 2019

A Practical Fixed-Parameter Algorithm for Constructing Tree-Child Networks from Multiple Binary Trees

Article 15 February 2022

Algorithms for generating all possible spanning trees of a simple undirected connected graph: an extensive review

Article Open access 13 August 2018

Notes

Because of these good results on polylog-sized alphabets, we focus on larger alphabets in this article, and therefore do not distinguish between redundancies of the form o(n)lgσ and no(lgσ), writing o(nlgσ) for all. See also footnote 6 of Barbay et al. [4].
The representation actually compresses to the k-th order entropy of a different sequence, not s (A. Golynski, personal communication).
By effective we mean that every character appears in s, and thus σ≤n. In Sect. 3.3 we handle the case of larger alphabets.
This is achieved by using block sizes of length \(\frac{\lg n}{2}\) and not \(\frac{\lg|s_{\ell}|}{2}\), at the price of storing universal tables of size \({\mathcal{O} ( {\sqrt{n} \textrm {polylog}(n)} )}=o(n)\) bits. Therefore all of our o(⋅) expressions involving n and other variables will be asymptotic in n.
Given σ pairs of numbers a _i,b _i>0, it holds that \(\sum a_{i} \lg\frac{a_{i}}{b_{i}} \ge (\sum a_{i} )\lg\frac{\sum a_{i}}{\sum b_{i}}\). Use a _i=|s|_i/n and b _i=−a _ilga _i to obtain the result.
One can retain lglgn in the denominator by using block sizes depending on n and not on |s ⁱ|, as explained in the footnote at the beginning of Sect. 3.2.
The code for generating these sequences is available at https://github.com/fclaude/txtinvlists.
We use G. Navarro’s Huffman implementation; the code is available in Libcds.
We use the code by J. Carpinelli, A. Moffat, R. Neal, W. Salamonsen, L. Stuiver, A. Turpin and I. Witten, available at http://ww2.cs.mu.oz.au/~alistair/arith_coder/arith_coder-3.tar.gz. We modified the decompressor to read the whole stream before timing decompression.
Note that this works well as long as each node points to at least one node. We solve this problem by keeping an additional bitmap marking the nodes whose list is not empty.
http://law.dsi.unimi.it.
Note that WTRRR is almost 50 % larger than WTRG. This is because the former is a more complex structure and requires a larger (constant) number of pointers to be represented. Multiplying by the σ nodes of the Huffman-shaped wavelet tree makes a significant difference when the alphabet is so large.
Djamal Belazzougui, personal communication.

References

Arroyuelo, D., González, S., Oyarzún, M.: Compressed self-indices supporting conjunctive queries on document collections. In: Proc. 17th International Symposium on String Processing and Information Retrieval (SPIRE), pp. 43–54 (2010)
Chapter Google Scholar
Barbay, J., Claude, F., Navarro, G.: Compact rich-functional binary relation representations. In: Proc. 9th Latin American Symposium on Theoretical Informatics (LATIN). LNCS, vol. 6034, pp. 170–183 (2010)
Google Scholar
Barbay, J., Golynski, A., Munro, J.I., Rao, S.S.: Adaptive searching in succinctly encoded binary relations and tree-structured documents. Theor. Comput. Sci. 387(3), 284–297 (2007)
Article MATH MathSciNet Google Scholar
Barbay, J., He, M., Munro, J.I., Rao, S.S.: Succinct indexes for strings, binary relations and multilabeled trees. ACM Trans. Algorithms 7(4), 52 (2011)
Article MathSciNet Google Scholar
Barbay, J., López-Ortiz, A., Lu, T., Salinger, A.: An experimental investigation of set intersection algorithms for text searching. ACM J. Exp. Algorithmics 14(3), 7 (2009)
Google Scholar
Barbay, J., Navarro, G.: Compressed representations of permutations, and applications. In: Proc. 26th Symposium on Theoretical Aspects of Computer Science (STACS), pp. 111–122 (2009)
Google Scholar
Barbay, J., Navarro, G.: On compressing permutations and adaptive sorting. CoRR (2011). 1108.4408v1
Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. In: Proc. 19th Annual European Symposium on Algorithms (ESA). LNCS, vol. 6942, pp. 748–759 (2011)
Google Scholar
Belazzougui, D., Navarro, G.: New lower and upper bounds for representing sequences. In: Proc. 20th Annual European Symposium on Algorithms (ESA). LNCS, vol. 7501, pp. 181–192 (2012)
Google Scholar
Boldi, P., Vigna, S.: The WebGraph framework I: compression techniques. In: Proc. 13th World Wide Web Conference (WWW), pp. 595–602 (2004)
Chapter Google Scholar
Brisaboa, N., Luaces, M., Navarro, G., Seco, D.: A new point access method based on wavelet trees. In: Proc. 3rd International Workshop on Semantic and Conceptual Issues in GIS (SeCoGIS). LNCS, vol. 5833, pp. 297–306 (2009)
Google Scholar
Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)
Clark, D.: Compact Pat Trees. Ph.D. Thesis, University of Waterloo, Canada (1996)
Clarke, C., Cormack, G., Tudhope, E.: Relevance ranking for one to three term queries. In: Proc. 5th International Conference on Computer-Assisted Information Retrieval (RIAO), pp. 388–401 (1997)
Google Scholar
Claude, F., Navarro, G.: Practical rank/select queries over arbitrary sequences. In: Proc. 15th International Symposium on String Processing and Information Retrieval (SPIRE), pp. 176–187 (2008)
Google Scholar
Claude, F., Navarro, G.: Extended compact web graph representations. In: Elomaa, T., Mannila, H., Orponen, P. (eds.) Algorithms and Applications (Ukkonen Festschrift). LNCS, vol. 6060, pp. 77–91. Springer, Berlin (2010)
Chapter Google Scholar
Claude, F., Navarro, G.: Fast and compact web graph representations. ACM Trans. Web 4(4), 16 (2010)
Article Google Scholar
Demaine, E., López-Ortiz, A., Munro, J.I.: Adaptive set intersections, unions, and differences. In: Proc. 11th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 743–752 (2000)
Google Scholar
Fariña, A., Brisaboa, N., Navarro, G., Claude, F., Places, A., Rodríguez, E.: Word-based self-indexes for natural language text. ACM Trans. Inf. Syst. 30(1), 1 (2012)
Article Google Scholar
Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Compressing and indexing labeled trees, with applications. J. ACM 57(1), 4 (2009)
Article MathSciNet Google Scholar
Ferragina, P., Manzini, G.: Indexing compressed texts. J. ACM 52(4), 552–581 (2005)
Article MathSciNet Google Scholar
Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms 3(2), 20 (2007)
Article MathSciNet Google Scholar
Ferragina, P., Venturini, R.: A simple storage scheme for strings achieving entropy bounds. Theor. Comput. Sci. 372(1), 115–121 (2007)
Article MATH MathSciNet Google Scholar
Gagie, T., Nekrich, Y.: Worst-case optimal adaptive prefix coding. In: Proc. 11th International Symposium on Algorithms and Data Structures (WADS). LNCS, vol. 5664, pp. 315–326 (2009)
Chapter Google Scholar
Golynski, A.: Optimal lower bounds for rank and select indexes. Theor. Comput. Sci. 387(3), 348–359 (2007)
Article MATH MathSciNet Google Scholar
Golynski, A.: Cell probe lower bounds for succinct data structures. In: Proc. 20th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 625–634 (2009)
Chapter Google Scholar
Golynski, A., Grossi, R., Gupta, A., Raman, R., Srinivasa Rao, S.: On the size of succinct indices. In: Proc. 15th Annual European Symposium on Algorithms (ESA). LNCS, vol. 4698, pp. 371–382 (2007)
Google Scholar
Golynski, A., Munro, J.I., Rao, S.S.: Rank/select operations on large alphabets: a tool for text indexing. In: Proc. 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 368–373 (2006)
Google Scholar
Golynski, A., Raman, R., Rao, S.: On the redundancy of succinct data structures. In: Proc. 11th Scandinavian Workshop on Algorithm Theory (SWAT). LNCS, vol. 5124, pp. 148–159 (2008)
Google Scholar
González, R., Grabowski, Sz., Mäkinen, V., Navarro, G.: Practical implementation of rank and select queries. In: Proc. 4th Workshop on Efficient and Experimental Algorithms (WEA), pp. 27–38 (2005). Posters
Google Scholar
González, R., Navarro, G.: Statistical encoding of succinct data structures. In: Proc. 17th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 294–305 (2006)
Chapter Google Scholar
Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proc. 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 841–850 (2003)
Google Scholar
Grossi, R., Orlandi, A., Raman, R.: Optimal trade-offs for succinct string indexes. In: Proc. 37th International Colloquim on Automata, Languages and Programming (ICALP), pp. 678–689 (2010)
Chapter Google Scholar
Grossi, R., Orlandi, A., Raman, R., Srinivasa Rao, S.: More haste, less waste: lowering the redundancy in fully indexable dictionaries. In: Proc. 26th Symposium on Theoretical Aspects of Computer Science (STACS), pp. 517–528 (2009)
Google Scholar
Grossi, R., Vitter, J.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2006)
Article MathSciNet Google Scholar
Haskel, B., Puri, A., Netravali, A.: Digital Video: an Introduction to MPEG-2. Chapman & Hall, London (1997)
Google Scholar
Hernández, C., Navarro, G.: Compression of web and social graphs supporting neighbor and community queries. In: Proc. 5th ACM Workshop on Social Network Mining and Analysis (SNA-KDD). ACM, New York (2011)
Google Scholar
Hreinsson, J.B., Krøyer, M., Pagh, R.: Storing a compressed function with constant time access. In: Proc. 17th European Symposium on Algorithms (ESA), pp. 730–741 (2009)
Google Scholar
Huffman, D.: A method for the construction of minimum-redundancy codes. Proc. IRE 40(9), 1090–1101 (1952)
Article Google Scholar
Levcopoulos, C., Petersson, O.: Sorting shuffled monotone sequences. Inf. Comput. 112(1), 37–50 (1994)
Article MATH MathSciNet Google Scholar
Mäkinen, V., Navarro, G.: Rank and select revisited and extended. Theor. Comput. Sci. 387(3), 332–347 (2007)
Article MATH Google Scholar
Mäkinen, V., Navarro, G.: Dynamic entropy-compressed sequences and full-text indexes. ACM Trans. Algorithms 4(3), 32 (2008)
Article MathSciNet Google Scholar
Manzini, G.: An analysis of the Burrows-Wheeler transform. J. ACM 48(3), 407–430 (2001)
Article MathSciNet Google Scholar
Mehlhorn, K.: Sorting presorted files. In: Proc. 4th GI-Conference on Theoretical Computer Science. LNCS, vol. 67, pp. 199–212 (1979)
Chapter Google Scholar
Moffat, A., Turpin, A.: On the implementation of minimum-redundancy prefix codes. IEEE Trans. Commun. 45(10), 1200–1207 (1997)
Article Google Scholar
Munro, I.: Tables. In: Proc. 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS). LNCS, vol. 1180, pp. 37–42 (1996)
Chapter Google Scholar
Munro, I., Raman, R., Raman, V., Rao, S.S.: Succinct representations of permutations and functions. Theor. Comput. Sci. 438, 74–88 (2012)
Article MATH MathSciNet Google Scholar
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39(1), 2 (2007)
Article Google Scholar
Navarro, G., Nekrich, Y.: Optimal dynamic sequence representations. In: Proc. 24th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (2013, to appear)
Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: Proc. 10th Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 60–70 (2007)
Google Scholar
Pearlman, W., Islam, A., Nagaraj, N., Said, A.: Efficient, low-complexity image coding with a set-partitioning embedded block coder. IEEE Trans. Circuits Syst. Video Technol. 14(11), 1219–1235 (2004)
Article Google Scholar
Pennebaker, W., Mitchell, J.: JPEG: Still Image Data Compression Standard. Van Nostrand-Reinhold, New York (1992)
Google Scholar
Pătraşcu, M.: Succincter. In: Proc. 49th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 305–313 (2008)
Google Scholar
Pătraşcu, M.: A lower bound for succinct rank queries. CoRR (2009). arXiv:0907.1103v1 [cs.DS]
Raman, R., Raman, V., Rao, S.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms 3(4), 43 (2007)
Article MathSciNet Google Scholar
Russo, L., Navarro, G., Oliveira, A., Morales, P.: Approximate string matching with compressed indexes. Algorithms 2(3), 1105–1136 (2009)
Article MathSciNet Google Scholar
Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. J. Algorithms 48(2), 294–313 (2003)
Article MATH MathSciNet Google Scholar
Sadakane, K., Grossi, R.: Squeezing succinct data structures into entropy bounds. In: Proc. 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1230–1239 (2006)
Google Scholar
Said, A.: Efficient alphabet partitioning algorithms for low-complexity entropy coding. In: Proc. 15th Data Compression Conference (DCC), pp. 193–202 (2005)
Google Scholar
Tarjan, R.E., van Leeuwen, J.: Worst-case analysis of set union algorithms. J. ACM 31(2), 245–281 (1984)
Article MATH Google Scholar
Witten, I., Moffat, A., Bell, T.: Managing Gigabytes, 2nd edn. Morgan Kaufmann, San Mateo (1999)
Google Scholar

Download references

Acknowledgements

We thank Djamal Belazzougui for helpful comments on a draft of this paper, and Meg Gagie for righting our grammar.

Author information

Authors and Affiliations

Department of Computer Science, University of Chile, Santiago, Chile
Jérémy Barbay & Gonzalo Navarro
David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Canada
Francisco Claude
Department of Computer Science, Aalto University, Helsinki, Finland
Travis Gagie
Department of Electrical Engineering & Computer Science, University of Kansas, Kansas, USA
Yakov Nekrich

Authors

Jérémy Barbay
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Claude
View author publications
You can also search for this author in PubMed Google Scholar
Travis Gagie
View author publications
You can also search for this author in PubMed Google Scholar
Gonzalo Navarro
View author publications
You can also search for this author in PubMed Google Scholar
Yakov Nekrich
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gonzalo Navarro.

Additional information

An early version of this article appeared in Proc. 21st Annual International Symposium on Algorithms and Computation (ISAAC), part II, pp. 215–326, 2010. J. Barbay was funded in part by Fondecyt grants 1-110066 and 1120054, Chile. F. Claude was funded in part by Google PhD Fellowship Program. G. Navarro was funded in part by Fondecyt grant 1-110066, Chile. Work partially done during Y. Nekrich stay at the University of Chile.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Barbay, J., Claude, F., Gagie, T. et al. Efficient Fully-Compressed Sequence Representations. Algorithmica 69, 232–268 (2014). https://doi.org/10.1007/s00453-012-9726-3

Download citation

Received: 04 February 2012
Accepted: 04 December 2012
Published: 15 December 2012
Issue Date: May 2014
DOI: https://doi.org/10.1007/s00453-012-9726-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient Fully-Compressed Sequence Representations

Abstract

Access this article

Similar content being viewed by others

Longest Common Substring with Approximately k Mismatches

A Practical Fixed-Parameter Algorithm for Constructing Tree-Child Networks from Multiple Binary Trees

Algorithms for generating all possible spanning trees of a simple undirected connected graph: an extensive review

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient Fully-Compressed Sequence Representations

Abstract

Access this article

Similar content being viewed by others

Longest Common Substring with Approximately k Mismatches

A Practical Fixed-Parameter Algorithm for Constructing Tree-Child Networks from Multiple Binary Trees

Algorithms for generating all possible spanning trees of a simple undirected connected graph: an extensive review

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation