Abstract
We present a data structure that stores a sequence s[1..n] over alphabet [1..σ] in \(n\mathcal{H}_{0}(s) + o(n)(\mathcal {H}_{0}(s){+}1)\) bits, where \(\mathcal{H}_{0}(s)\) is the zero-order entropy of s. This structure supports the queries access, rank and select, which are fundamental building blocks for many other compressed data structures, in worst-case time \({\mathcal{O} ( {\lg\lg\sigma} )}\) and average time \({\mathcal{O} ( {\lg\mathcal{H}_{0}(s)} )}\). The worst-case complexity matches the best previous results, yet these had been achieved with data structures using \(n\mathcal{H}_{0}(s)+o(n\lg \sigma)\) bits. On highly compressible sequences the o(nlgσ) bits of the redundancy may be significant compared to the \(n\mathcal{H}_{0}(s)\) bits that encode the data. Our representation, instead, compresses the redundancy as well. Moreover, our average-case complexity is unprecedented.
Our technique is based on partitioning the alphabet into characters of similar frequency. The subsequence corresponding to each group can then be encoded using fast uncompressed representations without harming the overall compression ratios, even in the redundancy.
The result also improves upon the best current compressed representations of several other data structures. For example, we achieve (i) compressed redundancy, retaining the best time complexities, for the smallest existing full-text self-indexes; (ii) compressed permutations π with times for π() and π −1() improved to loglogarithmic; and (iii) the first compressed representation of dynamic collections of disjoint sets. We also point out various applications to inverted indexes, suffix arrays, binary relations, and data compressors.
Our structure is practical on large alphabets. Our experiments show that, as predicted by theory, it dominates the space/time tradeoff map of all the sequence representations, both in synthetic and application scenarios.
Similar content being viewed by others
Notes
Because of these good results on polylog-sized alphabets, we focus on larger alphabets in this article, and therefore do not distinguish between redundancies of the form o(n)lgσ and no(lgσ), writing o(nlgσ) for all. See also footnote 6 of Barbay et al. [4].
The representation actually compresses to the k-th order entropy of a different sequence, not s (A. Golynski, personal communication).
By effective we mean that every character appears in s, and thus σ≤n. In Sect. 3.3 we handle the case of larger alphabets.
This is achieved by using block sizes of length \(\frac{\lg n}{2}\) and not \(\frac{\lg|s_{\ell}|}{2}\), at the price of storing universal tables of size \({\mathcal{O} ( {\sqrt{n} \textrm {polylog}(n)} )}=o(n)\) bits. Therefore all of our o(⋅) expressions involving n and other variables will be asymptotic in n.
Given σ pairs of numbers a i ,b i >0, it holds that \(\sum a_{i} \lg\frac{a_{i}}{b_{i}} \ge (\sum a_{i} )\lg\frac{\sum a_{i}}{\sum b_{i}}\). Use a i =|s| i /n and b i =−a i lga i to obtain the result.
One can retain lglgn in the denominator by using block sizes depending on n and not on |s i|, as explained in the footnote at the beginning of Sect. 3.2.
The code for generating these sequences is available at https://github.com/fclaude/txtinvlists.
We use G. Navarro’s Huffman implementation; the code is available in Libcds.
We use the code by J. Carpinelli, A. Moffat, R. Neal, W. Salamonsen, L. Stuiver, A. Turpin and I. Witten, available at http://ww2.cs.mu.oz.au/~alistair/arith_coder/arith_coder-3.tar.gz. We modified the decompressor to read the whole stream before timing decompression.
Note that this works well as long as each node points to at least one node. We solve this problem by keeping an additional bitmap marking the nodes whose list is not empty.
Note that WTRRR is almost 50 % larger than WTRG. This is because the former is a more complex structure and requires a larger (constant) number of pointers to be represented. Multiplying by the σ nodes of the Huffman-shaped wavelet tree makes a significant difference when the alphabet is so large.
Djamal Belazzougui, personal communication.
References
Arroyuelo, D., González, S., Oyarzún, M.: Compressed self-indices supporting conjunctive queries on document collections. In: Proc. 17th International Symposium on String Processing and Information Retrieval (SPIRE), pp. 43–54 (2010)
Barbay, J., Claude, F., Navarro, G.: Compact rich-functional binary relation representations. In: Proc. 9th Latin American Symposium on Theoretical Informatics (LATIN). LNCS, vol. 6034, pp. 170–183 (2010)
Barbay, J., Golynski, A., Munro, J.I., Rao, S.S.: Adaptive searching in succinctly encoded binary relations and tree-structured documents. Theor. Comput. Sci. 387(3), 284–297 (2007)
Barbay, J., He, M., Munro, J.I., Rao, S.S.: Succinct indexes for strings, binary relations and multilabeled trees. ACM Trans. Algorithms 7(4), 52 (2011)
Barbay, J., López-Ortiz, A., Lu, T., Salinger, A.: An experimental investigation of set intersection algorithms for text searching. ACM J. Exp. Algorithmics 14(3), 7 (2009)
Barbay, J., Navarro, G.: Compressed representations of permutations, and applications. In: Proc. 26th Symposium on Theoretical Aspects of Computer Science (STACS), pp. 111–122 (2009)
Barbay, J., Navarro, G.: On compressing permutations and adaptive sorting. CoRR (2011). 1108.4408v1
Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. In: Proc. 19th Annual European Symposium on Algorithms (ESA). LNCS, vol. 6942, pp. 748–759 (2011)
Belazzougui, D., Navarro, G.: New lower and upper bounds for representing sequences. In: Proc. 20th Annual European Symposium on Algorithms (ESA). LNCS, vol. 7501, pp. 181–192 (2012)
Boldi, P., Vigna, S.: The WebGraph framework I: compression techniques. In: Proc. 13th World Wide Web Conference (WWW), pp. 595–602 (2004)
Brisaboa, N., Luaces, M., Navarro, G., Seco, D.: A new point access method based on wavelet trees. In: Proc. 3rd International Workshop on Semantic and Conceptual Issues in GIS (SeCoGIS). LNCS, vol. 5833, pp. 297–306 (2009)
Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)
Clark, D.: Compact Pat Trees. Ph.D. Thesis, University of Waterloo, Canada (1996)
Clarke, C., Cormack, G., Tudhope, E.: Relevance ranking for one to three term queries. In: Proc. 5th International Conference on Computer-Assisted Information Retrieval (RIAO), pp. 388–401 (1997)
Claude, F., Navarro, G.: Practical rank/select queries over arbitrary sequences. In: Proc. 15th International Symposium on String Processing and Information Retrieval (SPIRE), pp. 176–187 (2008)
Claude, F., Navarro, G.: Extended compact web graph representations. In: Elomaa, T., Mannila, H., Orponen, P. (eds.) Algorithms and Applications (Ukkonen Festschrift). LNCS, vol. 6060, pp. 77–91. Springer, Berlin (2010)
Claude, F., Navarro, G.: Fast and compact web graph representations. ACM Trans. Web 4(4), 16 (2010)
Demaine, E., López-Ortiz, A., Munro, J.I.: Adaptive set intersections, unions, and differences. In: Proc. 11th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 743–752 (2000)
Fariña, A., Brisaboa, N., Navarro, G., Claude, F., Places, A., Rodríguez, E.: Word-based self-indexes for natural language text. ACM Trans. Inf. Syst. 30(1), 1 (2012)
Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Compressing and indexing labeled trees, with applications. J. ACM 57(1), 4 (2009)
Ferragina, P., Manzini, G.: Indexing compressed texts. J. ACM 52(4), 552–581 (2005)
Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms 3(2), 20 (2007)
Ferragina, P., Venturini, R.: A simple storage scheme for strings achieving entropy bounds. Theor. Comput. Sci. 372(1), 115–121 (2007)
Gagie, T., Nekrich, Y.: Worst-case optimal adaptive prefix coding. In: Proc. 11th International Symposium on Algorithms and Data Structures (WADS). LNCS, vol. 5664, pp. 315–326 (2009)
Golynski, A.: Optimal lower bounds for rank and select indexes. Theor. Comput. Sci. 387(3), 348–359 (2007)
Golynski, A.: Cell probe lower bounds for succinct data structures. In: Proc. 20th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 625–634 (2009)
Golynski, A., Grossi, R., Gupta, A., Raman, R., Srinivasa Rao, S.: On the size of succinct indices. In: Proc. 15th Annual European Symposium on Algorithms (ESA). LNCS, vol. 4698, pp. 371–382 (2007)
Golynski, A., Munro, J.I., Rao, S.S.: Rank/select operations on large alphabets: a tool for text indexing. In: Proc. 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 368–373 (2006)
Golynski, A., Raman, R., Rao, S.: On the redundancy of succinct data structures. In: Proc. 11th Scandinavian Workshop on Algorithm Theory (SWAT). LNCS, vol. 5124, pp. 148–159 (2008)
González, R., Grabowski, Sz., Mäkinen, V., Navarro, G.: Practical implementation of rank and select queries. In: Proc. 4th Workshop on Efficient and Experimental Algorithms (WEA), pp. 27–38 (2005). Posters
González, R., Navarro, G.: Statistical encoding of succinct data structures. In: Proc. 17th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 294–305 (2006)
Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proc. 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 841–850 (2003)
Grossi, R., Orlandi, A., Raman, R.: Optimal trade-offs for succinct string indexes. In: Proc. 37th International Colloquim on Automata, Languages and Programming (ICALP), pp. 678–689 (2010)
Grossi, R., Orlandi, A., Raman, R., Srinivasa Rao, S.: More haste, less waste: lowering the redundancy in fully indexable dictionaries. In: Proc. 26th Symposium on Theoretical Aspects of Computer Science (STACS), pp. 517–528 (2009)
Grossi, R., Vitter, J.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2006)
Haskel, B., Puri, A., Netravali, A.: Digital Video: an Introduction to MPEG-2. Chapman & Hall, London (1997)
Hernández, C., Navarro, G.: Compression of web and social graphs supporting neighbor and community queries. In: Proc. 5th ACM Workshop on Social Network Mining and Analysis (SNA-KDD). ACM, New York (2011)
Hreinsson, J.B., Krøyer, M., Pagh, R.: Storing a compressed function with constant time access. In: Proc. 17th European Symposium on Algorithms (ESA), pp. 730–741 (2009)
Huffman, D.: A method for the construction of minimum-redundancy codes. Proc. IRE 40(9), 1090–1101 (1952)
Levcopoulos, C., Petersson, O.: Sorting shuffled monotone sequences. Inf. Comput. 112(1), 37–50 (1994)
Mäkinen, V., Navarro, G.: Rank and select revisited and extended. Theor. Comput. Sci. 387(3), 332–347 (2007)
Mäkinen, V., Navarro, G.: Dynamic entropy-compressed sequences and full-text indexes. ACM Trans. Algorithms 4(3), 32 (2008)
Manzini, G.: An analysis of the Burrows-Wheeler transform. J. ACM 48(3), 407–430 (2001)
Mehlhorn, K.: Sorting presorted files. In: Proc. 4th GI-Conference on Theoretical Computer Science. LNCS, vol. 67, pp. 199–212 (1979)
Moffat, A., Turpin, A.: On the implementation of minimum-redundancy prefix codes. IEEE Trans. Commun. 45(10), 1200–1207 (1997)
Munro, I.: Tables. In: Proc. 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS). LNCS, vol. 1180, pp. 37–42 (1996)
Munro, I., Raman, R., Raman, V., Rao, S.S.: Succinct representations of permutations and functions. Theor. Comput. Sci. 438, 74–88 (2012)
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39(1), 2 (2007)
Navarro, G., Nekrich, Y.: Optimal dynamic sequence representations. In: Proc. 24th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (2013, to appear)
Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: Proc. 10th Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 60–70 (2007)
Pearlman, W., Islam, A., Nagaraj, N., Said, A.: Efficient, low-complexity image coding with a set-partitioning embedded block coder. IEEE Trans. Circuits Syst. Video Technol. 14(11), 1219–1235 (2004)
Pennebaker, W., Mitchell, J.: JPEG: Still Image Data Compression Standard. Van Nostrand-Reinhold, New York (1992)
Pătraşcu, M.: Succincter. In: Proc. 49th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 305–313 (2008)
Pătraşcu, M.: A lower bound for succinct rank queries. CoRR (2009). arXiv:0907.1103v1 [cs.DS]
Raman, R., Raman, V., Rao, S.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms 3(4), 43 (2007)
Russo, L., Navarro, G., Oliveira, A., Morales, P.: Approximate string matching with compressed indexes. Algorithms 2(3), 1105–1136 (2009)
Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. J. Algorithms 48(2), 294–313 (2003)
Sadakane, K., Grossi, R.: Squeezing succinct data structures into entropy bounds. In: Proc. 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1230–1239 (2006)
Said, A.: Efficient alphabet partitioning algorithms for low-complexity entropy coding. In: Proc. 15th Data Compression Conference (DCC), pp. 193–202 (2005)
Tarjan, R.E., van Leeuwen, J.: Worst-case analysis of set union algorithms. J. ACM 31(2), 245–281 (1984)
Witten, I., Moffat, A., Bell, T.: Managing Gigabytes, 2nd edn. Morgan Kaufmann, San Mateo (1999)
Acknowledgements
We thank Djamal Belazzougui for helpful comments on a draft of this paper, and Meg Gagie for righting our grammar.
Author information
Authors and Affiliations
Corresponding author
Additional information
An early version of this article appeared in Proc. 21st Annual International Symposium on Algorithms and Computation (ISAAC), part II, pp. 215–326, 2010. J. Barbay was funded in part by Fondecyt grants 1-110066 and 1120054, Chile. F. Claude was funded in part by Google PhD Fellowship Program. G. Navarro was funded in part by Fondecyt grant 1-110066, Chile. Work partially done during Y. Nekrich stay at the University of Chile.
Rights and permissions
About this article
Cite this article
Barbay, J., Claude, F., Gagie, T. et al. Efficient Fully-Compressed Sequence Representations. Algorithmica 69, 232–268 (2014). https://doi.org/10.1007/s00453-012-9726-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-012-9726-3