skip to main content
article

Boosting textual compression in optimal linear time

Published:01 July 2005Publication History
Skip Abstract Section

Abstract

We provide a general boosting technique for Textual Data Compression. Qualitatively, it takes a good compression algorithm and turns it into an algorithm with a better compression performance guarantee. It displays the following remarkable properties: (a) it can turn any memoryless compressor into a compression algorithm that uses the “best possible” contexts; (b) it is very simple and optimal in terms of time; and (c) it admits a decompression algorithm again optimal in time. To the best of our knowledge, this is the first boosting technique displaying these properties.Technically, our boosting technique builds upon three main ingredients: the Burrows--Wheeler Transform, the Suffix Tree data structure, and a greedy algorithm to process them. Specifically, we show that there exists a proper partition of the Burrows--Wheeler Transform of a string s that shows a deep combinatorial relation with the kth order entropy of s. That partition can be identified via a greedy processing of the suffix tree of s with the aim of minimizing a proper objective function over its nodes. The final compressed string is then obtained by compressing individually each substring of the partition by means of the base compressor we wish to boost.Our boosting technique is inherently combinatorial because it does not need to assume any prior probabilistic model about the source emitting s, and it does not deploy any training, parameter estimation and learning. Various corollaries are derived from this main achievement. Among the others, we show analytically that using our booster, we get better compression algorithms than some of the best existing ones, that is, LZ77, LZ78, PPMC and the ones derived from the Burrows--Wheeler Transform. Further, we settle analytically some long-standing open problems about the algorithmic structure and the performance of BWT-based compressors. Namely, we provide the first family of BWT algorithms that do not use Move-To-Front or Symbol Ranking as a part of the compression process.

References

  1. Arnavut, Z. 2002. Generalization of the BWT transformation and inversion ranks. In Proceedings of the IEEE Data Compression Conference, IEEE Computer Society Press, Los Alamitos, Calif., page 447. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Balkenhol, B., Kurtz, S., and Shtarkov, Y. M. 1999. Modification of the Burrows and Wheeler data compression algorithm. In Proceedings of the IEEE Data Compression Conference, IEEE Computer Society Press, Los Alamitos, Calif., page 188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bentley, J., Sleator, D., Tarjan, R., and Wei, V. 1986. A locally adaptive compression scheme. Commun. ACM 29, 4, 320--330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Burrows, M., and Wheeler, D. 1994. A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation.Google ScholarGoogle Scholar
  5. Capocelli, R. M., Giancarlo, R., and Taneja, I. J. 1986. Bounds on the redundancy of Huffman codes. IEEE Trans. Inf. Theory 32, 854--857. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Cleary, J. G., and Teahan, W. J. 1997. Unbounded length contexts for PPM. Comput. J. 40, 67--75.Google ScholarGoogle ScholarCross RefCross Ref
  7. Cormak, G. V., and Horspool, R. N. S. 1987. Data compression using dynamic Markov modelling. Comput. J. 30, 541--550. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Cover, T. M., and Thomas, J. A. 1990. Elements of Information Theory. Wiley Interscience, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Crochemore, M., Désarménien, J., and Perrin, D. 2005. A note on the Burrows--Wheeler transform. Theoret. Comput. Sci. 332, 567--572. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Deorowicz, S. 2002. Second step algorithms in the Burrows--Wheeler compression algorithm. Softw. Pract. Exper. 32, 2, 99--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Effros, M., Visweswariah, K., Kulkarni, S., and Verdu, S. 2002. Universal codeword sets and representations of the integers. IEEE Trans. Inf. Theory 48, 5, 1061--1081. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Elias, P. 1975. Universal codeword sets and representations of the integers. IEEE Trans. Inf. Theory 21, 2, 194--203.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Fenwick, P. 1996. The Burrows--Wheeler transform for block sorting text compression: principles and improvements. Comput. J. 39, 9, 731--740.Google ScholarGoogle ScholarCross RefCross Ref
  14. Gessel, I. M., and Reutenauer, C. 1993. Counting permutations with given cycle structure and descend set. J. Combinat. Theory, Ser. A 64, 189--215. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Hon, W., Sadakane, K., and Sung, W. 2003. Breaking a time-and-space barrier in constructing full-text indices. In Proceedings of the IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, Calif., pp. 251--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Karp, R., Pippenger, N., and Sipser, M. 1985. A Time-Randomness tradeoff. In Proceedings of the Conference on Probabilistic Computational Complexity. AMS, 150--159.Google ScholarGoogle Scholar
  17. Kosaraju, R., and Manzini, G. 1999. Compression of low entropy strings with Lempel--Ziv algorithms. SIAM Journal on Computing 29, 3, 893--911. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Larsson, N. J. 1998. The context trees of block sorting compression. In Proceedings of the IEEE Data Compression Conference. IEEE Computer Society Press, Los Alamitos, Calif., pp. 189--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Levenshtein, V. I. 1968. On the redundancy and delay of decodable coding of natural numbers. In (Translation from) Problems in Cybernetics (Nauka, Moscow) vol. 20, 173--179.Google ScholarGoogle Scholar
  20. Manzini, G. 2001. An analysis of the Burrows--Wheeler transform. J. ACM 48, 3, 407--430. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. McCreight, E. M. 1976. A space-economical suffix tree construction algorithm. J. ACM 23, 2, 262--272. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Moffat, A. 1990. Implementing the PPM data compression scheme. IEEE Trans. Commun. 38, 1917--1921.Google ScholarGoogle ScholarCross RefCross Ref
  23. Nisan, N., and Ta-Shma, A. 1999. Extracting randomness: A survey and new constructions. J. Comput. Syst. Sci. 58, 148--173. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Rissanen, J. 1984. Universal coding, information, prediction, and estimation. IEEE Trans. Inf. Theory IT-29, 629--636.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Sadakane, K. 1998. On optimality of variants of the block sorting compression. In Proceedings of the IEEE Data Compression Conference. IEEE Computer Society Press, Los Alamitos, Calif., page 570. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Schapire, R. E. 1990. The strength of weak learnability. Mach. Learn. 2, 197--227. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Schapire, R. E. 2002. The boosting approach to Machine Learning: An overview. In MSRI Workshop on Nonlinear Estimation and Classification, D. D. Denison, M. H. Hansen, C. C. Holmes, B. Mallick, and B. Yu, Eds. Springer-Verlag Lecture Notes in Statistics n. 171, 149--172.Google ScholarGoogle Scholar
  28. Seward, J. 1997. The bzip2 home page. http://sources.redhat.com/bzip2.Google ScholarGoogle Scholar
  29. Schindler, M. 1997. A fast block-sorting algorithm for lossless data compression. In Proceedings of the IEEE Data Compression Conference. IEEE Computer Society Press, Los Alamitos, Calif., page 469. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Storer, J. 1992. Image and Text Compression. Kluwer Academic Press, Norwell, Mass. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Trevisan, L. 2001. Extractors and pseudorandom generators. J. ACM 48, 860--789. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Valiant, L. G. 1984. A theory of the Learnable. Commun. ACM 27, 1134--1142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Wirth, A., and Moffat, A. 2001. Can we do without ranks in Burrows--Wheeler transform compression? In Proceedings of the IEEE Data Compression Conference. IEEE Computer Society Press, Los Alamitos, Calif., pp. 419--428. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images, Second ed. Morgan Kaufmann Publishers, Los Altos, Calif. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Ziv, J., and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 337--343.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Ziv, J., and Lempel, A. 1978. Compression of individual sequences via variable length coding. IEEE Trans. Inf. Theory 24, 530--536.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Boosting textual compression in optimal linear time

                        Recommendations

                        Comments

                        Login options

                        Check if you have access through your login credentials or your institution to get full access on this article.

                        Sign in

                        Full Access

                        • Published in

                          cover image Journal of the ACM
                          Journal of the ACM  Volume 52, Issue 4
                          July 2005
                          199 pages
                          ISSN:0004-5411
                          EISSN:1557-735X
                          DOI:10.1145/1082036
                          Issue’s Table of Contents

                          Copyright © 2005 ACM

                          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                          Publisher

                          Association for Computing Machinery

                          New York, NY, United States

                          Publication History

                          • Published: 1 July 2005
                          Published in jacm Volume 52, Issue 4

                          Permissions

                          Request permissions about this article.

                          Request Permissions

                          Check for updates

                          Qualifiers

                          • article

                        PDF Format

                        View or Download as a PDF file.

                        PDF

                        eReader

                        View online with eReader.

                        eReader