article

Boosting textual compression in optimal linear time

Authors:
Paolo Ferragina

Università di Pisa, Pisa, Italy

Università di Pisa, Pisa, Italy
View Profile

,
Raffaele Giancarlo

Università di Palermo, Palermo, Italy

Università di Palermo, Palermo, Italy
View Profile

,
Giovanni Manzini

Università del Piemonte Orientale, Alessandria, Italy

Università del Piemonte Orientale, Alessandria, Italy
View Profile

,
Marinella Sciortino

Università di Palermo, Palermo, Italy

Università di Palermo, Palermo, Italy
View Profile

Authors Info & Claims

Journal of the ACM Volume 52 Issue 4pp 688–713https://doi.org/10.1145/1082036.1082043

Published:01 July 2005Publication History

Journal of the ACM

Abstract

We provide a general boosting technique for Textual Data Compression. Qualitatively, it takes a good compression algorithm and turns it into an algorithm with a better compression performance guarantee. It displays the following remarkable properties: (a) it can turn any memoryless compressor into a compression algorithm that uses the “best possible” contexts; (b) it is very simple and optimal in terms of time; and (c) it admits a decompression algorithm again optimal in time. To the best of our knowledge, this is the first boosting technique displaying these properties.Technically, our boosting technique builds upon three main ingredients: the Burrows--Wheeler Transform, the Suffix Tree data structure, and a greedy algorithm to process them. Specifically, we show that there exists a proper partition of the Burrows--Wheeler Transform of a string s that shows a deep combinatorial relation with the kth order entropy of s. That partition can be identified via a greedy processing of the suffix tree of s with the aim of minimizing a proper objective function over its nodes. The final compressed string is then obtained by compressing individually each substring of the partition by means of the base compressor we wish to boost.Our boosting technique is inherently combinatorial because it does not need to assume any prior probabilistic model about the source emitting s, and it does not deploy any training, parameter estimation and learning. Various corollaries are derived from this main achievement. Among the others, we show analytically that using our booster, we get better compression algorithms than some of the best existing ones, that is, LZ77, LZ78, PPMC and the ones derived from the Burrows--Wheeler Transform. Further, we settle analytically some long-standing open problems about the algorithmic structure and the performance of BWT-based compressors. Namely, we provide the first family of BWT algorithms that do not use Move-To-Front or Symbol Ranking as a part of the compression process.

References

Arnavut, Z. 2002. Generalization of the BWT transformation and inversion ranks. In Proceedings of the IEEE Data Compression Conference, IEEE Computer Society Press, Los Alamitos, Calif., page 447. Google ScholarDigital Library
Balkenhol, B., Kurtz, S., and Shtarkov, Y. M. 1999. Modification of the Burrows and Wheeler data compression algorithm. In Proceedings of the IEEE Data Compression Conference, IEEE Computer Society Press, Los Alamitos, Calif., page 188. Google ScholarDigital Library
Bentley, J., Sleator, D., Tarjan, R., and Wei, V. 1986. A locally adaptive compression scheme. Commun. ACM 29, 4, 320--330. Google ScholarDigital Library
Burrows, M., and Wheeler, D. 1994. A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation.Google Scholar
Capocelli, R. M., Giancarlo, R., and Taneja, I. J. 1986. Bounds on the redundancy of Huffman codes. IEEE Trans. Inf. Theory 32, 854--857. Google ScholarDigital Library
Cleary, J. G., and Teahan, W. J. 1997. Unbounded length contexts for PPM. Comput. J. 40, 67--75.Google ScholarCross Ref
Cormak, G. V., and Horspool, R. N. S. 1987. Data compression using dynamic Markov modelling. Comput. J. 30, 541--550. Google ScholarDigital Library
Cover, T. M., and Thomas, J. A. 1990. Elements of Information Theory. Wiley Interscience, New York. Google ScholarDigital Library
Crochemore, M., Désarménien, J., and Perrin, D. 2005. A note on the Burrows--Wheeler transform. Theoret. Comput. Sci. 332, 567--572. Google ScholarDigital Library
Deorowicz, S. 2002. Second step algorithms in the Burrows--Wheeler compression algorithm. Softw. Pract. Exper. 32, 2, 99--111. Google ScholarDigital Library
Effros, M., Visweswariah, K., Kulkarni, S., and Verdu, S. 2002. Universal codeword sets and representations of the integers. IEEE Trans. Inf. Theory 48, 5, 1061--1081. Google ScholarDigital Library
Elias, P. 1975. Universal codeword sets and representations of the integers. IEEE Trans. Inf. Theory 21, 2, 194--203.Google ScholarDigital Library
Fenwick, P. 1996. The Burrows--Wheeler transform for block sorting text compression: principles and improvements. Comput. J. 39, 9, 731--740.Google ScholarCross Ref
Gessel, I. M., and Reutenauer, C. 1993. Counting permutations with given cycle structure and descend set. J. Combinat. Theory, Ser. A 64, 189--215. Google ScholarDigital Library
Hon, W., Sadakane, K., and Sung, W. 2003. Breaking a time-and-space barrier in constructing full-text indices. In Proceedings of the IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, Calif., pp. 251--260. Google ScholarDigital Library
Karp, R., Pippenger, N., and Sipser, M. 1985. A Time-Randomness tradeoff. In Proceedings of the Conference on Probabilistic Computational Complexity. AMS, 150--159.Google Scholar
Kosaraju, R., and Manzini, G. 1999. Compression of low entropy strings with Lempel--Ziv algorithms. SIAM Journal on Computing 29, 3, 893--911. Google ScholarDigital Library
Larsson, N. J. 1998. The context trees of block sorting compression. In Proceedings of the IEEE Data Compression Conference. IEEE Computer Society Press, Los Alamitos, Calif., pp. 189--198. Google ScholarDigital Library
Levenshtein, V. I. 1968. On the redundancy and delay of decodable coding of natural numbers. In (Translation from) Problems in Cybernetics (Nauka, Moscow) vol. 20, 173--179.Google Scholar
Manzini, G. 2001. An analysis of the Burrows--Wheeler transform. J. ACM 48, 3, 407--430. Google ScholarDigital Library
McCreight, E. M. 1976. A space-economical suffix tree construction algorithm. J. ACM 23, 2, 262--272. Google ScholarDigital Library
Moffat, A. 1990. Implementing the PPM data compression scheme. IEEE Trans. Commun. 38, 1917--1921.Google ScholarCross Ref
Nisan, N., and Ta-Shma, A. 1999. Extracting randomness: A survey and new constructions. J. Comput. Syst. Sci. 58, 148--173. Google ScholarDigital Library
Rissanen, J. 1984. Universal coding, information, prediction, and estimation. IEEE Trans. Inf. Theory IT-29, 629--636.Google ScholarDigital Library
Sadakane, K. 1998. On optimality of variants of the block sorting compression. In Proceedings of the IEEE Data Compression Conference. IEEE Computer Society Press, Los Alamitos, Calif., page 570. Google ScholarDigital Library
Schapire, R. E. 1990. The strength of weak learnability. Mach. Learn. 2, 197--227. Google ScholarDigital Library
Schapire, R. E. 2002. The boosting approach to Machine Learning: An overview. In MSRI Workshop on Nonlinear Estimation and Classification, D. D. Denison, M. H. Hansen, C. C. Holmes, B. Mallick, and B. Yu, Eds. Springer-Verlag Lecture Notes in Statistics n. 171, 149--172.Google Scholar
Seward, J. 1997. The bzip2 home page. http://sources.redhat.com/bzip2.Google Scholar
Schindler, M. 1997. A fast block-sorting algorithm for lossless data compression. In Proceedings of the IEEE Data Compression Conference. IEEE Computer Society Press, Los Alamitos, Calif., page 469. Google ScholarDigital Library
Storer, J. 1992. Image and Text Compression. Kluwer Academic Press, Norwell, Mass. Google ScholarDigital Library
Trevisan, L. 2001. Extractors and pseudorandom generators. J. ACM 48, 860--789. Google ScholarDigital Library
Valiant, L. G. 1984. A theory of the Learnable. Commun. ACM 27, 1134--1142. Google ScholarDigital Library
Wirth, A., and Moffat, A. 2001. Can we do without ranks in Burrows--Wheeler transform compression? In Proceedings of the IEEE Data Compression Conference. IEEE Computer Society Press, Los Alamitos, Calif., pp. 419--428. Google ScholarDigital Library
Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images, Second ed. Morgan Kaufmann Publishers, Los Altos, Calif. Google ScholarDigital Library
Ziv, J., and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 337--343.Google ScholarDigital Library
Ziv, J., and Lempel, A. 1978. Compression of individual sequences via variable length coding. IEEE Trans. Inf. Theory 24, 530--536.Google ScholarDigital Library

Index Terms

Recommendations

Indexing compressed text

We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form.Our first compressed data structure retrieves the occ ...
Read More
Microarray Image Lossless Compression Using General Entropy Coders and Image Compression Standards
Abstract
Lossless compression is still a challenging task in the case of microarray images. This research proposes two algorithms that aim to improve the lossless compression efficiency for high spatial resolution microarray images using general entropy ...
Read More
Application of Arithmetic Coding to Compression of VLSI Test Data

This paper proposes arithmetic coding for application to data compression for VLSI testing. The use of arithmetic codes results in a codeword whose length is close to the optimal value (as predicted by entropy in information theory), thus achieving a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Journal of the ACM Volume 52, Issue 4
July 2005
199 pages
ISSN:0004-5411
EISSN:1557-735X
DOI:10.1145/1082036
Issue’s Table of Contents

Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 July 2005
Published in jacm Volume 52, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Arithmetic coding
Burrows--Wheeler transform
Huffman coding
Lempel--Ziv compressors
empirical entropy
suffix tree
text compression
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 103
  Total Citations
  View Citations
- 1,171
  Total Downloads
- Downloads (Last 12 months)18
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Boosting textual compression in optimal linear time

Journal of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Indexing compressed text

Microarray Image Lossless Compression Using General Entropy Coders and Image Compression Standards

Application of Arithmetic Coding to Compression of VLSI Test Data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Boosting textual compression in optimal linear time

Journal of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Indexing compressed text

Microarray Image Lossless Compression Using General Entropy Coders and Image Compression Standards

Application of Arithmetic Coding to Compression of VLSI Test Data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media