article

Free Access

Data compression

Authors:
Debra A. Lelewer

Univ. of California, Irvine

Univ. of California, Irvine
View Profile

,
Daniel S. Hirschberg

Univ. of California, Irvine

Univ. of California, Irvine
View Profile

Authors Info & Claims

ACM Computing Surveys Volume 19 Issue 3pp 261–296https://doi.org/10.1145/45072.45074

Published:01 September 1987Publication History

ACM Computing Surveys

Abstract

This paper surveys a variety of data compression methods spanning almost 40 years of research, from the work of Shannon, Fano, and Huffman in the late 1940s to a technique developed in 1986. The aim of data compression is to reduce redundancy in stored or communicated data, thus increasing effective data density. Data compression has important application in the areas of file storage and distributed systems. Concepts from information theory as they relate to the goals and evaluation of data compression methods are discussed briefly. A framework for evaluation and comparison of methods is constructed and applied to the algorithms presented. Comparisons of both theoretical and empirical natures are reported, and possibilities for future research are suggested.

References

ABRAMSON, N. 1963. in{ormation Theory and Coding. McGraw-Hill, New York.Google Scholar
APOSTOLICO, A., AND FRAENKEL, A. S. 1985. Robust transmission of unbounded strings using Fibonacci representations. Tech. Rep. CS85-14, Dept. of Applied Mathematics, The Weizmann Institute of Science, Rehovot, Israel.Google Scholar
ARC 1986. ARC File Archive Utility, Version 5.1. System Enhancement Associates, Wayne, N.J.Google Scholar
ASH, R. B. 1965. Information Theory. Interscience, New York.Google Scholar
BENTLEY, J. L., SLEATOR, D. D., TARJAN, R. E., ANO WEI, V. K. 1986. A locally adaptive data compression scheme. Commun. ACM 29, 4 (Apr.), 320-330. Google Scholar
BOBROW, L. S., ANO HAKIMI, S. L. 1969. Graph theoretic prefix codes and their synchronizing properties. Inf. Contr. 15, 1 (July), 70-94.Google Scholar
BRENT, R. P., AND KUN(}, H. T. 1978. Fast algorithms for manipulating formal power series. J. ACM 25, 4 (Oct.), 581-595. Google Scholar
CAPOCELLI, R. M., GIANCARLO, R., AND TANEJA, I. J. 1986. Bounds on the redundancy of Huffman codes. IEEE Trans. In{. Theory 32, 6 (Nov.), 854-857. Google Scholar
CAPPELLINI, V., ED. 1985. Data Compression and Error Control Techniques with Applications. Academic Press, London. Google Scholar
CONNELL, J. B. 1973. A Huffman-Shannon-Fano code. Proc. IEEE 61, 7 (July), 1046-1047.Google Scholar
CORMACK, G. V. 1985. Data compression on a database system. Commun. ACM 28, 12 (Dec.), 1336- 1342. Google Scholar
CORMACK, G. V., AND HORSPOOL, R. N. 1984. Algorithms for adaptive Huffman codes. Inf. Process. Lett. 18, 3 (Mar.), 159-165. Google Scholar
CORTESI, D. 1982. An effective text-compression algorithm. BYTE 7, 1 (Jan.), 397-403.Google Scholar
COT, N. 1977. Characterization and design of optireal prefix codes. Ph.D. dissertation, Computer Science Dept., Stanford Univ., Stanford, Calif.Google Scholar
ELIAS, P. 1975. Universal codeword sets and representations of the integers. IEEE Trans. Inf. Theory 21, 2 (Mar.), 194-203.Google Scholar
ELIAS, P. 1987. Interval and recency rank source coding: Two on-line adaptive variable-length schemes. IEEE Trans. In{. Theory 33, 1 (Jan.), 3-10. Google Scholar
FALLER, N. 1973. An adaptive system for data compression. In Record of the 7th Asilomar Conference on Circuits, Systems and Computers (Pacific Grove, Calif., Nov.). Naval Postgraduate School, Monterey, Calif., pp. 593-597.Google Scholar
FANO, R. M. 1949. Transmission of Information. M.I.T. Press, Cambridge, Mass.Google Scholar
FRAENKEL, A. S., ANO KLEIN, S. T. 1985. Robust universal complete codes as alternatives to Huffman codes. Tech. Rep. CS85-16, Dept. of Applied Mathematics, The Weizmann institute of Science, Rehovot, Israel.Google Scholar
FRAENKEL, A. S., MOR, M., AND PERL, ~. 1983. Is text compression by prefixes and suffixes practical? Acta Inf. 20, 4 (Dec.), 371-375.Google Scholar
GALLAGER, R. G. 1968. Information Theory and Reliable Communication. Wiley, New York. Google Scholar
GALLAGER, R. G. 1978. Variations on a theme by Huffman. IEEE Trans. Inf. Theory 24, 6 (Nov.), 668-674.Google Scholar
GAREY, M. R. 1974. Optimal binary search trees with restricted maximal depth. SIAM J. Comput. 3, 2 (June), 101-110.Google Scholar
GILBERT, E. N. 1971. Codes based on inaccurate source probabilities. IEEE Trans. Inf. Theory 17, 3 (May), 304-314.Google Scholar
GILBERT, E. N., AND MOORE, E. F. 1959. Variablelength binary encodings. Bell S~st. Tech. J. 38, 4 (July), 933-967.Google Scholar
GLASSEY, C. R., AND KARP, R. M. 1976. On the optimality of Huffman trees. SIAM J. Appl. Math. 31, 2 (Sept.), 368-378.Google Scholar
GONZALEZ, R. C., AND WINTZ, P. 1977. Digital Image Processing. Addison-Wesley, Reading, Mass. Google Scholar
HAHN, B. 1974. A new technique for compression and storage of data. Commun. ACM 17, 8 (Aug.), 434-436. Google Scholar
HESTER, J. H., AND HIRSCHBERG, D. S. 1985. Selforganizing linear search. ACM Comput. Surv. 17, 3 (Sept.), 295-311. Google Scholar
HORSPOOL, R. N., AND CORMACK, G. V. 1987. A locally adaptive data compression scheme. Commun. ACM 30, 9 (Sept.), 792-794.Google Scholar
HU, T. C., ANO TAN, K. C. 1972. Path length of binary search trees. SIAM J. Appl. Math 22, 2 (Mar.), 225-234.Google Scholar
HU, T. C., AND TUCKER, A. C. 1971. Optimal computer search trees and variable-length alphabetic codes. SIAM J. Appl. Math. 21, 4 (Dec.), 514- 532.Google Scholar
HUFFMAN, D. A. 1952. A method for the construction of minimum-redundancy codes. Proc. IRE 40, 9 (Sept.), 1098-1101.Google Scholar
INGELS, F. M. 1971. Information and Coding Theory. intext, Scranton, Penn.Google Scholar
ITAI, A. 1976. Optimal alphabetic trees. SIAM J. Comput. 5, 1 (Mar.), 9-18.Google Scholar
KARP, R. M. 1961. Minimum redundancy coding for the discrete noiseless channel. IRE Trans. Inf. Theory 7, i (Jan.), 27-38.Google Scholar
KNUTH, D. E. 1971. Optimum binary search trees. Acta In{. 1, 1 (Jan.), 14-25.Google Scholar
KNUTH, D. E. 1985. Dynamic Huffman coding. J. Algorithms 6, 2 (June), 163-180. Google Scholar
KRAUSE, R. M. 1962. Channels which transmit letters of unequal duration. Inf. Control 5, i (Mar.), 13-24.Google Scholar
LAESER, R. P., MCLAUGHLIN, W. I., AND WOLFF, D. M. 1986. Engineering Voyager 2's encounter with Uranus. Sci. Am. 255, 5 (Nov.), 36-45.Google Scholar
LANGDON, G. G., AND RISSANEN, J. J. 1983. A double-adaptive file compression algorithm. 1EEE Trans. Comm. 31, 11 (Nov.), 1253-1255.Google Scholar
LLEWELLYN, J. A. 1987. Data compression for a source with Markov characteristics. Comput. J. 30, 2, 149-156. Google Scholar
MCINTYRE, D. R., ANO PECHURA, M. A. 1985. Data compression using static Huffman code-decode tables. Commun. ACM 28, 6 (June), 612- 616. Google Scholar
MEHLHORN, K. 1980. An efficient algorithm for constructing nearly optimal prefix codes. IEEE Trans. In{. Theory 26, 5 (Sept.), 513-517.Google Scholar
NEUMANN, P. G. 1962. Efficient error-limiting variable-length codes. IRE Trans. In{. Theory 8, 4 (July), 292-304.Google Scholar
PARKER, D. S. 1980. Conditions for the optimality of the Huffman algorithm. SIAM J. Comput. 9, 3 (Aug.), 470-489.Google Scholar
PASCO, R. 1976. Source coding algorithms for fast data compression. Ph.D. dissertation, Dept. of Electrical Engineering, Stanford Univ., Stanford, Calif. Google Scholar
PECHURA, M. 1982. File archival techniques using data compression. Commun. ACM 25, 9 (Sept.), 605-609. Google Scholar
PERL, Y., GARE~, M. R., AND EVEN, S. 1975. Efficient generation of optimal prefix code: Equiprobable words using unequal cost letters. J. ACM 22, 2 (Apr.), 202-214. Google Scholar
PKARC FAST! 1987. PKARC FAST/File Archival Utility, Version 3.5. PKWARE, Inc. Glendale, Wis.Google Scholar
REGHBATI, H. K. 1981. An overview of data compression techniques. Computer 14, 4 (Apr.), 71-75.Google Scholar
RISSANEN, J. J. 1976. Generalized Kraft inequality and arithmetic coding. IBM J. Res. Dev. 20 (May), 198-203.Google Scholar
RISSANEN, J. J. 1983. A universal data compression system. IEEE Trans. In{. Theory 29, 5 (Sept.), 656-664.Google Scholar
RODEH, M., PRATT, V. R., AND EVEN, S. 1981. Linear algorithm for data compression via string matching. J. ACM 28, i (Jan.), 16-24. Google Scholar
RUBIN, F. 1976. Experiments in text file compression. Commun. ACM 19, 11 (Nov.), 617-623. Google Scholar
RUBIN, F. 1979. Arithmetic stream coding using fixed precision registers. IEEE Trans. In{. Theory 25, 6 (Nov.), 672-675.Google Scholar
RUDNER, B. 1971. Construction of minimumredundancy codes with optimal synchronizing property. IEEE Trans. Inf. Theory 17, 4 (July), 478-487.Google Scholar
RUTH, S. S., AND KREUTZER, P. J. 1972. Data compression for large business files. Datamation 18, 9 (Sept.), 62-66.Google Scholar
RYABKO, B. Y. 1987. A locally adaptive data compression scheme. Commun. A CM 16, 2 (Sept.), 792.Google Scholar
SAMET, H. 1984. The Quadtree and related hierarchical data structures. A CM Comput. Surv. 30, 9 (June), 187-260. Google Scholar
SCHWARTZ, E. S. 1964. An optimum encoding with minimum longest code and total number of digits. Inf. Control 7, 1 (Mar.), 37-44.Google Scholar
SCHWARTZ, E. S., AND KALLICK, B. 1964. Generating a canonical prefix encoding. Commun. ACM 7, 3 (Mar.), 166-169. Google Scholar
SEVERANCE, D. G. 1983. A practitioner's guide to data base compression. In{. Syst. 8, 1, 51-62.Google Scholar
SHANNON, C. E., AND WEAVER, W. 1949. The Mathematical Theory o{ Communication. University of Illinois Press, Urbana, Ill. Google Scholar
SNYDERMAN, M., AND HUNT, B. 1970. The myriad virtues of text compaction. Datamation 16, 12 (Dec.), 36-40.Google Scholar
STANDISH, T. A. 1980. Data Structure Techniques. Addison-Wesley, Reading, Mass. Google Scholar
STIFFLER, J. J. 1971. Theory of Synchronous Communications. Prentice-Hall, Englewood Cliffs, N.J.Google Scholar
STORER, J. A., AND SZYMANSKI, T. G. 1982. Data compression via textual substitution. J. A CM 29, 4 (Oct.), 928-951. Google Scholar
TANAKA, H. 1987. Data structure of Huffman codes and its application to efficient encoding and decoding. IEEE Trans. Inf. Theory 33, 1 (Jan.), 154-156. Google Scholar
TROPPER, R. 1982. Binary-coded text, A textcompression method. BYTE 7, 4 (Apr.), 398-413.Google Scholar
UNIX. 1984. UNIX User's Manual, Version 4.2. Berkeley Software Distribution, Virtual VAX-11 Version, Univ. of California, Berkeley, Calif.Google Scholar
VARN, B. 1971. Optimal variable length codes (arbitrary symbol cost and equal code word probability). inf. Control 19, 4 (Nov.), 289-301.Google Scholar
VITTER, J. S. 1987. Design and analysis of dynamic Huffman codes. J. ACM 34, 4 (Oct.), 825-845. Google Scholar
WAGNER, R. A. 1973. Common phrases and minimum-space text storage. Commun. A CM 16, 3 (Mar.), 148-152. Google Scholar
WELCH, T. A. 1984. A technique for high-performance data compression. Computer 17, 6 (June), 8-19.Google Scholar
WILKINS, L. C., AND WINTZ, P. A. 1971. Bibliography on data compression, picture properties and picture coding. IEEE Trans. In{. Theory 17, 2, 180-197.Google Scholar
WITTEN, I. H., NEAL, R. M., AND CLEARY, J. G. 1987. Arithmetic coding for data compression. Commun. ACM 30, 6 (June), 520-540. Google Scholar
ZIMMERMAN, S. 1959. An optimal search procedure. Am. Math. Monthly 66, (Oct.), 690-693.Google Scholar
ZIV, J., AND LEMPEL, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 3 (May), 337-343.Google Scholar
ZIV, J., AND LEMPEL, A. 1978. Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24, 5 (Sept.), 530-536.Google Scholar

Recommendations

Reviews

Reviewer: Ian Hugh Witten

This paper reviews the techniques of data compression that have been developed since Shannon's and Fano's pioneering work in the late 1940s. It deals exclusively with reversible (sometimes called noiseless) compression, where the decoder can recover exactly what the transmitter sent (barring transmission errors). Nonreversible methods are normally used for coding speech, gray-scale or color pictures, and telemetry data, where significantly increased compression can be obtained if one is satisfied with a high-fidelity approximation to the original signal rather than bit-for-bit reproduction. Reversible methods are used for text, program, and database compression, where errors cannot generally be tolerated. The authors have compiled an impressive variety of approaches to coding for data compression, including Shannon-Fano coding, Huffman coding and numerous elaborations such as efficient methods for adaptive Huffman coding, Elias's variable-length representation of the integers, Fibonacci codes, arithmetic coding, Ziv-Lempel methods, and an adaptive move-to-front word code. They quote a number of empirical results on coding efficiency from other papers. Several coding schemes that depend on the semantics of particular domains are mentioned, although the authors correctly point out that general-purpose methods can perform just as well and are less sensitive to the type of data being encoded. A number of applications are mentioned to motivate the survey, including commonly used file archival programs such as ARC, file compression systems such as UNIX compress, and even some work that apparently uses Huffman coding to find leaks in a pipeline. The past 40 years have been dominated by Huffman's algorithm, published in 1952; so is the paper, which devotes a large proportion of its length to this method. By choosing to cover the history of compression fairly evenly, the authors have given Huffman coding undue prominence, and consequently the primary methods currently used—arithmetic coding and Ziv-Lempel coding—receive short shrift. For this reason, while the paper may be excellent as historical background for new researchers, it is of limited value to the practitioner seeking to understand state-of-the-art compression methods. One of the reviewers has argued the benefits of arithmetic over Huffman coding elsewhere [1], and this is not the place to repeat those arguments. Suffice it to say that arithmetic coding gives greater compression, can be faster for adaptive methods, and clearly separates the model from the channel encoding. In this survey, however, aritmetic coding is tucked at the end of the section on nonadaptive methods (although the fact that it can be used adaptively is mentioned), while a substantial section is devoted to adaptive Huffman coding in its own right, quite apart from the material on nonadaptive Huffman coding. One of the most important advances in the theory of data compression over the last decade is the insight that data compression can be split into two parts: an encoder and a separate modeler that feeds information to it [2]. This survey is principally devoted to the study of coding, where a probability distribution (model) is assumed and particular messages are represented as compactly as possible based on these probabilities. The authors do not sufficiently stress this model-dependence, however. For example, they often speak of “the” entropy of a message as though it imposes a lower bound on the number of bits required for coding; this is of course only true with respect to a particular model of the source that generated the message. It is quite easy to represent text with fewer bits than its entropy calculated from letter frequencies would imply (for example, by using a model based on words). While this is obvious enough, it deserves to be emphasized. Indeed, although a wide range of coding algorithms are surveyed, only the simplest of models—those based on single-character frequencies—are considered. These are capable of only mediocre compression. More sophisticated models (for example, ones based on characters in context) are required to achieve better compression, but these are only touched on in the “New directions” section. The section on Ziv-Lempel coding is confused and confusing. The main algorithm described is Welch's version [3], but parts of the description pertain to other versions. The trouble is that these versions are moderately different from each other. For example, Welch's uses a fixed code length of 12 bits, which performs poorly on large files. The UNIX compress utility uses an increasing code length, and so it is not stuck with the behavior it learns initially. Also, the complexity of LZ coding is given as O( n 2), or O( n) with a complex data structure. This is true only of the version given by Even et al. [4]; others (including Welch's) can quite easily be implemented in O( n) time using a trie—indeed, some are O( n) with a linear search (see Ziv and Lempel [5]). Much of this confusion stems from the fact that Ziv and Lempel published two papers [5,6] that describe rather different methods for data compression, yet one invariably reads in the literature about “the” Ziv-Lempel method (and, to make matters worse, it is abbreviated with the initials reversed, as “LZ coding”). Most coding methods are optimal in one sense or another, and the authors mention many important optimality results. For example, Huffman's algorithm generates an optimal “defined-word” code (i.e., one with integer-length codes), Ziv and Lempel's 1978 method approaches the source entropy as the size of the input increases, and arithmetic coding represents the input in a number of bits that is arbitrarily close to its entropy with respect to the model used for coding. Unfortunately, the authors fail to point out the practical relevances of these. One is that arithmetic coding is guaranteed to give superior compression to Huffman coding, assuming that both are based on the same model. Another is that while Ziv-Lempel methods yield asymptotically optimal compression, they are exceedingly slow to approach the asymptote (and do so at the expense of unbounded memory requirements); in consequence, they are invariably outperformed by probability-based methods. The section on empirical results is disappointing because there is no comparison between the performance of the methods on the same test files. Instead, unrelated results from research papers are quoted, and in some cases these are not even converted to a common unit for comparison. The crucial influence of the underlying models on compression performance is hardly acknowledged. Also, there is little mention of the relative speeds of the methods, which is often an important factor in practice. A final section about errors considers in detail how each coding method performs when the encoded data are transmitted on a noisy channel. The authors might have pointed out that a code can only be robust if it contains redundancy, and that robustness is thus directly opposed to compression. It is far better to separate the two processes, making use of the wealth of error-correcting techniques available in the literature after the message has been compressed. The survey covers a wide range of coding methods in a fair amount of detail. It is particularly useful as a historical record of implementation techniques for the now-superseded method of Huffman coding. The main disappointment is that the reader may find it difficult to distinguish the relative merits of the bewildering battery of methods available.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Computing Surveys Volume 19, Issue 3
Sept. 1987
96 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/45072
Issue’s Table of Contents

Copyright © 1987 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 September 1987
Published in csur Volume 19, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 304
  Total Citations
  View Citations
- 7,561
  Total Downloads
- Downloads (Last 12 months)1,200
- Downloads (Last 6 weeks)155
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Data compression

ACM Computing Surveys

Abstract

References

Cited By

Recommendations

Lossless Data Compression: Data compression, Algorithm, Lossy compression, Bit rate, ZIP (file format), Unix, Gzip, Portable Network Graphics, Graphics Interchange Format, Tagged Image File Format

Hyperspectral Data Compression

Audio compression (data): Data compression, Streaming media, Audio file format, Algorithm, Computer software, Audio codec, Lossless data compression, Lossy ... (information theory), Coding theory

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Data compression

ACM Computing Surveys

Abstract

References

Cited By

Recommendations

Lossless Data Compression: Data compression, Algorithm, Lossy compression, Bit rate, ZIP (file format), Unix, Gzip, Portable Network Graphics, Graphics Interchange Format, Tagged Image File Format

Hyperspectral Data Compression

Audio compression (data): Data compression, Streaming media, Audio file format, Algorithm, Computer software, Audio codec, Lossless data compression, Lossy ... (information theory), Coding theory

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media