skip to main content
article
Free Access

Data compression

Published:01 September 1987Publication History
Skip Abstract Section

Abstract

This paper surveys a variety of data compression methods spanning almost 40 years of research, from the work of Shannon, Fano, and Huffman in the late 1940s to a technique developed in 1986. The aim of data compression is to reduce redundancy in stored or communicated data, thus increasing effective data density. Data compression has important application in the areas of file storage and distributed systems. Concepts from information theory as they relate to the goals and evaluation of data compression methods are discussed briefly. A framework for evaluation and comparison of methods is constructed and applied to the algorithms presented. Comparisons of both theoretical and empirical natures are reported, and possibilities for future research are suggested.

References

  1. ABRAMSON, N. 1963. in{ormation Theory and Coding. McGraw-Hill, New York.Google ScholarGoogle Scholar
  2. APOSTOLICO, A., AND FRAENKEL, A. S. 1985. Robust transmission of unbounded strings using Fibonacci representations. Tech. Rep. CS85-14, Dept. of Applied Mathematics, The Weizmann Institute of Science, Rehovot, Israel.Google ScholarGoogle Scholar
  3. ARC 1986. ARC File Archive Utility, Version 5.1. System Enhancement Associates, Wayne, N.J.Google ScholarGoogle Scholar
  4. ASH, R. B. 1965. Information Theory. Interscience, New York.Google ScholarGoogle Scholar
  5. BENTLEY, J. L., SLEATOR, D. D., TARJAN, R. E., ANO WEI, V. K. 1986. A locally adaptive data compression scheme. Commun. ACM 29, 4 (Apr.), 320-330. Google ScholarGoogle Scholar
  6. BOBROW, L. S., ANO HAKIMI, S. L. 1969. Graph theoretic prefix codes and their synchronizing properties. Inf. Contr. 15, 1 (July), 70-94.Google ScholarGoogle Scholar
  7. BRENT, R. P., AND KUN(}, H. T. 1978. Fast algorithms for manipulating formal power series. J. ACM 25, 4 (Oct.), 581-595. Google ScholarGoogle Scholar
  8. CAPOCELLI, R. M., GIANCARLO, R., AND TANEJA, I. J. 1986. Bounds on the redundancy of Huffman codes. IEEE Trans. In{. Theory 32, 6 (Nov.), 854-857. Google ScholarGoogle Scholar
  9. CAPPELLINI, V., ED. 1985. Data Compression and Error Control Techniques with Applications. Academic Press, London. Google ScholarGoogle Scholar
  10. CONNELL, J. B. 1973. A Huffman-Shannon-Fano code. Proc. IEEE 61, 7 (July), 1046-1047.Google ScholarGoogle Scholar
  11. CORMACK, G. V. 1985. Data compression on a database system. Commun. ACM 28, 12 (Dec.), 1336- 1342. Google ScholarGoogle Scholar
  12. CORMACK, G. V., AND HORSPOOL, R. N. 1984. Algorithms for adaptive Huffman codes. Inf. Process. Lett. 18, 3 (Mar.), 159-165. Google ScholarGoogle Scholar
  13. CORTESI, D. 1982. An effective text-compression algorithm. BYTE 7, 1 (Jan.), 397-403.Google ScholarGoogle Scholar
  14. COT, N. 1977. Characterization and design of optireal prefix codes. Ph.D. dissertation, Computer Science Dept., Stanford Univ., Stanford, Calif.Google ScholarGoogle Scholar
  15. ELIAS, P. 1975. Universal codeword sets and representations of the integers. IEEE Trans. Inf. Theory 21, 2 (Mar.), 194-203.Google ScholarGoogle Scholar
  16. ELIAS, P. 1987. Interval and recency rank source coding: Two on-line adaptive variable-length schemes. IEEE Trans. In{. Theory 33, 1 (Jan.), 3-10. Google ScholarGoogle Scholar
  17. FALLER, N. 1973. An adaptive system for data compression. In Record of the 7th Asilomar Conference on Circuits, Systems and Computers (Pacific Grove, Calif., Nov.). Naval Postgraduate School, Monterey, Calif., pp. 593-597.Google ScholarGoogle Scholar
  18. FANO, R. M. 1949. Transmission of Information. M.I.T. Press, Cambridge, Mass.Google ScholarGoogle Scholar
  19. FRAENKEL, A. S., ANO KLEIN, S. T. 1985. Robust universal complete codes as alternatives to Huffman codes. Tech. Rep. CS85-16, Dept. of Applied Mathematics, The Weizmann institute of Science, Rehovot, Israel.Google ScholarGoogle Scholar
  20. FRAENKEL, A. S., MOR, M., AND PERL, ~. 1983. Is text compression by prefixes and suffixes practical? Acta Inf. 20, 4 (Dec.), 371-375.Google ScholarGoogle Scholar
  21. GALLAGER, R. G. 1968. Information Theory and Reliable Communication. Wiley, New York. Google ScholarGoogle Scholar
  22. GALLAGER, R. G. 1978. Variations on a theme by Huffman. IEEE Trans. Inf. Theory 24, 6 (Nov.), 668-674.Google ScholarGoogle Scholar
  23. GAREY, M. R. 1974. Optimal binary search trees with restricted maximal depth. SIAM J. Comput. 3, 2 (June), 101-110.Google ScholarGoogle Scholar
  24. GILBERT, E. N. 1971. Codes based on inaccurate source probabilities. IEEE Trans. Inf. Theory 17, 3 (May), 304-314.Google ScholarGoogle Scholar
  25. GILBERT, E. N., AND MOORE, E. F. 1959. Variablelength binary encodings. Bell S~st. Tech. J. 38, 4 (July), 933-967.Google ScholarGoogle Scholar
  26. GLASSEY, C. R., AND KARP, R. M. 1976. On the optimality of Huffman trees. SIAM J. Appl. Math. 31, 2 (Sept.), 368-378.Google ScholarGoogle Scholar
  27. GONZALEZ, R. C., AND WINTZ, P. 1977. Digital Image Processing. Addison-Wesley, Reading, Mass. Google ScholarGoogle Scholar
  28. HAHN, B. 1974. A new technique for compression and storage of data. Commun. ACM 17, 8 (Aug.), 434-436. Google ScholarGoogle Scholar
  29. HESTER, J. H., AND HIRSCHBERG, D. S. 1985. Selforganizing linear search. ACM Comput. Surv. 17, 3 (Sept.), 295-311. Google ScholarGoogle Scholar
  30. HORSPOOL, R. N., AND CORMACK, G. V. 1987. A locally adaptive data compression scheme. Commun. ACM 30, 9 (Sept.), 792-794.Google ScholarGoogle Scholar
  31. HU, T. C., ANO TAN, K. C. 1972. Path length of binary search trees. SIAM J. Appl. Math 22, 2 (Mar.), 225-234.Google ScholarGoogle Scholar
  32. HU, T. C., AND TUCKER, A. C. 1971. Optimal computer search trees and variable-length alphabetic codes. SIAM J. Appl. Math. 21, 4 (Dec.), 514- 532.Google ScholarGoogle Scholar
  33. HUFFMAN, D. A. 1952. A method for the construction of minimum-redundancy codes. Proc. IRE 40, 9 (Sept.), 1098-1101.Google ScholarGoogle Scholar
  34. INGELS, F. M. 1971. Information and Coding Theory. intext, Scranton, Penn.Google ScholarGoogle Scholar
  35. ITAI, A. 1976. Optimal alphabetic trees. SIAM J. Comput. 5, 1 (Mar.), 9-18.Google ScholarGoogle Scholar
  36. KARP, R. M. 1961. Minimum redundancy coding for the discrete noiseless channel. IRE Trans. Inf. Theory 7, i (Jan.), 27-38.Google ScholarGoogle Scholar
  37. KNUTH, D. E. 1971. Optimum binary search trees. Acta In{. 1, 1 (Jan.), 14-25.Google ScholarGoogle Scholar
  38. KNUTH, D. E. 1985. Dynamic Huffman coding. J. Algorithms 6, 2 (June), 163-180. Google ScholarGoogle Scholar
  39. KRAUSE, R. M. 1962. Channels which transmit letters of unequal duration. Inf. Control 5, i (Mar.), 13-24.Google ScholarGoogle Scholar
  40. LAESER, R. P., MCLAUGHLIN, W. I., AND WOLFF, D. M. 1986. Engineering Voyager 2's encounter with Uranus. Sci. Am. 255, 5 (Nov.), 36-45.Google ScholarGoogle Scholar
  41. LANGDON, G. G., AND RISSANEN, J. J. 1983. A double-adaptive file compression algorithm. 1EEE Trans. Comm. 31, 11 (Nov.), 1253-1255.Google ScholarGoogle Scholar
  42. LLEWELLYN, J. A. 1987. Data compression for a source with Markov characteristics. Comput. J. 30, 2, 149-156. Google ScholarGoogle Scholar
  43. MCINTYRE, D. R., ANO PECHURA, M. A. 1985. Data compression using static Huffman code-decode tables. Commun. ACM 28, 6 (June), 612- 616. Google ScholarGoogle Scholar
  44. MEHLHORN, K. 1980. An efficient algorithm for constructing nearly optimal prefix codes. IEEE Trans. In{. Theory 26, 5 (Sept.), 513-517.Google ScholarGoogle Scholar
  45. NEUMANN, P. G. 1962. Efficient error-limiting variable-length codes. IRE Trans. In{. Theory 8, 4 (July), 292-304.Google ScholarGoogle Scholar
  46. PARKER, D. S. 1980. Conditions for the optimality of the Huffman algorithm. SIAM J. Comput. 9, 3 (Aug.), 470-489.Google ScholarGoogle Scholar
  47. PASCO, R. 1976. Source coding algorithms for fast data compression. Ph.D. dissertation, Dept. of Electrical Engineering, Stanford Univ., Stanford, Calif. Google ScholarGoogle Scholar
  48. PECHURA, M. 1982. File archival techniques using data compression. Commun. ACM 25, 9 (Sept.), 605-609. Google ScholarGoogle Scholar
  49. PERL, Y., GARE~, M. R., AND EVEN, S. 1975. Efficient generation of optimal prefix code: Equiprobable words using unequal cost letters. J. ACM 22, 2 (Apr.), 202-214. Google ScholarGoogle Scholar
  50. PKARC FAST! 1987. PKARC FAST/File Archival Utility, Version 3.5. PKWARE, Inc. Glendale, Wis.Google ScholarGoogle Scholar
  51. REGHBATI, H. K. 1981. An overview of data compression techniques. Computer 14, 4 (Apr.), 71-75.Google ScholarGoogle Scholar
  52. RISSANEN, J. J. 1976. Generalized Kraft inequality and arithmetic coding. IBM J. Res. Dev. 20 (May), 198-203.Google ScholarGoogle Scholar
  53. RISSANEN, J. J. 1983. A universal data compression system. IEEE Trans. In{. Theory 29, 5 (Sept.), 656-664.Google ScholarGoogle Scholar
  54. RODEH, M., PRATT, V. R., AND EVEN, S. 1981. Linear algorithm for data compression via string matching. J. ACM 28, i (Jan.), 16-24. Google ScholarGoogle Scholar
  55. RUBIN, F. 1976. Experiments in text file compression. Commun. ACM 19, 11 (Nov.), 617-623. Google ScholarGoogle Scholar
  56. RUBIN, F. 1979. Arithmetic stream coding using fixed precision registers. IEEE Trans. In{. Theory 25, 6 (Nov.), 672-675.Google ScholarGoogle Scholar
  57. RUDNER, B. 1971. Construction of minimumredundancy codes with optimal synchronizing property. IEEE Trans. Inf. Theory 17, 4 (July), 478-487.Google ScholarGoogle Scholar
  58. RUTH, S. S., AND KREUTZER, P. J. 1972. Data compression for large business files. Datamation 18, 9 (Sept.), 62-66.Google ScholarGoogle Scholar
  59. RYABKO, B. Y. 1987. A locally adaptive data compression scheme. Commun. A CM 16, 2 (Sept.), 792.Google ScholarGoogle Scholar
  60. SAMET, H. 1984. The Quadtree and related hierarchical data structures. A CM Comput. Surv. 30, 9 (June), 187-260. Google ScholarGoogle Scholar
  61. SCHWARTZ, E. S. 1964. An optimum encoding with minimum longest code and total number of digits. Inf. Control 7, 1 (Mar.), 37-44.Google ScholarGoogle Scholar
  62. SCHWARTZ, E. S., AND KALLICK, B. 1964. Generating a canonical prefix encoding. Commun. ACM 7, 3 (Mar.), 166-169. Google ScholarGoogle Scholar
  63. SEVERANCE, D. G. 1983. A practitioner's guide to data base compression. In{. Syst. 8, 1, 51-62.Google ScholarGoogle Scholar
  64. SHANNON, C. E., AND WEAVER, W. 1949. The Mathematical Theory o{ Communication. University of Illinois Press, Urbana, Ill. Google ScholarGoogle Scholar
  65. SNYDERMAN, M., AND HUNT, B. 1970. The myriad virtues of text compaction. Datamation 16, 12 (Dec.), 36-40.Google ScholarGoogle Scholar
  66. STANDISH, T. A. 1980. Data Structure Techniques. Addison-Wesley, Reading, Mass. Google ScholarGoogle Scholar
  67. STIFFLER, J. J. 1971. Theory of Synchronous Communications. Prentice-Hall, Englewood Cliffs, N.J.Google ScholarGoogle Scholar
  68. STORER, J. A., AND SZYMANSKI, T. G. 1982. Data compression via textual substitution. J. A CM 29, 4 (Oct.), 928-951. Google ScholarGoogle Scholar
  69. TANAKA, H. 1987. Data structure of Huffman codes and its application to efficient encoding and decoding. IEEE Trans. Inf. Theory 33, 1 (Jan.), 154-156. Google ScholarGoogle Scholar
  70. TROPPER, R. 1982. Binary-coded text, A textcompression method. BYTE 7, 4 (Apr.), 398-413.Google ScholarGoogle Scholar
  71. UNIX. 1984. UNIX User's Manual, Version 4.2. Berkeley Software Distribution, Virtual VAX-11 Version, Univ. of California, Berkeley, Calif.Google ScholarGoogle Scholar
  72. VARN, B. 1971. Optimal variable length codes (arbitrary symbol cost and equal code word probability). inf. Control 19, 4 (Nov.), 289-301.Google ScholarGoogle Scholar
  73. VITTER, J. S. 1987. Design and analysis of dynamic Huffman codes. J. ACM 34, 4 (Oct.), 825-845. Google ScholarGoogle Scholar
  74. WAGNER, R. A. 1973. Common phrases and minimum-space text storage. Commun. A CM 16, 3 (Mar.), 148-152. Google ScholarGoogle Scholar
  75. WELCH, T. A. 1984. A technique for high-performance data compression. Computer 17, 6 (June), 8-19.Google ScholarGoogle Scholar
  76. WILKINS, L. C., AND WINTZ, P. A. 1971. Bibliography on data compression, picture properties and picture coding. IEEE Trans. In{. Theory 17, 2, 180-197.Google ScholarGoogle Scholar
  77. WITTEN, I. H., NEAL, R. M., AND CLEARY, J. G. 1987. Arithmetic coding for data compression. Commun. ACM 30, 6 (June), 520-540. Google ScholarGoogle Scholar
  78. ZIMMERMAN, S. 1959. An optimal search procedure. Am. Math. Monthly 66, (Oct.), 690-693.Google ScholarGoogle Scholar
  79. ZIV, J., AND LEMPEL, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 3 (May), 337-343.Google ScholarGoogle Scholar
  80. ZIV, J., AND LEMPEL, A. 1978. Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24, 5 (Sept.), 530-536.Google ScholarGoogle Scholar

Recommendations

Reviews

Ian Hugh Witten

This paper reviews the techniques of data compression that have been developed since Shannon's and Fano's pioneering work in the late 1940s. It deals exclusively with reversible (sometimes called noiseless) compression, where the decoder can recover exactly what the transmitter sent (barring transmission errors). Nonreversible methods are normally used for coding speech, gray-scale or color pictures, and telemetry data, where significantly increased compression can be obtained if one is satisfied with a high-fidelity approximation to the original signal rather than bit-for-bit reproduction. Reversible methods are used for text, program, and database compression, where errors cannot generally be tolerated. The authors have compiled an impressive variety of approaches to coding for data compression, including Shannon-Fano coding, Huffman coding and numerous elaborations such as efficient methods for adaptive Huffman coding, Elias's variable-length representation of the integers, Fibonacci codes, arithmetic coding, Ziv-Lempel methods, and an adaptive move-to-front word code. They quote a number of empirical results on coding efficiency from other papers. Several coding schemes that depend on the semantics of particular domains are mentioned, although the authors correctly point out that general-purpose methods can perform just as well and are less sensitive to the type of data being encoded. A number of applications are mentioned to motivate the survey, including commonly used file archival programs such as ARC, file compression systems such as UNIX compress, and even some work that apparently uses Huffman coding to find leaks in a pipeline. The past 40 years have been dominated by Huffman's algorithm, published in 1952; so is the paper, which devotes a large proportion of its length to this method. By choosing to cover the history of compression fairly evenly, the authors have given Huffman coding undue prominence, and consequently the primary methods currently used—arithmetic coding and Ziv-Lempel coding—receive short shrift. For this reason, while the paper may be excellent as historical background for new researchers, it is of limited value to the practitioner seeking to understand state-of-the-art compression methods. One of the reviewers has argued the benefits of arithmetic over Huffman coding elsewhere [1], and this is not the place to repeat those arguments. Suffice it to say that arithmetic coding gives greater compression, can be faster for adaptive methods, and clearly separates the model from the channel encoding. In this survey, however, aritmetic coding is tucked at the end of the section on nonadaptive methods (although the fact that it can be used adaptively is mentioned), while a substantial section is devoted to adaptive Huffman coding in its own right, quite apart from the material on nonadaptive Huffman coding. One of the most important advances in the theory of data compression over the last decade is the insight that data compression can be split into two parts: an encoder and a separate modeler that feeds information to it [2]. This survey is principally devoted to the study of coding, where a probability distribution (model) is assumed and particular messages are represented as compactly as possible based on these probabilities. The authors do not sufficiently stress this model-dependence, however. For example, they often speak of “the” entropy of a message as though it imposes a lower bound on the number of bits required for coding; this is of course only true with respect to a particular model of the source that generated the message. It is quite easy to represent text with fewer bits than its entropy calculated from letter frequencies would imply (for example, by using a model based on words). While this is obvious enough, it deserves to be emphasized. Indeed, although a wide range of coding algorithms are surveyed, only the simplest of models—those based on single-character frequencies—are considered. These are capable of only mediocre compression. More sophisticated models (for example, ones based on characters in context) are required to achieve better compression, but these are only touched on in the “New directions” section. The section on Ziv-Lempel coding is confused and confusing. The main algorithm described is Welch's version [3], but parts of the description pertain to other versions. The trouble is that these versions are moderately different from each other. For example, Welch's uses a fixed code length of 12 bits, which performs poorly on large files. The UNIX compress utility uses an increasing code length, and so it is not stuck with the behavior it learns initially. Also, the complexity of LZ coding is given as O( n 2), or O( n) with a complex data structure. This is true only of the version given by Even et al. [4]; others (including Welch's) can quite easily be implemented in O( n) time using a trie—indeed, some are O( n) with a linear search (see Ziv and Lempel [5]). Much of this confusion stems from the fact that Ziv and Lempel published two papers [5,6] that describe rather different methods for data compression, yet one invariably reads in the literature about “the” Ziv-Lempel method (and, to make matters worse, it is abbreviated with the initials reversed, as “LZ coding”). Most coding methods are optimal in one sense or another, and the authors mention many important optimality results. For example, Huffman's algorithm generates an optimal “defined-word” code (i.e., one with integer-length codes), Ziv and Lempel's 1978 method approaches the source entropy as the size of the input increases, and arithmetic coding represents the input in a number of bits that is arbitrarily close to its entropy with respect to the model used for coding. Unfortunately, the authors fail to point out the practical relevances of these. One is that arithmetic coding is guaranteed to give superior compression to Huffman coding, assuming that both are based on the same model. Another is that while Ziv-Lempel methods yield asymptotically optimal compression, they are exceedingly slow to approach the asymptote (and do so at the expense of unbounded memory requirements); in consequence, they are invariably outperformed by probability-based methods. The section on empirical results is disappointing because there is no comparison between the performance of the methods on the same test files. Instead, unrelated results from research papers are quoted, and in some cases these are not even converted to a common unit for comparison. The crucial influence of the underlying models on compression performance is hardly acknowledged. Also, there is little mention of the relative speeds of the methods, which is often an important factor in practice. A final section about errors considers in detail how each coding method performs when the encoded data are transmitted on a noisy channel. The authors might have pointed out that a code can only be robust if it contains redundancy, and that robustness is thus directly opposed to compression. It is far better to separate the two processes, making use of the wealth of error-correcting techniques available in the literature after the message has been compressed. The survey covers a wide range of coding methods in a fair amount of detail. It is particularly useful as a historical record of implementation techniques for the now-superseded method of Huffman coding. The main disappointment is that the reader may find it difficult to distinguish the relative merits of the bewildering battery of methods available.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM Computing Surveys
    ACM Computing Surveys  Volume 19, Issue 3
    Sept. 1987
    96 pages
    ISSN:0360-0300
    EISSN:1557-7341
    DOI:10.1145/45072
    Issue’s Table of Contents

    Copyright © 1987 ACM

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 1 September 1987
    Published in csur Volume 19, Issue 3

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader