ABSTRACT
Full-text retrieval systems often use either a bitmap or an inverted file to identify which documents contain which terms, so that the documents containing any combination of query terms can be quickly located. Bitmaps of term occurrences are large, but are usually sparse, and thus are amenable to a variety of compression techniques. Here we consider techniques in which the encoding of each bitvector within the bitmap is parameterised, so that a different code can be used for each bitvector. Our experimental results show that the new methods yield better compression than previous techniques.
- 1.A. Bookstein and S.T. Klein. Compression of correlated bit-vectors. Information Systems, 16(4):387-400, 1991. Google ScholarDigital Library
- 2.A. Bookstein and S.T. Klein. Flexible compression for bitmap sets. In j.A. Storer and J.H. Reif, editors~ Proc. IEEE Data Compression Conference, pages 402-410, Snowbird, Utah, April 1991.Google ScholarCross Ref
- 3.A. Bookstein and S.T. Klein. Generative roodels for bitmap sets with compression applications. In Proc. 1~ ~th ACM-SIGIR Conference on Information Retrieval, pages 63-71, Chicago, 1991. Google ScholarDigital Library
- 4.A. Bookstein, 8.T. Klein, and T. Raita. Model based concordance compression. In J.A. Storer and J.H. Reif, editors, Proc. IEEE Data Compression Conference, Snowbird, Utah, March 1992. To appear.Google Scholar
- 5.Y. Choueka, A.S. Fraenkel, and S.T. Klein. Compression of concordances in full-text retrieval systems. In Proc. 11'th A CM-SIGIR Conference on Information Retrieval, pages 597-612, Grenoble, France, June 1988. ACM, New York. Google ScholarDigital Library
- 6.Y. Choueka, A.S. Fraenkel, S.T. Klein, and E. Segal. Improved hierarchical bit-vector compression in document retrieval systems. In Proc. 9'th A CM-SIGIR Conference on Information Retrieval, pages 88-97, Pisa, Italy, September 1986. ACM, New York. Google ScholarDigital Library
- 7.W.B. Croft and P. Savino. Implementing ranking strategies using text signatures. A CM Transactions on Office Information Systems, 6( 1):42-62, 1988. Google ScholarDigital Library
- 8.P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, IT-21:194-203, March 1975.Google ScholarDigital Library
- 9.A.S. Fraenkel and $.T. Klein. Novel compression of sparse bit-strings~Preliminary report. In A. Apostolico and Z. Galil, editors, Combinatorial Algorithms on Words, Volume 12, NATO ASI Series F, pages 169-183, Berlin, 1985. Springer-Verlag.Google ScholarCross Ref
- 10.R.G. Gallager and D.C. Van Voorhis. OptimM source codes for geometrically distributed alphabets. IEEE Transactions on Information Theory, IT-21(2):228-230, March 1975.Google ScholarDigital Library
- 11.S.W. Golomb. Run-length encodings. IEEE Transactions on Information Theory, IT- 12(3):399-401, July 1966.Google ScholarDigital Library
- 12.D.A. Huffman. A method for the construction of minimum redundancy codes. Proc. IRE, 40(9):1098-1101, September 1952.Google ScholarCross Ref
- 13.M. Jakobsson. Huffman coding in bit-vector compression. Information Processing Letters, 7(6):304-307, October 1978.Google ScholarCross Ref
- 14.M. Lesk. Some applications of inverted indexes on the Unix system. In Unix Programmers Manual, Volume 214. Bell Laboratories, Murray Hill, New Jersey, 1988.Google Scholar
- 15.M.D. Mcilroy. Development of a spelling fist. IEEE Transactions on Communications, COM-30(1):91-99, January 1982.Google Scholar
- 16.A.M. Moffat. Economical inversion of large text files. Computing Systems, 5(2), 1992. To appear.Google Scholar
- 17.A.M. Moffat and J. Zobel. Coding for compression in full-text retrieval systems. In J.A. Storer and J.H. Reif, editors, Proc. IEEE Data Compression Conference, Snowbird, Utah, March 1992. To appear.Google Scholar
- 18.G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983. Google ScholarDigital Library
- 19.E.J. Schuegraf. Compression of large inverted files with hyperbolic term distribution. Information Processing ~Z Management, 12:377- 384, 1976.Google Scholar
- 20.Z. Somogyi. The Melbourne University bibliography system. Technical Report 90/3, Department of Computer Science, The University of Melbourne, Parkville, Victoria 3052, Australia, March 1990.Google Scholar
- 21.J. Teuhola. A compression method for clustered bit-vectors. Information Processing Letters, 7(6):308-311, October 1978.Google ScholarCross Ref
- 22.B. Tuthill. Refer~a bibliography system. In Unix User's Manual Supplementary Documents. 4.2 Berkeley Software Distribution, Berkeley, California, 1984.Google Scholar
- 23.I.H. Witten, T.C. Bell, and C. Nevill. Models for compression in full-text retrieval systems. In J.A. Storer and J.H. Reif, editors, Proc. IEEE Data Compression Conference, pages 23-32, Snowbird, Utah, April 1991.Google ScholarCross Ref
- 24.I.H. Witten, R. Neal, and J.G. C/eary. Arithmetic coding for data compression. Communications of the A CM, 30(6):520-541, June 1987. Google ScholarDigital Library
- 25.J. Zobel and A.M. Moffat. Adding compression to a full-text retrieval system. In Proc. 15'th Australian Computer Science Conference, pages 1077-1089, Hobart, January 1992. University of Tasmania.Google Scholar
- 26.J. Zobel, A.M. Moffat, and It. Sacks-Davis. An efficient indexing technique for full-text database systems. Technical Report 92/21, Collaborative Information Technology Research Institute, Department of Computer Science, Royal Melbourne Institute of Technology, Australia, February 1992.Google Scholar
Index Terms
- Parameterised compression for sparse bitmaps
Recommendations
Optimizing bitmap indices with efficient compression
Bitmap indices are efficient for answering queries on low-cardinality attributes. In this article, we present a new compression scheme called Word-Aligned Hybrid (WAH) code that makes compressed bitmap indices efficient even for high-cardinality ...
Better bitmap performance with Roaring bitmaps
Bitmap indexes are commonly used in databases and search engines. By exploiting bit-level parallelism, they can significantly accelerate queries. However, they can use much memory, and thus, we might prefer compressed bitmap indexes. Following Oracle's ...
Estimating JPEG compression history of bitmaps based on factor histogram
Estimation of JPEG compression history for bitmaps has drawn increasing attention in the past decade due to its extensive applications in image processing, image forensics and steganalysis. In this paper, we propose a novel statistic named factor ...
Comments