skip to main content
research-article

Compressed text indexes: From theory to practice

Published:23 February 2009Publication History
Skip Abstract Section

Abstract

A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This represents a significant advancement over the (full-)text indexing techniques of the previous decade, whose indexes required several times the size of the text. Although it is relatively new, this algorithmic technology has matured up to a point where theoretical research is giving way to practical developments. Nonetheless this requires significant programming skills, a deep engineering effort, and a strong algorithmic background to dig into the research results. To date only isolated implementations and focused comparisons of compressed indexes have been reported, and they missed a common API, which prevented their re-use or deployment within other applications.

The goal of this article is to fill this gap. First, we present the existing implementations of compressed indexes from a practitioner's point of view. Second, we introduce the Pizza&Chili site, which offers tuned implementations and a standardized API for the most successful compressed full-text self-indexes, together with effective test-beds and scripts for their automatic validation and test. Third, we show the results of our extensive experiments on these codes with the aim of demonstrating the practical relevance of this novel algorithmic technology.

References

  1. Aluru, S., and Ko, P. 2008. Encyclopedia of Algorithms. Springer, Chapter on “Lookup Tables, Suffix Trees and Suffix Arrays”. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Andersson, A. and Nilsson, S. 1995. Efficient implementation of suffix trees. Software Practice and Experience 25, 2, 129--141. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Arroyuelo, D., and Navarro, G. 2005. Space-efficient construction of LZ-index. In Proceedings 16th Annual International Symposium on Algorithms and Computation (ISAAC). Lecture Notes in Computer Science, vol. 3827. Springer, 1143--1152.Google ScholarGoogle Scholar
  4. Arroyuelo, D., and Navarro, G. 2008. Practical approaches to reduce the space requirement of Lempel-Ziv-based compressed text indices. Tech. Rep. TR/DCC-2008-9, Dept. of Computer Science, Univ. of Chile. July.Google ScholarGoogle Scholar
  5. Arroyuelo, D., Navarro, G., and Sadakane, K. 2006. Reducing the space requirement of LZ-index. In Proceedings 17th Annual Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 4009. Springer, 319--330.Google ScholarGoogle Scholar
  6. Brisaboa, N., Fariña, A., Ladra, S., and Navarro, G. 2008. Reorganizing compressed text. In Proceedings 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM Press, 139--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Brisaboa, N., Fariña, A., Navarro, G., Places, A., and Rodríguez, E. 2008. Self-indexing natural language. In Proceedings 15th International Symposium on String Processing and Information Retrieval (SPIRE). Lecture Notes in Computer Science. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Burrows, M. and Wheeler, D. 1994. A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation.Google ScholarGoogle Scholar
  9. Claude, F. 2008. Compressed data structures for Web graphs. M.S. thesis, Dept. of Computer Science, Univ. of Chile. Also as Tech. Report TR/DCC-2008-12.Google ScholarGoogle Scholar
  10. Claude, F. and Navarro, G. 2008. Practical rank/select queries over arbitrary sequences. In Proceedings 15th International Symposium on String Processing and Information Retrieval (SPIRE). Lecture Notes in Computer Science. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Cole, R., Kopelowitz, T., and Lewenstein, M. 2006. Suffix trays and suffix trists: Structures for faster text indexing. In Proceedings 33th International Colloquium on Automata, Languages and Programming (ICALP). Lecture Notes in Computer Science, vol. 4051. Springer, 358--369.Google ScholarGoogle Scholar
  12. Fariña, A., Navarro, G., and Paramá, J. 2008. Word-based statistical compressors as natural language compression boosters. In Proceedings 18th Data Compression Conference (DCC). 162--171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ferragina, P., Giancarlo, R., Manzini, G., and Sciortino, M. 2005. Boosting textual compression in optimal linear time. Journal of the ACM 52, 688--713. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ferragina, P., Luccio, F., Manzini, G., and Muthukrishnan, S. 2005. Structuring labeled trees for optimal succinctness, and beyond. In Proceedings 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS). IEEE Computer Society, 184--196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ferragina, P., Luccio, F., Manzini, G., and Muthukrishnan, S. 2006. Compressing and searching xml data via two zips. In Proceedings 15th World Wide Web Conference (WWW). 751--760. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Ferragina, P. and Manzini, G. 2001. An experimental study of an opportunistic index. In Proceedings 12th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 269--278. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ferragina, P. and Manzini, G. 2005. Indexing compressed texts. Journal of the ACM 52, 4, 552--581. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Ferragina, P., Manzini, G., Mäkinen, V., and Navarro, G. 2007. Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms (TALG) 3, 2, article 20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Ferragina, P. and Venturini, R. 2007a. Compressed permuterm indexes. In Proceedings 30th ACM SIGIR Conference on Research and Development in Information Retrieval. 535--542. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Ferragina, P. and Venturini, R. 2007b. A simple storage scheme for strings achieving entropy bounds. Theoretical Computer Science 372, 1, 115--121. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Foschini, L., Grossi, R., Gupta, A., and Vitter, J. 2006. When indexing equals compression: Experiments with compressing suffix arrays and applications. ACM Transactions on Algorithms 2, 4, 611--639. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Giegerich, R., Kurtz, S., and Stoye, J. 2003. Efficient implementation of lazy suffix trees. Software Practice and Experience 33, 11, 1035--1049.Google ScholarGoogle ScholarCross RefCross Ref
  23. González, R., Grabowski, S., Mäkinen, V., and Navarro, G. 2005. Practical implementation of rank and select queries. In Poster Proceedings Volume of 4th Workshop on Efficient and Experimental Algorithms (WEA). 27--38.Google ScholarGoogle Scholar
  24. González, R. and Navarro, G. 2006. Statistical encoding of succinct data structures. In Proceedings 17th Annual Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 4009. Springer, 295--306.Google ScholarGoogle Scholar
  25. González, R. and Navarro, G. 2007. Compressed text indexes with fast locate. In Proceedings 18th Annual Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 4580. Springer, 216--227. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Grossi, R., Gupta, A., and Vitter, J. 2003. High-order entropy-compressed text indexes. In Proceedings 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 841--850. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Grossi, R. and Vitter, J. 2006. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing 35, 2, 378--407. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Gusfield, D. 1997. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Hon, W., Sadakane, K., and Sung, W. 2003. Breaking a time-and-space barrier in constructing full-text indices. In Proceedings 44th Annual IEEE Symposium on Foundations of Computer Science (FOCS). IEEE Computer Society, 251--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jokinen, P. and Ukkonen, E. 1991. Two algorithms for approximate string matching in static texts. In Proceedings 2nd Annual Symposium on Mathematical Foundations of Computer Science (MFCS). Vol. 16. 240--248.Google ScholarGoogle Scholar
  31. Kärkkäinen, J. 1995. Suffix cactus: a cross between suffix tree and suffix array. In Proceedings 6th Annual Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 937. Springer, 191--204.Google ScholarGoogle Scholar
  32. Kärkkäinen, J. and Ukkonen, E. 1996a. Lempel-Ziv parsing and sublinear-size index structures for string matching. In Proceedings 3rd South American Workshop on String Processing (WSP). Carleton University Press, 141--155.Google ScholarGoogle Scholar
  33. Kärkkäinen, J. and Ukkonen, E. 1996b. Sparse suffix trees. In Proceedings 2nd Annual International Conference on Computing and Combinatorics (COCOON). Lecture Notes in Computer Science, vol. 1090. Springer, 219--230. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Lam, T.-W., Sadakane, K., Sung, W.-K., and Yiu, S.-M. 2002. A space and time efficient algorithm for constructing compressed suffix arrays. In Proceedings 8th Conference on Computing and Combinatorics (COCOON). Lecture Notes in Computer Science, vol. 2387. Springer, 401--410. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Larsson, N. J. and Sadakane, K. 2007. Faster suffix sorting. Theoretical Computer Science (TCS) 387, 3, 258--272. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Lehtinen, O., Sutinen, E., and Tarhio, J. 1996. Experiments on block indexing. In Proceedings 3rd South American Workshop on String Processing (WSP). Carleton University Press, 183--193.Google ScholarGoogle Scholar
  37. Lemström, K. and Perttu, S. 2000. SEMEX -- an efficient music retrieval prototype. In Proceedings 1st International Symposium on Music Information Retrieval (ISMIR).Google ScholarGoogle Scholar
  38. Mäkinen, V. and Navarro, G. 2005. Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing 12, 1, 40--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Mäkinen, V. and Navarro, G. 2008a. Dynamic entropy-compressed sequences and full-text indexes. ACM Transactions on Algorithms 4, 3, article 32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Mäkinen, V. and Navarro, G. 2008b. On self-indexing images—image compression with added value. In Proceedings 18th Data Compression Conference (DCC). 422--431. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Manber, U. and Myers, G. 1993. Suffix arrays: A new method for on-line string searches. SIAM Journal of Computing 22, 935--948. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Manber, U. and Wu, S. 1994. Glimpse: a tool to search through entire file systems. In Proceedings of the USENIX Winter 1994 Technical Conference. USENIX Association, 4--4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Manzini, G. 2001. An analysis of the Burrows-Wheeler transform. Journal of the ACM 48, 3, 407--430. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Manzini, G. and Ferragina, P. 2004. Engineering a lightweight suffix array construction algorithm. Algorithmica 40, 1, 33--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Munro, I. 1996. Tables. In Proceedings 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS). Lecture Notes in Computer Science, vol. 1180. Springer, 37--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Munro, I., Raman, R., Raman, V., and Rao, S. 2003. Succinct representations of permutations. In Proceedings 30th International Colloquium on Automata, Languages and Programming (ICALP). Lecture Notes in Computer Science, vol. 2719. Springer, 345--356.Google ScholarGoogle Scholar
  47. Munro, I. and Raman, V. 1997. Succinct representation of balanced parentheses, static trees and planar graphs. In Proceedings 38th Annual IEEE Symposium on Foundations of Computer Science (FOCS). IEEE Computer Society, 118--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Navarro, G. 2004. Indexing text using the Ziv-Lempel trie. Journal of Discrete Algorithms (JDA) 2, 1, 87--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Navarro, G. and Baeza-Yates, R. 1998. A practical q-gram index for text retrieval allowing errors. CLEI Electronic Journal 1, 2. http://www.clei.cl.Google ScholarGoogle Scholar
  50. Navarro, G., Baeza-Yates, R., Sutinen, E., and Tarhio, J. 2001. Indexing methods for approximate string matching. IEEE Data Engineering Bulletin 24, 4, 19--27.Google ScholarGoogle Scholar
  51. Navarro, G. and Mäkinen, V. 2007. Compressed full-text indexes. ACM Computing Surveys 39, 1, article 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Navarro, G., Moura, E., Neubert, M., Ziviani, N., and Baeza-Yates, R. 2000. Adding compression to block addressing inverted indexes. Information Retrieval 3, 1, 49--77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Navarro, G. and Tarhio, J. 2005. LZgrep: A Boyer-Moore string matching tool for Ziv-Lempel compressed text. Software Practice and Experience 35, 12, 1107--1130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Puglisi, S., Smyth, W., and Turpin, A. 2007. A taxonomy of suffix array construction algorithms. ACM Computing Surveys 39, 2, article 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Puglisi, S. J., Smyth, W. F., and Turpin, A. 2006. Inverted files versus suffix arrays for locating patterns in primary memory. In Proceedings 13th String Processing and Information Retrieval (SPIRE). Lecture Notes in Computer Science, vol. 4209. Springer, 122--133.Google ScholarGoogle Scholar
  56. Sadakane, K. 2002. Succinct representations of lcp information and improvements in the compressed suffix arrays. In Proceedings 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 225--232. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Sadakane, K. 2003. New text indexing functionalities of the compressed suffix arrays. Journal of Algorithms 48, 2, 294--313. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Sutinen, E. and Tarhio, J. 1996. Filtration with q-samples in approximate string matching. In Proceedings 7th Annual Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 1075. Springer, 50--61. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Ullman, J. 1977. A binary n-gram technique for automatic correction of substitution, deletion, insertion and reversal errors in words. The Computer Journal 10, 141--147.Google ScholarGoogle ScholarCross RefCross Ref
  60. Williams, H. E. and Zobel, J. 2002. Indexing and retrieval for genomic databases. IEEE Transaction on Knowledge and Data Engineering 14, 1, 63--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images, second ed. Morgan Kaufmann Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Ziv, J. and Lempel, A. 1978. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory 24, 5, 530--536.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Zobel, J. and Moffat, A. 2006. Inverted files for text search engines. ACM Computing Surveys 38, 2, 6. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Compressed text indexes: From theory to practice

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                Full Access

                • Published in

                  cover image ACM Journal of Experimental Algorithmics
                  ACM Journal of Experimental Algorithmics  Volume 13, Issue
                  2009
                  482 pages
                  ISSN:1084-6654
                  EISSN:1084-6654
                  DOI:10.1145/1412228
                  Issue’s Table of Contents

                  Copyright © 2009 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 23 February 2009
                  Published in jea Volume 13, Issue

                  Qualifiers

                  • research-article
                  • Research
                  • Refereed

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader