Abstract
A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This represents a significant advancement over the (full-)text indexing techniques of the previous decade, whose indexes required several times the size of the text. Although it is relatively new, this algorithmic technology has matured up to a point where theoretical research is giving way to practical developments. Nonetheless this requires significant programming skills, a deep engineering effort, and a strong algorithmic background to dig into the research results. To date only isolated implementations and focused comparisons of compressed indexes have been reported, and they missed a common API, which prevented their re-use or deployment within other applications.
The goal of this article is to fill this gap. First, we present the existing implementations of compressed indexes from a practitioner's point of view. Second, we introduce the Pizza&Chili site, which offers tuned implementations and a standardized API for the most successful compressed full-text self-indexes, together with effective test-beds and scripts for their automatic validation and test. Third, we show the results of our extensive experiments on these codes with the aim of demonstrating the practical relevance of this novel algorithmic technology.
- Aluru, S., and Ko, P. 2008. Encyclopedia of Algorithms. Springer, Chapter on “Lookup Tables, Suffix Trees and Suffix Arrays”. Google ScholarDigital Library
- Andersson, A. and Nilsson, S. 1995. Efficient implementation of suffix trees. Software Practice and Experience 25, 2, 129--141. Google ScholarDigital Library
- Arroyuelo, D., and Navarro, G. 2005. Space-efficient construction of LZ-index. In Proceedings 16th Annual International Symposium on Algorithms and Computation (ISAAC). Lecture Notes in Computer Science, vol. 3827. Springer, 1143--1152.Google Scholar
- Arroyuelo, D., and Navarro, G. 2008. Practical approaches to reduce the space requirement of Lempel-Ziv-based compressed text indices. Tech. Rep. TR/DCC-2008-9, Dept. of Computer Science, Univ. of Chile. July.Google Scholar
- Arroyuelo, D., Navarro, G., and Sadakane, K. 2006. Reducing the space requirement of LZ-index. In Proceedings 17th Annual Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 4009. Springer, 319--330.Google Scholar
- Brisaboa, N., Fariña, A., Ladra, S., and Navarro, G. 2008. Reorganizing compressed text. In Proceedings 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM Press, 139--146. Google ScholarDigital Library
- Brisaboa, N., Fariña, A., Navarro, G., Places, A., and Rodríguez, E. 2008. Self-indexing natural language. In Proceedings 15th International Symposium on String Processing and Information Retrieval (SPIRE). Lecture Notes in Computer Science. Springer. Google ScholarDigital Library
- Burrows, M. and Wheeler, D. 1994. A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation.Google Scholar
- Claude, F. 2008. Compressed data structures for Web graphs. M.S. thesis, Dept. of Computer Science, Univ. of Chile. Also as Tech. Report TR/DCC-2008-12.Google Scholar
- Claude, F. and Navarro, G. 2008. Practical rank/select queries over arbitrary sequences. In Proceedings 15th International Symposium on String Processing and Information Retrieval (SPIRE). Lecture Notes in Computer Science. Springer. Google ScholarDigital Library
- Cole, R., Kopelowitz, T., and Lewenstein, M. 2006. Suffix trays and suffix trists: Structures for faster text indexing. In Proceedings 33th International Colloquium on Automata, Languages and Programming (ICALP). Lecture Notes in Computer Science, vol. 4051. Springer, 358--369.Google Scholar
- Fariña, A., Navarro, G., and Paramá, J. 2008. Word-based statistical compressors as natural language compression boosters. In Proceedings 18th Data Compression Conference (DCC). 162--171. Google ScholarDigital Library
- Ferragina, P., Giancarlo, R., Manzini, G., and Sciortino, M. 2005. Boosting textual compression in optimal linear time. Journal of the ACM 52, 688--713. Google ScholarDigital Library
- Ferragina, P., Luccio, F., Manzini, G., and Muthukrishnan, S. 2005. Structuring labeled trees for optimal succinctness, and beyond. In Proceedings 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS). IEEE Computer Society, 184--196. Google ScholarDigital Library
- Ferragina, P., Luccio, F., Manzini, G., and Muthukrishnan, S. 2006. Compressing and searching xml data via two zips. In Proceedings 15th World Wide Web Conference (WWW). 751--760. Google ScholarDigital Library
- Ferragina, P. and Manzini, G. 2001. An experimental study of an opportunistic index. In Proceedings 12th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 269--278. Google ScholarDigital Library
- Ferragina, P. and Manzini, G. 2005. Indexing compressed texts. Journal of the ACM 52, 4, 552--581. Google ScholarDigital Library
- Ferragina, P., Manzini, G., Mäkinen, V., and Navarro, G. 2007. Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms (TALG) 3, 2, article 20. Google ScholarDigital Library
- Ferragina, P. and Venturini, R. 2007a. Compressed permuterm indexes. In Proceedings 30th ACM SIGIR Conference on Research and Development in Information Retrieval. 535--542. Google ScholarDigital Library
- Ferragina, P. and Venturini, R. 2007b. A simple storage scheme for strings achieving entropy bounds. Theoretical Computer Science 372, 1, 115--121. Google ScholarDigital Library
- Foschini, L., Grossi, R., Gupta, A., and Vitter, J. 2006. When indexing equals compression: Experiments with compressing suffix arrays and applications. ACM Transactions on Algorithms 2, 4, 611--639. Google ScholarDigital Library
- Giegerich, R., Kurtz, S., and Stoye, J. 2003. Efficient implementation of lazy suffix trees. Software Practice and Experience 33, 11, 1035--1049.Google ScholarCross Ref
- González, R., Grabowski, S., Mäkinen, V., and Navarro, G. 2005. Practical implementation of rank and select queries. In Poster Proceedings Volume of 4th Workshop on Efficient and Experimental Algorithms (WEA). 27--38.Google Scholar
- González, R. and Navarro, G. 2006. Statistical encoding of succinct data structures. In Proceedings 17th Annual Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 4009. Springer, 295--306.Google Scholar
- González, R. and Navarro, G. 2007. Compressed text indexes with fast locate. In Proceedings 18th Annual Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 4580. Springer, 216--227. Google ScholarDigital Library
- Grossi, R., Gupta, A., and Vitter, J. 2003. High-order entropy-compressed text indexes. In Proceedings 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 841--850. Google ScholarDigital Library
- Grossi, R. and Vitter, J. 2006. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing 35, 2, 378--407. Google ScholarDigital Library
- Gusfield, D. 1997. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press. Google ScholarDigital Library
- Hon, W., Sadakane, K., and Sung, W. 2003. Breaking a time-and-space barrier in constructing full-text indices. In Proceedings 44th Annual IEEE Symposium on Foundations of Computer Science (FOCS). IEEE Computer Society, 251--260. Google ScholarDigital Library
- Jokinen, P. and Ukkonen, E. 1991. Two algorithms for approximate string matching in static texts. In Proceedings 2nd Annual Symposium on Mathematical Foundations of Computer Science (MFCS). Vol. 16. 240--248.Google Scholar
- Kärkkäinen, J. 1995. Suffix cactus: a cross between suffix tree and suffix array. In Proceedings 6th Annual Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 937. Springer, 191--204.Google Scholar
- Kärkkäinen, J. and Ukkonen, E. 1996a. Lempel-Ziv parsing and sublinear-size index structures for string matching. In Proceedings 3rd South American Workshop on String Processing (WSP). Carleton University Press, 141--155.Google Scholar
- Kärkkäinen, J. and Ukkonen, E. 1996b. Sparse suffix trees. In Proceedings 2nd Annual International Conference on Computing and Combinatorics (COCOON). Lecture Notes in Computer Science, vol. 1090. Springer, 219--230. Google ScholarDigital Library
- Lam, T.-W., Sadakane, K., Sung, W.-K., and Yiu, S.-M. 2002. A space and time efficient algorithm for constructing compressed suffix arrays. In Proceedings 8th Conference on Computing and Combinatorics (COCOON). Lecture Notes in Computer Science, vol. 2387. Springer, 401--410. Google ScholarDigital Library
- Larsson, N. J. and Sadakane, K. 2007. Faster suffix sorting. Theoretical Computer Science (TCS) 387, 3, 258--272. Google ScholarDigital Library
- Lehtinen, O., Sutinen, E., and Tarhio, J. 1996. Experiments on block indexing. In Proceedings 3rd South American Workshop on String Processing (WSP). Carleton University Press, 183--193.Google Scholar
- Lemström, K. and Perttu, S. 2000. SEMEX -- an efficient music retrieval prototype. In Proceedings 1st International Symposium on Music Information Retrieval (ISMIR).Google Scholar
- Mäkinen, V. and Navarro, G. 2005. Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing 12, 1, 40--66. Google ScholarDigital Library
- Mäkinen, V. and Navarro, G. 2008a. Dynamic entropy-compressed sequences and full-text indexes. ACM Transactions on Algorithms 4, 3, article 32. Google ScholarDigital Library
- Mäkinen, V. and Navarro, G. 2008b. On self-indexing images—image compression with added value. In Proceedings 18th Data Compression Conference (DCC). 422--431. Google ScholarDigital Library
- Manber, U. and Myers, G. 1993. Suffix arrays: A new method for on-line string searches. SIAM Journal of Computing 22, 935--948. Google ScholarDigital Library
- Manber, U. and Wu, S. 1994. Glimpse: a tool to search through entire file systems. In Proceedings of the USENIX Winter 1994 Technical Conference. USENIX Association, 4--4. Google ScholarDigital Library
- Manzini, G. 2001. An analysis of the Burrows-Wheeler transform. Journal of the ACM 48, 3, 407--430. Google ScholarDigital Library
- Manzini, G. and Ferragina, P. 2004. Engineering a lightweight suffix array construction algorithm. Algorithmica 40, 1, 33--50. Google ScholarDigital Library
- Munro, I. 1996. Tables. In Proceedings 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS). Lecture Notes in Computer Science, vol. 1180. Springer, 37--42. Google ScholarDigital Library
- Munro, I., Raman, R., Raman, V., and Rao, S. 2003. Succinct representations of permutations. In Proceedings 30th International Colloquium on Automata, Languages and Programming (ICALP). Lecture Notes in Computer Science, vol. 2719. Springer, 345--356.Google Scholar
- Munro, I. and Raman, V. 1997. Succinct representation of balanced parentheses, static trees and planar graphs. In Proceedings 38th Annual IEEE Symposium on Foundations of Computer Science (FOCS). IEEE Computer Society, 118--126. Google ScholarDigital Library
- Navarro, G. 2004. Indexing text using the Ziv-Lempel trie. Journal of Discrete Algorithms (JDA) 2, 1, 87--114. Google ScholarDigital Library
- Navarro, G. and Baeza-Yates, R. 1998. A practical q-gram index for text retrieval allowing errors. CLEI Electronic Journal 1, 2. http://www.clei.cl.Google Scholar
- Navarro, G., Baeza-Yates, R., Sutinen, E., and Tarhio, J. 2001. Indexing methods for approximate string matching. IEEE Data Engineering Bulletin 24, 4, 19--27.Google Scholar
- Navarro, G. and Mäkinen, V. 2007. Compressed full-text indexes. ACM Computing Surveys 39, 1, article 2. Google ScholarDigital Library
- Navarro, G., Moura, E., Neubert, M., Ziviani, N., and Baeza-Yates, R. 2000. Adding compression to block addressing inverted indexes. Information Retrieval 3, 1, 49--77. Google ScholarDigital Library
- Navarro, G. and Tarhio, J. 2005. LZgrep: A Boyer-Moore string matching tool for Ziv-Lempel compressed text. Software Practice and Experience 35, 12, 1107--1130. Google ScholarDigital Library
- Puglisi, S., Smyth, W., and Turpin, A. 2007. A taxonomy of suffix array construction algorithms. ACM Computing Surveys 39, 2, article 4. Google ScholarDigital Library
- Puglisi, S. J., Smyth, W. F., and Turpin, A. 2006. Inverted files versus suffix arrays for locating patterns in primary memory. In Proceedings 13th String Processing and Information Retrieval (SPIRE). Lecture Notes in Computer Science, vol. 4209. Springer, 122--133.Google Scholar
- Sadakane, K. 2002. Succinct representations of lcp information and improvements in the compressed suffix arrays. In Proceedings 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 225--232. Google ScholarDigital Library
- Sadakane, K. 2003. New text indexing functionalities of the compressed suffix arrays. Journal of Algorithms 48, 2, 294--313. Google ScholarDigital Library
- Sutinen, E. and Tarhio, J. 1996. Filtration with q-samples in approximate string matching. In Proceedings 7th Annual Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 1075. Springer, 50--61. Google ScholarDigital Library
- Ullman, J. 1977. A binary n-gram technique for automatic correction of substitution, deletion, insertion and reversal errors in words. The Computer Journal 10, 141--147.Google ScholarCross Ref
- Williams, H. E. and Zobel, J. 2002. Indexing and retrieval for genomic databases. IEEE Transaction on Knowledge and Data Engineering 14, 1, 63--78. Google ScholarDigital Library
- Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images, second ed. Morgan Kaufmann Publishers. Google ScholarDigital Library
- Ziv, J. and Lempel, A. 1978. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory 24, 5, 530--536.Google ScholarDigital Library
- Zobel, J. and Moffat, A. 2006. Inverted files for text search engines. ACM Computing Surveys 38, 2, 6. Google ScholarDigital Library
Index Terms
- Compressed text indexes: From theory to practice
Recommendations
Compressed full-text indexes
Full-text indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that ...
Compressed representations of sequences and full-text indexes
Given a sequence S = s1s2…sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq, as well as answer rank and select queries on S, in constant time. H0(S) is the zero-order ...
Indexing compressed text
We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form.Our first compressed data structure retrieves the occ ...
Comments