research-article

Compressed text indexes: From theory to practice

Authors:
Paolo Ferragina

University of Pisa

University of Pisa
View Profile

,
Rodrigo González

University of Chile

University of Chile
View Profile

,
Gonzalo Navarro

University of Chile

University of Chile
View Profile

,
Rossano Venturini

University of Pisa

University of Pisa
View Profile

ACM Journal of Experimental Algorithmics Volume 13Article No.: 12pp 1.12–1.31https://doi.org/10.1145/1412228.1455268

Published:23 February 2009Publication History

ACM Journal of Experimental Algorithmics

Abstract

A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This represents a significant advancement over the (full-)text indexing techniques of the previous decade, whose indexes required several times the size of the text. Although it is relatively new, this algorithmic technology has matured up to a point where theoretical research is giving way to practical developments. Nonetheless this requires significant programming skills, a deep engineering effort, and a strong algorithmic background to dig into the research results. To date only isolated implementations and focused comparisons of compressed indexes have been reported, and they missed a common API, which prevented their re-use or deployment within other applications.

The goal of this article is to fill this gap. First, we present the existing implementations of compressed indexes from a practitioner's point of view. Second, we introduce the Pizza&Chili site, which offers tuned implementations and a standardized API for the most successful compressed full-text self-indexes, together with effective test-beds and scripts for their automatic validation and test. Third, we show the results of our extensive experiments on these codes with the aim of demonstrating the practical relevance of this novel algorithmic technology.

References

Aluru, S., and Ko, P. 2008. Encyclopedia of Algorithms. Springer, Chapter on “Lookup Tables, Suffix Trees and Suffix Arrays”. Google ScholarDigital Library
Andersson, A. and Nilsson, S. 1995. Efficient implementation of suffix trees. Software Practice and Experience 25, 2, 129--141. Google ScholarDigital Library
Arroyuelo, D., and Navarro, G. 2005. Space-efficient construction of LZ-index. In Proceedings 16th Annual International Symposium on Algorithms and Computation (ISAAC). Lecture Notes in Computer Science, vol. 3827. Springer, 1143--1152.Google Scholar
Arroyuelo, D., and Navarro, G. 2008. Practical approaches to reduce the space requirement of Lempel-Ziv-based compressed text indices. Tech. Rep. TR/DCC-2008-9, Dept. of Computer Science, Univ. of Chile. July.Google Scholar
Arroyuelo, D., Navarro, G., and Sadakane, K. 2006. Reducing the space requirement of LZ-index. In Proceedings 17th Annual Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 4009. Springer, 319--330.Google Scholar
Brisaboa, N., Fariña, A., Ladra, S., and Navarro, G. 2008. Reorganizing compressed text. In Proceedings 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM Press, 139--146. Google ScholarDigital Library
Brisaboa, N., Fariña, A., Navarro, G., Places, A., and Rodríguez, E. 2008. Self-indexing natural language. In Proceedings 15th International Symposium on String Processing and Information Retrieval (SPIRE). Lecture Notes in Computer Science. Springer. Google ScholarDigital Library
Burrows, M. and Wheeler, D. 1994. A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation.Google Scholar
Claude, F. 2008. Compressed data structures for Web graphs. M.S. thesis, Dept. of Computer Science, Univ. of Chile. Also as Tech. Report TR/DCC-2008-12.Google Scholar
Claude, F. and Navarro, G. 2008. Practical rank/select queries over arbitrary sequences. In Proceedings 15th International Symposium on String Processing and Information Retrieval (SPIRE). Lecture Notes in Computer Science. Springer. Google ScholarDigital Library
Cole, R., Kopelowitz, T., and Lewenstein, M. 2006. Suffix trays and suffix trists: Structures for faster text indexing. In Proceedings 33th International Colloquium on Automata, Languages and Programming (ICALP). Lecture Notes in Computer Science, vol. 4051. Springer, 358--369.Google Scholar
Fariña, A., Navarro, G., and Paramá, J. 2008. Word-based statistical compressors as natural language compression boosters. In Proceedings 18th Data Compression Conference (DCC). 162--171. Google ScholarDigital Library
Ferragina, P., Giancarlo, R., Manzini, G., and Sciortino, M. 2005. Boosting textual compression in optimal linear time. Journal of the ACM 52, 688--713. Google ScholarDigital Library
Ferragina, P., Luccio, F., Manzini, G., and Muthukrishnan, S. 2005. Structuring labeled trees for optimal succinctness, and beyond. In Proceedings 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS). IEEE Computer Society, 184--196. Google ScholarDigital Library
Ferragina, P., Luccio, F., Manzini, G., and Muthukrishnan, S. 2006. Compressing and searching xml data via two zips. In Proceedings 15th World Wide Web Conference (WWW). 751--760. Google ScholarDigital Library
Ferragina, P. and Manzini, G. 2001. An experimental study of an opportunistic index. In Proceedings 12th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 269--278. Google ScholarDigital Library
Ferragina, P. and Manzini, G. 2005. Indexing compressed texts. Journal of the ACM 52, 4, 552--581. Google ScholarDigital Library
Ferragina, P., Manzini, G., Mäkinen, V., and Navarro, G. 2007. Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms (TALG) 3, 2, article 20. Google ScholarDigital Library
Ferragina, P. and Venturini, R. 2007a. Compressed permuterm indexes. In Proceedings 30th ACM SIGIR Conference on Research and Development in Information Retrieval. 535--542. Google ScholarDigital Library
Ferragina, P. and Venturini, R. 2007b. A simple storage scheme for strings achieving entropy bounds. Theoretical Computer Science 372, 1, 115--121. Google ScholarDigital Library
Foschini, L., Grossi, R., Gupta, A., and Vitter, J. 2006. When indexing equals compression: Experiments with compressing suffix arrays and applications. ACM Transactions on Algorithms 2, 4, 611--639. Google ScholarDigital Library
Giegerich, R., Kurtz, S., and Stoye, J. 2003. Efficient implementation of lazy suffix trees. Software Practice and Experience 33, 11, 1035--1049.Google ScholarCross Ref
González, R., Grabowski, S., Mäkinen, V., and Navarro, G. 2005. Practical implementation of rank and select queries. In Poster Proceedings Volume of 4th Workshop on Efficient and Experimental Algorithms (WEA). 27--38.Google Scholar
González, R. and Navarro, G. 2006. Statistical encoding of succinct data structures. In Proceedings 17th Annual Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 4009. Springer, 295--306.Google Scholar
González, R. and Navarro, G. 2007. Compressed text indexes with fast locate. In Proceedings 18th Annual Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 4580. Springer, 216--227. Google ScholarDigital Library
Grossi, R., Gupta, A., and Vitter, J. 2003. High-order entropy-compressed text indexes. In Proceedings 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 841--850. Google ScholarDigital Library
Grossi, R. and Vitter, J. 2006. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing 35, 2, 378--407. Google ScholarDigital Library
Gusfield, D. 1997. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press. Google ScholarDigital Library
Hon, W., Sadakane, K., and Sung, W. 2003. Breaking a time-and-space barrier in constructing full-text indices. In Proceedings 44th Annual IEEE Symposium on Foundations of Computer Science (FOCS). IEEE Computer Society, 251--260. Google ScholarDigital Library
Jokinen, P. and Ukkonen, E. 1991. Two algorithms for approximate string matching in static texts. In Proceedings 2nd Annual Symposium on Mathematical Foundations of Computer Science (MFCS). Vol. 16. 240--248.Google Scholar
Kärkkäinen, J. 1995. Suffix cactus: a cross between suffix tree and suffix array. In Proceedings 6th Annual Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 937. Springer, 191--204.Google Scholar
Kärkkäinen, J. and Ukkonen, E. 1996a. Lempel-Ziv parsing and sublinear-size index structures for string matching. In Proceedings 3rd South American Workshop on String Processing (WSP). Carleton University Press, 141--155.Google Scholar
Kärkkäinen, J. and Ukkonen, E. 1996b. Sparse suffix trees. In Proceedings 2nd Annual International Conference on Computing and Combinatorics (COCOON). Lecture Notes in Computer Science, vol. 1090. Springer, 219--230. Google ScholarDigital Library
Lam, T.-W., Sadakane, K., Sung, W.-K., and Yiu, S.-M. 2002. A space and time efficient algorithm for constructing compressed suffix arrays. In Proceedings 8th Conference on Computing and Combinatorics (COCOON). Lecture Notes in Computer Science, vol. 2387. Springer, 401--410. Google ScholarDigital Library
Larsson, N. J. and Sadakane, K. 2007. Faster suffix sorting. Theoretical Computer Science (TCS) 387, 3, 258--272. Google ScholarDigital Library
Lehtinen, O., Sutinen, E., and Tarhio, J. 1996. Experiments on block indexing. In Proceedings 3rd South American Workshop on String Processing (WSP). Carleton University Press, 183--193.Google Scholar
Lemström, K. and Perttu, S. 2000. SEMEX -- an efficient music retrieval prototype. In Proceedings 1st International Symposium on Music Information Retrieval (ISMIR).Google Scholar
Mäkinen, V. and Navarro, G. 2005. Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing 12, 1, 40--66. Google ScholarDigital Library
Mäkinen, V. and Navarro, G. 2008a. Dynamic entropy-compressed sequences and full-text indexes. ACM Transactions on Algorithms 4, 3, article 32. Google ScholarDigital Library
Mäkinen, V. and Navarro, G. 2008b. On self-indexing images—image compression with added value. In Proceedings 18th Data Compression Conference (DCC). 422--431. Google ScholarDigital Library
Manber, U. and Myers, G. 1993. Suffix arrays: A new method for on-line string searches. SIAM Journal of Computing 22, 935--948. Google ScholarDigital Library
Manber, U. and Wu, S. 1994. Glimpse: a tool to search through entire file systems. In Proceedings of the USENIX Winter 1994 Technical Conference. USENIX Association, 4--4. Google ScholarDigital Library
Manzini, G. 2001. An analysis of the Burrows-Wheeler transform. Journal of the ACM 48, 3, 407--430. Google ScholarDigital Library
Manzini, G. and Ferragina, P. 2004. Engineering a lightweight suffix array construction algorithm. Algorithmica 40, 1, 33--50. Google ScholarDigital Library
Munro, I. 1996. Tables. In Proceedings 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS). Lecture Notes in Computer Science, vol. 1180. Springer, 37--42. Google ScholarDigital Library
Munro, I., Raman, R., Raman, V., and Rao, S. 2003. Succinct representations of permutations. In Proceedings 30th International Colloquium on Automata, Languages and Programming (ICALP). Lecture Notes in Computer Science, vol. 2719. Springer, 345--356.Google Scholar
Munro, I. and Raman, V. 1997. Succinct representation of balanced parentheses, static trees and planar graphs. In Proceedings 38th Annual IEEE Symposium on Foundations of Computer Science (FOCS). IEEE Computer Society, 118--126. Google ScholarDigital Library
Navarro, G. 2004. Indexing text using the Ziv-Lempel trie. Journal of Discrete Algorithms (JDA) 2, 1, 87--114. Google ScholarDigital Library
Navarro, G. and Baeza-Yates, R. 1998. A practical q-gram index for text retrieval allowing errors. CLEI Electronic Journal 1, 2. http://www.clei.cl.Google Scholar
Navarro, G., Baeza-Yates, R., Sutinen, E., and Tarhio, J. 2001. Indexing methods for approximate string matching. IEEE Data Engineering Bulletin 24, 4, 19--27.Google Scholar
Navarro, G. and Mäkinen, V. 2007. Compressed full-text indexes. ACM Computing Surveys 39, 1, article 2. Google ScholarDigital Library
Navarro, G., Moura, E., Neubert, M., Ziviani, N., and Baeza-Yates, R. 2000. Adding compression to block addressing inverted indexes. Information Retrieval 3, 1, 49--77. Google ScholarDigital Library
Navarro, G. and Tarhio, J. 2005. LZgrep: A Boyer-Moore string matching tool for Ziv-Lempel compressed text. Software Practice and Experience 35, 12, 1107--1130. Google ScholarDigital Library
Puglisi, S., Smyth, W., and Turpin, A. 2007. A taxonomy of suffix array construction algorithms. ACM Computing Surveys 39, 2, article 4. Google ScholarDigital Library
Puglisi, S. J., Smyth, W. F., and Turpin, A. 2006. Inverted files versus suffix arrays for locating patterns in primary memory. In Proceedings 13th String Processing and Information Retrieval (SPIRE). Lecture Notes in Computer Science, vol. 4209. Springer, 122--133.Google Scholar
Sadakane, K. 2002. Succinct representations of lcp information and improvements in the compressed suffix arrays. In Proceedings 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 225--232. Google ScholarDigital Library
Sadakane, K. 2003. New text indexing functionalities of the compressed suffix arrays. Journal of Algorithms 48, 2, 294--313. Google ScholarDigital Library
Sutinen, E. and Tarhio, J. 1996. Filtration with q-samples in approximate string matching. In Proceedings 7th Annual Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 1075. Springer, 50--61. Google ScholarDigital Library
Ullman, J. 1977. A binary n-gram technique for automatic correction of substitution, deletion, insertion and reversal errors in words. The Computer Journal 10, 141--147.Google ScholarCross Ref
Williams, H. E. and Zobel, J. 2002. Indexing and retrieval for genomic databases. IEEE Transaction on Knowledge and Data Engineering 14, 1, 63--78. Google ScholarDigital Library
Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images, second ed. Morgan Kaufmann Publishers. Google ScholarDigital Library
Ziv, J. and Lempel, A. 1978. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory 24, 5, 530--536.Google ScholarDigital Library
Zobel, J. and Moffat, A. 2006. Inverted files for text search engines. ACM Computing Surveys 38, 2, 6. Google ScholarDigital Library

Index Terms

Compressed text indexes: From theory to practice
1. Information systems
2. Theory of computation
  1. Design and analysis of algorithms
    1. Data structures design and analysis
      1. Pattern matching
      2. Sorting and searching
  2. Randomness, geometry and discrete structures

Recommendations

Compressed full-text indexes

Full-text indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that ...
Read More
Compressed representations of sequences and full-text indexes

Given a sequence S = s₁s₂…s_n of integers smaller than r = O(polylog(n)), we show how S can be represented using nH₀(S) + o(n) bits, so that we can know any s_q, as well as answer rank and select queries on S, in constant time. H₀(S) is the zero-order ...
Read More
Indexing compressed text

We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form.Our first compressed data structure retrieves the occ ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Journal of Experimental Algorithmics Volume 13, Issue
2009
482 pages
ISSN:1084-6654
EISSN:1084-6654
DOI:10.1145/1412228
Issue’s Table of Contents

Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 February 2009
Published in jea Volume 13, Issue
Author Tags
Text indexing
bioinformatics databases
coding and information theory
data storage representation
data structures
indexing methods
text compression
textual databases
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 91
  Total Citations
  View Citations
- 1,302
  Total Downloads
- Downloads (Last 12 months)25
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Compressed text indexes: From theory to practice

ACM Journal of Experimental Algorithmics

Abstract

References

Cited By

Index Terms

Recommendations

Compressed full-text indexes

Compressed representations of sequences and full-text indexes

Indexing compressed text

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Compressed text indexes: From theory to practice

ACM Journal of Experimental Algorithmics

Abstract

References

Cited By

Index Terms

Recommendations

Compressed full-text indexes

Compressed representations of sequences and full-text indexes

Indexing compressed text

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media