article

Indexing compressed text

Authors:
Paolo Ferragina

Università di Pisa, Pisa, Italy

Università di Pisa, Pisa, Italy
View Profile

,
Giovanni Manzini

Università del Piemonte Orientale, Alessandria, Italy

Università del Piemonte Orientale, Alessandria, Italy
View Profile

Authors Info & Claims

Journal of the ACM Volume 52 Issue 4pp 552–581https://doi.org/10.1145/1082036.1082039

Published:01 July 2005Publication History

Journal of the ACM

Abstract

We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form.Our first compressed data structure retrieves the occ occurrences of a pattern P[1,p] within a text T[1,n] in O(p + occ log^1+ε n) time for any chosen ε, 0<ε<1. This data structure uses at most 5nH_k(T) + o(n) bits of storage, where H_k(T) is the kth order empirical entropy of T. The space usage is Θ(n) bits in the worst case and o(n) bits for compressible texts. This data structure exploits the relationship between suffix arrays and the Burrows--Wheeler Transform, and can be regarded as a compressed suffix array.Our second compressed data structure achieves O(p+occ) query time using O(nH_k(T)log^ε n) + o(n) bits of storage for any chosen ε, 0<ε<1. Therefore, it provides optimal output-sensitive query time using o(nlog n) bits in the worst case. This second data structure builds upon the first one and exploits the interplay between two compressors: the Burrows--Wheeler Transform and the LZ78 algorithm.

References

Alstrup, S., Brodal, G. S., and Rauhe, T. 2000. New data structures for orthogonal range searching. In Proceedings of the 41st IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, Calif., 198--207.]] Google ScholarDigital Library
Andersson, A. 1996. Sorting and searching revisited. In Proceedings of the 5th Scandinavian Workshop on Algorithm Theory. Lecture Notes in Computer Science, vol. 1097. Springer-Verlag, New York, 185--197.]] Google ScholarDigital Library
Bentley, J., Sleator, D., Tarjan, R., and Wei, V. 1986. A locally adaptive compression scheme. Commun. ACM 29, 4, 320--330.]] Google ScholarDigital Library
Brodnik, A., and Munro, I. 1999. Membership in constant time and almost-minimum space. SIAM J. Comput. 28, 5, 1627--1640.]] Google ScholarDigital Library
Burrows, M., and Wheeler, D. 1994. A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation.]]Google Scholar
Chan, H., Hon, W., and Lam, T. 2004. Compressed index for a dynamic collection of texts. In Proceedings of the 15th Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 3109. Springer-Verlag, New York, 445--456.]]Google Scholar
Clark, D. R., and Munro, I. 1996. Efficient suffix trees on secondary storage. In Proceedings of the 7th ACM-SIAM Symposium on Discrete Algorithms. ACM, New York, 383--391.]] Google ScholarDigital Library
Colussi, L., and De Col, A. 1996. A time and space efficient data structure for string searching on large texts. Info. Proc. Lett. 58, 5, 217--222.]]Google ScholarCross Ref
Farach, M., and Thorup, M. 1998. String matching in Lempel--Ziv compressed strings. Algorithmica 20, 4, 388--404.]]Google ScholarCross Ref
Fenwick, P. 1996. The Burrows-Wheeler transform for block sorting text compression: principles and improvements. The Computer J. 39, 9, 731--740.]]Google ScholarCross Ref
Ferragina, P., Giancarlo, R., Manzini, G., and Sciortino, M. 2005. Boosting textual compression in optimal linear time. J. ACM 52, 4 (July), 688--713.]] Google ScholarDigital Library
Ferragina, P., and Grossi, R. 1999. The string B-tree: A new data structure for string search in external memory and its applications. J. ACM 46, 2, 236--280.]] Google ScholarDigital Library
Ferragina, P., and Manzini, G. 2000. Opportunistic data structures with applications. In Proceedings of the 41st IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, Calif., 390--398.]] Google ScholarDigital Library
Ferragina, P., and Manzini, G. 2001. An experimental study of a compressed index. Information Sciences: Special issue on “Dictionary Based Compression” 135, 13--28.]] Google ScholarDigital Library
Ferragina, P., Manzini, G., Mäkinen, V., and Navarro, G. 2004. An alphabet-friendly FM-index. In Proceedings of the 11th International Symposium on String Processing and Information Retrieval. Lecture Notes in Computer Science, vol. 3246. Springer-Verlag, New York, 150--160.]]Google Scholar
Gonnet, G. H., Baeza-Yates, R. A., and Snider, T. 1992. New indices for text: PAT trees and PAT arrays. In Information Retrieval: Data Structures and Algorithms, B. Frakes and R. A. Baeza-Yates Eds. Prentice-Hall, Englewood Cliffs, N.J., Chapter 5, 66--82.]] Google ScholarDigital Library
Grabowski, S., Mäkinen, V., and Navarro, G. 2004. First Huffman, then Burrows-Wheeler: An alphabet-independent FM-index. In Proceedings of the 11th International Symposium on String Processing and Information Retrieval. Lecture Notes in Computer Science, vol. 3246. Springer-Verlag, New York, 210--211.]]Google Scholar
Grossi, R., Gupta, A., and Vitter, J. 2003. High-order entropy-compressed text indexes. In Proceedings of the 14th ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM Press, New York, 841--850.]] Google ScholarDigital Library
Grossi, R., Gupta, A., and Vitter, J. 2004. When indexing equals compression: Experiments on compressing suffix arrays and applications. In Proceedings of the 15th ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM Press, New York, 636--645.]] Google ScholarDigital Library
Grossi, R., and Vitter, J. 2000. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proceedings of the 32nd ACM Symposium on Theory of Computing. ACM, New York, 397--406.]] Google ScholarDigital Library
Healy, J., Thomas, E. E., Schwartz, J. T., and Wigler, M. 2003. Annotating large genomes with exact word matches. Genome Research 13, 2306--2315.]]Google ScholarCross Ref
Hon, W., Lam, T., Sadakane, K., Sung, W., and Yiu, S. 2004a. Compressed index for dynamic text. In Proceedings of the IEEE Data Compression Conference. IEEE Computer Society Press, Los Alamitos, Calif., 102--111.]] Google ScholarDigital Library
Hon, W., Lam, T., Sung, W., Tse, W., Wong, C., and Yiu, S. 2004b. Practical aspects of compressed suffix arrays and FM-index in searching DNA sequences. In Proceedings of the 6th Workshop on Algorithm Engineering and Experiments. SIAM Press, Philadelphia, Pa., 31--38.]]Google Scholar
Huynh, N., Hon, W., Lam, T., and Sung, W. 2004. Approximate string matching using compressed suffix arrays. In Proceedings of the 15th Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 3109. Springer-Verlag, New York, 434--444.]]Google Scholar
Kärkkäinen, J., and Sutinen, E. 1998. Lempel-Ziv index for q-grams. Algorithmica 21, 137--154.]]Google ScholarCross Ref
Kärkkäinen, J., and Ukkonen, E. 1996. Lempel-Ziv parsing and sublinear-size index structures for string matching. In Proceedings of the 3rd South American Workshop on String Processing, N. Ziviani, R. Baeza-Yates, and K. Guimarães, Eds. Carleton University Press, 141--155.]]Google Scholar
Kosaraju, R., and Manzini, G. 1999. Compression of low entropy strings with Lempel--Ziv algorithms. SIAM J. Comput. 29, 3, 893--911.]] Google ScholarDigital Library
Kurtz, S. 1999. Reducing the space requirement of suffix trees. Software---Practice and Experience 29, 13, 1149--1171.]] Google ScholarDigital Library
Mäkinen, V., and Navarro, G. 2004. Compressed compact suffix arrays. In Proceedings of the 15th Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 3109. Springer-Verlag, New York, 420--433.]]Google Scholar
Mäkinen, V., Navarro, G., and Sadakane, K. 2004. Advantages of backward searching---efficient secondary memory and distributed implementation of compressed suffix arrays. In Proceedings of the 15th International Symposium on Algorithms and Computation. Lecture Notes in Computer Science, Springer-Verlag, New York.]]Google Scholar
Manber, U., and Myers, G. 1993. Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22, 5, 935--948.]] Google ScholarDigital Library
Manzini, G. 2001. An analysis of the Burrows-Wheeler transform. J. ACM 48, 3, 407--430.]] Google ScholarDigital Library
McCreight, E. M. 1976. A space-economical suffix tree construction algorithm. J. ACM 23, 2, 262-- 272.]] Google ScholarDigital Library
Munro, J. I., Raman, V., and Rao, S. S. 2001. Space efficient suffix trees. J. Algor. 39, 2, 205--222.]] Google ScholarDigital Library
Navarro, G. 2004. Indexing text using the Ziv-Lempel trie. J. Discr. Algor. 2, 1, 87--114.]] Google ScholarDigital Library
Pagh, R. 2001. Low redundancy in static dictionaries with constant query time. SIAM J. Comput. 31, 2, 353--363.]] Google ScholarDigital Library
Raman, R., Raman, V., and Rao, S. S. 2002. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proceedings of the 13th ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM Press, New York, 233--242.]] Google ScholarDigital Library
Rao, S. S. 2002. Time-space trade-offs for compressed suffix arrays. Inf. Proc. Lett. 82, 6, 307--311.]] Google ScholarDigital Library
Sadakane, K. 2000. Compressed text databases with efficient query algorithms based on the compressed suffix array. In Proceeding of the 11th International Symposium on Algorithms and Computation. Lecture Notes in Computer Science, vol. 1969. Springer-Verlag, New York, 410--421.]] Google ScholarDigital Library
Sadakane, K. 2002. Succinct representations of LCP information and improvements in the compressed suffix arrays. In Proceedings of the 13th ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM Press, New York, 225--232.]] Google ScholarDigital Library
Sadakane, K. 2003. New text indexing functionalities of compressed suffix arrays. J. Algor. 48, 2, 294--313.]] Google ScholarDigital Library
Sadakane, K., and Shibuya, T. 2001. Indexing huge genome sequences for solving various problems. Genome Informatics 12, 175--183.]]Google Scholar
Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images, Second ed., Morgan Kaufmann Publishers, Los Altos, Calif.]] Google ScholarDigital Library
Ziv, J., and Lempel, A. 1978. Compression of individual sequences via variable length coding. IEEE Trans. Info. Theory 24, 530--536.]]Google ScholarCross Ref

Index Terms

Recommendations

Compressed representations of sequences and full-text indexes

Given a sequence S = s₁s₂…s_n of integers smaller than r = O(polylog(n)), we show how S can be represented using nH₀(S) + o(n) bits, so that we can know any s_q, as well as answer rank and select queries on S, in constant time. H₀(S) is the zero-order ...
Read More
When indexing equals compression: Experiments with compressing suffix arrays and applications

We report on a new experimental analysis of high-order entropy-compressed suffix arrays, which retains the theoretical performance of previous work and represents an improvement in practice. Our experiments indicate that the resulting text index offers ...
Read More
Boosting textual compression in optimal linear time

We provide a general boosting technique for Textual Data Compression. Qualitatively, it takes a good compression algorithm and turns it into an algorithm with a better compression performance guarantee. It displays the following remarkable properties: (...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Journal of the ACM Volume 52, Issue 4
July 2005
199 pages
ISSN:0004-5411
EISSN:1557-735X
DOI:10.1145/1082036
Issue’s Table of Contents

Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 July 2005
Published in jacm Volume 52, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Burrows--Wheeler transform
Lempel--Ziv compressor
full-text indexing
indexing data structure
pattern searching
suffix array
suffix tree
text compression
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 522
  Total Citations
  View Citations
- 3,000
  Total Downloads
- Downloads (Last 12 months)95
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Indexing compressed text

Journal of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Compressed representations of sequences and full-text indexes

When indexing equals compression: Experiments with compressing suffix arrays and applications

Boosting textual compression in optimal linear time

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Indexing compressed text

Journal of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Compressed representations of sequences and full-text indexes

When indexing equals compression: Experiments with compressing suffix arrays and applications

Boosting textual compression in optimal linear time

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media