skip to main content
article

Indexing compressed text

Published:01 July 2005Publication History
Skip Abstract Section

Abstract

We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form.Our first compressed data structure retrieves the occ occurrences of a pattern P[1,p] within a text T[1,n] in O(p + occ log1+ε n) time for any chosen ε, 0<ε<1. This data structure uses at most 5nHk(T) + o(n) bits of storage, where Hk(T) is the kth order empirical entropy of T. The space usage is Θ(n) bits in the worst case and o(n) bits for compressible texts. This data structure exploits the relationship between suffix arrays and the Burrows--Wheeler Transform, and can be regarded as a compressed suffix array.Our second compressed data structure achieves O(p+occ) query time using O(nHk(T)logε n) + o(n) bits of storage for any chosen ε, 0<ε<1. Therefore, it provides optimal output-sensitive query time using o(nlog n) bits in the worst case. This second data structure builds upon the first one and exploits the interplay between two compressors: the Burrows--Wheeler Transform and the LZ78 algorithm.

References

  1. Alstrup, S., Brodal, G. S., and Rauhe, T. 2000. New data structures for orthogonal range searching. In Proceedings of the 41st IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, Calif., 198--207.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Andersson, A. 1996. Sorting and searching revisited. In Proceedings of the 5th Scandinavian Workshop on Algorithm Theory. Lecture Notes in Computer Science, vol. 1097. Springer-Verlag, New York, 185--197.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bentley, J., Sleator, D., Tarjan, R., and Wei, V. 1986. A locally adaptive compression scheme. Commun. ACM 29, 4, 320--330.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Brodnik, A., and Munro, I. 1999. Membership in constant time and almost-minimum space. SIAM J. Comput. 28, 5, 1627--1640.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Burrows, M., and Wheeler, D. 1994. A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation.]]Google ScholarGoogle Scholar
  6. Chan, H., Hon, W., and Lam, T. 2004. Compressed index for a dynamic collection of texts. In Proceedings of the 15th Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 3109. Springer-Verlag, New York, 445--456.]]Google ScholarGoogle Scholar
  7. Clark, D. R., and Munro, I. 1996. Efficient suffix trees on secondary storage. In Proceedings of the 7th ACM-SIAM Symposium on Discrete Algorithms. ACM, New York, 383--391.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Colussi, L., and De Col, A. 1996. A time and space efficient data structure for string searching on large texts. Info. Proc. Lett. 58, 5, 217--222.]]Google ScholarGoogle ScholarCross RefCross Ref
  9. Farach, M., and Thorup, M. 1998. String matching in Lempel--Ziv compressed strings. Algorithmica 20, 4, 388--404.]]Google ScholarGoogle ScholarCross RefCross Ref
  10. Fenwick, P. 1996. The Burrows-Wheeler transform for block sorting text compression: principles and improvements. The Computer J. 39, 9, 731--740.]]Google ScholarGoogle ScholarCross RefCross Ref
  11. Ferragina, P., Giancarlo, R., Manzini, G., and Sciortino, M. 2005. Boosting textual compression in optimal linear time. J. ACM 52, 4 (July), 688--713.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Ferragina, P., and Grossi, R. 1999. The string B-tree: A new data structure for string search in external memory and its applications. J. ACM 46, 2, 236--280.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ferragina, P., and Manzini, G. 2000. Opportunistic data structures with applications. In Proceedings of the 41st IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, Calif., 390--398.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ferragina, P., and Manzini, G. 2001. An experimental study of a compressed index. Information Sciences: Special issue on “Dictionary Based Compression” 135, 13--28.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ferragina, P., Manzini, G., Mäkinen, V., and Navarro, G. 2004. An alphabet-friendly FM-index. In Proceedings of the 11th International Symposium on String Processing and Information Retrieval. Lecture Notes in Computer Science, vol. 3246. Springer-Verlag, New York, 150--160.]]Google ScholarGoogle Scholar
  16. Gonnet, G. H., Baeza-Yates, R. A., and Snider, T. 1992. New indices for text: PAT trees and PAT arrays. In Information Retrieval: Data Structures and Algorithms, B. Frakes and R. A. Baeza-Yates Eds. Prentice-Hall, Englewood Cliffs, N.J., Chapter 5, 66--82.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Grabowski, S., Mäkinen, V., and Navarro, G. 2004. First Huffman, then Burrows-Wheeler: An alphabet-independent FM-index. In Proceedings of the 11th International Symposium on String Processing and Information Retrieval. Lecture Notes in Computer Science, vol. 3246. Springer-Verlag, New York, 210--211.]]Google ScholarGoogle Scholar
  18. Grossi, R., Gupta, A., and Vitter, J. 2003. High-order entropy-compressed text indexes. In Proceedings of the 14th ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM Press, New York, 841--850.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Grossi, R., Gupta, A., and Vitter, J. 2004. When indexing equals compression: Experiments on compressing suffix arrays and applications. In Proceedings of the 15th ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM Press, New York, 636--645.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Grossi, R., and Vitter, J. 2000. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proceedings of the 32nd ACM Symposium on Theory of Computing. ACM, New York, 397--406.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Healy, J., Thomas, E. E., Schwartz, J. T., and Wigler, M. 2003. Annotating large genomes with exact word matches. Genome Research 13, 2306--2315.]]Google ScholarGoogle ScholarCross RefCross Ref
  22. Hon, W., Lam, T., Sadakane, K., Sung, W., and Yiu, S. 2004a. Compressed index for dynamic text. In Proceedings of the IEEE Data Compression Conference. IEEE Computer Society Press, Los Alamitos, Calif., 102--111.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Hon, W., Lam, T., Sung, W., Tse, W., Wong, C., and Yiu, S. 2004b. Practical aspects of compressed suffix arrays and FM-index in searching DNA sequences. In Proceedings of the 6th Workshop on Algorithm Engineering and Experiments. SIAM Press, Philadelphia, Pa., 31--38.]]Google ScholarGoogle Scholar
  24. Huynh, N., Hon, W., Lam, T., and Sung, W. 2004. Approximate string matching using compressed suffix arrays. In Proceedings of the 15th Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 3109. Springer-Verlag, New York, 434--444.]]Google ScholarGoogle Scholar
  25. Kärkkäinen, J., and Sutinen, E. 1998. Lempel-Ziv index for q-grams. Algorithmica 21, 137--154.]]Google ScholarGoogle ScholarCross RefCross Ref
  26. Kärkkäinen, J., and Ukkonen, E. 1996. Lempel-Ziv parsing and sublinear-size index structures for string matching. In Proceedings of the 3rd South American Workshop on String Processing, N. Ziviani, R. Baeza-Yates, and K. Guimarães, Eds. Carleton University Press, 141--155.]]Google ScholarGoogle Scholar
  27. Kosaraju, R., and Manzini, G. 1999. Compression of low entropy strings with Lempel--Ziv algorithms. SIAM J. Comput. 29, 3, 893--911.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Kurtz, S. 1999. Reducing the space requirement of suffix trees. Software---Practice and Experience 29, 13, 1149--1171.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Mäkinen, V., and Navarro, G. 2004. Compressed compact suffix arrays. In Proceedings of the 15th Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 3109. Springer-Verlag, New York, 420--433.]]Google ScholarGoogle Scholar
  30. Mäkinen, V., Navarro, G., and Sadakane, K. 2004. Advantages of backward searching---efficient secondary memory and distributed implementation of compressed suffix arrays. In Proceedings of the 15th International Symposium on Algorithms and Computation. Lecture Notes in Computer Science, Springer-Verlag, New York.]]Google ScholarGoogle Scholar
  31. Manber, U., and Myers, G. 1993. Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22, 5, 935--948.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Manzini, G. 2001. An analysis of the Burrows-Wheeler transform. J. ACM 48, 3, 407--430.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. McCreight, E. M. 1976. A space-economical suffix tree construction algorithm. J. ACM 23, 2, 262-- 272.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Munro, J. I., Raman, V., and Rao, S. S. 2001. Space efficient suffix trees. J. Algor. 39, 2, 205--222.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Navarro, G. 2004. Indexing text using the Ziv-Lempel trie. J. Discr. Algor. 2, 1, 87--114.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Pagh, R. 2001. Low redundancy in static dictionaries with constant query time. SIAM J. Comput. 31, 2, 353--363.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Raman, R., Raman, V., and Rao, S. S. 2002. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proceedings of the 13th ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM Press, New York, 233--242.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Rao, S. S. 2002. Time-space trade-offs for compressed suffix arrays. Inf. Proc. Lett. 82, 6, 307--311.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Sadakane, K. 2000. Compressed text databases with efficient query algorithms based on the compressed suffix array. In Proceeding of the 11th International Symposium on Algorithms and Computation. Lecture Notes in Computer Science, vol. 1969. Springer-Verlag, New York, 410--421.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Sadakane, K. 2002. Succinct representations of LCP information and improvements in the compressed suffix arrays. In Proceedings of the 13th ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM Press, New York, 225--232.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Sadakane, K. 2003. New text indexing functionalities of compressed suffix arrays. J. Algor. 48, 2, 294--313.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Sadakane, K., and Shibuya, T. 2001. Indexing huge genome sequences for solving various problems. Genome Informatics 12, 175--183.]]Google ScholarGoogle Scholar
  43. Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images, Second ed., Morgan Kaufmann Publishers, Los Altos, Calif.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Ziv, J., and Lempel, A. 1978. Compression of individual sequences via variable length coding. IEEE Trans. Info. Theory 24, 530--536.]]Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Indexing compressed text

                      Recommendations

                      Comments

                      Login options

                      Check if you have access through your login credentials or your institution to get full access on this article.

                      Sign in

                      Full Access

                      • Published in

                        cover image Journal of the ACM
                        Journal of the ACM  Volume 52, Issue 4
                        July 2005
                        199 pages
                        ISSN:0004-5411
                        EISSN:1557-735X
                        DOI:10.1145/1082036
                        Issue’s Table of Contents

                        Copyright © 2005 ACM

                        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                        Publisher

                        Association for Computing Machinery

                        New York, NY, United States

                        Publication History

                        • Published: 1 July 2005
                        Published in jacm Volume 52, Issue 4

                        Permissions

                        Request permissions about this article.

                        Request Permissions

                        Check for updates

                        Qualifiers

                        • article

                      PDF Format

                      View or Download as a PDF file.

                      PDF

                      eReader

                      View online with eReader.

                      eReader