Abstract
We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form.Our first compressed data structure retrieves the occ occurrences of a pattern P[1,p] within a text T[1,n] in O(p + occ log1+ε n) time for any chosen ε, 0<ε<1. This data structure uses at most 5nHk(T) + o(n) bits of storage, where Hk(T) is the kth order empirical entropy of T. The space usage is Θ(n) bits in the worst case and o(n) bits for compressible texts. This data structure exploits the relationship between suffix arrays and the Burrows--Wheeler Transform, and can be regarded as a compressed suffix array.Our second compressed data structure achieves O(p+occ) query time using O(nHk(T)logε n) + o(n) bits of storage for any chosen ε, 0<ε<1. Therefore, it provides optimal output-sensitive query time using o(nlog n) bits in the worst case. This second data structure builds upon the first one and exploits the interplay between two compressors: the Burrows--Wheeler Transform and the LZ78 algorithm.
- Alstrup, S., Brodal, G. S., and Rauhe, T. 2000. New data structures for orthogonal range searching. In Proceedings of the 41st IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, Calif., 198--207.]] Google ScholarDigital Library
- Andersson, A. 1996. Sorting and searching revisited. In Proceedings of the 5th Scandinavian Workshop on Algorithm Theory. Lecture Notes in Computer Science, vol. 1097. Springer-Verlag, New York, 185--197.]] Google ScholarDigital Library
- Bentley, J., Sleator, D., Tarjan, R., and Wei, V. 1986. A locally adaptive compression scheme. Commun. ACM 29, 4, 320--330.]] Google ScholarDigital Library
- Brodnik, A., and Munro, I. 1999. Membership in constant time and almost-minimum space. SIAM J. Comput. 28, 5, 1627--1640.]] Google ScholarDigital Library
- Burrows, M., and Wheeler, D. 1994. A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation.]]Google Scholar
- Chan, H., Hon, W., and Lam, T. 2004. Compressed index for a dynamic collection of texts. In Proceedings of the 15th Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 3109. Springer-Verlag, New York, 445--456.]]Google Scholar
- Clark, D. R., and Munro, I. 1996. Efficient suffix trees on secondary storage. In Proceedings of the 7th ACM-SIAM Symposium on Discrete Algorithms. ACM, New York, 383--391.]] Google ScholarDigital Library
- Colussi, L., and De Col, A. 1996. A time and space efficient data structure for string searching on large texts. Info. Proc. Lett. 58, 5, 217--222.]]Google ScholarCross Ref
- Farach, M., and Thorup, M. 1998. String matching in Lempel--Ziv compressed strings. Algorithmica 20, 4, 388--404.]]Google ScholarCross Ref
- Fenwick, P. 1996. The Burrows-Wheeler transform for block sorting text compression: principles and improvements. The Computer J. 39, 9, 731--740.]]Google ScholarCross Ref
- Ferragina, P., Giancarlo, R., Manzini, G., and Sciortino, M. 2005. Boosting textual compression in optimal linear time. J. ACM 52, 4 (July), 688--713.]] Google ScholarDigital Library
- Ferragina, P., and Grossi, R. 1999. The string B-tree: A new data structure for string search in external memory and its applications. J. ACM 46, 2, 236--280.]] Google ScholarDigital Library
- Ferragina, P., and Manzini, G. 2000. Opportunistic data structures with applications. In Proceedings of the 41st IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, Calif., 390--398.]] Google ScholarDigital Library
- Ferragina, P., and Manzini, G. 2001. An experimental study of a compressed index. Information Sciences: Special issue on “Dictionary Based Compression” 135, 13--28.]] Google ScholarDigital Library
- Ferragina, P., Manzini, G., Mäkinen, V., and Navarro, G. 2004. An alphabet-friendly FM-index. In Proceedings of the 11th International Symposium on String Processing and Information Retrieval. Lecture Notes in Computer Science, vol. 3246. Springer-Verlag, New York, 150--160.]]Google Scholar
- Gonnet, G. H., Baeza-Yates, R. A., and Snider, T. 1992. New indices for text: PAT trees and PAT arrays. In Information Retrieval: Data Structures and Algorithms, B. Frakes and R. A. Baeza-Yates Eds. Prentice-Hall, Englewood Cliffs, N.J., Chapter 5, 66--82.]] Google ScholarDigital Library
- Grabowski, S., Mäkinen, V., and Navarro, G. 2004. First Huffman, then Burrows-Wheeler: An alphabet-independent FM-index. In Proceedings of the 11th International Symposium on String Processing and Information Retrieval. Lecture Notes in Computer Science, vol. 3246. Springer-Verlag, New York, 210--211.]]Google Scholar
- Grossi, R., Gupta, A., and Vitter, J. 2003. High-order entropy-compressed text indexes. In Proceedings of the 14th ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM Press, New York, 841--850.]] Google ScholarDigital Library
- Grossi, R., Gupta, A., and Vitter, J. 2004. When indexing equals compression: Experiments on compressing suffix arrays and applications. In Proceedings of the 15th ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM Press, New York, 636--645.]] Google ScholarDigital Library
- Grossi, R., and Vitter, J. 2000. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proceedings of the 32nd ACM Symposium on Theory of Computing. ACM, New York, 397--406.]] Google ScholarDigital Library
- Healy, J., Thomas, E. E., Schwartz, J. T., and Wigler, M. 2003. Annotating large genomes with exact word matches. Genome Research 13, 2306--2315.]]Google ScholarCross Ref
- Hon, W., Lam, T., Sadakane, K., Sung, W., and Yiu, S. 2004a. Compressed index for dynamic text. In Proceedings of the IEEE Data Compression Conference. IEEE Computer Society Press, Los Alamitos, Calif., 102--111.]] Google ScholarDigital Library
- Hon, W., Lam, T., Sung, W., Tse, W., Wong, C., and Yiu, S. 2004b. Practical aspects of compressed suffix arrays and FM-index in searching DNA sequences. In Proceedings of the 6th Workshop on Algorithm Engineering and Experiments. SIAM Press, Philadelphia, Pa., 31--38.]]Google Scholar
- Huynh, N., Hon, W., Lam, T., and Sung, W. 2004. Approximate string matching using compressed suffix arrays. In Proceedings of the 15th Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 3109. Springer-Verlag, New York, 434--444.]]Google Scholar
- Kärkkäinen, J., and Sutinen, E. 1998. Lempel-Ziv index for q-grams. Algorithmica 21, 137--154.]]Google ScholarCross Ref
- Kärkkäinen, J., and Ukkonen, E. 1996. Lempel-Ziv parsing and sublinear-size index structures for string matching. In Proceedings of the 3rd South American Workshop on String Processing, N. Ziviani, R. Baeza-Yates, and K. Guimarães, Eds. Carleton University Press, 141--155.]]Google Scholar
- Kosaraju, R., and Manzini, G. 1999. Compression of low entropy strings with Lempel--Ziv algorithms. SIAM J. Comput. 29, 3, 893--911.]] Google ScholarDigital Library
- Kurtz, S. 1999. Reducing the space requirement of suffix trees. Software---Practice and Experience 29, 13, 1149--1171.]] Google ScholarDigital Library
- Mäkinen, V., and Navarro, G. 2004. Compressed compact suffix arrays. In Proceedings of the 15th Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 3109. Springer-Verlag, New York, 420--433.]]Google Scholar
- Mäkinen, V., Navarro, G., and Sadakane, K. 2004. Advantages of backward searching---efficient secondary memory and distributed implementation of compressed suffix arrays. In Proceedings of the 15th International Symposium on Algorithms and Computation. Lecture Notes in Computer Science, Springer-Verlag, New York.]]Google Scholar
- Manber, U., and Myers, G. 1993. Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22, 5, 935--948.]] Google ScholarDigital Library
- Manzini, G. 2001. An analysis of the Burrows-Wheeler transform. J. ACM 48, 3, 407--430.]] Google ScholarDigital Library
- McCreight, E. M. 1976. A space-economical suffix tree construction algorithm. J. ACM 23, 2, 262-- 272.]] Google ScholarDigital Library
- Munro, J. I., Raman, V., and Rao, S. S. 2001. Space efficient suffix trees. J. Algor. 39, 2, 205--222.]] Google ScholarDigital Library
- Navarro, G. 2004. Indexing text using the Ziv-Lempel trie. J. Discr. Algor. 2, 1, 87--114.]] Google ScholarDigital Library
- Pagh, R. 2001. Low redundancy in static dictionaries with constant query time. SIAM J. Comput. 31, 2, 353--363.]] Google ScholarDigital Library
- Raman, R., Raman, V., and Rao, S. S. 2002. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proceedings of the 13th ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM Press, New York, 233--242.]] Google ScholarDigital Library
- Rao, S. S. 2002. Time-space trade-offs for compressed suffix arrays. Inf. Proc. Lett. 82, 6, 307--311.]] Google ScholarDigital Library
- Sadakane, K. 2000. Compressed text databases with efficient query algorithms based on the compressed suffix array. In Proceeding of the 11th International Symposium on Algorithms and Computation. Lecture Notes in Computer Science, vol. 1969. Springer-Verlag, New York, 410--421.]] Google ScholarDigital Library
- Sadakane, K. 2002. Succinct representations of LCP information and improvements in the compressed suffix arrays. In Proceedings of the 13th ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM Press, New York, 225--232.]] Google ScholarDigital Library
- Sadakane, K. 2003. New text indexing functionalities of compressed suffix arrays. J. Algor. 48, 2, 294--313.]] Google ScholarDigital Library
- Sadakane, K., and Shibuya, T. 2001. Indexing huge genome sequences for solving various problems. Genome Informatics 12, 175--183.]]Google Scholar
- Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images, Second ed., Morgan Kaufmann Publishers, Los Altos, Calif.]] Google ScholarDigital Library
- Ziv, J., and Lempel, A. 1978. Compression of individual sequences via variable length coding. IEEE Trans. Info. Theory 24, 530--536.]]Google ScholarCross Ref
Index Terms
- Indexing compressed text
Recommendations
Compressed representations of sequences and full-text indexes
Given a sequence S = s1s2…sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq, as well as answer rank and select queries on S, in constant time. H0(S) is the zero-order ...
When indexing equals compression: Experiments with compressing suffix arrays and applications
We report on a new experimental analysis of high-order entropy-compressed suffix arrays, which retains the theoretical performance of previous work and represents an improvement in practice. Our experiments indicate that the resulting text index offers ...
Boosting textual compression in optimal linear time
We provide a general boosting technique for Textual Data Compression. Qualitatively, it takes a good compression algorithm and turns it into an algorithm with a better compression performance guarantee. It displays the following remarkable properties: (...
Comments