Abstract
The LZ-index is a theoretical proposal of a lightweight data structure for text indexing, based on the Ziv-Lempel trie. If a text of u characters over an alphabet of size σ is compressible to n symbols using the LZ78 algorithm, then the LZ-index takes 4n log2 n (1+o(1)) bits of space (that is, 4 times the entropy of the text) and reports the R occurrences of a pattern of length m in worst case time O(m3 log σ + (m + R)log n). In this paper we face the challenge of obtaining a practical implementation of the LZ-index, which is not at all straightforward from the theoretical proposal. We end up with a prototype that takes the promised space and has average search time O(σ m log u + √uR). This prototype is shown to be faster than other competing approaches when we take into account the time to report the positions or text contexts of the occurrences found. We show in detail the process of implementing the index, which involves interesting lessons of theory versus practice.
- Abouelhoda, M., Ohlebusch, E., and Kurtz, S. 2002. Optimal exact string matching based on suffix arrays. In Proceedings of the 9th International Symposium String Processing and Information Retrieval (SPIRE). LNCS 2476. 31--43. Google ScholarDigital Library
- Agarwal, P. and Erickson, J. 1999. Geometric range searching and its relatives. Contemporary Mathematics 23: Advances in Discrete and Computational Geometry, 1--56.Google Scholar
- Apostolico, A. 1985. The myriad virtues of subword trees. In Combinatorial Algorithms on Words. NATO ISI Series. Springer-Verlag, 85--96.Google Scholar
- Arroyuelo, D. and Navarro, G. 2005. Space-efficient construction of LZ-index. In Proceedings of the 16th Annual International Symposium on Algorithms and Computation (ISAAC). LNCS 3827. 1143--1152. Google ScholarDigital Library
- Arroyuelo, D. and Navarro, G. 2007a. A lempel-ziv text index on secondary storage. In Proceedings of the 18th Annual Symposium on Combinatorial Pattern Matching (CPM). LNCS 4580. 83--94. Google ScholarDigital Library
- Arroyuelo, D. and Navarro, G. 2007b. Smaller and faster lempel-ziv indices. In Proceedings of the 18th International Workshop on Combinatorial Algorithms (IWOCA). College Publications, UK, 11--20.Google Scholar
- Arroyuelo, D., Navarro, G., and Sadakane, K. 2006. Reducing the space requirement of LZ-index. In Proceedings of the 17th Annual Symposium on Combinatorial Pattern Matching (CPM). LNCS 4009. 319--330. Google ScholarDigital Library
- Bell, T., Cleary, J., and Witten, I. 1990. Text compression. Prentice Hall. Google ScholarDigital Library
- Benoit, D., Demaine, E., Munro, I., Raman, R., Raman, V., and Rao, S. 2005. Representing trees of higher degree. Algorithmica 43, 4, 275--292.Google ScholarDigital Library
- Chazelle, B. 1988. A functional approach to data structures and its use in multidimensional searching. SIAM Journal on Computing 17, 3, 427--462. Google ScholarDigital Library
- Ferragina, P. and Manzini, G. 2000. Opportunistic data structures with applications. In Proceedings of the 41st IEEE Symposium Foundations of Computer Science (FOCS). 390--398. Google ScholarDigital Library
- Ferragina, P. and Manzini, G. 2001. An experimental study of an opportunistic index. In Proceedings of the 12th ACM Symposium on Discrete Algorithms (SODA). 269--278. Google ScholarDigital Library
- Ferragina, P. and Manzini, G. 2002. On compressing and indexing data. Tech. Rep. TR-02-01, Dipartamento di Informatica, Univ. of Pisa. Google ScholarDigital Library
- Geary, R., Rahman, N., Raman, R., and Raman, V. 2004. A simple optimal representation for balanced parentheses. In Proceedings of the 15th Annual Symposium on Combinatorial Pattern Matching (CPM). LNCS v. 3109. 159--172.Google Scholar
- González, R., Grabowski, S., Mäkinen, V., and Navarro, G. 2005. Practical implementation of rank and select queries. In Poster Proceedings Volume of the 4th Workshop on Efficient and Experimental Algorithms (WEA). CTI Press and Ellinika Grammata, Greece, 27--38.Google Scholar
- Grossi, R. and Vitter, J. 2000. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proceedings of the 32nd ACM Symposium Theory of Computing (STOC). 397--406. Google ScholarDigital Library
- Harman, D. 1995. Overview of the Third Text REtrieval Conference. In Proceedings of the 3rd Text REtrieval Conf. (TREC-3). 1--19. NIST Special Publication 500--507.Google Scholar
- Jacobson, G. 1989. Space-efficient static trees and graphs. In Proceedings of the 30th IEEE Symposium Foundations of Computer Science (FOCS). 549--554. Google ScholarDigital Library
- Kärkkäinen, J. 1995. Suffix cactus: a cross between suffix tree and suffix array. In Proceedings of the 6th Annual Symposium Combinatorial Pattern Matching (CPM). LNCS 937. 191--204.Google ScholarCross Ref
- Kärkkäinen, J. 1999. Repetition-based text indexes. Ph.D. thesis, Dept. of Computer Science, University of Helsinki, Finland. Also available as Report A-1999-4, Series A.Google Scholar
- Kärkkäinen, J. and Ukkonen, E. 1996a. Lempel-Ziv parsing and sublinear-size index structures for string matching. In Proceedings of the 3rd South American Workshop on String Processing (WSP). Carleton University Press, 141--155.Google Scholar
- Kärkkäinen, J. and Ukkonen, E. 1996b. Sparse suffix trees. In Proceedings of the 2nd Ann. International Conf. on Computing and Combinatorics (COCOON). LNCS 1090. Google ScholarDigital Library
- Kosaraju, R. and Manzini, G. 1999. Compression of low entropy strings with Lempel-Ziv algorithms. SIAM Journal on Computing 29, 3, 893--911. Google ScholarDigital Library
- Kurtz, S. 1998. Reducing the space requirements of suffix trees. Report 98-03, Technische Kakultät, Universität Bielefeld.Google Scholar
- Mäkinen, V. 2003. Compact suffix array—a space-efficient full-text index. Fundamenta Informaticae 56, 1--2, 191--210. Google ScholarDigital Library
- Manber, U. and Myers, G. 1993. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 935--948. Google ScholarDigital Library
- Munro, I. 1996. Tables. In Proceedings of the 16th Foundations of Software Technology and Theoretical Computer Science (FSTTCS). LNCS 1180. 37--42. Google ScholarDigital Library
- Munro, I. and Raman, V. 1997. Succint representation of balanced parentheses, static trees and planar graphs. In Proceedings of the 38th IEEE Symposium Foundations of Computer Science (FOCS). 118--126. Google ScholarDigital Library
- Munro, I., Raman, V., and Rao, S. 2001. Space efficient suffix trees. Journal of Algorithms, 205--222. Google ScholarDigital Library
- Navarro, G. 2002. Indexing text using the Ziv-Lempel trie. In Proceedings of the 9th International Symposium String Processing and Information Retrieval (SPIRE). LNCS 2476. 325--336. Google ScholarDigital Library
- Navarro, G. 2004. Indexing text using the ziv-lempel trie. Journal of Discrete Algorithms (JDA) 2, 1, 87--114. Google ScholarDigital Library
- Navarro, G. and Mäkinen, V. 2007. Compressed full-text indexes. ACM Computing Surveys 39, 1, article 2. Google ScholarDigital Library
- Navarro, G., Moura, E., Neubert, M., Ziviani, N., and Baeza-Yates, R. 2000. Adding compression to block addressing inverted indexes. Information Retrieval 3, 1, 49--77. Google ScholarDigital Library
- Russo, L., Navarro, G., and Oliveira, A. 2007. Approximate string matching with Lempel-Ziv compressed indexes. In Proceedings of the 14th International Symposium on String Processing and Information Retrieval (SPIRE). LNCS 4726. 265--275. Google ScholarDigital Library
- Russo, L. and Oliveira, A. 2006. A compressed self-index using a ziv-lempel dictionary. In Proceedings of the 13th Symposium on String Processing and Information Retrieval (SPIRE). LNCS 4209. 163--180. Google ScholarDigital Library
- Sadakane, K. 2000. Compressed text databases with efficient query algorithms based on the compressed suffix array. In Proceedings of the 11th International Symposium Algorithms and Computation (ISAAC). LNCS 1969. 410--421. Google ScholarDigital Library
- Sadakane, K. 2002. Succint representations of lcp information and improvements in the compressed suffix arrays. In Proceedings of the 13th ACM Symposium on Discrete Algorithms (SODA). 225--232. Google ScholarDigital Library
- Welch, T. 1984. A technique for high performance data compression. IEEE Computer Magazine 17, 6 (June), 8--19. Google ScholarDigital Library
- Witten, I., Moffat, A., and Bell, T. 1999. Managing Gigabytes, second ed. Morgan Kaufmann Publishers, New York.Google Scholar
- Ziv, J. and Lempel, A. 1978. Compression of individual sequences via variable length coding. IEEE Trans. on Information Theory 24, 530--536.Google ScholarCross Ref
Index Terms
- Implementing the LZ-index: Theory versus practice
Recommendations
Compressed text indexes: From theory to practice
A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This represents a significant advancement over the (full-)text indexing techniques of the previous decade, whose indexes required several ...
Improved Word-Aligned Binary Compression for Text Indexing
We present an improved compression mechanism for handling the compressed inverted indexes used in text retrieval systems, extending the word-aligned binary coding carry method. Experiments using two typical document collections show that the new method ...
Dynamic index and LZ factorization in compressed space
AbstractIn this paper, we propose a new dynamic compressed index of O ( w ) space for a dynamic text T, where w = O ( min ( z log N log ∗ M , N ) ) is the size of the signature encoding of T, z is the size of the Lempel–Ziv77 (LZ77) ...
Comments