skip to main content
research-article

Implementing the LZ-index: Theory versus practice

Published:23 February 2009Publication History
Skip Abstract Section

Abstract

The LZ-index is a theoretical proposal of a lightweight data structure for text indexing, based on the Ziv-Lempel trie. If a text of u characters over an alphabet of size σ is compressible to n symbols using the LZ78 algorithm, then the LZ-index takes 4n log2 n (1+o(1)) bits of space (that is, 4 times the entropy of the text) and reports the R occurrences of a pattern of length m in worst case time O(m3 log σ + (m + R)log n). In this paper we face the challenge of obtaining a practical implementation of the LZ-index, which is not at all straightforward from the theoretical proposal. We end up with a prototype that takes the promised space and has average search time Om log u + √uR). This prototype is shown to be faster than other competing approaches when we take into account the time to report the positions or text contexts of the occurrences found. We show in detail the process of implementing the index, which involves interesting lessons of theory versus practice.

References

  1. Abouelhoda, M., Ohlebusch, E., and Kurtz, S. 2002. Optimal exact string matching based on suffix arrays. In Proceedings of the 9th International Symposium String Processing and Information Retrieval (SPIRE). LNCS 2476. 31--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Agarwal, P. and Erickson, J. 1999. Geometric range searching and its relatives. Contemporary Mathematics 23: Advances in Discrete and Computational Geometry, 1--56.Google ScholarGoogle Scholar
  3. Apostolico, A. 1985. The myriad virtues of subword trees. In Combinatorial Algorithms on Words. NATO ISI Series. Springer-Verlag, 85--96.Google ScholarGoogle Scholar
  4. Arroyuelo, D. and Navarro, G. 2005. Space-efficient construction of LZ-index. In Proceedings of the 16th Annual International Symposium on Algorithms and Computation (ISAAC). LNCS 3827. 1143--1152. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Arroyuelo, D. and Navarro, G. 2007a. A lempel-ziv text index on secondary storage. In Proceedings of the 18th Annual Symposium on Combinatorial Pattern Matching (CPM). LNCS 4580. 83--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Arroyuelo, D. and Navarro, G. 2007b. Smaller and faster lempel-ziv indices. In Proceedings of the 18th International Workshop on Combinatorial Algorithms (IWOCA). College Publications, UK, 11--20.Google ScholarGoogle Scholar
  7. Arroyuelo, D., Navarro, G., and Sadakane, K. 2006. Reducing the space requirement of LZ-index. In Proceedings of the 17th Annual Symposium on Combinatorial Pattern Matching (CPM). LNCS 4009. 319--330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bell, T., Cleary, J., and Witten, I. 1990. Text compression. Prentice Hall. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Benoit, D., Demaine, E., Munro, I., Raman, R., Raman, V., and Rao, S. 2005. Representing trees of higher degree. Algorithmica 43, 4, 275--292.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chazelle, B. 1988. A functional approach to data structures and its use in multidimensional searching. SIAM Journal on Computing 17, 3, 427--462. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ferragina, P. and Manzini, G. 2000. Opportunistic data structures with applications. In Proceedings of the 41st IEEE Symposium Foundations of Computer Science (FOCS). 390--398. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Ferragina, P. and Manzini, G. 2001. An experimental study of an opportunistic index. In Proceedings of the 12th ACM Symposium on Discrete Algorithms (SODA). 269--278. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ferragina, P. and Manzini, G. 2002. On compressing and indexing data. Tech. Rep. TR-02-01, Dipartamento di Informatica, Univ. of Pisa. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Geary, R., Rahman, N., Raman, R., and Raman, V. 2004. A simple optimal representation for balanced parentheses. In Proceedings of the 15th Annual Symposium on Combinatorial Pattern Matching (CPM). LNCS v. 3109. 159--172.Google ScholarGoogle Scholar
  15. González, R., Grabowski, S., Mäkinen, V., and Navarro, G. 2005. Practical implementation of rank and select queries. In Poster Proceedings Volume of the 4th Workshop on Efficient and Experimental Algorithms (WEA). CTI Press and Ellinika Grammata, Greece, 27--38.Google ScholarGoogle Scholar
  16. Grossi, R. and Vitter, J. 2000. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proceedings of the 32nd ACM Symposium Theory of Computing (STOC). 397--406. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Harman, D. 1995. Overview of the Third Text REtrieval Conference. In Proceedings of the 3rd Text REtrieval Conf. (TREC-3). 1--19. NIST Special Publication 500--507.Google ScholarGoogle Scholar
  18. Jacobson, G. 1989. Space-efficient static trees and graphs. In Proceedings of the 30th IEEE Symposium Foundations of Computer Science (FOCS). 549--554. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Kärkkäinen, J. 1995. Suffix cactus: a cross between suffix tree and suffix array. In Proceedings of the 6th Annual Symposium Combinatorial Pattern Matching (CPM). LNCS 937. 191--204.Google ScholarGoogle ScholarCross RefCross Ref
  20. Kärkkäinen, J. 1999. Repetition-based text indexes. Ph.D. thesis, Dept. of Computer Science, University of Helsinki, Finland. Also available as Report A-1999-4, Series A.Google ScholarGoogle Scholar
  21. Kärkkäinen, J. and Ukkonen, E. 1996a. Lempel-Ziv parsing and sublinear-size index structures for string matching. In Proceedings of the 3rd South American Workshop on String Processing (WSP). Carleton University Press, 141--155.Google ScholarGoogle Scholar
  22. Kärkkäinen, J. and Ukkonen, E. 1996b. Sparse suffix trees. In Proceedings of the 2nd Ann. International Conf. on Computing and Combinatorics (COCOON). LNCS 1090. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Kosaraju, R. and Manzini, G. 1999. Compression of low entropy strings with Lempel-Ziv algorithms. SIAM Journal on Computing 29, 3, 893--911. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Kurtz, S. 1998. Reducing the space requirements of suffix trees. Report 98-03, Technische Kakultät, Universität Bielefeld.Google ScholarGoogle Scholar
  25. Mäkinen, V. 2003. Compact suffix array—a space-efficient full-text index. Fundamenta Informaticae 56, 1--2, 191--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Manber, U. and Myers, G. 1993. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 935--948. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Munro, I. 1996. Tables. In Proceedings of the 16th Foundations of Software Technology and Theoretical Computer Science (FSTTCS). LNCS 1180. 37--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Munro, I. and Raman, V. 1997. Succint representation of balanced parentheses, static trees and planar graphs. In Proceedings of the 38th IEEE Symposium Foundations of Computer Science (FOCS). 118--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Munro, I., Raman, V., and Rao, S. 2001. Space efficient suffix trees. Journal of Algorithms, 205--222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Navarro, G. 2002. Indexing text using the Ziv-Lempel trie. In Proceedings of the 9th International Symposium String Processing and Information Retrieval (SPIRE). LNCS 2476. 325--336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Navarro, G. 2004. Indexing text using the ziv-lempel trie. Journal of Discrete Algorithms (JDA) 2, 1, 87--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Navarro, G. and Mäkinen, V. 2007. Compressed full-text indexes. ACM Computing Surveys 39, 1, article 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Navarro, G., Moura, E., Neubert, M., Ziviani, N., and Baeza-Yates, R. 2000. Adding compression to block addressing inverted indexes. Information Retrieval 3, 1, 49--77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Russo, L., Navarro, G., and Oliveira, A. 2007. Approximate string matching with Lempel-Ziv compressed indexes. In Proceedings of the 14th International Symposium on String Processing and Information Retrieval (SPIRE). LNCS 4726. 265--275. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Russo, L. and Oliveira, A. 2006. A compressed self-index using a ziv-lempel dictionary. In Proceedings of the 13th Symposium on String Processing and Information Retrieval (SPIRE). LNCS 4209. 163--180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Sadakane, K. 2000. Compressed text databases with efficient query algorithms based on the compressed suffix array. In Proceedings of the 11th International Symposium Algorithms and Computation (ISAAC). LNCS 1969. 410--421. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Sadakane, K. 2002. Succint representations of lcp information and improvements in the compressed suffix arrays. In Proceedings of the 13th ACM Symposium on Discrete Algorithms (SODA). 225--232. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Welch, T. 1984. A technique for high performance data compression. IEEE Computer Magazine 17, 6 (June), 8--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Witten, I., Moffat, A., and Bell, T. 1999. Managing Gigabytes, second ed. Morgan Kaufmann Publishers, New York.Google ScholarGoogle Scholar
  40. Ziv, J. and Lempel, A. 1978. Compression of individual sequences via variable length coding. IEEE Trans. on Information Theory 24, 530--536.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Implementing the LZ-index: Theory versus practice

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                Full Access

                • Published in

                  cover image ACM Journal of Experimental Algorithmics
                  ACM Journal of Experimental Algorithmics  Volume 13, Issue
                  2009
                  482 pages
                  ISSN:1084-6654
                  EISSN:1084-6654
                  DOI:10.1145/1412228
                  Issue’s Table of Contents

                  Copyright © 2009 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 23 February 2009
                  • Accepted: 1 June 2003
                  • Received: 1 January 2003
                  Published in jea Volume 13, Issue

                  Qualifiers

                  • research-article
                  • Research
                  • Refereed

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader