research-article

Implementing the LZ-index: Theory versus practice

Author:
Gonzalo Navarro

University of Chile, Santiago, Chile

University of Chile, Santiago, Chile
View Profile

ACM Journal of Experimental Algorithmics Volume 13Article No.: 2pp 1.2–1.49https://doi.org/10.1145/1412228.1412230

Published:23 February 2009Publication History

ACM Journal of Experimental Algorithmics

Abstract

The LZ-index is a theoretical proposal of a lightweight data structure for text indexing, based on the Ziv-Lempel trie. If a text of u characters over an alphabet of size σ is compressible to n symbols using the LZ78 algorithm, then the LZ-index takes 4n log₂ n (1+o(1)) bits of space (that is, 4 times the entropy of the text) and reports the R occurrences of a pattern of length m in worst case time O(m³ log σ + (m + R)log n). In this paper we face the challenge of obtaining a practical implementation of the LZ-index, which is not at all straightforward from the theoretical proposal. We end up with a prototype that takes the promised space and has average search time O(σ m log u + √uR). This prototype is shown to be faster than other competing approaches when we take into account the time to report the positions or text contexts of the occurrences found. We show in detail the process of implementing the index, which involves interesting lessons of theory versus practice.

References

Abouelhoda, M., Ohlebusch, E., and Kurtz, S. 2002. Optimal exact string matching based on suffix arrays. In Proceedings of the 9th International Symposium String Processing and Information Retrieval (SPIRE). LNCS 2476. 31--43. Google ScholarDigital Library
Agarwal, P. and Erickson, J. 1999. Geometric range searching and its relatives. Contemporary Mathematics 23: Advances in Discrete and Computational Geometry, 1--56.Google Scholar
Apostolico, A. 1985. The myriad virtues of subword trees. In Combinatorial Algorithms on Words. NATO ISI Series. Springer-Verlag, 85--96.Google Scholar
Arroyuelo, D. and Navarro, G. 2005. Space-efficient construction of LZ-index. In Proceedings of the 16th Annual International Symposium on Algorithms and Computation (ISAAC). LNCS 3827. 1143--1152. Google ScholarDigital Library
Arroyuelo, D. and Navarro, G. 2007a. A lempel-ziv text index on secondary storage. In Proceedings of the 18th Annual Symposium on Combinatorial Pattern Matching (CPM). LNCS 4580. 83--94. Google ScholarDigital Library
Arroyuelo, D. and Navarro, G. 2007b. Smaller and faster lempel-ziv indices. In Proceedings of the 18th International Workshop on Combinatorial Algorithms (IWOCA). College Publications, UK, 11--20.Google Scholar
Arroyuelo, D., Navarro, G., and Sadakane, K. 2006. Reducing the space requirement of LZ-index. In Proceedings of the 17th Annual Symposium on Combinatorial Pattern Matching (CPM). LNCS 4009. 319--330. Google ScholarDigital Library
Bell, T., Cleary, J., and Witten, I. 1990. Text compression. Prentice Hall. Google ScholarDigital Library
Benoit, D., Demaine, E., Munro, I., Raman, R., Raman, V., and Rao, S. 2005. Representing trees of higher degree. Algorithmica 43, 4, 275--292.Google ScholarDigital Library
Chazelle, B. 1988. A functional approach to data structures and its use in multidimensional searching. SIAM Journal on Computing 17, 3, 427--462. Google ScholarDigital Library
Ferragina, P. and Manzini, G. 2000. Opportunistic data structures with applications. In Proceedings of the 41st IEEE Symposium Foundations of Computer Science (FOCS). 390--398. Google ScholarDigital Library
Ferragina, P. and Manzini, G. 2001. An experimental study of an opportunistic index. In Proceedings of the 12th ACM Symposium on Discrete Algorithms (SODA). 269--278. Google ScholarDigital Library
Ferragina, P. and Manzini, G. 2002. On compressing and indexing data. Tech. Rep. TR-02-01, Dipartamento di Informatica, Univ. of Pisa. Google ScholarDigital Library
Geary, R., Rahman, N., Raman, R., and Raman, V. 2004. A simple optimal representation for balanced parentheses. In Proceedings of the 15th Annual Symposium on Combinatorial Pattern Matching (CPM). LNCS v. 3109. 159--172.Google Scholar
González, R., Grabowski, S., Mäkinen, V., and Navarro, G. 2005. Practical implementation of rank and select queries. In Poster Proceedings Volume of the 4th Workshop on Efficient and Experimental Algorithms (WEA). CTI Press and Ellinika Grammata, Greece, 27--38.Google Scholar
Grossi, R. and Vitter, J. 2000. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proceedings of the 32nd ACM Symposium Theory of Computing (STOC). 397--406. Google ScholarDigital Library
Harman, D. 1995. Overview of the Third Text REtrieval Conference. In Proceedings of the 3rd Text REtrieval Conf. (TREC-3). 1--19. NIST Special Publication 500--507.Google Scholar
Jacobson, G. 1989. Space-efficient static trees and graphs. In Proceedings of the 30th IEEE Symposium Foundations of Computer Science (FOCS). 549--554. Google ScholarDigital Library
Kärkkäinen, J. 1995. Suffix cactus: a cross between suffix tree and suffix array. In Proceedings of the 6th Annual Symposium Combinatorial Pattern Matching (CPM). LNCS 937. 191--204.Google ScholarCross Ref
Kärkkäinen, J. 1999. Repetition-based text indexes. Ph.D. thesis, Dept. of Computer Science, University of Helsinki, Finland. Also available as Report A-1999-4, Series A.Google Scholar
Kärkkäinen, J. and Ukkonen, E. 1996a. Lempel-Ziv parsing and sublinear-size index structures for string matching. In Proceedings of the 3rd South American Workshop on String Processing (WSP). Carleton University Press, 141--155.Google Scholar
Kärkkäinen, J. and Ukkonen, E. 1996b. Sparse suffix trees. In Proceedings of the 2nd Ann. International Conf. on Computing and Combinatorics (COCOON). LNCS 1090. Google ScholarDigital Library
Kosaraju, R. and Manzini, G. 1999. Compression of low entropy strings with Lempel-Ziv algorithms. SIAM Journal on Computing 29, 3, 893--911. Google ScholarDigital Library
Kurtz, S. 1998. Reducing the space requirements of suffix trees. Report 98-03, Technische Kakultät, Universität Bielefeld.Google Scholar
Mäkinen, V. 2003. Compact suffix array—a space-efficient full-text index. Fundamenta Informaticae 56, 1--2, 191--210. Google ScholarDigital Library
Manber, U. and Myers, G. 1993. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 935--948. Google ScholarDigital Library
Munro, I. 1996. Tables. In Proceedings of the 16th Foundations of Software Technology and Theoretical Computer Science (FSTTCS). LNCS 1180. 37--42. Google ScholarDigital Library
Munro, I. and Raman, V. 1997. Succint representation of balanced parentheses, static trees and planar graphs. In Proceedings of the 38th IEEE Symposium Foundations of Computer Science (FOCS). 118--126. Google ScholarDigital Library
Munro, I., Raman, V., and Rao, S. 2001. Space efficient suffix trees. Journal of Algorithms, 205--222. Google ScholarDigital Library
Navarro, G. 2002. Indexing text using the Ziv-Lempel trie. In Proceedings of the 9th International Symposium String Processing and Information Retrieval (SPIRE). LNCS 2476. 325--336. Google ScholarDigital Library
Navarro, G. 2004. Indexing text using the ziv-lempel trie. Journal of Discrete Algorithms (JDA) 2, 1, 87--114. Google ScholarDigital Library
Navarro, G. and Mäkinen, V. 2007. Compressed full-text indexes. ACM Computing Surveys 39, 1, article 2. Google ScholarDigital Library
Navarro, G., Moura, E., Neubert, M., Ziviani, N., and Baeza-Yates, R. 2000. Adding compression to block addressing inverted indexes. Information Retrieval 3, 1, 49--77. Google ScholarDigital Library
Russo, L., Navarro, G., and Oliveira, A. 2007. Approximate string matching with Lempel-Ziv compressed indexes. In Proceedings of the 14th International Symposium on String Processing and Information Retrieval (SPIRE). LNCS 4726. 265--275. Google ScholarDigital Library
Russo, L. and Oliveira, A. 2006. A compressed self-index using a ziv-lempel dictionary. In Proceedings of the 13th Symposium on String Processing and Information Retrieval (SPIRE). LNCS 4209. 163--180. Google ScholarDigital Library
Sadakane, K. 2000. Compressed text databases with efficient query algorithms based on the compressed suffix array. In Proceedings of the 11th International Symposium Algorithms and Computation (ISAAC). LNCS 1969. 410--421. Google ScholarDigital Library
Sadakane, K. 2002. Succint representations of lcp information and improvements in the compressed suffix arrays. In Proceedings of the 13th ACM Symposium on Discrete Algorithms (SODA). 225--232. Google ScholarDigital Library
Welch, T. 1984. A technique for high performance data compression. IEEE Computer Magazine 17, 6 (June), 8--19. Google ScholarDigital Library
Witten, I., Moffat, A., and Bell, T. 1999. Managing Gigabytes, second ed. Morgan Kaufmann Publishers, New York.Google Scholar
Ziv, J. and Lempel, A. 1978. Compression of individual sequences via variable length coding. IEEE Trans. on Information Theory 24, 530--536.Google ScholarCross Ref

Index Terms

Implementing the LZ-index: Theory versus practice
1. Information systems
2. Theory of computation
  1. Design and analysis of algorithms
    1. Data structures design and analysis
      1. Pattern matching
      2. Sorting and searching
  2. Randomness, geometry and discrete structures

Recommendations

Compressed text indexes: From theory to practice

A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This represents a significant advancement over the (full-)text indexing techniques of the previous decade, whose indexes required several ...
Read More
Improved Word-Aligned Binary Compression for Text Indexing

We present an improved compression mechanism for handling the compressed inverted indexes used in text retrieval systems, extending the word-aligned binary coding carry method. Experiments using two typical document collections show that the new method ...
Read More
Dynamic index and LZ factorization in compressed space
Abstract
In this paper, we propose a new dynamic compressed index of O ( w ) space for a dynamic text T, where w = O ( min ( z log N log ∗ M , N ) ) is the size of the signature encoding of T, z is the size of the Lempel–Ziv77 (LZ77) ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Journal of Experimental Algorithmics Volume 13, Issue
2009
482 pages
ISSN:1084-6654
EISSN:1084-6654
DOI:10.1145/1412228
Issue’s Table of Contents

Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 February 2009
- Accepted: 1 June 2003
- Received: 1 January 2003
Published in jea Volume 13, Issue
Author Tags
Data structures
coding and information theory
data storage representations
indexing methods
textual databases
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 621
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Implementing the LZ-index: Theory versus practice

ACM Journal of Experimental Algorithmics

Abstract

References

Cited By

Index Terms

Recommendations

Compressed text indexes: From theory to practice

Improved Word-Aligned Binary Compression for Text Indexing

Dynamic index and LZ factorization in compressed space

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Implementing the LZ-index: Theory versus practice

ACM Journal of Experimental Algorithmics

Abstract

References

Cited By

Index Terms

Recommendations

Compressed text indexes: From theory to practice

Improved Word-Aligned Binary Compression for Text Indexing

Dynamic index and LZ factorization in compressed space

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media