research-article

Compressed suffix trees: Efficient computation and storage of LCP-values

Authors:
Simon Gog

The University of Melbourne, Parkville, Australia

The University of Melbourne, Parkville, Australia
View Profile

,
Enno Ohlebusch

Ulm University, Ulm, Germany

Ulm University, Ulm, Germany
View Profile

ACM Journal of Experimental Algorithmics Volume 18Article No.: 2.1pp 2.1–2.31https://doi.org/10.1145/2444016.2461327

Published:17 May 2013Publication History

ACM Journal of Experimental Algorithmics

Abstract

The suffix tree is a very important data structure in string processing, but typical implementations suffer from huge space consumption. In large-scale applications, compressed suffix trees (CSTs) are therefore used instead. A CST consists of three (compressed) components: the suffix array, the longest common prefix (LCP)-array and data structures for simulating navigational operations on the suffix tree. The LCP-array stores the lengths of the LCPs of lexicographically adjacent suffixes, and it can be computed in linear time. In this article, we present a new LCP-array construction algorithm that is fast and very space efficient. In practice, our algorithm outperforms alternative algorithms. Moreover, we introduce a new compressed representation of LCP-arrays.

References

Mohamed I. Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. 2004. Replacing suffix trees with enhanced suffix arrays. J. Discrete Algor. 2, 1, 53--83. Google ScholarDigital Library
Timo Beller, Simon Gog, Enno Ohlebusch, and Thomas Schnattinger. 2011. Computing the longest common prefix array based on the Burrows-Wheeler Transform. In Proceedings of the 16th International Symposium on String Processing and Information Retrieval (SPIRE'09). Lecture Notes in Computer Science, vol. 7024, Springer, 197--208. Google ScholarDigital Library
Nieves R. Brisaboa, Susana Ladra, and Gonzalo Navarro. 2009. Directly addressable variable-length codes, string processing and information retrieval. In Proceedings of the 16th International Symposium (SPIRE'09). Lecture Notes in Computer Science, vol. 5721, Springer, 122--130. Google ScholarDigital Library
Rodrigo Cánovas and Gonzalo Navarro. 2010. Practical compressed suffix trees. In Proceedings of the 9th International Symposium on Experimental Algorithms (SEA'10). Lecture Notes in Computer Science, vol. 6049, Springer, 94--105. Google ScholarDigital Library
David R. Clark. 1996. Compact pat trees. Ph.D. dissertation. University of Waterloo. Google ScholarDigital Library
Jasbir Dhaliwal, Simon J. Puglisi, and Andrew Turpin. 2012. Practical efficient string mining. IEEE Trans. Knowl. Data Eng. 24, 4, 735--744. Google ScholarDigital Library
Paolo Ferragina and Giovanni Manzini. 2000. Opportunistic data Structures with applications. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science. 390--398. Google ScholarDigital Library
Johannes Fischer. 2010a. Optimal Succinctness for range minimum queries. In Proceedings of the 4th Latin American Symposium on Theoretical Informatics (LATIN'00). Lecture Notes in Computer Science, vol. 6034, Springer, 158--169. Google ScholarDigital Library
Johannes Fischer. 2010b. Wee LCP. Inform. Process. Lett. 110, 8--9, 317--320. Google ScholarDigital Library
Johannes Fischer, Veli Mäkinen, and Gonzalo Navarro. 2009. Faster entropy-bounded compressed suffix trees. Theor. Comput. Sci. 410, 51, 5354--5364. Google ScholarDigital Library
Simon Gog. 2011. Compressed suffix trees: Design, construction, and applications. Ph.D. dissertation. Ulm University.Google Scholar
Simon Gog and Johannes Fischer. 2010. Advantages of shared data structures for sequences of balanced parentheses. In Proceedings of the Data Compression Conference (DCC'10). IEEE, 406--415. Google ScholarDigital Library
Simon Gog and Enno Ohlebusch. 2011. Fast and lightweight LCP-array construction algorithms. In Proceedings of the Workshop on Algorithm Engineering and Experiments (ALENEX'11). SIAM, 25--34.Google ScholarCross Ref
Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. 2003. High-order entropy-compressed text indexes. In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'03). 841--850. Google ScholarDigital Library
Dan Gusfield. 1997. Algorithms on Strings, Trees, and Sequences. Cambridge University Press. Google ScholarDigital Library
Wing-Kai Hon and Kunihiko Sadakane. 2002. Space-economical algorithms for finding maximal unique matches. In Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching (CPM'02). Lecture Notes in Computer Science, vol. 2373, Springer, 144--152. Google ScholarDigital Library
David A. Huffman. 1952. A Method for the construction of minimum-redundancy codes. Proc. Inst. Radio Eng. 40, 9, 1098--1101.Google ScholarCross Ref
Guy Jacobson. 1989. Space-efficient static trees and graphs. In Proceedings of the 30th Annual Symposium on Foundations of Computer Science. IEEE, 549--554. Google ScholarDigital Library
Juha Kärkkäinen, Giovanni Manzini, and Simon J. Puglisi. 2009. Permuted Longest-common-prefix array. In Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM'09). Lecture Notes in Computer Science, vol. 5577, Springer, 181--192.Google ScholarCross Ref
Juha Kärkkäinen and Peter Sanders. 2003. Simple Linear work suffix array construction. In Proceedings of the 30th International Colloquium on Automata, Languages and Programming (ICALP'03). Lecture Notes in Computer Science, vol. 2719, Springer, 943--955. Google ScholarDigital Library
Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa, and Kunsoo Park. 2001. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching (CPM'01). Lecture Notes in Computer Science, vol. 2089, Springer, 181--192. Google ScholarDigital Library
Pang Ko and Srinivas Aluru. 2003. Space efficient linear time construction of suffix arrays. In Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching (CPM'03). Lecture Notes in Computer Science, vol. 2676, Springer, 200--210. Google ScholarDigital Library
Markus Kowarschik and Christian Weiss. 2002. An overview of cache optimization techniques and cache-aware numerical algorithms. In Algorithms for Memory Hierarchies, Lecture Notes in Computer Science, vol. 2625, Springer, 213--232.Google Scholar
Stefan Kurtz. 1999. Reducing the space requirement of suffix trees. Softw., Pract. Exper. 29, 13 (1999), 1149--1171. Google ScholarDigital Library
Veli Mäkinen. 2003. Compact suffix array—a space-efficient full-text index. Foundamenta Informaticae 56, 1--2, 191--210. Google ScholarDigital Library
Giovanni Manzini. 2004. Two space saving tricks for linear time LCP array computation. In Proceedings of the 9th Scandinavian Workshop on Algorithm Theory (SWAT'04). Lecture Notes in Computer Science, vol. 3111, Springer, 372--383.Google ScholarCross Ref
Giovanni Manzini and Paolo Ferragina. 2004. Engineering a lightweight suffix array construction algorithm. Algorithmica 40, 1, 33--50. Google ScholarDigital Library
Gonzalo Navarro and Veli Mäkinen. 2007. Compressed full-text indexes. ACM Comput. Surv. 39, 1. Google ScholarDigital Library
Ge Nong, Sen Zhang, and Wai Hong Chan. 2009. Linear suffix array construction by almost pure induced-sorting. In Proceedings of the Data Compression Conference (DCC'09). IEEE, 193--202. Google ScholarDigital Library
Enno Ohlebusch, Johannes Fischer, and Simon Gog. 2010. CST++. In Proceedings of the 17th International Symposium on String Processing and Information Retrieval (SPIRE'10). Lecture Notes in Computer Science, vol. 6393, Springer, 322--333. Google ScholarDigital Library
Enno Ohlebusch and Simon Gog. 2009. A Compressed enhanced suffix array supporting fast string matching. In Proceedings of the 16th International Symposium on String Processing and Information Retrieval (SPIRE'09). Lecture Notes in Computer Science, vol. 5721, Springer, 51--62. Google ScholarDigital Library
Daisuke Okanohara and Kunihiko Sadakane. 2009. A linear-time Burrows-Wheeler transform using induced sorting. In Proceedings of the 16th International Symposium on String Processing and Information Retrieval (SPIRE'09). Lecture Notes in Computer Science, vol. 5721, Springer, 90--101. Google ScholarDigital Library
Simon J. Puglisi, William F. Smyth, and Andrew Turpin. 2007. A taxonomy of suffix array construction algorithms. ACM Comput. Surv. 39, 2. Google ScholarDigital Library
Simon J. Puglisi and Andrew Turpin. 2008. Space-time tradeoffs for longest-common-prefix array computation. In Proceedings of the 19th International Symposium on Algorithms and Computation (ISAAC'09). Lecture Notes in Computer Science, vol. 5369, Springer, 124--135. Google ScholarDigital Library
Luís M. S. Russo, Gonzalo Navarro, and Arlindo L. Oliveira. 2008. Fully-compressed suffix trees. In Proceedings of the 12th Latin American Symposium on Theoretical Informatics (LATIN'08). Lecture Notes in Computer Science, vol. 4957, Springer, 362--373. Google ScholarDigital Library
Kunihiko Sadakane. 2002. Succinct representations of LCP information and improvements in the compressed suffix arrays. In Proceedings of the 2002 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'02). SIAM, 225--232. Google ScholarDigital Library
Kunihiko Sadakane. 2007. Compressed suffix trees with full functionality. Theory Comput. Syst. 41, 4, 589--607. Google ScholarDigital Library
Kunihiko Sadakane and Gonzalo Navarro. 2010. Fully-functional succinct trees. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'10). SIAM, 134--149. Google ScholarDigital Library
Jouni Sirén. 2010. Sampled longest common prefix array. In Proceedings of the 21st Annual Symposium on Combinatorial Pattern Matching (CPM'10). Lecture Notes in Computer Science, vol. 6129, Springer, 227--237. Google ScholarDigital Library
Niko Välimäki, Veli Mäkinen, Wolfgang Gerlach, and Kashyap Dixit. 2009. Engineering a compressed suffix tree implementation. ACM J. Exp. Algor. 14, 2. Google ScholarDigital Library

Index Terms

Recommendations

Fully compressed suffix trees

Suffix trees are by far the most important data structure in stringology, with a myriad of applications in fields like bioinformatics and information retrieval. Classical representations of suffix trees require Θ(n log n) bits of space, for a string of ...
Read More
Engineering a compressed suffix tree implementation

Suffix tree is one of the most important data structures in string algorithms and biological sequence analysis. Unfortunately, when it comes to implementing those algorithms and applying them to real genomic sequences, often the main memory size becomes ...
Read More
Reverse engineering of compact suffix trees and links

Invented in the 1970s, the Suffix Tree (ST) is a data structure that indexes all substrings of a text in linear space. Although more space demanding than other indexes, the ST remains likely an inspiring index because it represents substrings in a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Journal of Experimental Algorithmics Volume 18, Issue
2013
329 pages
ISSN:1084-6654
EISSN:1084-6654
DOI:10.1145/2444016
Issue’s Table of Contents

Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 May 2013
- Accepted: 1 March 2013
- Revised: 1 February 2013
- Received: 1 March 2012
Published in jea Volume 18, Issue
Author Tags
Algorithm engineering
Burrows-Wheeler transform
LCP-array
balanced parentheses
text compression
text indexing
wavelet tree
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 15
  Total Citations
  View Citations
- 383
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Compressed suffix trees: Efficient computation and storage of LCP-values

ACM Journal of Experimental Algorithmics

Abstract

References

Cited By

Index Terms

Recommendations

Fully compressed suffix trees

Engineering a compressed suffix tree implementation

Reverse engineering of compact suffix trees and links

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Compressed suffix trees: Efficient computation and storage of LCP-values

ACM Journal of Experimental Algorithmics

Abstract

References

Cited By

Index Terms

Recommendations

Fully compressed suffix trees

Engineering a compressed suffix tree implementation

Reverse engineering of compact suffix trees and links

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media