skip to main content
research-article

Compressed suffix trees: Efficient computation and storage of LCP-values

Published:17 May 2013Publication History
Skip Abstract Section

Abstract

The suffix tree is a very important data structure in string processing, but typical implementations suffer from huge space consumption. In large-scale applications, compressed suffix trees (CSTs) are therefore used instead. A CST consists of three (compressed) components: the suffix array, the longest common prefix (LCP)-array and data structures for simulating navigational operations on the suffix tree. The LCP-array stores the lengths of the LCPs of lexicographically adjacent suffixes, and it can be computed in linear time. In this article, we present a new LCP-array construction algorithm that is fast and very space efficient. In practice, our algorithm outperforms alternative algorithms. Moreover, we introduce a new compressed representation of LCP-arrays.

References

  1. Mohamed I. Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. 2004. Replacing suffix trees with enhanced suffix arrays. J. Discrete Algor. 2, 1, 53--83. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Timo Beller, Simon Gog, Enno Ohlebusch, and Thomas Schnattinger. 2011. Computing the longest common prefix array based on the Burrows-Wheeler Transform. In Proceedings of the 16th International Symposium on String Processing and Information Retrieval (SPIRE'09). Lecture Notes in Computer Science, vol. 7024, Springer, 197--208. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Nieves R. Brisaboa, Susana Ladra, and Gonzalo Navarro. 2009. Directly addressable variable-length codes, string processing and information retrieval. In Proceedings of the 16th International Symposium (SPIRE'09). Lecture Notes in Computer Science, vol. 5721, Springer, 122--130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Rodrigo Cánovas and Gonzalo Navarro. 2010. Practical compressed suffix trees. In Proceedings of the 9th International Symposium on Experimental Algorithms (SEA'10). Lecture Notes in Computer Science, vol. 6049, Springer, 94--105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. David R. Clark. 1996. Compact pat trees. Ph.D. dissertation. University of Waterloo. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Jasbir Dhaliwal, Simon J. Puglisi, and Andrew Turpin. 2012. Practical efficient string mining. IEEE Trans. Knowl. Data Eng. 24, 4, 735--744. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Paolo Ferragina and Giovanni Manzini. 2000. Opportunistic data Structures with applications. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science. 390--398. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Johannes Fischer. 2010a. Optimal Succinctness for range minimum queries. In Proceedings of the 4th Latin American Symposium on Theoretical Informatics (LATIN'00). Lecture Notes in Computer Science, vol. 6034, Springer, 158--169. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Johannes Fischer. 2010b. Wee LCP. Inform. Process. Lett. 110, 8--9, 317--320. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Johannes Fischer, Veli Mäkinen, and Gonzalo Navarro. 2009. Faster entropy-bounded compressed suffix trees. Theor. Comput. Sci. 410, 51, 5354--5364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Simon Gog. 2011. Compressed suffix trees: Design, construction, and applications. Ph.D. dissertation. Ulm University.Google ScholarGoogle Scholar
  12. Simon Gog and Johannes Fischer. 2010. Advantages of shared data structures for sequences of balanced parentheses. In Proceedings of the Data Compression Conference (DCC'10). IEEE, 406--415. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Simon Gog and Enno Ohlebusch. 2011. Fast and lightweight LCP-array construction algorithms. In Proceedings of the Workshop on Algorithm Engineering and Experiments (ALENEX'11). SIAM, 25--34.Google ScholarGoogle ScholarCross RefCross Ref
  14. Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. 2003. High-order entropy-compressed text indexes. In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'03). 841--850. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Dan Gusfield. 1997. Algorithms on Strings, Trees, and Sequences. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Wing-Kai Hon and Kunihiko Sadakane. 2002. Space-economical algorithms for finding maximal unique matches. In Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching (CPM'02). Lecture Notes in Computer Science, vol. 2373, Springer, 144--152. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. David A. Huffman. 1952. A Method for the construction of minimum-redundancy codes. Proc. Inst. Radio Eng. 40, 9, 1098--1101.Google ScholarGoogle ScholarCross RefCross Ref
  18. Guy Jacobson. 1989. Space-efficient static trees and graphs. In Proceedings of the 30th Annual Symposium on Foundations of Computer Science. IEEE, 549--554. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Juha Kärkkäinen, Giovanni Manzini, and Simon J. Puglisi. 2009. Permuted Longest-common-prefix array. In Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM'09). Lecture Notes in Computer Science, vol. 5577, Springer, 181--192.Google ScholarGoogle ScholarCross RefCross Ref
  20. Juha Kärkkäinen and Peter Sanders. 2003. Simple Linear work suffix array construction. In Proceedings of the 30th International Colloquium on Automata, Languages and Programming (ICALP'03). Lecture Notes in Computer Science, vol. 2719, Springer, 943--955. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa, and Kunsoo Park. 2001. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching (CPM'01). Lecture Notes in Computer Science, vol. 2089, Springer, 181--192. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Pang Ko and Srinivas Aluru. 2003. Space efficient linear time construction of suffix arrays. In Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching (CPM'03). Lecture Notes in Computer Science, vol. 2676, Springer, 200--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Markus Kowarschik and Christian Weiss. 2002. An overview of cache optimization techniques and cache-aware numerical algorithms. In Algorithms for Memory Hierarchies, Lecture Notes in Computer Science, vol. 2625, Springer, 213--232.Google ScholarGoogle Scholar
  24. Stefan Kurtz. 1999. Reducing the space requirement of suffix trees. Softw., Pract. Exper. 29, 13 (1999), 1149--1171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Veli Mäkinen. 2003. Compact suffix array—a space-efficient full-text index. Foundamenta Informaticae 56, 1--2, 191--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Giovanni Manzini. 2004. Two space saving tricks for linear time LCP array computation. In Proceedings of the 9th Scandinavian Workshop on Algorithm Theory (SWAT'04). Lecture Notes in Computer Science, vol. 3111, Springer, 372--383.Google ScholarGoogle ScholarCross RefCross Ref
  27. Giovanni Manzini and Paolo Ferragina. 2004. Engineering a lightweight suffix array construction algorithm. Algorithmica 40, 1, 33--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Gonzalo Navarro and Veli Mäkinen. 2007. Compressed full-text indexes. ACM Comput. Surv. 39, 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Ge Nong, Sen Zhang, and Wai Hong Chan. 2009. Linear suffix array construction by almost pure induced-sorting. In Proceedings of the Data Compression Conference (DCC'09). IEEE, 193--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Enno Ohlebusch, Johannes Fischer, and Simon Gog. 2010. CST++. In Proceedings of the 17th International Symposium on String Processing and Information Retrieval (SPIRE'10). Lecture Notes in Computer Science, vol. 6393, Springer, 322--333. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Enno Ohlebusch and Simon Gog. 2009. A Compressed enhanced suffix array supporting fast string matching. In Proceedings of the 16th International Symposium on String Processing and Information Retrieval (SPIRE'09). Lecture Notes in Computer Science, vol. 5721, Springer, 51--62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Daisuke Okanohara and Kunihiko Sadakane. 2009. A linear-time Burrows-Wheeler transform using induced sorting. In Proceedings of the 16th International Symposium on String Processing and Information Retrieval (SPIRE'09). Lecture Notes in Computer Science, vol. 5721, Springer, 90--101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Simon J. Puglisi, William F. Smyth, and Andrew Turpin. 2007. A taxonomy of suffix array construction algorithms. ACM Comput. Surv. 39, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Simon J. Puglisi and Andrew Turpin. 2008. Space-time tradeoffs for longest-common-prefix array computation. In Proceedings of the 19th International Symposium on Algorithms and Computation (ISAAC'09). Lecture Notes in Computer Science, vol. 5369, Springer, 124--135. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Luís M. S. Russo, Gonzalo Navarro, and Arlindo L. Oliveira. 2008. Fully-compressed suffix trees. In Proceedings of the 12th Latin American Symposium on Theoretical Informatics (LATIN'08). Lecture Notes in Computer Science, vol. 4957, Springer, 362--373. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Kunihiko Sadakane. 2002. Succinct representations of LCP information and improvements in the compressed suffix arrays. In Proceedings of the 2002 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'02). SIAM, 225--232. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Kunihiko Sadakane. 2007. Compressed suffix trees with full functionality. Theory Comput. Syst. 41, 4, 589--607. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Kunihiko Sadakane and Gonzalo Navarro. 2010. Fully-functional succinct trees. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'10). SIAM, 134--149. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Jouni Sirén. 2010. Sampled longest common prefix array. In Proceedings of the 21st Annual Symposium on Combinatorial Pattern Matching (CPM'10). Lecture Notes in Computer Science, vol. 6129, Springer, 227--237. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Niko Välimäki, Veli Mäkinen, Wolfgang Gerlach, and Kashyap Dixit. 2009. Engineering a compressed suffix tree implementation. ACM J. Exp. Algor. 14, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Compressed suffix trees: Efficient computation and storage of LCP-values

                      Recommendations

                      Comments

                      Login options

                      Check if you have access through your login credentials or your institution to get full access on this article.

                      Sign in

                      Full Access

                      • Published in

                        cover image ACM Journal of Experimental Algorithmics
                        ACM Journal of Experimental Algorithmics  Volume 18, Issue
                        2013
                        329 pages
                        ISSN:1084-6654
                        EISSN:1084-6654
                        DOI:10.1145/2444016
                        Issue’s Table of Contents

                        Copyright © 2013 ACM

                        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                        Publisher

                        Association for Computing Machinery

                        New York, NY, United States

                        Publication History

                        • Published: 17 May 2013
                        • Accepted: 1 March 2013
                        • Revised: 1 February 2013
                        • Received: 1 March 2012
                        Published in jea Volume 18, Issue

                        Qualifiers

                        • research-article
                        • Research
                        • Refereed

                      PDF Format

                      View or Download as a PDF file.

                      PDF

                      eReader

                      View online with eReader.

                      eReader