Abstract
The suffix tree is a very important data structure in string processing, but typical implementations suffer from huge space consumption. In large-scale applications, compressed suffix trees (CSTs) are therefore used instead. A CST consists of three (compressed) components: the suffix array, the longest common prefix (LCP)-array and data structures for simulating navigational operations on the suffix tree. The LCP-array stores the lengths of the LCPs of lexicographically adjacent suffixes, and it can be computed in linear time. In this article, we present a new LCP-array construction algorithm that is fast and very space efficient. In practice, our algorithm outperforms alternative algorithms. Moreover, we introduce a new compressed representation of LCP-arrays.
- Mohamed I. Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. 2004. Replacing suffix trees with enhanced suffix arrays. J. Discrete Algor. 2, 1, 53--83. Google ScholarDigital Library
- Timo Beller, Simon Gog, Enno Ohlebusch, and Thomas Schnattinger. 2011. Computing the longest common prefix array based on the Burrows-Wheeler Transform. In Proceedings of the 16th International Symposium on String Processing and Information Retrieval (SPIRE'09). Lecture Notes in Computer Science, vol. 7024, Springer, 197--208. Google ScholarDigital Library
- Nieves R. Brisaboa, Susana Ladra, and Gonzalo Navarro. 2009. Directly addressable variable-length codes, string processing and information retrieval. In Proceedings of the 16th International Symposium (SPIRE'09). Lecture Notes in Computer Science, vol. 5721, Springer, 122--130. Google ScholarDigital Library
- Rodrigo Cánovas and Gonzalo Navarro. 2010. Practical compressed suffix trees. In Proceedings of the 9th International Symposium on Experimental Algorithms (SEA'10). Lecture Notes in Computer Science, vol. 6049, Springer, 94--105. Google ScholarDigital Library
- David R. Clark. 1996. Compact pat trees. Ph.D. dissertation. University of Waterloo. Google ScholarDigital Library
- Jasbir Dhaliwal, Simon J. Puglisi, and Andrew Turpin. 2012. Practical efficient string mining. IEEE Trans. Knowl. Data Eng. 24, 4, 735--744. Google ScholarDigital Library
- Paolo Ferragina and Giovanni Manzini. 2000. Opportunistic data Structures with applications. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science. 390--398. Google ScholarDigital Library
- Johannes Fischer. 2010a. Optimal Succinctness for range minimum queries. In Proceedings of the 4th Latin American Symposium on Theoretical Informatics (LATIN'00). Lecture Notes in Computer Science, vol. 6034, Springer, 158--169. Google ScholarDigital Library
- Johannes Fischer. 2010b. Wee LCP. Inform. Process. Lett. 110, 8--9, 317--320. Google ScholarDigital Library
- Johannes Fischer, Veli Mäkinen, and Gonzalo Navarro. 2009. Faster entropy-bounded compressed suffix trees. Theor. Comput. Sci. 410, 51, 5354--5364. Google ScholarDigital Library
- Simon Gog. 2011. Compressed suffix trees: Design, construction, and applications. Ph.D. dissertation. Ulm University.Google Scholar
- Simon Gog and Johannes Fischer. 2010. Advantages of shared data structures for sequences of balanced parentheses. In Proceedings of the Data Compression Conference (DCC'10). IEEE, 406--415. Google ScholarDigital Library
- Simon Gog and Enno Ohlebusch. 2011. Fast and lightweight LCP-array construction algorithms. In Proceedings of the Workshop on Algorithm Engineering and Experiments (ALENEX'11). SIAM, 25--34.Google ScholarCross Ref
- Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. 2003. High-order entropy-compressed text indexes. In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'03). 841--850. Google ScholarDigital Library
- Dan Gusfield. 1997. Algorithms on Strings, Trees, and Sequences. Cambridge University Press. Google ScholarDigital Library
- Wing-Kai Hon and Kunihiko Sadakane. 2002. Space-economical algorithms for finding maximal unique matches. In Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching (CPM'02). Lecture Notes in Computer Science, vol. 2373, Springer, 144--152. Google ScholarDigital Library
- David A. Huffman. 1952. A Method for the construction of minimum-redundancy codes. Proc. Inst. Radio Eng. 40, 9, 1098--1101.Google ScholarCross Ref
- Guy Jacobson. 1989. Space-efficient static trees and graphs. In Proceedings of the 30th Annual Symposium on Foundations of Computer Science. IEEE, 549--554. Google ScholarDigital Library
- Juha Kärkkäinen, Giovanni Manzini, and Simon J. Puglisi. 2009. Permuted Longest-common-prefix array. In Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM'09). Lecture Notes in Computer Science, vol. 5577, Springer, 181--192.Google ScholarCross Ref
- Juha Kärkkäinen and Peter Sanders. 2003. Simple Linear work suffix array construction. In Proceedings of the 30th International Colloquium on Automata, Languages and Programming (ICALP'03). Lecture Notes in Computer Science, vol. 2719, Springer, 943--955. Google ScholarDigital Library
- Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa, and Kunsoo Park. 2001. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching (CPM'01). Lecture Notes in Computer Science, vol. 2089, Springer, 181--192. Google ScholarDigital Library
- Pang Ko and Srinivas Aluru. 2003. Space efficient linear time construction of suffix arrays. In Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching (CPM'03). Lecture Notes in Computer Science, vol. 2676, Springer, 200--210. Google ScholarDigital Library
- Markus Kowarschik and Christian Weiss. 2002. An overview of cache optimization techniques and cache-aware numerical algorithms. In Algorithms for Memory Hierarchies, Lecture Notes in Computer Science, vol. 2625, Springer, 213--232.Google Scholar
- Stefan Kurtz. 1999. Reducing the space requirement of suffix trees. Softw., Pract. Exper. 29, 13 (1999), 1149--1171. Google ScholarDigital Library
- Veli Mäkinen. 2003. Compact suffix array—a space-efficient full-text index. Foundamenta Informaticae 56, 1--2, 191--210. Google ScholarDigital Library
- Giovanni Manzini. 2004. Two space saving tricks for linear time LCP array computation. In Proceedings of the 9th Scandinavian Workshop on Algorithm Theory (SWAT'04). Lecture Notes in Computer Science, vol. 3111, Springer, 372--383.Google ScholarCross Ref
- Giovanni Manzini and Paolo Ferragina. 2004. Engineering a lightweight suffix array construction algorithm. Algorithmica 40, 1, 33--50. Google ScholarDigital Library
- Gonzalo Navarro and Veli Mäkinen. 2007. Compressed full-text indexes. ACM Comput. Surv. 39, 1. Google ScholarDigital Library
- Ge Nong, Sen Zhang, and Wai Hong Chan. 2009. Linear suffix array construction by almost pure induced-sorting. In Proceedings of the Data Compression Conference (DCC'09). IEEE, 193--202. Google ScholarDigital Library
- Enno Ohlebusch, Johannes Fischer, and Simon Gog. 2010. CST++. In Proceedings of the 17th International Symposium on String Processing and Information Retrieval (SPIRE'10). Lecture Notes in Computer Science, vol. 6393, Springer, 322--333. Google ScholarDigital Library
- Enno Ohlebusch and Simon Gog. 2009. A Compressed enhanced suffix array supporting fast string matching. In Proceedings of the 16th International Symposium on String Processing and Information Retrieval (SPIRE'09). Lecture Notes in Computer Science, vol. 5721, Springer, 51--62. Google ScholarDigital Library
- Daisuke Okanohara and Kunihiko Sadakane. 2009. A linear-time Burrows-Wheeler transform using induced sorting. In Proceedings of the 16th International Symposium on String Processing and Information Retrieval (SPIRE'09). Lecture Notes in Computer Science, vol. 5721, Springer, 90--101. Google ScholarDigital Library
- Simon J. Puglisi, William F. Smyth, and Andrew Turpin. 2007. A taxonomy of suffix array construction algorithms. ACM Comput. Surv. 39, 2. Google ScholarDigital Library
- Simon J. Puglisi and Andrew Turpin. 2008. Space-time tradeoffs for longest-common-prefix array computation. In Proceedings of the 19th International Symposium on Algorithms and Computation (ISAAC'09). Lecture Notes in Computer Science, vol. 5369, Springer, 124--135. Google ScholarDigital Library
- Luís M. S. Russo, Gonzalo Navarro, and Arlindo L. Oliveira. 2008. Fully-compressed suffix trees. In Proceedings of the 12th Latin American Symposium on Theoretical Informatics (LATIN'08). Lecture Notes in Computer Science, vol. 4957, Springer, 362--373. Google ScholarDigital Library
- Kunihiko Sadakane. 2002. Succinct representations of LCP information and improvements in the compressed suffix arrays. In Proceedings of the 2002 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'02). SIAM, 225--232. Google ScholarDigital Library
- Kunihiko Sadakane. 2007. Compressed suffix trees with full functionality. Theory Comput. Syst. 41, 4, 589--607. Google ScholarDigital Library
- Kunihiko Sadakane and Gonzalo Navarro. 2010. Fully-functional succinct trees. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'10). SIAM, 134--149. Google ScholarDigital Library
- Jouni Sirén. 2010. Sampled longest common prefix array. In Proceedings of the 21st Annual Symposium on Combinatorial Pattern Matching (CPM'10). Lecture Notes in Computer Science, vol. 6129, Springer, 227--237. Google ScholarDigital Library
- Niko Välimäki, Veli Mäkinen, Wolfgang Gerlach, and Kashyap Dixit. 2009. Engineering a compressed suffix tree implementation. ACM J. Exp. Algor. 14, 2. Google ScholarDigital Library
Index Terms
- Compressed suffix trees: Efficient computation and storage of LCP-values
Recommendations
Fully compressed suffix trees
Suffix trees are by far the most important data structure in stringology, with a myriad of applications in fields like bioinformatics and information retrieval. Classical representations of suffix trees require Θ(n log n) bits of space, for a string of ...
Engineering a compressed suffix tree implementation
Suffix tree is one of the most important data structures in string algorithms and biological sequence analysis. Unfortunately, when it comes to implementing those algorithms and applying them to real genomic sequences, often the main memory size becomes ...
Reverse engineering of compact suffix trees and links
Invented in the 1970s, the Suffix Tree (ST) is a data structure that indexes all substrings of a text in linear space. Although more space demanding than other indexes, the ST remains likely an inspiring index because it represents substrings in a ...
Comments