ABSTRACT
Over the past three decades, the suffix tree has served as a fundamental data structure in string processing. However, its widespread applicability has been hindered due to the fact that suffix tree construction does not scale well with the size of the input string. With advances in data collection and storage technologies, large strings have become ubiquitous, especially across emerging applications involving text, time series, and biological sequence data. To benefit from these advances, it is imperative that we realize a scalable suffix tree construction algorithm.
To deal with the aforementioned challenge, the past few years have seen the emergence of several disk-based suffix tree construction algorithms. However, construction times continue to be daunting -- for e.g., indexing the entire Human genome still takes over 30 hours on a system with 2 gigabytes of physical memory. In this paper, first, we empirically demonstrate and argue that all existing suffix tree construction algorithms have a severe limitation -- to glean reasonable disk I/O efficiency, the input string being indexed must fit in main memory. This limitation is attributed to the poor locality properties of existing suffix tree construction algorithms and inhibits both sequential and parallel scalability. To deal with this limitation, second, we show that through careful algorithm design, one of the simplest suffix tree construction algorithms can be re-architected to build a suffix tree in a tiled fashion, allowing the implementation to maintain a constant working set size and fixed memory footprint when indexing strings of any size. Third, we show how improved locality of reference coupled with effective collective communication facilitates an efficient parallelization on massively parallel systems like the IBM Blue Gene/L. Finally, we empirically show that the proposed approach affords improvements of several orders of magnitude when indexing large strings. Furthermore, we demonstrate that the proposed parallelization is scalable and allows one to index the entire Human genome on a 1024 processor system in under 15 minutes.
- S. Bedathur and J. Haritsa. Engineering a fast online persistent suffix tree construction. In Proceedings of the IEEE International Conference on Data Engineering 2004. Google ScholarDigital Library
- S. Bedathur and J. Haritsa. Search optimized suffix tree storage for biological applications. In Proceedings of the IEEE International Conference on High Performance Computing 2005. Google ScholarDigital Library
- N. Bray, I. Dubchak, and L. Pachter. AVID:A global alignment program. Genome Research 13(1), 2003.Google Scholar
- A. Brown. Constructing genome scale suffix trees. In Proceedings of the Asia-Pacific Bioinformatics Conference 2004. Google ScholarDigital Library
- M. Burrows and D. Wheeler. A block sorting lossless data compression algorithm. Technical report, Digital Equipment Corporation, 1994.Google Scholar
- A. Carvalho, A. Freitas, A. Oliveira, and M. Sagot. Efficient extraction of structured motifs using box links. In Proceedings of the 11th Conference on String Processing and Information Retrieval 2004.Google ScholarCross Ref
- W. Chang and E. Lawler. Sublinear approximate string matching and biological applications. Algorithmica 12(4/5), 1994.Google Scholar
- C. Cheung, J. Yu, and H. Lu. Constructing suffix trees for gigabyte sequences with megabyte memory. IEEE Transactions on Knowledge and Data Engineering 17(1), 2005. Google ScholarDigital Library
- R. Clifford and M. Sergot. Distributed and paged suffix trees for large genetic databases. In Proceedings of the Annual Symposium on Combinatorial Pattern Matching 2003. Google ScholarDigital Library
- A. Delcher, S. Kasif, R. Fleischmann, J. Peterson, O. White, and S. Salzberg. Alignment of whole genomes. Nucleic Acids Res. 27(11), 1999.Google Scholar
- A. Delcher, A. Phillippy, J. Carlton, and S. Salzberg. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 30(1), 2002.Google Scholar
- M. Farach-Colton, P. Ferragina, and S. Muthukrishnan. Overcoming the memory bottleneck in suffix tree construction. In Proceedings of the Annual Symposium on Foundations of Computer Science 1998. Google ScholarDigital Library
- D. Gus field. Algorithms on strings, trees, and sequences: Computer science and computational biology Cambridge University Press, Cambridge, 1997. Google ScholarDigital Library
- D. Gus Field and J. Stoye. Linear time algorithms for Finding and representing all the tandem repeats in a string. Journal of Computer and System Sciences 69(4), 2004. Google ScholarDigital Library
- E. Hunt, M. Atkinson, and R. Irving. A database index to large biological sequences. In Proceedings of 27th International Conference on Very Large Databases 2001. Google ScholarDigital Library
- R. Japp. The top-compressed suffix tree:A disk resident index for large sequences. In Proceedings of the Bioinformatics Workshop at the 21st Annual British National Conference on Databases 2004.Google Scholar
- S. Kurtz, J. Choudhuri, E. Ohlebusch, C. Schleiermacher, J. Stoye, and R. Giegerich. Reputer: The manifold applications of repeat analysis on a genome scale. Nucleic Acids Res. 29, 2001.Google Scholar
- S. Kurtz, A. Phillippy, A. Delcher, M. Smoot, M. Shumway, C. Antonescu, and S. Salzberg. Versatile and open software for comparing large genomes. Genome Bio. 5(R12), 2004.Google Scholar
- E. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM 23(2), 1976. Google ScholarDigital Library
- C. Meek, J. Patel, and S. Kasetty. Oasis: An online and accurate technique for local-alignment searches on biological sequences. In Proceedings of 29th International Conference on Very Large Databases 2003. Google ScholarDigital Library
- NCBI. Public collections of dna and rna sequence reach 100 gigabases. http://www.nlm.nih.gov/news/press_releases/dna_rna_100_gig.html.Google Scholar
- B. Phoophakdee and M. Zaki. Genome-scale disk-based suffix tree indexing. In Proceedings of the ACM International Conference on Management of Data 2007. Google ScholarDigital Library
- B. Phoophakdee and M. Zaki. Trellis+: An effective approach for indexing massive sequences. In Proceedings of the Pacific Symposium on Biocomputing 2008.Google Scholar
- K. Schurmann and J. Stoye. suffix tree construction and storage with limited main memory. Technical report, Universitat Bielefeld, 2003.Google Scholar
- Y. Tian, S. Tata, R. Hankins, and J. Patel. Practical methods for constructing suffix trees. VLDB Journal 14(3), 2005. Google ScholarDigital Library
- E. Ukkonen. Constructing suffix trees on-line in linear time. In Proceedings of the IFIP 12th Work Computer Congress on Algorithms, Software, Architecture: Information Processing 1992. Google ScholarDigital Library
- P. Weiner. Linear pattern matching algorithms. In Proceedings of 14th Annual Symposium on Switch and Automata Theory 1973. Google ScholarDigital Library
- D. Yankov, E. Keogh, and U. Rebbapragada. Disk-aware discord discovery:Finding unusual time series in tera-byte sized datasets. In Proceedings of the IEEE International Conference on Data Mining 2007. Google ScholarDigital Library
- O. Zamir and O. Etzioni. Web document clustering: a feasibility demonstration. In Proceedings of 21th International Conference on Research and Development in Information Retrieval 1998. Google ScholarDigital Library
Index Terms
- Serial and parallel methods for i/o efficient suffix tree construction
Recommendations
A simple parallel cartesian tree algorithm and its application to parallel suffix tree construction
Inaugural Issue and Special Section on Top Papers from PACT-21, and Regular PapersWe present a simple linear work and space, and polylogarithmic time parallel algorithm for generating multiway Cartesian trees. We show that bottom-up traversals of the multiway Cartesian tree on the interleaved suffix array and longest common prefix ...
I/O efficient algorithms for serial and parallel suffix tree construction
Over the past three decades, the suffix tree has served as a fundamental data structure in string processing. However, its widespread applicability has been hindered due to the fact that suffix tree construction does not scale well with the size of the ...
Genome-scale disk-based suffix tree indexing
SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of dataWith the exponential growth of biological sequence databases, it has become critical to develop effective techniques for storing, querying, and analyzing these massive data. Suffix trees are widely used to solve many sequence-based problems, and they ...
Comments