skip to main content
10.1145/1559845.1559931acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Serial and parallel methods for i/o efficient suffix tree construction

Published:29 June 2009Publication History

ABSTRACT

Over the past three decades, the suffix tree has served as a fundamental data structure in string processing. However, its widespread applicability has been hindered due to the fact that suffix tree construction does not scale well with the size of the input string. With advances in data collection and storage technologies, large strings have become ubiquitous, especially across emerging applications involving text, time series, and biological sequence data. To benefit from these advances, it is imperative that we realize a scalable suffix tree construction algorithm.

To deal with the aforementioned challenge, the past few years have seen the emergence of several disk-based suffix tree construction algorithms. However, construction times continue to be daunting -- for e.g., indexing the entire Human genome still takes over 30 hours on a system with 2 gigabytes of physical memory. In this paper, first, we empirically demonstrate and argue that all existing suffix tree construction algorithms have a severe limitation -- to glean reasonable disk I/O efficiency, the input string being indexed must fit in main memory. This limitation is attributed to the poor locality properties of existing suffix tree construction algorithms and inhibits both sequential and parallel scalability. To deal with this limitation, second, we show that through careful algorithm design, one of the simplest suffix tree construction algorithms can be re-architected to build a suffix tree in a tiled fashion, allowing the implementation to maintain a constant working set size and fixed memory footprint when indexing strings of any size. Third, we show how improved locality of reference coupled with effective collective communication facilitates an efficient parallelization on massively parallel systems like the IBM Blue Gene/L. Finally, we empirically show that the proposed approach affords improvements of several orders of magnitude when indexing large strings. Furthermore, we demonstrate that the proposed parallelization is scalable and allows one to index the entire Human genome on a 1024 processor system in under 15 minutes.

References

  1. S. Bedathur and J. Haritsa. Engineering a fast online persistent suffix tree construction. In Proceedings of the IEEE International Conference on Data Engineering 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Bedathur and J. Haritsa. Search optimized suffix tree storage for biological applications. In Proceedings of the IEEE International Conference on High Performance Computing 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. N. Bray, I. Dubchak, and L. Pachter. AVID:A global alignment program. Genome Research 13(1), 2003.Google ScholarGoogle Scholar
  4. A. Brown. Constructing genome scale suffix trees. In Proceedings of the Asia-Pacific Bioinformatics Conference 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Burrows and D. Wheeler. A block sorting lossless data compression algorithm. Technical report, Digital Equipment Corporation, 1994.Google ScholarGoogle Scholar
  6. A. Carvalho, A. Freitas, A. Oliveira, and M. Sagot. Efficient extraction of structured motifs using box links. In Proceedings of the 11th Conference on String Processing and Information Retrieval 2004.Google ScholarGoogle ScholarCross RefCross Ref
  7. W. Chang and E. Lawler. Sublinear approximate string matching and biological applications. Algorithmica 12(4/5), 1994.Google ScholarGoogle Scholar
  8. C. Cheung, J. Yu, and H. Lu. Constructing suffix trees for gigabyte sequences with megabyte memory. IEEE Transactions on Knowledge and Data Engineering 17(1), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. Clifford and M. Sergot. Distributed and paged suffix trees for large genetic databases. In Proceedings of the Annual Symposium on Combinatorial Pattern Matching 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Delcher, S. Kasif, R. Fleischmann, J. Peterson, O. White, and S. Salzberg. Alignment of whole genomes. Nucleic Acids Res. 27(11), 1999.Google ScholarGoogle Scholar
  11. A. Delcher, A. Phillippy, J. Carlton, and S. Salzberg. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 30(1), 2002.Google ScholarGoogle Scholar
  12. M. Farach-Colton, P. Ferragina, and S. Muthukrishnan. Overcoming the memory bottleneck in suffix tree construction. In Proceedings of the Annual Symposium on Foundations of Computer Science 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Gus field. Algorithms on strings, trees, and sequences: Computer science and computational biology Cambridge University Press, Cambridge, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Gus Field and J. Stoye. Linear time algorithms for Finding and representing all the tandem repeats in a string. Journal of Computer and System Sciences 69(4), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. E. Hunt, M. Atkinson, and R. Irving. A database index to large biological sequences. In Proceedings of 27th International Conference on Very Large Databases 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Japp. The top-compressed suffix tree:A disk resident index for large sequences. In Proceedings of the Bioinformatics Workshop at the 21st Annual British National Conference on Databases 2004.Google ScholarGoogle Scholar
  17. S. Kurtz, J. Choudhuri, E. Ohlebusch, C. Schleiermacher, J. Stoye, and R. Giegerich. Reputer: The manifold applications of repeat analysis on a genome scale. Nucleic Acids Res. 29, 2001.Google ScholarGoogle Scholar
  18. S. Kurtz, A. Phillippy, A. Delcher, M. Smoot, M. Shumway, C. Antonescu, and S. Salzberg. Versatile and open software for comparing large genomes. Genome Bio. 5(R12), 2004.Google ScholarGoogle Scholar
  19. E. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM 23(2), 1976. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. Meek, J. Patel, and S. Kasetty. Oasis: An online and accurate technique for local-alignment searches on biological sequences. In Proceedings of 29th International Conference on Very Large Databases 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. NCBI. Public collections of dna and rna sequence reach 100 gigabases. http://www.nlm.nih.gov/news/press_releases/dna_rna_100_gig.html.Google ScholarGoogle Scholar
  22. B. Phoophakdee and M. Zaki. Genome-scale disk-based suffix tree indexing. In Proceedings of the ACM International Conference on Management of Data 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. B. Phoophakdee and M. Zaki. Trellis+: An effective approach for indexing massive sequences. In Proceedings of the Pacific Symposium on Biocomputing 2008.Google ScholarGoogle Scholar
  24. K. Schurmann and J. Stoye. suffix tree construction and storage with limited main memory. Technical report, Universitat Bielefeld, 2003.Google ScholarGoogle Scholar
  25. Y. Tian, S. Tata, R. Hankins, and J. Patel. Practical methods for constructing suffix trees. VLDB Journal 14(3), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. E. Ukkonen. Constructing suffix trees on-line in linear time. In Proceedings of the IFIP 12th Work Computer Congress on Algorithms, Software, Architecture: Information Processing 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. P. Weiner. Linear pattern matching algorithms. In Proceedings of 14th Annual Symposium on Switch and Automata Theory 1973. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. D. Yankov, E. Keogh, and U. Rebbapragada. Disk-aware discord discovery:Finding unusual time series in tera-byte sized datasets. In Proceedings of the IEEE International Conference on Data Mining 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. O. Zamir and O. Etzioni. Web document clustering: a feasibility demonstration. In Proceedings of 21th International Conference on Research and Development in Information Retrieval 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Serial and parallel methods for i/o efficient suffix tree construction

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in
                • Published in

                  cover image ACM Conferences
                  SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
                  June 2009
                  1168 pages
                  ISBN:9781605585512
                  DOI:10.1145/1559845

                  Copyright © 2009 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 29 June 2009

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • research-article

                  Acceptance Rates

                  Overall Acceptance Rate785of4,003submissions,20%

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader