skip to main content
research-article

ERA: efficient serial and parallel suffix tree construction for very long strings

Published:01 September 2011Publication History
Skip Abstract Section

Abstract

The suffix tree is a data structure for indexing strings. It is used in a variety of applications such as bioinformatics, time series analysis, clustering, text editing and data compression. However, when the string and the resulting suffix tree are too large to fit into the main memory, most existing construction algorithms become very inefficient.

This paper presents a disk-based suffix tree construction method, called Elastic Range (ERa), which works efficiently with very long strings that are much larger than the available memory. ERa partitions the tree construction process horizontally and vertically and minimizes I/Os by dynamically adjusting the horizontal partitions independently for each vertical partition, based on the evolving shape of the tree and the available memory. Where appropriate, ERa also groups vertical partitions together to amortize the I/O cost. We developed a serial version; a parallel version for shared-memory and shared-disk multi-core systems; and a parallel version for shared-nothing architectures. ERa indexes the entire human genome in 19 minutes on an ordinary desktop computer. For comparison, the fastest existing method needs 15 minutes using 1024 CPUs on an IBM BlueGene supercomputer.

References

  1. A. Amir, G. M. Landau, M. Lewenstein, and D. Sokol. Dynamic text and static pattern matching. ACM Transactions on Algorithms, 3, Issue 2, Article 19, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Barsky, U. Stege, A. Thomo, and C. Upton. Suffix trees for very large genomic sequences. In Proc. of ACM CIKM, pages 1417--1420, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Charras and T. Lecroq. Handbook of Exact String Matching Algorithms. King's College London Publications, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. H. Chim and X. Deng. A new suffix tree similarity measure for document clustering. In Proc. of ACM WWWW, pages 121--130, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. Ferragina, R. Giancarlo, G. Manzini, and M. Sciortino. Boosting textual compression in optimal linear time. Journal of ACM, 52:688--713, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Ghoting and K. Makarychev. Indexing genomic sequences on the IBM Blue Gene. In Proc. of Conf. on High Performance Computing Networking, Storage and Analysis (SC), pages 1--11, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Ghoting and K. Makarychev. Serial and parallel methods for I/O efficient suffix tree construction. In Proc. of ACM SIGMOD, pages 827--840, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. Hey, S. Tansley, and K. Tolle, editors. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, 2009.Google ScholarGoogle Scholar
  10. E. Hunt, M. P. Atkinson, and R. W. Irving. Database indexing for large DNA and protein sequence collections. The VLDB Journal, 11:256--271, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. W. Lam, R. Li, A. Tam, S. C. K. Wong, E. Wu, and S.-M. Yiu. High throughput short read alignment via bi-directional BWT. In BIBM, pages 31--36, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. E. M. McCreight. A space-economical suffix tree construction algorithm. Journal of ACM, 23:262--272, 1976. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. B. Phoophakdee and M. J. Zaki. Genome-scale disk-based suffix tree indexing. In Proc. of ACM SIGMOD, pages 833--844, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. J. Puglisi, W. F. Smyth, and A. H. Turpin. A taxonomy of suffix array construction algorithms. ACM Computing Surveys, 39, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. F. Rasheed, M. Alshalalfa, and R. Alhajj. Efficient periodicity mining in time series databases using suffix trees. IEEE TKDE, 23:79--94, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Shieh and E. J. Keogh. iSAX: disk-aware mining and indexing of massive time series datasets. Data Min. Knowl. Discov., 19(1):24--57, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Tata, R. A. Hankins, and J. M. Patel. Practical suffix tree construction. In Proc. of VLDB, pages 36--47, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Y. Tian, S. Tata, R. A. Hankins, and J. M. Patel. Practical methods for constructing suffix trees. The VLDB Journal, 14(3):281--299, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249--260, 1995.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. ERA: efficient serial and parallel suffix tree construction for very long strings

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Proceedings of the VLDB Endowment
          Proceedings of the VLDB Endowment  Volume 5, Issue 1
          September 2011
          84 pages

          Publisher

          VLDB Endowment

          Publication History

          • Published: 1 September 2011
          Published in pvldb Volume 5, Issue 1

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader