research-article

Serial and parallel methods for i/o efficient suffix tree construction

Authors:
Amol Ghoting

IBM T. J. Watson Research Center, Yorktown Heights, NY, USA

IBM T. J. Watson Research Center, Yorktown Heights, NY, USA
View Profile

,
Konstantin Makarychev

IBM T. J. Watson Research Center, Yorktown Heights, NY, USA

IBM T. J. Watson Research Center, Yorktown Heights, NY, USA
View Profile

SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of dataJune 2009Pages 827–840https://doi.org/10.1145/1559845.1559931

Published:29 June 2009Publication History

SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

Pages 827–840

ABSTRACT

Over the past three decades, the suffix tree has served as a fundamental data structure in string processing. However, its widespread applicability has been hindered due to the fact that suffix tree construction does not scale well with the size of the input string. With advances in data collection and storage technologies, large strings have become ubiquitous, especially across emerging applications involving text, time series, and biological sequence data. To benefit from these advances, it is imperative that we realize a scalable suffix tree construction algorithm.

To deal with the aforementioned challenge, the past few years have seen the emergence of several disk-based suffix tree construction algorithms. However, construction times continue to be daunting -- for e.g., indexing the entire Human genome still takes over 30 hours on a system with 2 gigabytes of physical memory. In this paper, first, we empirically demonstrate and argue that all existing suffix tree construction algorithms have a severe limitation -- to glean reasonable disk I/O efficiency, the input string being indexed must fit in main memory. This limitation is attributed to the poor locality properties of existing suffix tree construction algorithms and inhibits both sequential and parallel scalability. To deal with this limitation, second, we show that through careful algorithm design, one of the simplest suffix tree construction algorithms can be re-architected to build a suffix tree in a tiled fashion, allowing the implementation to maintain a constant working set size and fixed memory footprint when indexing strings of any size. Third, we show how improved locality of reference coupled with effective collective communication facilitates an efficient parallelization on massively parallel systems like the IBM Blue Gene/L. Finally, we empirically show that the proposed approach affords improvements of several orders of magnitude when indexing large strings. Furthermore, we demonstrate that the proposed parallelization is scalable and allows one to index the entire Human genome on a 1024 processor system in under 15 minutes.

References

S. Bedathur and J. Haritsa. Engineering a fast online persistent suffix tree construction. In Proceedings of the IEEE International Conference on Data Engineering 2004. Google ScholarDigital Library
S. Bedathur and J. Haritsa. Search optimized suffix tree storage for biological applications. In Proceedings of the IEEE International Conference on High Performance Computing 2005. Google ScholarDigital Library
N. Bray, I. Dubchak, and L. Pachter. AVID:A global alignment program. Genome Research 13(1), 2003.Google Scholar
A. Brown. Constructing genome scale suffix trees. In Proceedings of the Asia-Pacific Bioinformatics Conference 2004. Google ScholarDigital Library
M. Burrows and D. Wheeler. A block sorting lossless data compression algorithm. Technical report, Digital Equipment Corporation, 1994.Google Scholar
A. Carvalho, A. Freitas, A. Oliveira, and M. Sagot. Efficient extraction of structured motifs using box links. In Proceedings of the 11th Conference on String Processing and Information Retrieval 2004.Google ScholarCross Ref
W. Chang and E. Lawler. Sublinear approximate string matching and biological applications. Algorithmica 12(4/5), 1994.Google Scholar
C. Cheung, J. Yu, and H. Lu. Constructing suffix trees for gigabyte sequences with megabyte memory. IEEE Transactions on Knowledge and Data Engineering 17(1), 2005. Google ScholarDigital Library
R. Clifford and M. Sergot. Distributed and paged suffix trees for large genetic databases. In Proceedings of the Annual Symposium on Combinatorial Pattern Matching 2003. Google ScholarDigital Library
A. Delcher, S. Kasif, R. Fleischmann, J. Peterson, O. White, and S. Salzberg. Alignment of whole genomes. Nucleic Acids Res. 27(11), 1999.Google Scholar
A. Delcher, A. Phillippy, J. Carlton, and S. Salzberg. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 30(1), 2002.Google Scholar
M. Farach-Colton, P. Ferragina, and S. Muthukrishnan. Overcoming the memory bottleneck in suffix tree construction. In Proceedings of the Annual Symposium on Foundations of Computer Science 1998. Google ScholarDigital Library
D. Gus field. Algorithms on strings, trees, and sequences: Computer science and computational biology Cambridge University Press, Cambridge, 1997. Google ScholarDigital Library
D. Gus Field and J. Stoye. Linear time algorithms for Finding and representing all the tandem repeats in a string. Journal of Computer and System Sciences 69(4), 2004. Google ScholarDigital Library
E. Hunt, M. Atkinson, and R. Irving. A database index to large biological sequences. In Proceedings of 27th International Conference on Very Large Databases 2001. Google ScholarDigital Library
R. Japp. The top-compressed suffix tree:A disk resident index for large sequences. In Proceedings of the Bioinformatics Workshop at the 21st Annual British National Conference on Databases 2004.Google Scholar
S. Kurtz, J. Choudhuri, E. Ohlebusch, C. Schleiermacher, J. Stoye, and R. Giegerich. Reputer: The manifold applications of repeat analysis on a genome scale. Nucleic Acids Res. 29, 2001.Google Scholar
S. Kurtz, A. Phillippy, A. Delcher, M. Smoot, M. Shumway, C. Antonescu, and S. Salzberg. Versatile and open software for comparing large genomes. Genome Bio. 5(R12), 2004.Google Scholar
E. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM 23(2), 1976. Google ScholarDigital Library
C. Meek, J. Patel, and S. Kasetty. Oasis: An online and accurate technique for local-alignment searches on biological sequences. In Proceedings of 29th International Conference on Very Large Databases 2003. Google ScholarDigital Library
NCBI. Public collections of dna and rna sequence reach 100 gigabases. http://www.nlm.nih.gov/news/press_releases/dna_rna_100_gig.html.Google Scholar
B. Phoophakdee and M. Zaki. Genome-scale disk-based suffix tree indexing. In Proceedings of the ACM International Conference on Management of Data 2007. Google ScholarDigital Library
B. Phoophakdee and M. Zaki. Trellis+: An effective approach for indexing massive sequences. In Proceedings of the Pacific Symposium on Biocomputing 2008.Google Scholar
K. Schurmann and J. Stoye. suffix tree construction and storage with limited main memory. Technical report, Universitat Bielefeld, 2003.Google Scholar
Y. Tian, S. Tata, R. Hankins, and J. Patel. Practical methods for constructing suffix trees. VLDB Journal 14(3), 2005. Google ScholarDigital Library
E. Ukkonen. Constructing suffix trees on-line in linear time. In Proceedings of the IFIP 12th Work Computer Congress on Algorithms, Software, Architecture: Information Processing 1992. Google ScholarDigital Library
P. Weiner. Linear pattern matching algorithms. In Proceedings of 14th Annual Symposium on Switch and Automata Theory 1973. Google ScholarDigital Library
D. Yankov, E. Keogh, and U. Rebbapragada. Disk-aware discord discovery:Finding unusual time series in tera-byte sized datasets. In Proceedings of the IEEE International Conference on Data Mining 2007. Google ScholarDigital Library
O. Zamir and O. Etzioni. Web document clustering: a feasibility demonstration. In Proceedings of 21th International Conference on Research and Development in Information Retrieval 1998. Google ScholarDigital Library

Index Terms

Serial and parallel methods for i/o efficient suffix tree construction

Recommendations

A simple parallel cartesian tree algorithm and its application to parallel suffix tree construction
Inaugural Issue and Special Section on Top Papers from PACT-21, and Regular Papers

We present a simple linear work and space, and polylogarithmic time parallel algorithm for generating multiway Cartesian trees. We show that bottom-up traversals of the multiway Cartesian tree on the interleaved suffix array and longest common prefix ...
Read More
I/O efficient algorithms for serial and parallel suffix tree construction

Over the past three decades, the suffix tree has served as a fundamental data structure in string processing. However, its widespread applicability has been hindered due to the fact that suffix tree construction does not scale well with the size of the ...
Read More
Genome-scale disk-based suffix tree indexing
SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data

With the exponential growth of biological sequence databases, it has become critical to develop effective techniques for storing, querying, and analyzing these massive data. Suffix trees are widely used to solve many sequence-based problems, and they ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
June 2009
1168 pages
ISBN:9781605585512
DOI:10.1145/1559845
Editors:
Carsten Binnig,
Benoit Dageville,
General Chairs:
Uğur Çetintemel
Brown University, USA
,
Stan Zdonik
Brown University, USA
,
Program Chair:
Donald Kossmann
ETH Zurich, Switzerland
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 June 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
disk-based
external memory
genome indexing
parallel
sequence indexing
suffix tree
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 19
  Total Citations
  View Citations
- 885
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Serial and parallel methods for i/o efficient suffix tree construction

SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

A simple parallel cartesian tree algorithm and its application to parallel suffix tree construction

I/O efficient algorithms for serial and parallel suffix tree construction

Genome-scale disk-based suffix tree indexing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Serial and parallel methods for i/o efficient suffix tree construction

SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

A simple parallel cartesian tree algorithm and its application to parallel suffix tree construction

I/O efficient algorithms for serial and parallel suffix tree construction

Genome-scale disk-based suffix tree indexing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media