research-article

ERA: efficient serial and parallel suffix tree construction for very long strings

Authors:
Essam Mansour

King Abdullah Univ. of Science and Technology

King Abdullah Univ. of Science and Technology
View Profile

,
Amin Allam

King Abdullah Univ. of Science and Technology

King Abdullah Univ. of Science and Technology
View Profile

,
Spiros Skiadopoulos

University of Peloponnese

University of Peloponnese
View Profile

,
Panos Kalnis

King Abdullah Univ. of Science and Technology

King Abdullah Univ. of Science and Technology
View Profile

Proceedings of the VLDB Endowment Volume 5 Issue 1pp 49–60https://doi.org/10.14778/2047485.2047490

Published:01 September 2011Publication History

Proceedings of the VLDB Endowment

Abstract

The suffix tree is a data structure for indexing strings. It is used in a variety of applications such as bioinformatics, time series analysis, clustering, text editing and data compression. However, when the string and the resulting suffix tree are too large to fit into the main memory, most existing construction algorithms become very inefficient.

This paper presents a disk-based suffix tree construction method, called Elastic Range (ERa), which works efficiently with very long strings that are much larger than the available memory. ERa partitions the tree construction process horizontally and vertically and minimizes I/Os by dynamically adjusting the horizontal partitions independently for each vertical partition, based on the evolving shape of the tree and the available memory. Where appropriate, ERa also groups vertical partitions together to amortize the I/O cost. We developed a serial version; a parallel version for shared-memory and shared-disk multi-core systems; and a parallel version for shared-nothing architectures. ERa indexes the entire human genome in 19 minutes on an ordinary desktop computer. For comparison, the fastest existing method needs 15 minutes using 1024 CPUs on an IBM BlueGene supercomputer.

References

A. Amir, G. M. Landau, M. Lewenstein, and D. Sokol. Dynamic text and static pattern matching. ACM Transactions on Algorithms, 3, Issue 2, Article 19, 2007. Google ScholarDigital Library
M. Barsky, U. Stege, A. Thomo, and C. Upton. Suffix trees for very large genomic sequences. In Proc. of ACM CIKM, pages 1417--1420, 2009. Google ScholarDigital Library
C. Charras and T. Lecroq. Handbook of Exact String Matching Algorithms. King's College London Publications, 2004. Google ScholarDigital Library
H. Chim and X. Deng. A new suffix tree similarity measure for document clustering. In Proc. of ACM WWWW, pages 121--130, 2007. Google ScholarDigital Library
P. Ferragina, R. Giancarlo, G. Manzini, and M. Sciortino. Boosting textual compression in optimal linear time. Journal of ACM, 52:688--713, 2005. Google ScholarDigital Library
A. Ghoting and K. Makarychev. Indexing genomic sequences on the IBM Blue Gene. In Proc. of Conf. on High Performance Computing Networking, Storage and Analysis (SC), pages 1--11, 2009. Google ScholarDigital Library
A. Ghoting and K. Makarychev. Serial and parallel methods for I/O efficient suffix tree construction. In Proc. of ACM SIGMOD, pages 827--840, 2009. Google ScholarDigital Library
D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997. Google ScholarDigital Library
T. Hey, S. Tansley, and K. Tolle, editors. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, 2009.Google Scholar
E. Hunt, M. P. Atkinson, and R. W. Irving. Database indexing for large DNA and protein sequence collections. The VLDB Journal, 11:256--271, 2002. Google ScholarDigital Library
T. W. Lam, R. Li, A. Tam, S. C. K. Wong, E. Wu, and S.-M. Yiu. High throughput short read alignment via bi-directional BWT. In BIBM, pages 31--36, 2009. Google ScholarDigital Library
E. M. McCreight. A space-economical suffix tree construction algorithm. Journal of ACM, 23:262--272, 1976. Google ScholarDigital Library
B. Phoophakdee and M. J. Zaki. Genome-scale disk-based suffix tree indexing. In Proc. of ACM SIGMOD, pages 833--844, 2007. Google ScholarDigital Library
S. J. Puglisi, W. F. Smyth, and A. H. Turpin. A taxonomy of suffix array construction algorithms. ACM Computing Surveys, 39, 2007. Google ScholarDigital Library
F. Rasheed, M. Alshalalfa, and R. Alhajj. Efficient periodicity mining in time series databases using suffix trees. IEEE TKDE, 23:79--94, 2011. Google ScholarDigital Library
J. Shieh and E. J. Keogh. iSAX: disk-aware mining and indexing of massive time series datasets. Data Min. Knowl. Discov., 19(1):24--57, 2009. Google ScholarDigital Library
S. Tata, R. A. Hankins, and J. M. Patel. Practical suffix tree construction. In Proc. of VLDB, pages 36--47, 2004. Google ScholarDigital Library
Y. Tian, S. Tata, R. A. Hankins, and J. M. Patel. Practical methods for constructing suffix trees. The VLDB Journal, 14(3):281--299, 2005. Google ScholarDigital Library
E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249--260, 1995.Google ScholarDigital Library

Index Terms

ERA: efficient serial and parallel suffix tree construction for very long strings
1. Information systems
  1. Data management systems
    1. Database management system engines
  2. Information storage systems
    1. Record storage systems
      1. Directory structures
        B-trees
2. Mathematics of computing
  1. Discrete mathematics
    1. Graph theory
      1. Trees

Recommendations

OpenMP for Networks of SMPs

In this paper, we present the first system that implements OpenMP on a network of shared-memory multiprocessors. This system enables the programmer to rely on a single, standard, shared-memory API for parallelization within a multiprocessor and between ...
Read More
SPMD OpenMP versus MPI on a IBM SMP for 3 Kernels of the NAS Benchmarks
ISHPC '02: Proceedings of the 4th International Symposium on High Performance Computing

Shared Memory Multiprocessors are becoming more popular since they are used to deploy large parallel computers. The current trend is to enlarge the number of processors inside such multiprocessor nodes. However a lot of existing applications are using ...
Read More
SPMD OpenMP versus MPI on a IBM SMP for 3 Kernels of the NAS Benchmarks
ISHPC '02: Proceedings of the 4th International Symposium on High Performance Computing

Shared Memory Multiprocessors are becoming more popular since they are used to deploy large parallel computers. The current trend is to enlarge the number of processors inside such multiprocessor nodes. However a lot of existing applications are using ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 5, Issue 1
September 2011
84 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 September 2011
Published in pvldb Volume 5, Issue 1
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 31
  Total Citations
  View Citations
- 228
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

ERA: efficient serial and parallel suffix tree construction for very long strings

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

OpenMP for Networks of SMPs

SPMD OpenMP versus MPI on a IBM SMP for 3 Kernels of the NAS Benchmarks

SPMD OpenMP versus MPI on a IBM SMP for 3 Kernels of the NAS Benchmarks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

ERA: efficient serial and parallel suffix tree construction for very long strings

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

OpenMP for Networks of SMPs

SPMD OpenMP versus MPI on a IBM SMP for 3 Kernels of the NAS Benchmarks

SPMD OpenMP versus MPI on a IBM SMP for 3 Kernels of the NAS Benchmarks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media