Abstract
The suffix tree is a data structure for indexing strings. It is used in a variety of applications such as bioinformatics, time series analysis, clustering, text editing and data compression. However, when the string and the resulting suffix tree are too large to fit into the main memory, most existing construction algorithms become very inefficient.
This paper presents a disk-based suffix tree construction method, called Elastic Range (ERa), which works efficiently with very long strings that are much larger than the available memory. ERa partitions the tree construction process horizontally and vertically and minimizes I/Os by dynamically adjusting the horizontal partitions independently for each vertical partition, based on the evolving shape of the tree and the available memory. Where appropriate, ERa also groups vertical partitions together to amortize the I/O cost. We developed a serial version; a parallel version for shared-memory and shared-disk multi-core systems; and a parallel version for shared-nothing architectures. ERa indexes the entire human genome in 19 minutes on an ordinary desktop computer. For comparison, the fastest existing method needs 15 minutes using 1024 CPUs on an IBM BlueGene supercomputer.
- A. Amir, G. M. Landau, M. Lewenstein, and D. Sokol. Dynamic text and static pattern matching. ACM Transactions on Algorithms, 3, Issue 2, Article 19, 2007. Google ScholarDigital Library
- M. Barsky, U. Stege, A. Thomo, and C. Upton. Suffix trees for very large genomic sequences. In Proc. of ACM CIKM, pages 1417--1420, 2009. Google ScholarDigital Library
- C. Charras and T. Lecroq. Handbook of Exact String Matching Algorithms. King's College London Publications, 2004. Google ScholarDigital Library
- H. Chim and X. Deng. A new suffix tree similarity measure for document clustering. In Proc. of ACM WWWW, pages 121--130, 2007. Google ScholarDigital Library
- P. Ferragina, R. Giancarlo, G. Manzini, and M. Sciortino. Boosting textual compression in optimal linear time. Journal of ACM, 52:688--713, 2005. Google ScholarDigital Library
- A. Ghoting and K. Makarychev. Indexing genomic sequences on the IBM Blue Gene. In Proc. of Conf. on High Performance Computing Networking, Storage and Analysis (SC), pages 1--11, 2009. Google ScholarDigital Library
- A. Ghoting and K. Makarychev. Serial and parallel methods for I/O efficient suffix tree construction. In Proc. of ACM SIGMOD, pages 827--840, 2009. Google ScholarDigital Library
- D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997. Google ScholarDigital Library
- T. Hey, S. Tansley, and K. Tolle, editors. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, 2009.Google Scholar
- E. Hunt, M. P. Atkinson, and R. W. Irving. Database indexing for large DNA and protein sequence collections. The VLDB Journal, 11:256--271, 2002. Google ScholarDigital Library
- T. W. Lam, R. Li, A. Tam, S. C. K. Wong, E. Wu, and S.-M. Yiu. High throughput short read alignment via bi-directional BWT. In BIBM, pages 31--36, 2009. Google ScholarDigital Library
- E. M. McCreight. A space-economical suffix tree construction algorithm. Journal of ACM, 23:262--272, 1976. Google ScholarDigital Library
- B. Phoophakdee and M. J. Zaki. Genome-scale disk-based suffix tree indexing. In Proc. of ACM SIGMOD, pages 833--844, 2007. Google ScholarDigital Library
- S. J. Puglisi, W. F. Smyth, and A. H. Turpin. A taxonomy of suffix array construction algorithms. ACM Computing Surveys, 39, 2007. Google ScholarDigital Library
- F. Rasheed, M. Alshalalfa, and R. Alhajj. Efficient periodicity mining in time series databases using suffix trees. IEEE TKDE, 23:79--94, 2011. Google ScholarDigital Library
- J. Shieh and E. J. Keogh. iSAX: disk-aware mining and indexing of massive time series datasets. Data Min. Knowl. Discov., 19(1):24--57, 2009. Google ScholarDigital Library
- S. Tata, R. A. Hankins, and J. M. Patel. Practical suffix tree construction. In Proc. of VLDB, pages 36--47, 2004. Google ScholarDigital Library
- Y. Tian, S. Tata, R. A. Hankins, and J. M. Patel. Practical methods for constructing suffix trees. The VLDB Journal, 14(3):281--299, 2005. Google ScholarDigital Library
- E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249--260, 1995.Google ScholarDigital Library
Index Terms
- ERA: efficient serial and parallel suffix tree construction for very long strings
Recommendations
OpenMP for Networks of SMPs
In this paper, we present the first system that implements OpenMP on a network of shared-memory multiprocessors. This system enables the programmer to rely on a single, standard, shared-memory API for parallelization within a multiprocessor and between ...
SPMD OpenMP versus MPI on a IBM SMP for 3 Kernels of the NAS Benchmarks
ISHPC '02: Proceedings of the 4th International Symposium on High Performance ComputingShared Memory Multiprocessors are becoming more popular since they are used to deploy large parallel computers. The current trend is to enlarge the number of processors inside such multiprocessor nodes. However a lot of existing applications are using ...
SPMD OpenMP versus MPI on a IBM SMP for 3 Kernels of the NAS Benchmarks
ISHPC '02: Proceedings of the 4th International Symposium on High Performance ComputingShared Memory Multiprocessors are becoming more popular since they are used to deploy large parallel computers. The current trend is to enlarge the number of processors inside such multiprocessor nodes. However a lot of existing applications are using ...
Comments