Abstract
The suffix tree of a string is the fundamental data structure of combinatorial pattern matching. We present a recursive technique for building suffix trees that yields optimal algorithms in different computational models. Sorting is an inherent bottleneck in building suffix trees and our algorithms match the sorting lower bound. Specifically, we present the following results. (1) Weiner [1973], who introduced the data structure, gave an optimal 0(n)-time algorithm for building the suffix tree of an n-character string drawn from a constant-size alphabet. In the comparison model, there is a trivial Ο(n log n)-time lower bound based on sorting, and Weiner's algorithm matches this bound. For integer alphabets, the fastest known algorithm is the O(n log n)time comparison-based algorithm, but no super-linear lower bound is known. Closing this gap is the main open question in stringology. We settle this open problem by giving a linear time reduction to sorting for building suffix trees. Since sorting is a lower-bound for building suffix trees, this algorithm is time-optimal in every alphabet mode. In particular, for an alphabet consisting of integers in a polynomial range we get the first known linear-time algorithm. (2) All previously known algorithms for building suffix trees exhibit a marked absence of locality of reference, and thus they tend to elicit many page faults (I/Os) when indexing very long strings. They are therefore unsuitable for building suffix trees in secondary storage devices, where I/Os dominate the overall computational cost. We give a linear-I/O reduction to sorting for suffix tree construction. Since sorting is a trivial I/O-lower bound for building suffix trees, our algorithm is I/O-optimal.
- AGGARWAL, A., CHANDRA, A., AND SNIR, M. 1987. Heirarchical memory with block transfer. In Proceedings of the 28th IEEE Annual Symposium on Foundation of Computer Science. IEEE Computer Science Press, Los Alamitos, Calif., pp. 204-216.]]Google Scholar
- AGGARWAL, A., AND VITTER, J. S. 1988. The Input/Output complexity of sorting and related problems. Commun. ACM 31, 9 (Sept.), 1116-1127.]] Google Scholar
- AHO, A. V., HOPCROFT, J. E., AND ULLMAN, J. D. 1974. The Design and Analysis of Computer Algorithms Addison-Wesley, Reading, Mass.]] Google Scholar
- ALPERN, B., CARTER, L., AND FEIG, E. 1990. Uniform memory hierarchies. In Proceedings of the 31st IEEE Annual Symposium on Foundation of Computer Science, IEEE Computer Science Press, Los Alamos, Calif., pp. 600-609.]]Google Scholar
- ARGE, L., FERRAGINA, P., GROSSI,R.AND VITTER, J. S. 1997. On sorting strings in external memory. In Proceedings of the 29th Annual ACM Symposium on Theory of Computing (El Paso, Tex., May 4-6). ACM, New York, pp. 540-548.]] Google Scholar
- BENDER, M. AND FARACH-COLTON, M. 2000. Least common ancestors revisited. Latin '00.]]Google Scholar
- CHEN, M. T. AND SEIFERAS, J. 1985. Efficient and elegant subword tree construction. In Combinatorial Algorithms on Words, A. Apostolico and Z. Galil, eds. NATO ASI Series F.: Computer and System Sciences, chap. 12, pp. 97-107.]]Google Scholar
- CHIANG, Y., GOODRICH, M. T., GROVE, E. F., TAMASSIA, R., VENGROFF, D. E. AND VITTER, J. S. 1995. External-memory graph algorithms. In Proceedings of the 6th Annual ACM-SIAM Symposium on Discrete Algorithms, (San Francisco, Calif., Jan. 22-24). ACM, New York, pp. 139-149.]] Google Scholar
- CRAUSER, A. AND FERRAGINA, P. 2000. On constructing suffix arrays in external memory. Algorithmica, to appear.]]Google Scholar
- FARACH, M. 1997. Optimal suffix tree construction with large alphabets. In Proceedings of the 38th IEEE Annual Symposium on Foundation of Computer Science. IEEE Computer Science Press Los Alamitos, Calif., pp. 137-143.]] Google Scholar
- FARACH, M., FERRAGINA,P.AND MUTHUKRISHNAN, S. 1998. Overcoming the memory bottleneck in suffix tree construction. In Proceedings of the 39th IEEE Annual Symposium on Foundation of Computer Science. IEEE Computer Science Press, Los Alamitos, Calif., pp. 174-183.]] Google Scholar
- FARACH, M. AND MUTHUKRISHNAN, S. 1996. Optimal logarithmic time randomized suffix tree construction. In Proceedings of 23rd International Colloquium on Automata Languages and Programming. Elsevier Science Publishers, Amsterdam, The Netherlands, pp. 550-561.]] Google Scholar
- FERRAGINA, P. AND GROSSI, R. 1999. The string B-tree: A new data structure for string search in external memory and its applications. J. ACM 46 (2), pp 236-280.]] Google Scholar
- GALIL, Z. 1985. Open problems in stringology. In Combinatorial Algorithms on Words, vol. 12, Z. Galil and A. Apostolico, eds. NATO ASI Series F, pp. 1-8.]]Google Scholar
- GUSFIELD, D. 1998. Algorithms on Strings, Trees, and Sequences. Addison-Wesley, Reading, Mass.]] Google Scholar
- HAREL, D. AND TARJAN, R. E. 1984. Fast algorithms for finding nearest common ancestors. SIAM J. Comput., 13, 338-355.]] Google Scholar
- HARIHARAN, R. 1997. Optimal parallel suffix tree construction. J. Comput. Syst. Sci. 55, 1, 44-69.]] Google Scholar
- KOSARAJU, S. R. 1994. Real-time pattern matching and quasi-real-time construction of suffix trees. In Proceedings of the 26th Annual ACM Symposium on Theory of Computing (Montreal, Que., Canada, May 23-25). ACM, New York, pp. 310-316.]] Google Scholar
- KOSARAJU, S. AND DELCHER, A. 1995. Large-scale assembly of DNA strings and space-efficient construction of suffix trees. In Proceedings of the 27th Annual ACM Symposium on Theory of Computing. (Las Vegas, Nev., May 29-June 1). ACM, New York, pp. 169-177.]] Google Scholar
- KOSARAJU, S. AND DELCHER, A. 1996. Large-scale assembly of DNA strings and space-efficient construction of suffix trees (corrections). In Proceedings of the 28th Annual ACM Symposium on Theory of Computing (Philadelphia, Pa., May 22-24). ACM, New York, page 659.]] Google Scholar
- MANBER, U. AND MYERS, G. 1993. Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22, 5, 935-948,]] Google Scholar
- MCCREIGHT, E. M. 1976. A space-economical suffix tree construction algorithm. J. ACM 23, 2 (Apr.), 262-272.]] Google Scholar
- NODINE, M. H. AND VITTER, J. S. 1995. Greed sort: Optimal deterministic sorting on parallel disks. J. ACM 42, 4 (July), pp 919-933.]] Google Scholar
- RUEMMLER, C., AND WILKES, J. 1994. An introduction to disk drive modeling. IEEE Comput. 27, 3, 17-29.]] Google Scholar
- SAHINALP, S. C. AND VISHKIN, U. 1994. Symmetry breaking for suffix tree construction. In Proceedings of the 26th Annual ACM Symposium on Theory of Computing (Montreal, Que., Canada, May 23-25). ACM, New York, pp. 300-309.]] Google Scholar
- SHRIVER, E. 1997. Performance modelling for realistic storage devices. Ph.D. dissertation. Dept. Computer Science, New York Univ., New York, N.Y.]] Google Scholar
- THORUP, M. 1998. Faster deterministic sorting and priority queues in linear space. In Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms (San Francisco, Calif., Jan.) ACM, New York, 550-555.]] Google Scholar
- VISHKIN, U. 1994. Consistent naming for suffix tree construction. DIMACS talk on joint work with S. C. Sahinalp.]]Google Scholar
- VITTER, J. S. 1998. External memory algorithms. Proceedings of the 6th Annual European Symposium on Algorithms (ESA '98). Lecture Notes in Computer Science, vol. 1461. Springer-Verlag, New York.]] Google Scholar
- VITTER, J. S. AND SHRIVER, E. 1994. Algorithms for parallel memory: Two-level memories. Algorithmica 12, 110-147.]]Google Scholar
- WEINER, P. 1973. Linear pattern matching algorithm. In Proceedings of the 14th IEEE Symposium on Switching and Automata Theory. IEEE Computer Society Press, Los Alamitos, Calif., pp. 1-11.]]Google Scholar
Index Terms
- On the sorting-complexity of suffix tree construction
Recommendations
The suffix binary search tree and suffix AVL tree
Suffix trees and suffix arrays are classical data structures that are used to represent the set of suffixes of a given string, and thereby facilitate the efficient solution of various string processing problems--in particular on-line string searching. ...
A Suffix Tree Or Not a Suffix Tree?
Combinatorial AlgorithmsAbstractIn this paper we study the structure of suffix trees. Given an unlabeled tree on n nodes and suffix links of its internal nodes, we ask the question “Is a suffix tree?", i.e., is there a string S whose suffix tree has the same topological ...
Optimal suffix tree construction with large alphabets
FOCS '97: Proceedings of the 38th Annual Symposium on Foundations of Computer ScienceThe suffix tree of a string is the fundamental data structure of combinatorial pattern matching. Weiner (1973), who introduced the data structure, gave an O(n)-time algorithm for building the suffix tree of an n-character string drawn from a constant ...
Comments