skip to main content
article
Free Access

On the sorting-complexity of suffix tree construction

Published:01 November 2000Publication History
Skip Abstract Section

Abstract

The suffix tree of a string is the fundamental data structure of combinatorial pattern matching. We present a recursive technique for building suffix trees that yields optimal algorithms in different computational models. Sorting is an inherent bottleneck in building suffix trees and our algorithms match the sorting lower bound. Specifically, we present the following results. (1) Weiner [1973], who introduced the data structure, gave an optimal 0(n)-time algorithm for building the suffix tree of an n-character string drawn from a constant-size alphabet. In the comparison model, there is a trivial Ο(n log n)-time lower bound based on sorting, and Weiner's algorithm matches this bound. For integer alphabets, the fastest known algorithm is the O(n log n)time comparison-based algorithm, but no super-linear lower bound is known. Closing this gap is the main open question in stringology. We settle this open problem by giving a linear time reduction to sorting for building suffix trees. Since sorting is a lower-bound for building suffix trees, this algorithm is time-optimal in every alphabet mode. In particular, for an alphabet consisting of integers in a polynomial range we get the first known linear-time algorithm. (2) All previously known algorithms for building suffix trees exhibit a marked absence of locality of reference, and thus they tend to elicit many page faults (I/Os) when indexing very long strings. They are therefore unsuitable for building suffix trees in secondary storage devices, where I/Os dominate the overall computational cost. We give a linear-I/O reduction to sorting for suffix tree construction. Since sorting is a trivial I/O-lower bound for building suffix trees, our algorithm is I/O-optimal.

References

  1. AGGARWAL, A., CHANDRA, A., AND SNIR, M. 1987. Heirarchical memory with block transfer. In Proceedings of the 28th IEEE Annual Symposium on Foundation of Computer Science. IEEE Computer Science Press, Los Alamitos, Calif., pp. 204-216.]]Google ScholarGoogle Scholar
  2. AGGARWAL, A., AND VITTER, J. S. 1988. The Input/Output complexity of sorting and related problems. Commun. ACM 31, 9 (Sept.), 1116-1127.]] Google ScholarGoogle Scholar
  3. AHO, A. V., HOPCROFT, J. E., AND ULLMAN, J. D. 1974. The Design and Analysis of Computer Algorithms Addison-Wesley, Reading, Mass.]] Google ScholarGoogle Scholar
  4. ALPERN, B., CARTER, L., AND FEIG, E. 1990. Uniform memory hierarchies. In Proceedings of the 31st IEEE Annual Symposium on Foundation of Computer Science, IEEE Computer Science Press, Los Alamos, Calif., pp. 600-609.]]Google ScholarGoogle Scholar
  5. ARGE, L., FERRAGINA, P., GROSSI,R.AND VITTER, J. S. 1997. On sorting strings in external memory. In Proceedings of the 29th Annual ACM Symposium on Theory of Computing (El Paso, Tex., May 4-6). ACM, New York, pp. 540-548.]] Google ScholarGoogle Scholar
  6. BENDER, M. AND FARACH-COLTON, M. 2000. Least common ancestors revisited. Latin '00.]]Google ScholarGoogle Scholar
  7. CHEN, M. T. AND SEIFERAS, J. 1985. Efficient and elegant subword tree construction. In Combinatorial Algorithms on Words, A. Apostolico and Z. Galil, eds. NATO ASI Series F.: Computer and System Sciences, chap. 12, pp. 97-107.]]Google ScholarGoogle Scholar
  8. CHIANG, Y., GOODRICH, M. T., GROVE, E. F., TAMASSIA, R., VENGROFF, D. E. AND VITTER, J. S. 1995. External-memory graph algorithms. In Proceedings of the 6th Annual ACM-SIAM Symposium on Discrete Algorithms, (San Francisco, Calif., Jan. 22-24). ACM, New York, pp. 139-149.]] Google ScholarGoogle Scholar
  9. CRAUSER, A. AND FERRAGINA, P. 2000. On constructing suffix arrays in external memory. Algorithmica, to appear.]]Google ScholarGoogle Scholar
  10. FARACH, M. 1997. Optimal suffix tree construction with large alphabets. In Proceedings of the 38th IEEE Annual Symposium on Foundation of Computer Science. IEEE Computer Science Press Los Alamitos, Calif., pp. 137-143.]] Google ScholarGoogle Scholar
  11. FARACH, M., FERRAGINA,P.AND MUTHUKRISHNAN, S. 1998. Overcoming the memory bottleneck in suffix tree construction. In Proceedings of the 39th IEEE Annual Symposium on Foundation of Computer Science. IEEE Computer Science Press, Los Alamitos, Calif., pp. 174-183.]] Google ScholarGoogle Scholar
  12. FARACH, M. AND MUTHUKRISHNAN, S. 1996. Optimal logarithmic time randomized suffix tree construction. In Proceedings of 23rd International Colloquium on Automata Languages and Programming. Elsevier Science Publishers, Amsterdam, The Netherlands, pp. 550-561.]] Google ScholarGoogle Scholar
  13. FERRAGINA, P. AND GROSSI, R. 1999. The string B-tree: A new data structure for string search in external memory and its applications. J. ACM 46 (2), pp 236-280.]] Google ScholarGoogle Scholar
  14. GALIL, Z. 1985. Open problems in stringology. In Combinatorial Algorithms on Words, vol. 12, Z. Galil and A. Apostolico, eds. NATO ASI Series F, pp. 1-8.]]Google ScholarGoogle Scholar
  15. GUSFIELD, D. 1998. Algorithms on Strings, Trees, and Sequences. Addison-Wesley, Reading, Mass.]] Google ScholarGoogle Scholar
  16. HAREL, D. AND TARJAN, R. E. 1984. Fast algorithms for finding nearest common ancestors. SIAM J. Comput., 13, 338-355.]] Google ScholarGoogle Scholar
  17. HARIHARAN, R. 1997. Optimal parallel suffix tree construction. J. Comput. Syst. Sci. 55, 1, 44-69.]] Google ScholarGoogle Scholar
  18. KOSARAJU, S. R. 1994. Real-time pattern matching and quasi-real-time construction of suffix trees. In Proceedings of the 26th Annual ACM Symposium on Theory of Computing (Montreal, Que., Canada, May 23-25). ACM, New York, pp. 310-316.]] Google ScholarGoogle Scholar
  19. KOSARAJU, S. AND DELCHER, A. 1995. Large-scale assembly of DNA strings and space-efficient construction of suffix trees. In Proceedings of the 27th Annual ACM Symposium on Theory of Computing. (Las Vegas, Nev., May 29-June 1). ACM, New York, pp. 169-177.]] Google ScholarGoogle Scholar
  20. KOSARAJU, S. AND DELCHER, A. 1996. Large-scale assembly of DNA strings and space-efficient construction of suffix trees (corrections). In Proceedings of the 28th Annual ACM Symposium on Theory of Computing (Philadelphia, Pa., May 22-24). ACM, New York, page 659.]] Google ScholarGoogle Scholar
  21. MANBER, U. AND MYERS, G. 1993. Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22, 5, 935-948,]] Google ScholarGoogle Scholar
  22. MCCREIGHT, E. M. 1976. A space-economical suffix tree construction algorithm. J. ACM 23, 2 (Apr.), 262-272.]] Google ScholarGoogle Scholar
  23. NODINE, M. H. AND VITTER, J. S. 1995. Greed sort: Optimal deterministic sorting on parallel disks. J. ACM 42, 4 (July), pp 919-933.]] Google ScholarGoogle Scholar
  24. RUEMMLER, C., AND WILKES, J. 1994. An introduction to disk drive modeling. IEEE Comput. 27, 3, 17-29.]] Google ScholarGoogle Scholar
  25. SAHINALP, S. C. AND VISHKIN, U. 1994. Symmetry breaking for suffix tree construction. In Proceedings of the 26th Annual ACM Symposium on Theory of Computing (Montreal, Que., Canada, May 23-25). ACM, New York, pp. 300-309.]] Google ScholarGoogle Scholar
  26. SHRIVER, E. 1997. Performance modelling for realistic storage devices. Ph.D. dissertation. Dept. Computer Science, New York Univ., New York, N.Y.]] Google ScholarGoogle Scholar
  27. THORUP, M. 1998. Faster deterministic sorting and priority queues in linear space. In Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms (San Francisco, Calif., Jan.) ACM, New York, 550-555.]] Google ScholarGoogle Scholar
  28. VISHKIN, U. 1994. Consistent naming for suffix tree construction. DIMACS talk on joint work with S. C. Sahinalp.]]Google ScholarGoogle Scholar
  29. VITTER, J. S. 1998. External memory algorithms. Proceedings of the 6th Annual European Symposium on Algorithms (ESA '98). Lecture Notes in Computer Science, vol. 1461. Springer-Verlag, New York.]] Google ScholarGoogle Scholar
  30. VITTER, J. S. AND SHRIVER, E. 1994. Algorithms for parallel memory: Two-level memories. Algorithmica 12, 110-147.]]Google ScholarGoogle Scholar
  31. WEINER, P. 1973. Linear pattern matching algorithm. In Proceedings of the 14th IEEE Symposium on Switching and Automata Theory. IEEE Computer Society Press, Los Alamitos, Calif., pp. 1-11.]]Google ScholarGoogle Scholar

Index Terms

  1. On the sorting-complexity of suffix tree construction

                  Recommendations

                  Comments

                  Login options

                  Check if you have access through your login credentials or your institution to get full access on this article.

                  Sign in

                  Full Access

                  PDF Format

                  View or Download as a PDF file.

                  PDF

                  eReader

                  View online with eReader.

                  eReader