skip to main content
article

Linear work suffix array construction

Published:01 November 2006Publication History
Skip Abstract Section

Abstract

Suffix trees and suffix arrays are widely used and largely interchangeable index structures on strings and sequences. Practitioners prefer suffix arrays due to their simplicity and space efficiency while theoreticians use suffix trees due to linear-time construction algorithms and more explicit structure. We narrow this gap between theory and practice with a simple linear-time construction algorithm for suffix arrays. The simplicity is demonstrated with a C++ implementation of 50 effective lines of code. The algorithm is called DC3, which stems from the central underlying concept of difference cover. This view leads to a generalized algorithm, DC, that allows a space-efficient implementation and, moreover, supports the choice of a space--time tradeoff. For any v ∈ [1,√n], it runs in O(vn) time using O(n/√v) space in addition to the input string and the suffix array. We also present variants of the algorithm for several parallel and hierarchical memory models of computation. The algorithms for BSP and EREW-PRAM models are asymptotically faster than all previous suffix tree or array construction algorithms.

References

  1. Abouelhoda, M. I., Kurtz, S., and Ohlebusch, E. 2004. Replacing suffix trees with enhanced suffix arrays. J. Disc. Algor. 2, 1, 53--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Abouelhoda, M. I., Ohlebusch, E., and Kurtz, S. 2002. Optimal exact string matching based on suffix arrays. In Proceedings of the 9th International Symposium on String Processing and Information Retrieval. Lecture Notes in Computer Science, vol. 2476. Springer-Verlag, Berlin/Heidelberg, Germany, 31--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Andersson, A., Larsson, N. J., and Swanson, K. 1999. Suffix trees on words. Algorithmica 23, 3, 246--260.Google ScholarGoogle ScholarCross RefCross Ref
  4. Bertoni, A., Mereghetti, C., and Palano, B. 2003. Golomb rulers and difference sets for succinct quantum automata. Int. J. Found. Comput. Sci. 14, 5, 871--888.Google ScholarGoogle ScholarCross RefCross Ref
  5. Burkhardt, S., and Kärkkäinen, J. 2003. Fast lightweight suffix array construction and checking. In Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 2676. Springer-Verlag Berlin/Heidelberg, Germany, 55--69. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Burrows, M., and Wheeler, D. J. 1994. A block-sorting lossless data compression algorithm. Tech. Rep. 124, SRC (digital, Palo Alto).Google ScholarGoogle Scholar
  7. Chan, A., and Dehne, F. 1999. A note on coarse grained parallel integer sorting. Parall. Proc. Lett. 9, 4, 533--538.Google ScholarGoogle ScholarCross RefCross Ref
  8. Clifford, R., and Sergot, M. 2003. Distributed and paged suffix trees for large genetic databases. In Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 2676. Springer-Verlag Berlin/Heidelberg, Germany, 70--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Colbourn, C. J., and Ling, A. C. H. 2000. Quorums from difference covers. Inf. Process. Lett. 75, 1--2 (July), 9--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Cole, R. 1988. Parallel merge sort. SIAM J. Comput. 17, 4, 770--785. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Crauser, A., and Ferragina, P. 2002. Theoretical and experimental study on the construction of suffix arrays in external memory. Algorithmica 32, 1, 1--35.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Crochemore, M., and Rytter, W. 2002. Jewels of Stringology. World Scientific, Singapore.Google ScholarGoogle Scholar
  13. Dementiev, R., Mehnert, J., Kärkkäinen, J., and Sanders, P. 2006. Better external memory suffix array construction. ACM Journal of Experimental Algorithmics. To appear. (Earlier version in ALENEX'05.) Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Dementiev, R., and Sanders, P. 2003. Asynchronous parallel disk sorting. In Proceedings of the 15th Annual Symposium on Parallelism in Algorithms and Architectures. ACM, New York, 138--148. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Farach, M. 1997. Optimal suffix tree construction with large alphabets. In Proceedings of the 38th Annual Symposium on Foundations of Computer Science. IEEE Computer Society, Los Alamitos, CA, 137--143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Farach, M., Ferragina, P., and Muthukrishnan, S. 1998. Overcoming the memory bottleneck in suffix tree construction. In Proceedings of the 39th Annual Symposium on Foundations of Computer Science. IEEE Computer Society, Los Alamitos, CA, 174--183. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Farach, M., and Muthukrishnan, S. 1996. Optimal logarithmic time randomized suffix tree construction. In Proceedings of the 23th International Conference on Automata, Languages and Programming. Lecture Notes in Computer Science, vol. 1099. Springer-Verlag London, UK, 550--561. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Farach-Colton, M., Ferragina, P., and Muthukrishnan, S. 2000. On the sorting-complexity of suffix tree construction. J. ACM 47, 6, 987--1011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Ferragina, P., and Grossi, R. 1999. The string B-tree: A new data structure for string search in external memory and its applications. J. ACM 46, 2, 236--280. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Frigo, M., Leiserson, C. E., Prokop, H., and Ramachandran, S. 1999. Cache-oblivious algorithms. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science. IEEE Computer Society, Los Alamitos, CA, 285--298. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Gerbessiotis, A. V., and Siniolakis, C. J. 2001. Merging on the BSP model. Paral. Comput. 27, 809--822. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Gonnet, G., Baeza-Yates, R., and Snider, T. 1992. New indices for text: PAT trees and PAT arrays. In Information Retrieval: Data Structures & Algorithms, W. B. Frakes and R. Baeza-Yates, Eds. Prentice-Hall, Englewood Cliffs, NJ. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Goodrich, M. T. 1999. Communication-efficient parallel sorting. SIAM J. Comput. 29, 2, 416--432. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Grossi, R., and Italiano, G. F. 1996. Suffix trees and their applications in string algorithms. Rapporto di Ricerca CS-96-14, Università “Ca' Foscari” di Venezia, Italy.Google ScholarGoogle Scholar
  25. Gusfield, D. 1997. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge, UK. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Hagerup, T., and Raman, R. 1992. Waste makes haste: Tight bounds for loose parallel sorting. In Proceedings of the 33rd Annual Symposium on Foundations of Computer Science. IEEE Computer Society, Los Alamitos, CA, 628--637.Google ScholarGoogle Scholar
  27. Hagerup, T., and Rüb, C. 1989. Optimal merging and sorting on the EREW-PRAM. Inf. Proc. Lett. 33, 4, 181--185. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Hon, W.-K., Lam, T.-W., Sadakane, K., and Sung, W.-K. 2003a. Constructing compressed suffix arrays with large alphabets. In Proceedings of the 14th International Symposium on Algorithms and Computation. Lecture Notes in Computer Science, vol. 2906. Springer, Berlin/Heidelberg, Germany, 240--249.Google ScholarGoogle Scholar
  29. Hon, W.-K., Sadakane, K., and Sung, W.-K. 2003b. Breaking a time-and-space barrier in constructing full-text indices. In Proceedings of the 44th Annual Symposium on Foundations of Computer Science. IEEE Computer Society, Los Alamitos, CA, 251--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jájá, J. 1992. An Introduction to Parallel Algorithms. Addison Wesley, Reading, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Kärkkäinen, J. 1995. Suffix cactus: A cross between suffix tree and suffix array. In Proceedings of the 6th Annual Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 937. Springer-Verlag, Berlin/Heidelberg, UK, 191--204.Google ScholarGoogle Scholar
  32. Kärkkäinen, J. 2006. Fast BWT in small space by blockwise suffix sorting. Theor. Comput. Sci. To appear.Google ScholarGoogle Scholar
  33. Kärkkäinen, J., and Sanders, P. 2003. Simple linear work suffix array construction. In Proceedings of the 30th International Conference on Automata, Languages and Programming. Lecture Notes in Computer Science, vol. 2719. Springer-Verlag, Berlin/Heidelberg, Germany, 943--955. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Kärkkäinen, J., and Ukkonen, E. 1996. Sparse suffix trees. In Proceedings of the 2nd Annual International Conference on Computing and Combinatorics. Lecture Notes in Computer Science, vol. 1090. Springer-Verlag, Berlin/Heidelberg, Germany, 219--230. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Kasai, T., Lee, G., Arimura, H., Arikawa, S., and Park, K. 2001. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 2089. Springer-Verlag, Berlin/Heidelberg, Germany, 181--192. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Kilian, J., Kipnis, S., and Leiserson, C. E. 1990. The organization of permutation architectures with bused interconnections. IEEE Trans. Comput. 39, 11 (Nov.), 1346--1358. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Kim, D. K., Sim, J. S., Park, H., and Park, K. 2005. Constructing suffix arrays in linear time. J. Disc. Algor. 3, 2--4 (June), 126--142.Google ScholarGoogle ScholarCross RefCross Ref
  38. Ko, P., and Aluru, S. 2005. Space efficient linear time construction of suffix arrays. J. Disc. Algor. 3, 2--4 (June), 143--156.Google ScholarGoogle ScholarCross RefCross Ref
  39. Kulla, F., and Sanders, P. 2006. Scalable parallel suffix array construction. In Proceedings of the 13th European PVM/MPI User's Group Meeting. Lecture Notes in Computer Science, vol. 4192. Springer-Verlag, Berlin/Heidelberg, Germany, 22--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Lam, T.-W., Sadakane, K., and Sung, W.-K., and Yiu, S.-M. 2002. A space and time efficient algorithm for constructing compressed suffix arrays. In Proceedings of the 8th Annual Itnternational Conference on Computing and Combinatorics. Lecture Notes in Computer Science, vol. 2387. Springer-Verlag, Berlin/Heidelberg, Germany, 401--410. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Larsson, N. J., and Sadakane, K. 1999. Faster suffix sorting. Tech. Rep. LU-CS-TR:99-214, Dept. Computer Science, Lund University, Sweden.Google ScholarGoogle Scholar
  42. Luk, W.-S., and Wong, T.-T. 1997. Two new quorum based algorithms for distributed mutual exclusion. In Proceedings of the 17th International Conference on Distributed Computing Systems. IEEE Computer Society, Los Alamitos, CA, 100--106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Manber, U., and Myers, G. 1993. Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22, 5 (Oct.), 935--948. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Manzini, G. 2004. Two space saving tricks for linear time LCP array computation. In Proceedings of the 9th Scandinavian Workshop on Algorithm Theory. Lecture Notes in Computer Science, vol. 3111. Springer-Verlag, Berlin/Heidelberg, Germany, 372--383.Google ScholarGoogle Scholar
  45. Manzini, G., and Ferragina, P. 2004. Engineering a lightweight suffix array construction algorithm. Algorithmica 40, 1 (June), 33--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. McCreight, E. M. 1976. A space-economic suffix tree construction algorithm. J. ACM 23, 2, 262--272. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. McIlroy, P. M., Bostic, K., and McIlroy, M. D. 1993. Engineering radix sort. Comput. Syst. 6, 1, 5--27.Google ScholarGoogle Scholar
  48. Na, J. C. 2005. Linear-time construction of compressed suffix arrays using o(n log n)-bit working space for large alphabets. In Proceedings of the 16th Annual Symposium on Combinatorial Pattern Matching. Springer-Verlag, Berlin/Heidelberg, Germany, 57--67. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Navarro, G., and Baeza-Yates, R. 2000. A hybrid indexing method for approximate string matching. J. Disc. Algor. 1, 1, 205--239. (Special issue on Matching Patterns.)Google ScholarGoogle Scholar
  50. Neubert, K.-D. 1998. The Flashsort1 algorithm. Dr. Dobb's J. 23, 2 (Feb.), 123--125.Google ScholarGoogle Scholar
  51. Nodine, M. H., and Vitter, J. S. 1993. Deterministic distribution sort in shared and distributed memory multiprocessors. In Proceedings of the 5th Annual Symposium on Parallel Algorithms and Architectures. ACM, New York, 120--129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Nodine, M. H., and Vitter, J. S. 1995. Greed sort: An optimal sorting algorithm for multiple disks. J. ACM 42, 4, 919--933. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Puglisi, S., Smyth, W., and Turpin, A. 2005. The performance of linear time suffix sorting algorithms. In Proceedings of the Data Compression Conference. IEEE Computer Society, Los Alamitos, CA, 358--367. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Rajasekaran, S., and Reif, J. H. 1989. Optimal and sublogarithmic time randomized parallel sorting algorithms. SIAM J. Comput. 18, 3, 594--607. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Smyth, B. 2003. Computing Patterns in Strings. Pearson Addison--Wesley, Harlow, England.Google ScholarGoogle Scholar
  56. Ukkonen, E. 1995. On-line construction of suffix trees. Algorithmica 14, 3, 249--260.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Valiant, L. G. 1990. A bridging model for parallel computation. Commun. ACM 22, 8 (Aug.), 103--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Vitter, J. S., and Shriver, E. A. M. 1994. Algorithms for parallel memory, I: Two level memories. Algorithmica 12, 2/3, 110--147.Google ScholarGoogle Scholar
  59. Weiner, P. 1973. Linear pattern matching algorithm. In Proceedings of the 14th Symposium on Switching and Automata Theory. IEEE Computer Society, Los Alamitos, CA, 1--11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Linear work suffix array construction

                  Recommendations

                  Comments

                  Login options

                  Check if you have access through your login credentials or your institution to get full access on this article.

                  Sign in

                  Full Access

                  PDF Format

                  View or Download as a PDF file.

                  PDF

                  eReader

                  View online with eReader.

                  eReader