Abstract
Suffix trees and suffix arrays are widely used and largely interchangeable index structures on strings and sequences. Practitioners prefer suffix arrays due to their simplicity and space efficiency while theoreticians use suffix trees due to linear-time construction algorithms and more explicit structure. We narrow this gap between theory and practice with a simple linear-time construction algorithm for suffix arrays. The simplicity is demonstrated with a C++ implementation of 50 effective lines of code. The algorithm is called DC3, which stems from the central underlying concept of difference cover. This view leads to a generalized algorithm, DC, that allows a space-efficient implementation and, moreover, supports the choice of a space--time tradeoff. For any v ∈ [1,√n], it runs in O(vn) time using O(n/√v) space in addition to the input string and the suffix array. We also present variants of the algorithm for several parallel and hierarchical memory models of computation. The algorithms for BSP and EREW-PRAM models are asymptotically faster than all previous suffix tree or array construction algorithms.
- Abouelhoda, M. I., Kurtz, S., and Ohlebusch, E. 2004. Replacing suffix trees with enhanced suffix arrays. J. Disc. Algor. 2, 1, 53--86. Google ScholarDigital Library
- Abouelhoda, M. I., Ohlebusch, E., and Kurtz, S. 2002. Optimal exact string matching based on suffix arrays. In Proceedings of the 9th International Symposium on String Processing and Information Retrieval. Lecture Notes in Computer Science, vol. 2476. Springer-Verlag, Berlin/Heidelberg, Germany, 31--43. Google ScholarDigital Library
- Andersson, A., Larsson, N. J., and Swanson, K. 1999. Suffix trees on words. Algorithmica 23, 3, 246--260.Google ScholarCross Ref
- Bertoni, A., Mereghetti, C., and Palano, B. 2003. Golomb rulers and difference sets for succinct quantum automata. Int. J. Found. Comput. Sci. 14, 5, 871--888.Google ScholarCross Ref
- Burkhardt, S., and Kärkkäinen, J. 2003. Fast lightweight suffix array construction and checking. In Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 2676. Springer-Verlag Berlin/Heidelberg, Germany, 55--69. Google ScholarDigital Library
- Burrows, M., and Wheeler, D. J. 1994. A block-sorting lossless data compression algorithm. Tech. Rep. 124, SRC (digital, Palo Alto).Google Scholar
- Chan, A., and Dehne, F. 1999. A note on coarse grained parallel integer sorting. Parall. Proc. Lett. 9, 4, 533--538.Google ScholarCross Ref
- Clifford, R., and Sergot, M. 2003. Distributed and paged suffix trees for large genetic databases. In Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 2676. Springer-Verlag Berlin/Heidelberg, Germany, 70--82. Google ScholarDigital Library
- Colbourn, C. J., and Ling, A. C. H. 2000. Quorums from difference covers. Inf. Process. Lett. 75, 1--2 (July), 9--12. Google ScholarDigital Library
- Cole, R. 1988. Parallel merge sort. SIAM J. Comput. 17, 4, 770--785. Google ScholarDigital Library
- Crauser, A., and Ferragina, P. 2002. Theoretical and experimental study on the construction of suffix arrays in external memory. Algorithmica 32, 1, 1--35.Google ScholarDigital Library
- Crochemore, M., and Rytter, W. 2002. Jewels of Stringology. World Scientific, Singapore.Google Scholar
- Dementiev, R., Mehnert, J., Kärkkäinen, J., and Sanders, P. 2006. Better external memory suffix array construction. ACM Journal of Experimental Algorithmics. To appear. (Earlier version in ALENEX'05.) Google ScholarDigital Library
- Dementiev, R., and Sanders, P. 2003. Asynchronous parallel disk sorting. In Proceedings of the 15th Annual Symposium on Parallelism in Algorithms and Architectures. ACM, New York, 138--148. Google ScholarDigital Library
- Farach, M. 1997. Optimal suffix tree construction with large alphabets. In Proceedings of the 38th Annual Symposium on Foundations of Computer Science. IEEE Computer Society, Los Alamitos, CA, 137--143. Google ScholarDigital Library
- Farach, M., Ferragina, P., and Muthukrishnan, S. 1998. Overcoming the memory bottleneck in suffix tree construction. In Proceedings of the 39th Annual Symposium on Foundations of Computer Science. IEEE Computer Society, Los Alamitos, CA, 174--183. Google ScholarDigital Library
- Farach, M., and Muthukrishnan, S. 1996. Optimal logarithmic time randomized suffix tree construction. In Proceedings of the 23th International Conference on Automata, Languages and Programming. Lecture Notes in Computer Science, vol. 1099. Springer-Verlag London, UK, 550--561. Google ScholarDigital Library
- Farach-Colton, M., Ferragina, P., and Muthukrishnan, S. 2000. On the sorting-complexity of suffix tree construction. J. ACM 47, 6, 987--1011. Google ScholarDigital Library
- Ferragina, P., and Grossi, R. 1999. The string B-tree: A new data structure for string search in external memory and its applications. J. ACM 46, 2, 236--280. Google ScholarDigital Library
- Frigo, M., Leiserson, C. E., Prokop, H., and Ramachandran, S. 1999. Cache-oblivious algorithms. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science. IEEE Computer Society, Los Alamitos, CA, 285--298. Google ScholarDigital Library
- Gerbessiotis, A. V., and Siniolakis, C. J. 2001. Merging on the BSP model. Paral. Comput. 27, 809--822. Google ScholarDigital Library
- Gonnet, G., Baeza-Yates, R., and Snider, T. 1992. New indices for text: PAT trees and PAT arrays. In Information Retrieval: Data Structures & Algorithms, W. B. Frakes and R. Baeza-Yates, Eds. Prentice-Hall, Englewood Cliffs, NJ. Google ScholarDigital Library
- Goodrich, M. T. 1999. Communication-efficient parallel sorting. SIAM J. Comput. 29, 2, 416--432. Google ScholarDigital Library
- Grossi, R., and Italiano, G. F. 1996. Suffix trees and their applications in string algorithms. Rapporto di Ricerca CS-96-14, Università “Ca' Foscari” di Venezia, Italy.Google Scholar
- Gusfield, D. 1997. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge, UK. Google ScholarDigital Library
- Hagerup, T., and Raman, R. 1992. Waste makes haste: Tight bounds for loose parallel sorting. In Proceedings of the 33rd Annual Symposium on Foundations of Computer Science. IEEE Computer Society, Los Alamitos, CA, 628--637.Google Scholar
- Hagerup, T., and Rüb, C. 1989. Optimal merging and sorting on the EREW-PRAM. Inf. Proc. Lett. 33, 4, 181--185. Google ScholarDigital Library
- Hon, W.-K., Lam, T.-W., Sadakane, K., and Sung, W.-K. 2003a. Constructing compressed suffix arrays with large alphabets. In Proceedings of the 14th International Symposium on Algorithms and Computation. Lecture Notes in Computer Science, vol. 2906. Springer, Berlin/Heidelberg, Germany, 240--249.Google Scholar
- Hon, W.-K., Sadakane, K., and Sung, W.-K. 2003b. Breaking a time-and-space barrier in constructing full-text indices. In Proceedings of the 44th Annual Symposium on Foundations of Computer Science. IEEE Computer Society, Los Alamitos, CA, 251--260. Google ScholarDigital Library
- Jájá, J. 1992. An Introduction to Parallel Algorithms. Addison Wesley, Reading, MA. Google ScholarDigital Library
- Kärkkäinen, J. 1995. Suffix cactus: A cross between suffix tree and suffix array. In Proceedings of the 6th Annual Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 937. Springer-Verlag, Berlin/Heidelberg, UK, 191--204.Google Scholar
- Kärkkäinen, J. 2006. Fast BWT in small space by blockwise suffix sorting. Theor. Comput. Sci. To appear.Google Scholar
- Kärkkäinen, J., and Sanders, P. 2003. Simple linear work suffix array construction. In Proceedings of the 30th International Conference on Automata, Languages and Programming. Lecture Notes in Computer Science, vol. 2719. Springer-Verlag, Berlin/Heidelberg, Germany, 943--955. Google ScholarDigital Library
- Kärkkäinen, J., and Ukkonen, E. 1996. Sparse suffix trees. In Proceedings of the 2nd Annual International Conference on Computing and Combinatorics. Lecture Notes in Computer Science, vol. 1090. Springer-Verlag, Berlin/Heidelberg, Germany, 219--230. Google ScholarDigital Library
- Kasai, T., Lee, G., Arimura, H., Arikawa, S., and Park, K. 2001. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 2089. Springer-Verlag, Berlin/Heidelberg, Germany, 181--192. Google ScholarDigital Library
- Kilian, J., Kipnis, S., and Leiserson, C. E. 1990. The organization of permutation architectures with bused interconnections. IEEE Trans. Comput. 39, 11 (Nov.), 1346--1358. Google ScholarDigital Library
- Kim, D. K., Sim, J. S., Park, H., and Park, K. 2005. Constructing suffix arrays in linear time. J. Disc. Algor. 3, 2--4 (June), 126--142.Google ScholarCross Ref
- Ko, P., and Aluru, S. 2005. Space efficient linear time construction of suffix arrays. J. Disc. Algor. 3, 2--4 (June), 143--156.Google ScholarCross Ref
- Kulla, F., and Sanders, P. 2006. Scalable parallel suffix array construction. In Proceedings of the 13th European PVM/MPI User's Group Meeting. Lecture Notes in Computer Science, vol. 4192. Springer-Verlag, Berlin/Heidelberg, Germany, 22--29. Google ScholarDigital Library
- Lam, T.-W., Sadakane, K., and Sung, W.-K., and Yiu, S.-M. 2002. A space and time efficient algorithm for constructing compressed suffix arrays. In Proceedings of the 8th Annual Itnternational Conference on Computing and Combinatorics. Lecture Notes in Computer Science, vol. 2387. Springer-Verlag, Berlin/Heidelberg, Germany, 401--410. Google ScholarDigital Library
- Larsson, N. J., and Sadakane, K. 1999. Faster suffix sorting. Tech. Rep. LU-CS-TR:99-214, Dept. Computer Science, Lund University, Sweden.Google Scholar
- Luk, W.-S., and Wong, T.-T. 1997. Two new quorum based algorithms for distributed mutual exclusion. In Proceedings of the 17th International Conference on Distributed Computing Systems. IEEE Computer Society, Los Alamitos, CA, 100--106. Google ScholarDigital Library
- Manber, U., and Myers, G. 1993. Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22, 5 (Oct.), 935--948. Google ScholarDigital Library
- Manzini, G. 2004. Two space saving tricks for linear time LCP array computation. In Proceedings of the 9th Scandinavian Workshop on Algorithm Theory. Lecture Notes in Computer Science, vol. 3111. Springer-Verlag, Berlin/Heidelberg, Germany, 372--383.Google Scholar
- Manzini, G., and Ferragina, P. 2004. Engineering a lightweight suffix array construction algorithm. Algorithmica 40, 1 (June), 33--50. Google ScholarDigital Library
- McCreight, E. M. 1976. A space-economic suffix tree construction algorithm. J. ACM 23, 2, 262--272. Google ScholarDigital Library
- McIlroy, P. M., Bostic, K., and McIlroy, M. D. 1993. Engineering radix sort. Comput. Syst. 6, 1, 5--27.Google Scholar
- Na, J. C. 2005. Linear-time construction of compressed suffix arrays using o(n log n)-bit working space for large alphabets. In Proceedings of the 16th Annual Symposium on Combinatorial Pattern Matching. Springer-Verlag, Berlin/Heidelberg, Germany, 57--67. Google ScholarDigital Library
- Navarro, G., and Baeza-Yates, R. 2000. A hybrid indexing method for approximate string matching. J. Disc. Algor. 1, 1, 205--239. (Special issue on Matching Patterns.)Google Scholar
- Neubert, K.-D. 1998. The Flashsort1 algorithm. Dr. Dobb's J. 23, 2 (Feb.), 123--125.Google Scholar
- Nodine, M. H., and Vitter, J. S. 1993. Deterministic distribution sort in shared and distributed memory multiprocessors. In Proceedings of the 5th Annual Symposium on Parallel Algorithms and Architectures. ACM, New York, 120--129. Google ScholarDigital Library
- Nodine, M. H., and Vitter, J. S. 1995. Greed sort: An optimal sorting algorithm for multiple disks. J. ACM 42, 4, 919--933. Google ScholarDigital Library
- Puglisi, S., Smyth, W., and Turpin, A. 2005. The performance of linear time suffix sorting algorithms. In Proceedings of the Data Compression Conference. IEEE Computer Society, Los Alamitos, CA, 358--367. Google ScholarDigital Library
- Rajasekaran, S., and Reif, J. H. 1989. Optimal and sublogarithmic time randomized parallel sorting algorithms. SIAM J. Comput. 18, 3, 594--607. Google ScholarDigital Library
- Smyth, B. 2003. Computing Patterns in Strings. Pearson Addison--Wesley, Harlow, England.Google Scholar
- Ukkonen, E. 1995. On-line construction of suffix trees. Algorithmica 14, 3, 249--260.Google ScholarDigital Library
- Valiant, L. G. 1990. A bridging model for parallel computation. Commun. ACM 22, 8 (Aug.), 103--111. Google ScholarDigital Library
- Vitter, J. S., and Shriver, E. A. M. 1994. Algorithms for parallel memory, I: Two level memories. Algorithmica 12, 2/3, 110--147.Google Scholar
- Weiner, P. 1973. Linear pattern matching algorithm. In Proceedings of the 14th Symposium on Switching and Automata Theory. IEEE Computer Society, Los Alamitos, CA, 1--11.Google ScholarDigital Library
- Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco, CA. Google ScholarDigital Library
Index Terms
- Linear work suffix array construction
Recommendations
A taxonomy of suffix array construction algorithms
In 1990, Manber and Myers proposed suffix arrays as a space-saving alternative to suffix trees and described the first algorithms for suffix array construction and use. Since that time, and especially in the last few years, suffix array construction ...
Two Efficient Algorithms for Linear Time Suffix Array Construction
We present, in this paper, two efficient algorithms for linear time suffix array construction. These two algorithms achieve their linear time complexities, using the techniques of divide-and-conquer, and recursion. What distinguish the proposed ...
The suffix binary search tree and suffix AVL tree
Suffix trees and suffix arrays are classical data structures that are used to represent the set of suffixes of a given string, and thereby facilitate the efficient solution of various string processing problems--in particular on-line string searching. ...
Comments