research-article

A Linear-Time Algorithm for Seeds Computation

Authors:
Tomasz Kociumaka

University of Warsaw and Bar-Ilan University, Ramat-Gan, Israel

University of Warsaw and Bar-Ilan University, Ramat-Gan, Israel

0000-0002-2477-1702
View Profile

,
Marcin Kubica

University of Warsaw, Banacha, Warsaw, Poland

University of Warsaw, Banacha, Warsaw, Poland
View Profile

,
Jakub Radoszewski

University of Warsaw, Banacha, Warsaw, Poland

University of Warsaw, Banacha, Warsaw, Poland

0000-0002-0067-6401
View Profile

,
Wojciech Rytter

University of Warsaw, Banacha, Warsaw, Poland

University of Warsaw, Banacha, Warsaw, Poland
View Profile

,
Tomasz Waleń

University of Warsaw, Banacha, Warsaw, Poland

University of Warsaw, Banacha, Warsaw, Poland
View Profile

Authors Info & Claims

ACM Transactions on Algorithms Volume 16 Issue 2Article No.: 27pp 1–23https://doi.org/10.1145/3386369

Published:21 April 2020Publication History

ACM Transactions on Algorithms

Abstract

A seed in a word is a relaxed version of a period in which the occurrences of the repeating subword may overlap. Our first contribution is a linear-time algorithm computing a linear-size representation of all seeds of a word (the number of seeds might be quadratic). In particular, one can easily derive the shortest seed and the number of seeds from our representation. Thus, we solve an open problem stated in a survey by Smyth from 2000 and improve upon a previous O(n log n)-time algorithm by Iliopoulos et al. from 1996. Our approach is based on combinatorial relations between seeds and subword complexity (used here for the first time in the context of seeds).

In previous papers, compact representations of seeds consisted of two independent parts operating on the suffix tree of the input word and the suffix tree of its reverse, respectively. Our second contribution is a novel and significantly simpler representation of all seeds that avoids dealing with the suffix tree of the reversed word. This result is also of independent interest from a combinatorial point of view.

A preliminary version of this work, with a much more complex algorithm constructing a representation of seeds on two suffix trees, was presented at the 23rd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’12).

References

Hayam Alamro, Golnaz Badkobeh, Djamal Belazzougui, Costas S. Iliopoulos, and Simon J. Puglisi. 2019. Computing the antiperiod(s) of a string. In 30th Annual Symposium on Combinatorial Pattern Matching, CPM 2019, N. Pisanti and S. P. Pissis (Eds.), Vol. 128. Leibniz International Proceedings in Informatics. Schloss Dagstuhl--Leibniz-Zentrum für Informatik, Dagstuhl, Germany, Article 32, 11 pages. DOI:https://doi.org/10.4230/LIPIcs.CPM.2019.32Google Scholar
Mai Alzamel, Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, and Wiktor Zuba. 2019. Quasi-linear-time algorithm for longest common circular factor. In 30th Annual Symposium on Combinatorial Pattern Matching, CPM 2019, N. Pisanti and S. P. Pissis (Eds.), Vol. 128. Leibniz International Proceedings in Informatics. Schloss Dagstuhl--Leibniz-Zentrum für Informatik, Dagstuhl, Germany, Article 25, 14 pages. DOI:https://doi.org/10.4230/LIPIcs.CPM.2019.25Google Scholar
Amihood Amir, Gad M. Landau, Moshe Lewenstein, and Dina Sokol. 2007. Dynamic text and static pattern matching. ACM Transactions on Algorithms 3, 2 (2007), 19. DOI:https://doi.org/10.1145/1240233.1240242Google ScholarDigital Library
Amihood Amir, Avivit Levy, Moshe Lewenstein, Ronit Lubin, and Benny Porat. 2019. Can we recover the cover? Algorithmica 81, 7 (2019), 2857--2875. DOI:https://doi.org/10.1007/s00453-019-00559-8Google ScholarDigital Library
Amihood Amir, Avivit Levy, Ronit Lubin, and Ely Porat. 2019. Approximate cover of strings. Theoretical Computer Science 793 (2019), 59--69. DOI:https://doi.org/10.1016/j.tcs.2019.05.020Google ScholarCross Ref
Amihood Amir, Avivit Levy, and Ely Porat. 2018. Quasi-periodicity under mismatch errors. In 29th Annual Symposium on Combinatorial Pattern Matching, CPM 2018, G. Navarro, D. Sankoff, and B. Zhu (Eds.), Vol. 105. Leibniz International Proceedings in Informatics. Schloss Dagstuhl--Leibniz-Zentrum für Informatik, Dagstuhl, Germany, Article 4, 15 pages. DOI:https://doi.org/10.4230/LIPIcs.CPM.2018.4Google Scholar
Alberto Apostolico and Andrzej Ehrenfeucht. 1993. Efficient detection of quasiperiodicities in strings. Theoretical Computer Science 119, 2 (1993), 247--265. DOI:https://doi.org/10.1016/0304-3975(93)90159-QGoogle ScholarDigital Library
Alberto Apostolico, Martin Farach, and Costas S. Iliopoulos. 1991. Optimal superprimitivity testing for strings. Information Processing Letters 39, 1 (1991), 17--20. DOI:https://doi.org/10.1016/0020-0190(91)90056-NGoogle ScholarDigital Library
Carl Barton, Tomasz Kociumaka, Chang Liu, Solon P. Pissis, and Jakub Radoszewski. 2020. Indexing weighted sequences: Neat and efficient. Information and Computation 270 (2020), 104462. DOI:https://doi.org/10.1016/j.ic.2019.104462Google ScholarCross Ref
Omer Berkman, Costas S. Iliopoulos, and Kunsoo Park. 1995. The subtree max gap problem with application to parallel string covering. Information and Computation 123, 1 (1995), 127--137. DOI:https://doi.org/10.1006/inco.1995.1162Google ScholarDigital Library
Omer Berkman, Baruch Schieber, and Uzi Vishkin. 1993. Optimal doubly logarithmic parallel algorithms based on finding all nearest smaller values. Journal of Algorithms 14, 3 (1993), 344--370. DOI:https://doi.org/10.1006/jagm.1993.1018Google ScholarDigital Library
Dany Breslauer. 1992. An on-line string superprimitivity test. Information Processing Letters 44, 6 (1992), 345--347. DOI:https://doi.org/10.1016/0020-0190(92)90111-8Google ScholarDigital Library
Gerth Stølting Brodal and Christian N. S. Pedersen. 2000. Finding maximal quasiperiodicities in strings. In 11th Annual Symposium on Combinatorial Pattern Matching, CPM 2000 (LNCS), Raffaele Giancarlo and David Sankoff (Eds.), Vol. 1848. Springer, 397--411. https://doi.org/10.1007/3-540-45123-4_33Google Scholar
Manolis Christodoulakis, Costas S. Iliopoulos, Kunsoo Park, and Jeong Seop Sim. 2005. Approximate seeds of strings. Journal of Automata, Languages and Combinatorics 10, 5–6 (2005), 609--626.Google Scholar
Michalis Christou, Maxime Crochemore, Costas S. Iliopoulos, Marcin Kubica, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Bartosz Szreder, and Tomasz Waleń. 2013. Efficient seed computation revisited. Theoretical Computer Science 483 (2013), 171--181. DOI:https://doi.org/10.1016/j.tcs.2011.12.078Google ScholarDigital Library
Richard Cole, Costas S. Iliopoulos, Manal Mohamed, William F. Smyth, and Lu Yang. 2005. The complexity of the minimum k-cover problem. Journal of Automata, Languages and Combinatorics 10, 5-6 (2005), 641--653.Google Scholar
Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. 2007. Algorithms on Strings. Cambridge University Press. DOI:https://doi.org/10.1017/cbo9780511546853Google Scholar
Maxime Crochemore and Wojciech Rytter. 2003. Jewels of Stringology. World Scientific. DOI:https://doi.org/10.1142/4838Google Scholar
Patryk Czajka and Jakub Radoszewski. 2019. Experimental evaluation of algorithms for computing quasiperiods. arxiv:1909.11336.Google Scholar
Martin Farach and S. Muthukrishnan. 1996. Perfect hashing for strings: Formalization and algorithms. In 7th Annual Symposium on Combinatorial Pattern Matching, CPM 1996 (LNCS), Daniel S. Hirschberg and Eugene W. Myers (Eds.), Vol. 1075. Springer, 130--140. https://doi.org/10.1007/3-540-61258-0_11Google Scholar
Martin Farach-Colton, Paolo Ferragina, and S. Muthukrishnan. 2000. On the sorting-complexity of suffix tree construction. Journal of the ACM 47, 6 (2000), 987--1011. DOI:https://doi.org/10.1145/355541.355547Google ScholarDigital Library
Nathan J. Fine and Herbert S. Wilf. 1965. Uniqueness theorems for periodic functions. Proceedings of the American Mathematical Society 16, 1 (1965), 109--114. DOI:https://doi.org/10.2307/2034009Google ScholarCross Ref
Tomás Flouri, Costas S. Iliopoulos, Tomasz Kociumaka, Solon P. Pissis, Simon J. Puglisi, William F. Smyth, and Wojciech Tyczyński. 2013. Enhanced string covering. Theoretical Computer Science 506 (2013), 102--114. DOI:https://doi.org/10.1016/j.tcs.2013.08.013Google ScholarDigital Library
Harold N. Gabow and Roberd E. Tarjan. 1985. A linear-time algorithm for a special case of disjoint set union. Journal of Computer and System Sciences 30, 2 (1985), 209--221. DOI:https://doi.org/10.1016/0022-0000(85)90014-5Google ScholarCross Ref
Paweł Gawrychowski, Moshe Lewenstein, and Patrick K. Nicholson. 2014. Weighted ancestors in suffix trees. In 22th Annual European Symposium on Algorithms, ESA 2014 (LNCS), Andreas S. Schulz and Dorothea Wagner (Eds.), Vol. 8737. Springer, 455--466. https://doi.org/10.1007/978-3-662-44777-2_38Google Scholar
Qing Guo, Hui Zhang, and Costas S. Iliopoulos. 2006. Computing the λ-seeds of a string. In 2nd International Conference on Algorithmic Aspects in Information and Management, AAIM 2006 (LNCS), Siu-Wing Cheng and Chung Keung Poon (Eds.), Vol. 4041. Springer, 303--313. https://doi.org/10.1007/11775096_28Google Scholar
Qing Guo, Hui Zhang, and Costas S. Iliopoulos. 2006. Computing the minimum approximate λ-cover of a string. In 13th International Conference on String Processing and Information Retrieval, SPIRE 2006 (LNCS), Fabio Crestani, Paolo Ferragina, and Mark Sanderson (Eds.), Vol. 4209. Springer, 49--60. https://doi.org/10.1007/11880561_5Google Scholar
Qing Guo, Hui Zhang, and Costas S. Iliopoulos. 2007. Computing the λ-covers of a string. Information Sciences 177, 19 (2007), 3957--3967. DOI:https://doi.org/10.1016/j.ins.2007.02.020Google ScholarDigital Library
Ondřej Guth. 2019. On approximate enhanced covers under Hamming distance. Discrete Applied Mathematics 274 (2019), 67--80. DOI:https://doi.org/10.1016/j.dam.2019.01.015Google ScholarCross Ref
Ondřej Guth, Borivoj Melichar, and Miroslav Balík. 2008. Searching all approximate covers and their distance using finite automata. In Proceedings of the 20th Conference on Theory and Practice of Information Technologies (ITAT’08). http://ceur-ws.org/Vol-414/paper4.pdf.Google Scholar
Lucian Ilie, Sheng Yu, and Kaizhong Zhang. 2002. Repetition complexity of words. In 8th Annual International Conference on Computing and Combinatorics, COCOON 2002 (LNCS), Oscar H. Ibarra and Louxin Zhang (Eds.), Vol. 2387. Springer, 320--329. https://doi.org/10.1007/3-540-45655-4_35Google ScholarCross Ref
Costas S. Iliopoulos, Manal Mohamed, and William F. Smyth. 2011. New complexity results for the k-covers problem. Information Sciences 181, 12 (2011), 2571--2575. DOI:https://doi.org/10.1016/j.ins.2011.02.009Google ScholarDigital Library
Costas S. Iliopoulos, Dennis W. G. Moore, and Kunsoo Park. 1996. Covering a string. Algorithmica 16, 3 (1996), 288--297. DOI:https://doi.org/10.1007/BF01955677Google ScholarCross Ref
Costas S. Iliopoulos and Laurent Mouchard. 1999. Quasiperiodicity: From detection to normal forms. Journal of Automata, Languages and Combinatorics 4, 3 (1999), 213--228.Google Scholar
Costas S. Iliopoulos and William F. Smyth. 1998. An on-line algorithm of computing a minimum set of k-covers of a string. In Proceedings of the 9th Australasian Workshop on Combinatorial Algorithms (AWOCA’98). 97--106.Google Scholar
Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. 2006. Linear work suffix array construction. Journal of the ACM 53, 6 (2006), 918--936. DOI:https://doi.org/10.1145/1217856.1217858Google ScholarDigital Library
Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. 2012. A linear time algorithm for seeds computation. In 23rd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'12), Yuval Rabani (Ed.). SIAM, 1095--1112. https://doi.org/10.1137/1.9781611973099Google ScholarCross Ref
Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. 2019. Towards a definitive measure of repetitiveness. arxiv:1910.02151.Google Scholar
Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. 2018. Efficient algorithms for shortest partial seeds in words. Theoretical Computer Science 710 (2018), 139--147. DOI:https://doi.org/10.1016/j.tcs.2016.11.035Google ScholarCross Ref
Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. 2015. Fast algorithm for partial covers in words. Algorithmica 73, 1 (2015), 217--233. DOI:https://doi.org/10.1007/s00453-014-9915-3Google ScholarDigital Library
Roman M. Kolpakov and Gregory Kucherov. 1999. Finding maximal repetitions in a word in linear time. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science (FOCS’99). IEEE, Los Alamitos, CA, 596--604. DOI:https://doi.org/10.1109/SFFCS.1999.814634Google Scholar
Yin Li and William F. Smyth. 2002. Computing the cover array in linear time. Algorithmica 32, 1 (2002), 95--106. DOI:https://doi.org/10.1007/s00453-001-0062-2Google ScholarDigital Library
M. Lothaire. 2002. Algebraic Combinatorics on Words. Encyclopedia of Mathematics and Its Applications (1st edition). Cambridge University Press. DOI:https://doi.org/10.1017/cbo9781107326019Google Scholar
M. Lothaire. 2005. Applied Combinatorics on Words. Encyclopedia of Mathematics and Its Applications (1st edition). Cambridge University Press. DOI:https://doi.org/10.1017/cbo9781107341005Google Scholar
Udi Manber and Eugene W. Myers. 1993. Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing 22, 5 (1993), 935--948. DOI:https://doi.org/10.1137/0222058Google ScholarDigital Library
Dennis W. G. Moore and William F. Smyth. 1994. An optimal algorithm to compute all the covers of a string. Information Processing Letters 50, 5 (1994), 239--246. DOI:https://doi.org/10.1016/0020-0190(94)00045-XGoogle ScholarDigital Library
Dennis W. G. Moore and William F. Smyth. 1995. A correction to “An Optimal Algorithm to Compute All the Covers of a String”. Information Processing Letters 54, 2 (1995), 101--103. DOI:https://doi.org/10.1016/0020-0190(94)00235-QGoogle ScholarDigital Library
James H. Morris Jr. and Vaughan R. Pratt. 1970. A Linear Pattern-Matching Algorithm. Technical Report 40. Department of Computer Science, University of California, Berkeley.Google Scholar
Sofya Raskhodnikova, Dana Ron, Ronitt Rubinfeld, and Adam D. Smith. 2013. Sublinear algorithms for approximating string compressibility. Algorithmica 65, 3 (2013), 685--709. DOI:https://doi.org/10.1007/s00453-012-9618-6Google ScholarDigital Library
Jeong Seop Sim, Kunsoo Park, Sung-Ryul Kim, and Jee-Soo Lee. 2002. Finding approximate covers of strings. Journal of KIISE: Computer Systems and Theory 29, 1 (2002), 16--21. http://www.koreascience.or.kr/article/ArticleFullRecord.jsp?cn=JBGHG6_2002_v29n1_16.Google Scholar
William F. Smyth. 2000. Repetitive perhaps, but certainly not boring. Theoretical Computer Science 249, 2 (2000), 343--355. DOI:https://doi.org/10.1016/S0304-3975(00)00067-0Google ScholarDigital Library
Peter Weiner. 1973. Linear pattern matching algorithms. In Proceedings of the 14th Annual Symposium on Switching and Automata Theory (SWAT’73). IEEE, Los Alamitos, CA, 1--11. DOI:https://doi.org/10.1109/SWAT.1973.13Google ScholarDigital Library

Index Terms

A Linear-Time Algorithm for Seeds Computation
1. Theory of computation
  1. Design and analysis of algorithms
    1. Data structures design and analysis
      1. Pattern matching

Recommendations

Linear-time String Indexing and Analysis in Small Space

The field of succinct data structures has flourished over the past 16 years. Starting from the compressed suffix array by Grossi and Vitter (STOC 2000) and the FM-index by Ferragina and Manzini (FOCS 2000), a number of generalizations and applications ...
Read More
On left and right seeds of a string

We consider the problem of finding the repetitive structure of a given string y of length n. A factor u of y is a cover of y, if every letter of y lies within some occurrence of u in y. A string v is a seed of y, if it is a cover of a superstring of y. ...
Read More
Linear-size suffix tries

Suffix trees are highly regarded data structures for text indexing and string algorithms MCreight 76, Weiner 73. For any given string w of length n = | w | , a suffix tree for w takes O ( n ) nodes and links. It is often presented as a compacted version ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Algorithms Volume 16, Issue 2
April 2020
372 pages
ISSN:1549-6325
EISSN:1549-6333
DOI:10.1145/3386689
Editor:
Aravind Srinivasan
University of Maryland, USA
Issue’s Table of Contents
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 April 2020
- Accepted: 1 January 2020
- Revised: 1 October 2019
- Received: 1 March 2019
Published in talg Volume 16, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Seed of a word
quasiperiodicity
subword complexity
suffix tree
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 10
  Total Citations
  View Citations
- 250
  Total Downloads
- Downloads (Last 12 months)18
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

A Linear-Time Algorithm for Seeds Computation

ACM Transactions on Algorithms

Abstract

References

Cited By

Index Terms

Recommendations

Linear-time String Indexing and Analysis in Small Space

On left and right seeds of a string

Linear-size suffix tries

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

A Linear-Time Algorithm for Seeds Computation

ACM Transactions on Algorithms

Abstract

References

Cited By

Index Terms

Recommendations

Linear-time String Indexing and Analysis in Small Space

On left and right seeds of a string

Linear-size suffix tries

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media