skip to main content
article

Average-optimal single and multiple approximate string matching

Published:31 December 2004Publication History
Skip Abstract Section

Abstract

We present a new algorithm for multiple approximate string matching. It is based on reading backwards enough l-grams from text windows so as to prove that no occurrence can contain the part of the window read, and then shifting the window.We show analytically that our algorithm is optimal on average. Hence our first contribution is to fill an important gap in the area, since no average-optimal algorithm existed for multiple approximate string matching.We consider several variants and practical improvements to our algorithm, and show experimentally that they are resistant to the number of patterns and the fastest for low difference ratios, displacing the long-standing best algorithms. Hence our second contribution is to give a practical algorithm for this problem, by far better than any existing alternative in many cases of interest. On real-life texts, our algorithm is especially interesting for computational biology applications.In particular, we show that our algorithm can be successfully used to search for one pattern, where many more competing algorithms exist. Our algorithm is also average-optimal in this case, being the second after that of Chang and Marr. However, our algorithm permits higher difference ratios than Chang and Marr, and this is our third contribution.In practice, our algorithm is competitive in this scenario too, being the fastest for low difference ratios and moderate alphabet sizes. This is our fourth contribution, which also answers affirmatively the question of whether a practical average-optimal approximate string-matching algorithm existed.

References

  1. Baeza-Yates, R. and Navarro, G. 2000. New models and algorithms for multidimensional approximate pattern matching. Journal of Discrete Algorithms 1, 1, 21--49. Special issue on Matching Patterns.]]Google ScholarGoogle Scholar
  2. Baeza-Yates, R. and Navarro, G. 2002. New and faster filters for multiple approximate string matching. Random Structures and Algorithms 20, 23--49.]] Google ScholarGoogle Scholar
  3. Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison-Wesley, Reading, MA.]] Google ScholarGoogle Scholar
  4. Baeza-Yates, R. A. and Navarro, G. 1999. Faster approximate string matching. Algorithmica 23, 2, 127--158.]]Google ScholarGoogle Scholar
  5. Chang, W. and Lawler, E. 1994. Sublinear approximate string matching and biological applications. Algorithmica 12, 4/5, 327--344.]]Google ScholarGoogle Scholar
  6. Chang, W. and Marr, T. 1994. Approximate string matching and local similarity. In Proceedings of 5th Combinatorial Pattern Matching (CPM'94). LNCS, vol. 807. Springer-Verlag, Berlin, 259--273.]] Google ScholarGoogle Scholar
  7. Crochemore, M., Czumaj, A., Gasieniec, L., Jarominek, S., Lecroq, T., Plandowski, W., and Rytter, W. 1994. Speeding up two string matching algorithms. Algorithmica 12, 4/5, 247--267.]]Google ScholarGoogle Scholar
  8. Crochemore, M. and Rytter, W. 1994. Text Algorithms. Oxford University Press, Oxford, UK.]] Google ScholarGoogle Scholar
  9. Dixon, R. and Martin, T., Eds. 1979. Automatic Speech and Speaker Recognition. IEEE Press.]] Google ScholarGoogle Scholar
  10. Elliman, D. and Lancaster, I. 1990. A review of segmentation and contextual analysis techniques for text recognition. Pattern Recogn. 23, 3/4, 337--346.]] Google ScholarGoogle Scholar
  11. Fredriksson, K. 2003. Row-wise tiling for the Myers' bit-parallel approximate string matching algorithm. In Proceedings of 10th Symposium on String Processing and Information Retrieval (SPIRE'03). LNCS, vol. 2857. Springer-Verlag, Berlin, 66--79.]]Google ScholarGoogle Scholar
  12. Fredriksson, K. and Navarro, G. 2003. Average-optimal multiple approximate string matching. In Proceedings of 14th Combinatorial Pattern Matching (CPM'03). LNCS, vol. 2676. 109--128.]] Google ScholarGoogle Scholar
  13. Fredriksson, K. and Navarro, G. 2004. Improved single and multiple approximate string matching. In Proceedings of 15th Combinatorial Pattern Matching (CPM'04). LNCS, vol. 3109. Springer-Verlag, Berlin, 457--471.]]Google ScholarGoogle Scholar
  14. Grossi, R. and Luccio, F. 1989. Simple and efficient string matching with k mismatches. Information Processing Letters 33, 3, 113--120.]]Google ScholarGoogle Scholar
  15. Horspool, R. 1980. Practical fast searching in strings. Software Practice and Experience 10, 501--506.]]Google ScholarGoogle Scholar
  16. Hyyrö, H., Fredriksson, K., and Navarro, G. 2004. Increased bit-parallelism for approximate string matching. In Proceedings of 3rd Workshop on Efficient and Experimental Algorithms (WEA'04). LNCS, vol. 3059. Springer-Verlag, Berlin, 285--298.]]Google ScholarGoogle Scholar
  17. Hyyrö, H. and Navarro, G. 2002. Faster bit-parallel approximate string matching. In Proceedings of 13th Combinatorial Pattern Matching (CPM'02). LNCS, vol. 2373. Springer-Verlag, Berlin, 203--224. Extended version to appear in Algorithmica.]] Google ScholarGoogle Scholar
  18. Jokinen, P., Tarhio, J., and Ukkonen, E. 1996. A comparison of approximate string matching algorithms. Software Practice and Experience 26, 12, 1439--1458.]] Google ScholarGoogle Scholar
  19. Kukich, K. 1992. Techniques for automatically correcting words in text. ACM Computing Surveys 24, 4, 377--439.]] Google ScholarGoogle Scholar
  20. Kumar, S. and Spafford, E. 1994. A pattern-matching model for intrusion detection. In Proceedings of National Computer Security Conference. 11--21.]]Google ScholarGoogle Scholar
  21. Lopresti, D. and Tomkins, A. 1994. On the searchability of electronic ink. In Proceedings of 4th International Workshop on Frontiers in Handwriting Recognition. 156--165.]]Google ScholarGoogle Scholar
  22. Muth, R. and Manber, U. 1996. Approximate multiple string search. In Proceedings of 7th Combinatorial Pattern Matching (CPM'96). LNCS, vol. 1075. Springer-Verlag, Berlin, 75--86.]] Google ScholarGoogle Scholar
  23. Myers, E. W. 1999. A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM 46, 3, 395--415.]] Google ScholarGoogle Scholar
  24. Navarro, G. 2001. A guided tour to approximate string matching. ACM Computing Surveys 33, 1, 31--88.]] Google ScholarGoogle Scholar
  25. Navarro, G. and Baeza-Yates, R. 1999. Very fast and simple approximate string matching. Inf. Process. Lett. 72, 65--70.]] Google ScholarGoogle Scholar
  26. Navarro, G. and Baeza-Yates, R. 2001. Improving an algorithm for approximate pattern matching. Algorithmica 30, 4, 473--502.]]Google ScholarGoogle Scholar
  27. Navarro, G. and Fredriksson, K. 2004. Average complexity of exact and approximate multiple string matching. Theor. Comput. Sci. 321, 2-3, 283--290.]] Google ScholarGoogle Scholar
  28. Navarro, G. and Raffinot, M. 2000. Fast and flexible string matching by combining bit-parallelism and suffix automata. ACM J. Exp. Algorithmics 5, 4.]] Google ScholarGoogle Scholar
  29. Navarro, G. and Raffinot, M. 2002. Flexible Pattern Matching in Strings---Practical on-line Search Algorithms for Texts and Biological Sequences. Cambridge University Press, Cambridge, UK.]] Google ScholarGoogle Scholar
  30. Navarro, G., Sutinen, E., Tanninen, J., and Tarhio, J. 2000. Indexing text with approximate q-grams. In Proceedings of 11th Combinatorial Pattern Matching (CPM'00). LNCS, vol. 1848. Springer-Verlag, Berlin, 350--363.]] Google ScholarGoogle Scholar
  31. Paul, W. and Simon, J. 1980. Decision trees and random access machines. In Proceedings of International Symposium on Logic and Algorithmic (Zurich). 331--340.]]Google ScholarGoogle Scholar
  32. Sankoff, D. and Kruskal, J., Eds. 1983. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA.]]Google ScholarGoogle Scholar
  33. Sellers, P. 1980. The theory and computation of evolutionary distances: Pattern recognition. J. Algorithms 1, 359--373.]]Google ScholarGoogle Scholar
  34. Sutinen, E. and Tarhio, J. 1996. Filtration with q-samples in approximate string matching. In Proceedings of 7th Combinatorial Pattern Matching. LNCS, vol. 1075. Springer-Verlag, Berlin, 50--63.]] Google ScholarGoogle Scholar
  35. Tarhio, J. and Ukkonen, E. 1993. Approximate Boyer--Moore string matching. SIAM J. Comput. 22, 2, 243--260.]] Google ScholarGoogle Scholar
  36. Ukkonen, E. 1985. Finding approximate patterns in strings. J. Algorithms 6, 132--137.]]Google ScholarGoogle Scholar
  37. Waterman, M. 1995. Introduction to Computational Biology. Chapman and Hall, London.]]Google ScholarGoogle Scholar
  38. Yao, A. C. 1979. The complexity of pattern matching for a random string. SIAM J. Comput. 8, 3, 368--387.]]Google ScholarGoogle Scholar

Index Terms

  1. Average-optimal single and multiple approximate string matching

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader