Abstract
We present a new algorithm for multiple approximate string matching. It is based on reading backwards enough l-grams from text windows so as to prove that no occurrence can contain the part of the window read, and then shifting the window.We show analytically that our algorithm is optimal on average. Hence our first contribution is to fill an important gap in the area, since no average-optimal algorithm existed for multiple approximate string matching.We consider several variants and practical improvements to our algorithm, and show experimentally that they are resistant to the number of patterns and the fastest for low difference ratios, displacing the long-standing best algorithms. Hence our second contribution is to give a practical algorithm for this problem, by far better than any existing alternative in many cases of interest. On real-life texts, our algorithm is especially interesting for computational biology applications.In particular, we show that our algorithm can be successfully used to search for one pattern, where many more competing algorithms exist. Our algorithm is also average-optimal in this case, being the second after that of Chang and Marr. However, our algorithm permits higher difference ratios than Chang and Marr, and this is our third contribution.In practice, our algorithm is competitive in this scenario too, being the fastest for low difference ratios and moderate alphabet sizes. This is our fourth contribution, which also answers affirmatively the question of whether a practical average-optimal approximate string-matching algorithm existed.
- Baeza-Yates, R. and Navarro, G. 2000. New models and algorithms for multidimensional approximate pattern matching. Journal of Discrete Algorithms 1, 1, 21--49. Special issue on Matching Patterns.]]Google Scholar
- Baeza-Yates, R. and Navarro, G. 2002. New and faster filters for multiple approximate string matching. Random Structures and Algorithms 20, 23--49.]] Google Scholar
- Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison-Wesley, Reading, MA.]] Google Scholar
- Baeza-Yates, R. A. and Navarro, G. 1999. Faster approximate string matching. Algorithmica 23, 2, 127--158.]]Google Scholar
- Chang, W. and Lawler, E. 1994. Sublinear approximate string matching and biological applications. Algorithmica 12, 4/5, 327--344.]]Google Scholar
- Chang, W. and Marr, T. 1994. Approximate string matching and local similarity. In Proceedings of 5th Combinatorial Pattern Matching (CPM'94). LNCS, vol. 807. Springer-Verlag, Berlin, 259--273.]] Google Scholar
- Crochemore, M., Czumaj, A., Gasieniec, L., Jarominek, S., Lecroq, T., Plandowski, W., and Rytter, W. 1994. Speeding up two string matching algorithms. Algorithmica 12, 4/5, 247--267.]]Google Scholar
- Crochemore, M. and Rytter, W. 1994. Text Algorithms. Oxford University Press, Oxford, UK.]] Google Scholar
- Dixon, R. and Martin, T., Eds. 1979. Automatic Speech and Speaker Recognition. IEEE Press.]] Google Scholar
- Elliman, D. and Lancaster, I. 1990. A review of segmentation and contextual analysis techniques for text recognition. Pattern Recogn. 23, 3/4, 337--346.]] Google Scholar
- Fredriksson, K. 2003. Row-wise tiling for the Myers' bit-parallel approximate string matching algorithm. In Proceedings of 10th Symposium on String Processing and Information Retrieval (SPIRE'03). LNCS, vol. 2857. Springer-Verlag, Berlin, 66--79.]]Google Scholar
- Fredriksson, K. and Navarro, G. 2003. Average-optimal multiple approximate string matching. In Proceedings of 14th Combinatorial Pattern Matching (CPM'03). LNCS, vol. 2676. 109--128.]] Google Scholar
- Fredriksson, K. and Navarro, G. 2004. Improved single and multiple approximate string matching. In Proceedings of 15th Combinatorial Pattern Matching (CPM'04). LNCS, vol. 3109. Springer-Verlag, Berlin, 457--471.]]Google Scholar
- Grossi, R. and Luccio, F. 1989. Simple and efficient string matching with k mismatches. Information Processing Letters 33, 3, 113--120.]]Google Scholar
- Horspool, R. 1980. Practical fast searching in strings. Software Practice and Experience 10, 501--506.]]Google Scholar
- Hyyrö, H., Fredriksson, K., and Navarro, G. 2004. Increased bit-parallelism for approximate string matching. In Proceedings of 3rd Workshop on Efficient and Experimental Algorithms (WEA'04). LNCS, vol. 3059. Springer-Verlag, Berlin, 285--298.]]Google Scholar
- Hyyrö, H. and Navarro, G. 2002. Faster bit-parallel approximate string matching. In Proceedings of 13th Combinatorial Pattern Matching (CPM'02). LNCS, vol. 2373. Springer-Verlag, Berlin, 203--224. Extended version to appear in Algorithmica.]] Google Scholar
- Jokinen, P., Tarhio, J., and Ukkonen, E. 1996. A comparison of approximate string matching algorithms. Software Practice and Experience 26, 12, 1439--1458.]] Google Scholar
- Kukich, K. 1992. Techniques for automatically correcting words in text. ACM Computing Surveys 24, 4, 377--439.]] Google Scholar
- Kumar, S. and Spafford, E. 1994. A pattern-matching model for intrusion detection. In Proceedings of National Computer Security Conference. 11--21.]]Google Scholar
- Lopresti, D. and Tomkins, A. 1994. On the searchability of electronic ink. In Proceedings of 4th International Workshop on Frontiers in Handwriting Recognition. 156--165.]]Google Scholar
- Muth, R. and Manber, U. 1996. Approximate multiple string search. In Proceedings of 7th Combinatorial Pattern Matching (CPM'96). LNCS, vol. 1075. Springer-Verlag, Berlin, 75--86.]] Google Scholar
- Myers, E. W. 1999. A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM 46, 3, 395--415.]] Google Scholar
- Navarro, G. 2001. A guided tour to approximate string matching. ACM Computing Surveys 33, 1, 31--88.]] Google Scholar
- Navarro, G. and Baeza-Yates, R. 1999. Very fast and simple approximate string matching. Inf. Process. Lett. 72, 65--70.]] Google Scholar
- Navarro, G. and Baeza-Yates, R. 2001. Improving an algorithm for approximate pattern matching. Algorithmica 30, 4, 473--502.]]Google Scholar
- Navarro, G. and Fredriksson, K. 2004. Average complexity of exact and approximate multiple string matching. Theor. Comput. Sci. 321, 2-3, 283--290.]] Google Scholar
- Navarro, G. and Raffinot, M. 2000. Fast and flexible string matching by combining bit-parallelism and suffix automata. ACM J. Exp. Algorithmics 5, 4.]] Google Scholar
- Navarro, G. and Raffinot, M. 2002. Flexible Pattern Matching in Strings---Practical on-line Search Algorithms for Texts and Biological Sequences. Cambridge University Press, Cambridge, UK.]] Google Scholar
- Navarro, G., Sutinen, E., Tanninen, J., and Tarhio, J. 2000. Indexing text with approximate q-grams. In Proceedings of 11th Combinatorial Pattern Matching (CPM'00). LNCS, vol. 1848. Springer-Verlag, Berlin, 350--363.]] Google Scholar
- Paul, W. and Simon, J. 1980. Decision trees and random access machines. In Proceedings of International Symposium on Logic and Algorithmic (Zurich). 331--340.]]Google Scholar
- Sankoff, D. and Kruskal, J., Eds. 1983. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA.]]Google Scholar
- Sellers, P. 1980. The theory and computation of evolutionary distances: Pattern recognition. J. Algorithms 1, 359--373.]]Google Scholar
- Sutinen, E. and Tarhio, J. 1996. Filtration with q-samples in approximate string matching. In Proceedings of 7th Combinatorial Pattern Matching. LNCS, vol. 1075. Springer-Verlag, Berlin, 50--63.]] Google Scholar
- Tarhio, J. and Ukkonen, E. 1993. Approximate Boyer--Moore string matching. SIAM J. Comput. 22, 2, 243--260.]] Google Scholar
- Ukkonen, E. 1985. Finding approximate patterns in strings. J. Algorithms 6, 132--137.]]Google Scholar
- Waterman, M. 1995. Introduction to Computational Biology. Chapman and Hall, London.]]Google Scholar
- Yao, A. C. 1979. The complexity of pattern matching for a random string. SIAM J. Comput. 8, 3, 368--387.]]Google Scholar
Index Terms
- Average-optimal single and multiple approximate string matching
Recommendations
Increased bit-parallelism for approximate and multiple string matching
Bit-parallelism permits executing several operations simultaneously over a set of bits or numbers stored in a single computer word. This technique permits searching for the approximate occurrences of a pattern of length m in a text of length n in time O(...
Average-optimal string matching
The exact string matching problem is to find the occurrences of a pattern of length m from a text of length n symbols. We develop a novel and unorthodox filtering technique for this problem. Our method is based on transforming the problem into multiple ...
Approximate Boyer-Moore String Matching for Small Alphabets
Recently a new variation of approximate Boyer-Moore string matching was presented for the k-mismatch problem. The variation, called FAAST, is specifically tuned for small alphabets. We further improve this algorithm gaining speedups in both ...
Comments