skip to main content
article

Indexing methods for approximate dictionary searching: Comparative analysis

Published:28 May 2011Publication History
Skip Abstract Section

Abstract

The primary goal of this article is to survey state-of-the-art indexing methods for approximate dictionary searching. To improve understanding of the field, we introduce a taxonomy that classifies all methods into direct methods and sequence-based filtering methods. We focus on infrequently updated dictionaries, which are used primarily for retrieval. Therefore, we consider indices that are optimized for retrieval rather than for update. The indices are assumed to be associative, that is, capable of storing and retrieving auxiliary information, such as string identifiers. All solutions are lossless and guarantee retrieval of strings within a specified edit distance k. Benchmark results are presented for the practically important cases of k=1, 2, and 3. We concentrate on natural language datasets, which include synthetic English and Russian dictionaries, as well as dictionaries of frequent words extracted from the ClueWeb09 collection. In addition, we carry out experiments with dictionaries containing DNA sequences. The article is concluded with a discussion of benchmark results and directions for future research.

Skip Supplemental Material Section

Supplemental Material

References

  1. Amir, A., Keselman, D., Landau, G. M., Lewenstein, M., Lewenstein, N., and Rodeh, M. 2000. Text indexing and dictionary matching with one error. J. Algorithms 37, 2, 309--325. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Angell, R. C., Freund, G. E., and Willett, P. 1983. Automatic spelling correction using a trigram similarity measure. Inf. Process. Manage. 19, 4, 443--453.Google ScholarGoogle ScholarCross RefCross Ref
  3. Baeza-Yates, R., Cunto, W., Manber, U., and Wu, S. 1994. Proximity matching using fixed-queries trees. In Proceedings of the 5th Annual Symposium on Pattern Matching. Springer, Berlin, 198--212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Baeza-Yates, R. and Gonnet, G. 1992. A new approach to text searching. Commun. ACM 35, 10, 74--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Baeza-Yates, R. and Navarro, G. 1998. Fast approximate string matching in a dictionary. In Proceedings of the String Processing and Information Retrieval: A South American Symposium (SPIRE'98). IEEE, Los Alamitos, CA, 14--22.Google ScholarGoogle Scholar
  6. Baeza-Yates, R. and Navarro, G. 1999. Faster approximate string matching. Algorithmica 23, 2, 127--158.Google ScholarGoogle ScholarCross RefCross Ref
  7. Baeza-Yates, R. and Salinger, A. 2005. Experimental analysis of a fast intersection algorithm for sorted sequences. In Proceedings of the String Processing and Information Retrieval (SPIRE'05). Springer, Berlin, 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Baeza-Yates, R. A. and Gonnet, G. H. 1990. All-against-all sequence matching: Preliminary version. http://www.dcc.uchile.cl/~rbaeza/cv/reports.html.Google ScholarGoogle Scholar
  9. Baeza-Yates, R. A. and Gonnet, G. H. 1999. A fast algorithm on average for all-against-all sequence matching. In Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware (SPIRE'99). IEEE, Los Alamitos, CA. 16--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Barbay, J., López-Ortiz, A., and Lu, T. 2006. Faster adaptive set intersections for text searching. In Proceedings of the 5th International Workshop on Experimental Algorithms (WEA'06). Springer, Berlin, 146--157. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Beckmann, N., Kriegel, H., Schneider, R., and Seeger, B. 1990. The R*-tree: an efficient and robust access method for points and rectangles. In Proceedings of 1990 ACM SIGMOD International Conference on Management of Data. ACM, New York, 322--331. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Belazzougui, D. 2009. Faster and space-optimal edit distance 1 dictionary. In Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM'09). Springer, Berlin, 154--167.Google ScholarGoogle ScholarCross RefCross Ref
  13. Benoit, D., Demaine, E. D., Munro, J. I., Raman, R., Raman, V., and Rao, S. S. 2005. Representing trees of higher degree. Algorithmica 43, 4, 275--292.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Bentley, J. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 9, 509--517. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Berchtold, S., Keim, D. A., and Kriegel, H. P. 1996. The X-tree: An index structure for high-dimensional data. In Proceedings of the 22th International Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, 28--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Berman, A. 1994. A new data structure for fast approximate matching. Tech. rep. 1994-03-02, Department of Computer Science and Engineering, University of Washington.Google ScholarGoogle Scholar
  17. Bieganski, P., Riedl, J., Cartis, J., and Retzel, E. 1994. Generalized suffix trees for biological sequence data: applications and implementation. In Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences. IEEE, Los Alamitos, CA, 35--44.Google ScholarGoogle Scholar
  18. Böhm, C., Berchtold, S., and Keim, D. A. 2001. Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Comput. Surv. 33, 3, 322--373. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Boitsov, L. 2004. Classification and experimental comparison of modern dictionary fuzzy search algorithms (in Russian). In Proceedings of the 6th Russian Conference on Digital Libraries (RCDL'04).Google ScholarGoogle Scholar
  20. Botelho, F., Pagh, R., and Ziviani, N. 2007. Simple and space-efficient minimal perfect hash functions. In Proceedings of the 10th International Workshop on Algorithms and Data Structures. Springer, Berlin, 139--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Brill, E. and Moore, R. C. 2000. An improved error model for noisy channel spelling correction. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA, 286--293. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Burkhard, W. and Keller, R. 1973. Some approaches to best-match file searching. Commun. ACM 16, 4, 230--236. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Burkhardt, S. and Kärkkäinen, J. 2002. One-gapped q-gram filter for Levenshtein distance. In Proceedings of the 13th Symposium on Combinatorial Pattern Matching (CPM'02). Springer, Berlin, 225--234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Cánovas, R. and Navarro, G. 2010. Practical compressed suffix trees. In Proceedings of the 9th International Symposium on Experimental Algorithms (SEA'10). Springer, Berlin, 94--105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Carterette, B. and Can, F. 2005. Comparing inverted files and signature files for searching in a large lexicon. Inf. Process. Manage. 41, 3, 613--633. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Chan, H.-L., Lam, T.-W., Sung, W.-K., Tam, S.-L., and Wong, S.-S. 2006. Compressed indexes for approximate string matching. In Proceedings of the 14th Annual European Symposium (ESA'06). Springer-Verlag, Berlin, 208--219. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Chávez, E., Marroquín, J. L., and Navarro, G. 2001. Fixed Queries Array: A fast and economical data structure for proximity searching. Multimedia Tools Appl. 14, 2, 113--135. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Chávez, E. and Navarro, G. 2003. Probabilistic proximity search: Fighting the curse of dimensionality in metric spaces. Inf. Process. Lett. 85, 1, 39--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Chávez, E., Navarro, G., Baeza-Yates, R., and Marroquin, J. L. 2001. Searching in metric spaces. ACM Comput. Surv. 33, 3, 273--321. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Cho, J. and Rajagopalan, S. 2002. A fast regular expression indexing engine. In Proceedings of the 18th International Conference on Data Engineering. IEEE, Los Alamitos, CA, 419--430. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Claude, F., Fariña, A., and Navarro, G. 2009. Re-Pair compression of inverted lists. CoRR abs/0911.3318.Google ScholarGoogle Scholar
  32. Coelho, L. and Oliveira, A. 2006. Dotted Suffix Trees a structure for approximate text indexing. In Proceedings of the 13th International Conference on String Processing and Information Retrieval (SPIRE'06). Springer, Berlin, 329--336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Cole, R., Gottlieb, L.-A., and Lewenstein, M. 2004. Dictionary matching and indexing with errors and don't cares. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing (STOC'04). ACM, New York, 91--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Collins, P. 2009. Definately* the most misspelled word in the english language (*it should be definitely). The Daily Record. June 15, 2009. http://www.dailyrecord.co.uk/news/editors-choice/2009/06/15/definately-the-most-misspelled-word-in-the-english-language-it-should-be-definitely-86908-21441847/.Google ScholarGoogle Scholar
  35. Covington, M. A. 1996. An algorithm to align words for historical comparison. Comput. Ling. 22, 4, 481--496. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Daciuk, J., Mihov, S., Watson, B. W., and Watson, R. E. 2000. Incremental construction of minimal acyclic finite-state automata. Comput. Ling. 26, 1, 3--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Damerau, F. 1964. A technique for computer detection and correction of spelling errors. Commun. ACM 7, 3, 171--176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. D'Amore, R. J. and Mah, C. P. 1985. One-time complete indexing of text: theory and practice. In Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'85). ACM, New York, 155--164. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Demaine, E. D., López-Ortiz, A., and Munro, J. I. 2001. Experiments on adaptive set intersections for text retrieval systems. In Revised Papers from the 3rd International Workshop on Algorithm Engineering and Experimentation (ALENEX'01). Springer-Verlag, Berlin, 91--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Dietzfelbinger, M., Karlin, A., Mehlhorn, K., Meyer auf der Heide, F., Rohnert, H., and Tarjan, R. E. 1994. Dynamic perfect hashing: Upper and lower bounds. SIAM J. Comput. 23, 4, 738--761. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Dömölki, B. 1968. A universal compiler system based on production rules. BIT Numer. Math. 8, 262--275.Google ScholarGoogle ScholarCross RefCross Ref
  42. Doster, W. 1977. Contextual postprocessing system for cooperation with a multiple-choice character-recognition system. IEEE Trans. Comput. 26, 1090--1101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Du, M. W. and Chang, S. C. 1994. An approach to designing very fast approximate string matching algorithms. IEEE Trans. Knowl. Data Eng. 6, 4, 620--633. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Elias, P. 1974. Efficient storage and retrieval by content and address of static files. J. ACM 21, 2, 246--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Faloutsos, C. 1996. Searching Multimedia Databases by Content. Kluwer Academic Publisher, Dordrecht, The Netherlands. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Ferragina, P., Muthukrishnan, S., and de Berg, M. 1999. Multi-method dispatching: a geometric approach with applications to string matching problems. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing (STOC'99). ACM, New York, 483--491. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Ferragina, P. and Venturini, R. 2007. Compressed permuterm index. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'07). ACM, New York, 535--542. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Figueroa, K. and Fredriksson, K. 2007. Simple space-time trade-offs for AESA. In Proceedings of the 6th International Workshop on Experimental Algorithms (WEA'07). Springer, Berlin, 229--241. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Fischer, J., Mäkinen, V., and Navarro, G. 2008. An(other) entropy-bounded compressed suffix tree. In Proceedings of the 19th Annual Symposium on Combinatorial Pattern Matching (CPM'08). Springer-Verlag, Berlin, 152--165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Ford, G., Hause, S., Le, D. X., and Thoma, G. R. 2001. Pattern matching techniques for correcting low confidence OCR words in a known context. In Proceedings of the 8th SPIE International Conference on Document Recognition and Retrieval. SPIE, Bellingham, WA, 241--249.Google ScholarGoogle Scholar
  51. Fredman, M. L., Komlós, J., and Szemerédi, E. 1984. Storing a sparse table with o(1) worst case access time. J. ACM 31, 3, 538--544. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Fredriksson, K. 2007. Engineering efficient metric indexes. Pattern Recogn. Lett. 28, 1, 75--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Friedman, J. H., Bentley, J. L., and Finkel, R. A. 1977. An algorithm for finding best matches in logarithmic expected time. ACM Trans. Math. Software 3, 3, 209--226. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Gaede, V. and Günther, O. 1998. Multidimensional access methods. ACM Comput. Surv. 30, 2, 170--231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Giegerich, R., Kurtz, S., and Stoye, J. 2003. Efficient implementation of lazy suffix trees. Softw., Pract. Exper. 33, 11, 1035--1049.Google ScholarGoogle ScholarCross RefCross Ref
  56. Giladi, E., Walker, M. G., Wang, J. Z., and Volkmuth, W. 2002. SST: an algorithm for finding near-exact sequence matches in time proportional to the logarithm of the database size. Bioinformatics 18, 6, 873--877.Google ScholarGoogle ScholarCross RefCross Ref
  57. Glass, J. R. 2003. A probabilistic framework for segment-based speech recognition. Comput. Speech Lang. 17, 2-3, 137 -- 152.Google ScholarGoogle ScholarCross RefCross Ref
  58. Gollapudi, S. and Panigrahy, R. 2006. A dictionary for approximate string search and longest prefix search. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM'06). ACM, New York, 768--775. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Gorin, R. E. 1971. SPELL: Spelling check and correction program. http://pdp-10.trailing-edge.com/decuslib10-03/01/43,50270/spell.doc.html.Google ScholarGoogle Scholar
  60. Gravano, L., Ipeirotis, P., Jagadish, H. V., Koudas, N., Muthukrishnan, S., Pietarinen, L., and Srivastava, D. 2001. Using q-grams in a DBMS for approximate string processing. IEEE Data Eng. Bull. 24, 4, 28--34.Google ScholarGoogle Scholar
  61. Grossi, R. and Luccio, F. 1989. Simple and efficient string matching with k mismatches. Inf. Process. Lett. 33, 3, 113--120.Google ScholarGoogle ScholarCross RefCross Ref
  62. Grossi, R. and Vitter, J. S. 2005. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35, 2, 378--407. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Gusfield, D. 1999. Algorithms on Strings, Trees, and Sequences -- Computer Science and Computational Biology. Cambridge University Press, Cambridge, UK. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Guttman, A. 1984. R-trees: a dynamic index structure for spatial searching. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, New York, 47--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Gwehenberger, G. 1968. Anwendung einer binären verweiskettenmethode beim aufbau von listen. Elektronische Rechenanlagen 10, 223--226.Google ScholarGoogle Scholar
  66. Hall, P. and Dowling, G. 1980. Approximate string matching. ACM Computing Surveys 12, 4, 381--402. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Hellerstein, J. and Pfeffer, A. 1994. The RD-tree: An index structure for sets. Tech. rep. 1252, University of Wisconsin at Madison.Google ScholarGoogle Scholar
  68. Hoeffding, W. 1963. Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58, 301, 13--30.Google ScholarGoogle ScholarCross RefCross Ref
  69. Hyyrö, H. 2003a. Bit-parallel approximate string matching algorithms with transposition. In Proceedings of the 10th International Symposium on String Processing. Springer-Verlag, Berlin, 95--107.Google ScholarGoogle ScholarCross RefCross Ref
  70. Hyyrö, H. 2003b. Practical methods for approximate string matching. Ph.D. thesis, Department of Computer Science, University of Tampere, Finland.Google ScholarGoogle Scholar
  71. Hyyrö, H. 2005. Bit-parallel approximate string matching algorithms with transposition. J. Discrete Algorithms 3, 2-4, 215--229.Google ScholarGoogle ScholarCross RefCross Ref
  72. James, E. B. and Partridge, D. P. 1973. Adaptive correction of program statements. Commun. ACM 16, 1, 27--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Jeong, I.-S., Park, K.-W., Kang, S.-H., and Lim, H.-S. 2010. An efficient similarity search based on indexing in large dna databases. Comput. Biol. Chem. 34, 2, 131--136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Jokinen, P., Tarhio, J., and Ukkonen, E. 1996. A comparison of approximate string matching algorithms. Software Pract. Exp. 26, 12, 1439--1458. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Jokinen, P. and Ukkonen, E. 1991. Two algorithms for approximate string matching in static texts. In Proceedings of the 16th International Symposium on Mathematical Foundations of Computer Science. Springer, Berlin, 240--248.Google ScholarGoogle Scholar
  76. Kahveci, T., Ljosa, V., and Singh, A. K. 2004. Speeding up whole-genome alignment by indexing frequency vectors. Bioinformatics 20, 13, 2122--2134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Kahveci, T. and Singh, A. 2001. An efficient index structure for string databases. In Proceedings of the 27th International Conference on Very Large Databases. Morgan Kaufmann, San Francisco, CA, 351--360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Klovstad, J. and Mondshein, L. 1975. The CASPERS linguistic analysis system. IEEE Trans. Acoust. Speech Signal Process. 23, 118--123.Google ScholarGoogle ScholarCross RefCross Ref
  79. Knuth, D. 1973. The Art of Computer Programming. Sorting and Searching, 1st ed., vol. 3. Addison-Wesley, Upper Saddle River, NJ.Google ScholarGoogle Scholar
  80. Knuth, D. 1997. The Art of Computer Programming. Sorting and Searching, 2d ed., vol. 3. Addison-Wesley, Upper Saddle River, NJ.Google ScholarGoogle Scholar
  81. Kondrak, G. 2003. Phonetic alignment and similarity. Computers and the Humanities 37, 3, 273--291.Google ScholarGoogle ScholarCross RefCross Ref
  82. Kuenning, G., Gorin, R. E., Willisson, P., Buehring, W., and Stevens, K. 1988. International spell: a fast screen-oriented spelling checker. http://www.lasr.cs.ucla.edu/geoff/ispell.html.Google ScholarGoogle Scholar
  83. Kukich, K. 1992. Technique for automatically correcting words in text. ACM Comput. Surv. 24, 2, 377--439. Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. Kurtz, S. 1996. Approximate string searching under weighted edit distance. In Proceedings of the 3rd South American Workshop on String Processing. Carleton University Press, Ottawa, Ontario, 156--170.Google ScholarGoogle Scholar
  85. Kurtz, S. 1999. Reducing the space requirement of suffix trees. Softw. Pract. Exper. 29, 13, 1149--1171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. Levenshtein, V. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Doklady Akademii Nauk SSSR, 163, 4, 845--848.Google ScholarGoogle Scholar
  87. Lowrance, R. and Wagner, R. 1975. An extension of the string-to-string correction problem. J. ACM 22, 2, 177--183. Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. Maly, K. 1976. Compressed tries. Commun. ACM 19, 7, 409--415. Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. Manber, U. and Myers, G. 1990. Suffix arrays: a new method for on-line string searches. In Proceedings of the 1st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'90). Society for Industrial and Applied Mathematics, Philadephia, PA, 319--327. Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. Manber, U. and Wu, S. 1994a. An algorithm for approximate membership checking with application to password security. Inf. Process. Lett. 50, 4, 191--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. Manber, U. and Wu, S. 1994b. GLIMPSE: a tool to search through entire file systems. In Proceedings of the USENIX Winter 1994 Technical Conference (WTEC'94). USENIX, Berkeley, CA, 4--4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. McCreight, E. M. 1976. A space-economical suffix tree construction algorithm. J. ACM 23, 2, 262--272. Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. Melichar, B. 1996. String matching with k differences by finite automata. In Proceedings of the International Congress on Pattern Recognition. IEEE, Los Alamitos, CA, 256--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. Micó, M. L., Oncina, J., and Vidal-Ruiz, E. 1994. A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements. Pattern Recogn. Lett. 15, 1, 9--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  95. Mihov, S. and Schulz, K. U. 2004. Fast approximate string search in large dictionaries. Comput. Ling. 30, 4, 451--477. Google ScholarGoogle ScholarDigital LibraryDigital Library
  96. Mochizuki, H. and Hayashi, Y. 2000. An efficient retrieval algorithm of compound words using extendible hashing. Int. J. Comput. Process. Oriental Lang. 13, 1, 15--33.Google ScholarGoogle ScholarCross RefCross Ref
  97. Moore, T. and Edelman, B. 2010. Measuring the perpetrators and funders of typosquatting. In Proceedings of the 14th International Conference on Financial Cryptography and Data Security. Springer, Berlin. Google ScholarGoogle ScholarDigital LibraryDigital Library
  98. Mor, M. and Fraenkel, A. S. 1982. A hash code method for detecting and correcting spelling errors. Commun. ACM 25, 12, 935--938. Google ScholarGoogle ScholarDigital LibraryDigital Library
  99. Morrison, D. R. 1968. PATRICIA—practical algorithm to retrieve information coded in alphanumeric. J. ACM 15, 4, 514--534. Google ScholarGoogle ScholarDigital LibraryDigital Library
  100. Muth, R. and Manber, U. 1996. Approximate multiple string search. In Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching. Springer, Berlin, 75--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  101. Myers, E. 1994. A sublinear algorithm for approximate keyword searching. Algorithmica 12, 4/5, 345--374.Google ScholarGoogle Scholar
  102. Myers, E. 1999. A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM 46, 3, 395--415. Google ScholarGoogle ScholarDigital LibraryDigital Library
  103. Navarro, G. 1997a. Multiple approximate string matching by counting. In Proceedings of the 4th South American Workshop on String Processing. Carleton University Press, Ottawa, Ontario, 95--111.Google ScholarGoogle Scholar
  104. Navarro, G. 1997b. A partial deterministic automaton for approximate string matching. In Proceedings of the 4th South American Workshop on String Processing. Carleton University Press, Ottawa, Ontario, 112--124.Google ScholarGoogle Scholar
  105. Navarro, G. 2001a. A guided tour to approximate string matching. ACM Comput. Surv. 33, 1, 31--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  106. Navarro, G. 2001b. NR-grep: a fast and flexible pattern matching tool. Software Pract. Exp. 31, 13, 1265--1312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  107. Navarro, G. and Baeza-Yates, R. 1998. A practical q-gram index for text retrieval allowing errors. CLEI Electron. J. 1, 2.Google ScholarGoogle Scholar
  108. Navarro, G. and Baeza-Yates, R. 2000. A hybrid indexing method for approximate string matching. J. Discrete Algorithms, 1, 1, 205--239.Google ScholarGoogle Scholar
  109. Navarro, G., Baeza-Yates, R., Sutinen, E., and Tarhio, J. 2001. Indexing methods for approximate string matching. IEEE Data Engin. Bull. 24, 4, 19--27.Google ScholarGoogle Scholar
  110. Navarro, G., Paredes, R., and Chávez, E. 2002. t-Spanners as a data structure for metric space searching. In Proceedings of the 9th International Symposium on String Processing and Information Retrieval (SPIRE'02), Springer, Berlin, 195--200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  111. Navarro, G. and Raffinot, M. 2000. Fast and flexible string matching by combining bit-parallelism and suffix automata. J. Exp. Algorithmics, 5, 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  112. Navarro, G. and Salmela, L. 2009. Indexing variable length substrings for exact and approximate matching. In Proceedings of the 16th International Symposium on String Processing and Information Retrieval (SPIRE'09). Springer, Berlin, 214--221. Google ScholarGoogle ScholarDigital LibraryDigital Library
  113. Navarro, G., Sutinen, E., and Tarhio, J. 2005. Indexing text with approximate q-grams. J. Discrete Algorithms 3, 2-4, 157--175.Google ScholarGoogle ScholarCross RefCross Ref
  114. Needleman, S. B. and Wunsch, C. D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 3, 443--453.Google ScholarGoogle ScholarCross RefCross Ref
  115. Ng, K. and Zue, V. W. 2000. Subword-based approaches for spoken document retrieval. Speech Commun. 32, 3, 157--186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  116. Nievergelt, J., Hinterberger, H., and Sevcik, K. C. 1984. The grid file: An adaptable, symmetric multikey file structure. ACM Trans. Database Syst. 9, 1, 38--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  117. Nilsson, S. and Karlsson, G. 1999. IP-address lookup using LC-tries. IEEE J. Sel. Areas Commun. 17, 6, 1083--1092. Google ScholarGoogle ScholarDigital LibraryDigital Library
  118. Orenstein, J. 1982. Multidimensional tries used for associative searching. Inf. Process. Lett. 14, 4, 150--157.Google ScholarGoogle ScholarCross RefCross Ref
  119. Owolabi, O. 1996. Dictionary organizations for efficient similarity retrieval. J. Syst. Software 34, 2, 127--132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  120. Owolabi, O. and McGregor, D. R. 1988. Fast approximate string matching. Software Pract. Exp. 18, 4, 387--393. Google ScholarGoogle ScholarDigital LibraryDigital Library
  121. Ozturk, O. and Ferhatosmanoglu, H. 2003. Effective indexing and filtering for similarity search in large biosequence databases. In Proceedings of the 3rd IEEE Symposium on BioInformatics and BioEngineering (BIBE'03). IEEE, Los Alamitos, CA 359. Google ScholarGoogle ScholarDigital LibraryDigital Library
  122. Pagh, R. and Rodler, F. 2001. Cuckoo hashing. In Proceedings of the 9th Annual European Symposium on Algorithms (ESA '01). Springer, Berlin, 121--133. Google ScholarGoogle ScholarDigital LibraryDigital Library
  123. Peterson, J. L. 1980. Computer programs for detecting and correcting spelling errors. Commun. ACM 23, 12, 676--687. Google ScholarGoogle ScholarDigital LibraryDigital Library
  124. Peterson, J. L. 1986. A note on undetected typing errors. Commun. ACM 29, 7, 633--637. Google ScholarGoogle ScholarDigital LibraryDigital Library
  125. Raman, R., Raman, V., and Satti, S. R. 2007. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms 3, 4, 43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  126. Riseman, E. M. and Hanson, A. R. 1974. A contextual postprocessing system for error correction using binary n-grams. IEEE Trans. Comput. 23, 5, 480--493. Google ScholarGoogle ScholarDigital LibraryDigital Library
  127. Robinson, J. 1981. The K-D-B-Tree: A search structure for large multidimensional dynamic indexes. In Proceedings of 1981 ACM SIGMOD International Conference on Management of Data. ACM, New York, 10--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  128. Roussopoulos, N. and Leifker, D. 1985. Direct spatial search on pictorial databases using packed R-trees. SIGMOD Rec. 14, 4, 17--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  129. Russo, L. and Oliveira, A. 2005. An efficient algorithm for generating super condensed neighborhoods. In Proceedings of the 16th Annual Combinatorial Pattern Matching Symposium (CPM'05). Springer, Berlin, 104--115. Google ScholarGoogle ScholarDigital LibraryDigital Library
  130. Russo, L. M. S., Navarro, G., and Oliveira, A. L. 2008. Fully-compressed suffix trees. In Proceedings of the 8th Latin American Symposium on Theoretical Informatics (LATIN'08), Springer, Berlin, 362--373. Google ScholarGoogle ScholarDigital LibraryDigital Library
  131. Russo, L. M. S., Navarro, G., Oliveira, A. L., and Morales, P. 2009. Approximate string matching with compressed indexes. Algorithms 2, 3, 1105--1136.Google ScholarGoogle ScholarCross RefCross Ref
  132. Sadakane, K. 2007. Compressed suffix trees with full functionality. Theor. Comp. Sys. 41, 4, 589--607. Google ScholarGoogle ScholarDigital LibraryDigital Library
  133. Sakoe, H. and Chiba, S. 1971. A dynamic programming approach to continuous speech recognition. In Proceedings of the 7th International Congress on Acoustics. 65--68.Google ScholarGoogle Scholar
  134. Samet, H. 1995. Spatial data structures. In Modern Database Systems in The Object Model, Interoperability and Beyond. Addison-Wesley, Upper Saddle River, NJ, 361--385. Google ScholarGoogle ScholarDigital LibraryDigital Library
  135. Samet, H. 2005. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann, San Francisco, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  136. Sankoff, D. 2000. The early introduction of dynamic programming into computational biology. Bioinformatics 16, 1, 41--47.Google ScholarGoogle ScholarCross RefCross Ref
  137. Schek, H. J. 1978. The reference string indexing method. In Proceedings of the 2nd Conference of the European Cooperation in Informatics. Springer-Verlag, Berlin, 432--459. Google ScholarGoogle ScholarDigital LibraryDigital Library
  138. Scholer, F., Williams, H. E., Yiannis, J., and Zobel, J. 2002. Compression of inverted indexes for fast query evaluation. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in information Retrieval. ACM, New York, 222--229. Google ScholarGoogle ScholarDigital LibraryDigital Library
  139. Schuegraf, E. J. 1973. Selection of equifrequent work fragments for information retrieval. Inf. Storage Retrieval 9, 12, 697--711.Google ScholarGoogle ScholarCross RefCross Ref
  140. Sellers, P. 1974. On the computation of evolutionary distances. SIAM J. Appl. Math. 26, 787--793.Google ScholarGoogle ScholarCross RefCross Ref
  141. Sellis, T. K., Roussopoulos, N., and Faloutsos, C. 1987. The R<sup>+</sup>-Tree: A dynamic index for multi-dimensional objects. In Proceedings of the 13th International Conference on Very Large Data Bases (VLDB'87). Morgan Kaufmann, San Francisco, CA, 507--518. Google ScholarGoogle ScholarDigital LibraryDigital Library
  142. Siegler, M., Witbrock, M. J., Slattery, S. T., Seymore, K., Jones, R. E., and Hauptmann, A. G. 1997. Experiments in Spoken Document Retrieval at CMU. In Proceedings of the 6th Text REtrieval Conference. National Institute of Standard and Technology, Gaithersburg, MD, 291--302.Google ScholarGoogle Scholar
  143. Skut, W. 2004. Incremental construction of minimal acyclic sequential transducers from unsorted data. In Proceedings of the 20th International Conference on Computational Linguistics (COLING'04). Association for Computational Linguistics, Stroudsburg, PA, 15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  144. Sung, W.-K. 2008. Indexed approximate string matching. In Encyclopedia of Algorithms. 408--410.Google ScholarGoogle Scholar
  145. Sutinen, E. and Tarhio, J. 1996. Filtration with q-samples in approximate string matching. In Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching (CPM '96). Springer-Verlag, Berlin, 50--63. Google ScholarGoogle ScholarDigital LibraryDigital Library
  146. Uhlmann, J. 1991. Satisfying general proximity similarity queries with metric trees. Inf. Process. Lett. 40, 175--179.Google ScholarGoogle ScholarCross RefCross Ref
  147. Ukkonen, E. 1985a. Algorithms for approximate string matching. Inf. Control 64, 1-3, 100--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  148. Ukkonen, E. 1985b. Finding approximate patterns in strings. J. Algorithms 6, 1, 132--137.Google ScholarGoogle ScholarCross RefCross Ref
  149. Ukkonen, E. 1993. Approximate string matching over suffix trees. In Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching, Springer-Verlag, Berlin, 228--242. Google ScholarGoogle ScholarDigital LibraryDigital Library
  150. Ukkonen, E. 1995. On-line construction of suffix trees. Algorithmica 14, 249--260.Google ScholarGoogle ScholarDigital LibraryDigital Library
  151. Velichko, V. and Zagoruyko, N. 1970. Automatic recognition of 200 words. Int. J. Man-Machine Stud. 2, 3, 223--234.Google ScholarGoogle ScholarCross RefCross Ref
  152. Veronis, J. 1988. Computerized correction of phonographic errors. Computers and the Humanities 22, 1, 43--56.Google ScholarGoogle ScholarCross RefCross Ref
  153. Vidal, E. 1986. An algorithm for finding nearest neighbors in (approximately) constant average time. Pattern Recognit. Lett. 4, 3, 145--147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  154. Vintsyuk, T. 1968. Speech discrimination by dynamic programming. Cybernetics 4, 1, 52--57.Google ScholarGoogle ScholarCross RefCross Ref
  155. Wagner, R. A. and Fischer, M. J. 1974. The string-to-string correction problem. J. ACM 21, 1, 168--173. Google ScholarGoogle ScholarDigital LibraryDigital Library
  156. Wang, B., Xie, L., and Wang, G. 2009. A two-tire index structure for approximate string matching with block moves. In Proceedings of the 2nd International Workshop on Managing Data Quality in Collaborative Information Systems and 1st International Workshop on Data and Process Provenance. Springer, Berlin, 197--211. Google ScholarGoogle ScholarDigital LibraryDigital Library
  157. Weber, R., Schek, H. J., and Blott, S. 1998. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of the 24th International Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, 194--205. Google ScholarGoogle ScholarDigital LibraryDigital Library
  158. Weiner, P. 1973. Linear pattern matching algorithms. In Proceedings the 14th IEEE Symposium on Switching and Automata Theory. IEEE, Los Alamitos, CA, 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  159. Wilbur, W. J., Kim, W., and Xie, N. 2006. Spelling correction in the PubMed search engine. Information Retrieval 9, 5, 543--564. Google ScholarGoogle ScholarDigital LibraryDigital Library
  160. Willet, P. 1979. Document retrieval experiments using indexing vocabularies of varying size. II. Hashing, truncation, digram and trigram encoding of index terms. J. Doc. 35, 4, 296--305.Google ScholarGoogle ScholarCross RefCross Ref
  161. Woodland, P., Odell, J., Valtchev, V., and Young, S. 1994. Large vocabulary continuous speech recognition using HTK. In Proceedings of the IEEE International Conference on Acoustic, Speech, and Signal Processing. IEEE, Los Alamitos, CA, 3, 125--128.Google ScholarGoogle Scholar
  162. Wu, S. and Manber, U. 1992a. Agrep -- a fast approximate matching tool. In Proceedings of the USENIX Winter Technical Conference. USENIX, Berkeley, CA, 153--162.Google ScholarGoogle Scholar
  163. Wu, S. and Manber, U. 1992b. Fast text searching allowing errors. Commun. ACM 35, 10, 83--91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  164. Wu, S., Manber, U., and Myers, E. 1996. A subquadratic algorithm for approximate limited expression matching. Algorithmica 15, 1, 50--67.Google ScholarGoogle ScholarDigital LibraryDigital Library
  165. Zaliznyak, M. 2006. Telefonica's dispute over an accented domain in Chile. Multilingual Search. Sep 30, 2006. http://www.multilingual-search.com/telefonica%C2%B4s-dispute-over-an-accented-domain-in-chile/30/09/2006.Google ScholarGoogle Scholar
  166. Zamora, E., Pollock, J., and Zamora, A. 1981. The use of trigram analysis for spelling error detection. Inf. Process. Manage. 17, 6, 305--316.Google ScholarGoogle ScholarCross RefCross Ref
  167. Zezula, P., Amato, G., Dohnal, V., and Batko, M. 2005. Similarity Search: The Metric Space Approach (Advances in Database Systems). Springer-Verlag, Berlin. Google ScholarGoogle Scholar
  168. Zobel, J. and Dart, P. 1995. Finding approximate matches in large lexicons. Software Pract. Exp. 25, 2, 331--345. Google ScholarGoogle ScholarDigital LibraryDigital Library
  169. Zobel, J. and Dart, P. 1996. Phonetic string matching: lessons from information retrieval. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 166--172. Google ScholarGoogle ScholarDigital LibraryDigital Library
  170. Zobel, J., Moffat, A., and Ramamohanarao, K. 1998. Inverted files versus signature files for text indexing. ACM Trans. Database Syst. 23, 4, 453--490. Google ScholarGoogle ScholarDigital LibraryDigital Library
  171. Zobel, J., Moffat, A., and Sacks-davis, R. 1993. Searching large lexicons for partially specified terms using compressed inverted files. In Proceedings of the 19th international Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, 290--301. Google ScholarGoogle ScholarDigital LibraryDigital Library
  172. Zue, V., Glass, J., Phillips, M., and Seneff, S. 1989. The MIT SUMMIT speech recognition system: a progress report. In Proceedings of the Workshop on Speech and Natural Language (HLT '89). Association for Computational Linguistics, Morristown, NJ, 179--189. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Indexing methods for approximate dictionary searching: Comparative analysis

                              Recommendations

                              Reviews

                              Jerzy W. Jaromczyk

                              Anyone who has ever searched for a word (pattern) in a file or checked documents for spelling errors appreciates efficient algorithms that can quickly find relevant occurrences of the pattern, even if it is inexact or incomplete. This paper surveys a class of such algorithms: indexing methods for approximate searching. The paper is divided into sections that discuss fundamental concepts and definitions, describe the methods and algorithms, and provide a comparative analysis of the experimental results. The bibliography lists more than 180 references, with publication times spanning decades, from the 1960s to 2010. In spite of the voluminous material, thanks to the taxonomy introduced by the author, the presentation is clear and well organized. A rich appendix includes additional material, such as proofs of selected theorems. Numerous tables and figures augment the presentation. Furthermore, the main observations and key points are concisely itemized as lists, such as the "bird view" (Boytsov's term) of key conclusions from the experimental analysis of the search methods. The paper thoroughly describes intensive experiments that include detailed investigations of results that, on the surface, may appear to be counterintuitive. For example, with FB-tries, retrieval times for longer patterns are better than for short ones; Boytsov conducted additional experiments to understand this phenomenon. This paper would be of interest not only to computer scientists (for whom numerous open questions are listed in the conclusion section), but also to researchers and practitioners from a variety of areas, who want to select the best search methods for such tasks as searching in dictionaries of DNA sequences. Online Computing Reviews Service

                              Access critical reviews of Computing literature here

                              Become a reviewer for Computing Reviews.

                              Comments

                              Login options

                              Check if you have access through your login credentials or your institution to get full access on this article.

                              Sign in

                              Full Access

                              PDF Format

                              View or Download as a PDF file.

                              PDF

                              eReader

                              View online with eReader.

                              eReader