Abstract
The primary goal of this article is to survey state-of-the-art indexing methods for approximate dictionary searching. To improve understanding of the field, we introduce a taxonomy that classifies all methods into direct methods and sequence-based filtering methods. We focus on infrequently updated dictionaries, which are used primarily for retrieval. Therefore, we consider indices that are optimized for retrieval rather than for update. The indices are assumed to be associative, that is, capable of storing and retrieving auxiliary information, such as string identifiers. All solutions are lossless and guarantee retrieval of strings within a specified edit distance k. Benchmark results are presented for the practically important cases of k=1, 2, and 3. We concentrate on natural language datasets, which include synthetic English and Russian dictionaries, as well as dictionaries of frequent words extracted from the ClueWeb09 collection. In addition, we carry out experiments with dictionaries containing DNA sequences. The article is concluded with a discussion of benchmark results and directions for future research.
Supplemental Material
Available for Download
Supplemental materials file 1 of 4
README file for the supplemental material
Supplemental materials file 4 of 4
Supplemental materials file 3 of 4
Supplemental materials file 2 of 4
- Amir, A., Keselman, D., Landau, G. M., Lewenstein, M., Lewenstein, N., and Rodeh, M. 2000. Text indexing and dictionary matching with one error. J. Algorithms 37, 2, 309--325. Google ScholarDigital Library
- Angell, R. C., Freund, G. E., and Willett, P. 1983. Automatic spelling correction using a trigram similarity measure. Inf. Process. Manage. 19, 4, 443--453.Google ScholarCross Ref
- Baeza-Yates, R., Cunto, W., Manber, U., and Wu, S. 1994. Proximity matching using fixed-queries trees. In Proceedings of the 5th Annual Symposium on Pattern Matching. Springer, Berlin, 198--212. Google ScholarDigital Library
- Baeza-Yates, R. and Gonnet, G. 1992. A new approach to text searching. Commun. ACM 35, 10, 74--82. Google ScholarDigital Library
- Baeza-Yates, R. and Navarro, G. 1998. Fast approximate string matching in a dictionary. In Proceedings of the String Processing and Information Retrieval: A South American Symposium (SPIRE'98). IEEE, Los Alamitos, CA, 14--22.Google Scholar
- Baeza-Yates, R. and Navarro, G. 1999. Faster approximate string matching. Algorithmica 23, 2, 127--158.Google ScholarCross Ref
- Baeza-Yates, R. and Salinger, A. 2005. Experimental analysis of a fast intersection algorithm for sorted sequences. In Proceedings of the String Processing and Information Retrieval (SPIRE'05). Springer, Berlin, 13--24. Google ScholarDigital Library
- Baeza-Yates, R. A. and Gonnet, G. H. 1990. All-against-all sequence matching: Preliminary version. http://www.dcc.uchile.cl/~rbaeza/cv/reports.html.Google Scholar
- Baeza-Yates, R. A. and Gonnet, G. H. 1999. A fast algorithm on average for all-against-all sequence matching. In Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware (SPIRE'99). IEEE, Los Alamitos, CA. 16--23. Google ScholarDigital Library
- Barbay, J., López-Ortiz, A., and Lu, T. 2006. Faster adaptive set intersections for text searching. In Proceedings of the 5th International Workshop on Experimental Algorithms (WEA'06). Springer, Berlin, 146--157. Google ScholarDigital Library
- Beckmann, N., Kriegel, H., Schneider, R., and Seeger, B. 1990. The R*-tree: an efficient and robust access method for points and rectangles. In Proceedings of 1990 ACM SIGMOD International Conference on Management of Data. ACM, New York, 322--331. Google ScholarDigital Library
- Belazzougui, D. 2009. Faster and space-optimal edit distance 1 dictionary. In Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM'09). Springer, Berlin, 154--167.Google ScholarCross Ref
- Benoit, D., Demaine, E. D., Munro, J. I., Raman, R., Raman, V., and Rao, S. S. 2005. Representing trees of higher degree. Algorithmica 43, 4, 275--292.Google ScholarDigital Library
- Bentley, J. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 9, 509--517. Google ScholarDigital Library
- Berchtold, S., Keim, D. A., and Kriegel, H. P. 1996. The X-tree: An index structure for high-dimensional data. In Proceedings of the 22th International Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, 28--39. Google ScholarDigital Library
- Berman, A. 1994. A new data structure for fast approximate matching. Tech. rep. 1994-03-02, Department of Computer Science and Engineering, University of Washington.Google Scholar
- Bieganski, P., Riedl, J., Cartis, J., and Retzel, E. 1994. Generalized suffix trees for biological sequence data: applications and implementation. In Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences. IEEE, Los Alamitos, CA, 35--44.Google Scholar
- Böhm, C., Berchtold, S., and Keim, D. A. 2001. Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Comput. Surv. 33, 3, 322--373. Google ScholarDigital Library
- Boitsov, L. 2004. Classification and experimental comparison of modern dictionary fuzzy search algorithms (in Russian). In Proceedings of the 6th Russian Conference on Digital Libraries (RCDL'04).Google Scholar
- Botelho, F., Pagh, R., and Ziviani, N. 2007. Simple and space-efficient minimal perfect hash functions. In Proceedings of the 10th International Workshop on Algorithms and Data Structures. Springer, Berlin, 139--150. Google ScholarDigital Library
- Brill, E. and Moore, R. C. 2000. An improved error model for noisy channel spelling correction. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA, 286--293. Google ScholarDigital Library
- Burkhard, W. and Keller, R. 1973. Some approaches to best-match file searching. Commun. ACM 16, 4, 230--236. Google ScholarDigital Library
- Burkhardt, S. and Kärkkäinen, J. 2002. One-gapped q-gram filter for Levenshtein distance. In Proceedings of the 13th Symposium on Combinatorial Pattern Matching (CPM'02). Springer, Berlin, 225--234. Google ScholarDigital Library
- Cánovas, R. and Navarro, G. 2010. Practical compressed suffix trees. In Proceedings of the 9th International Symposium on Experimental Algorithms (SEA'10). Springer, Berlin, 94--105. Google ScholarDigital Library
- Carterette, B. and Can, F. 2005. Comparing inverted files and signature files for searching in a large lexicon. Inf. Process. Manage. 41, 3, 613--633. Google ScholarDigital Library
- Chan, H.-L., Lam, T.-W., Sung, W.-K., Tam, S.-L., and Wong, S.-S. 2006. Compressed indexes for approximate string matching. In Proceedings of the 14th Annual European Symposium (ESA'06). Springer-Verlag, Berlin, 208--219. Google ScholarDigital Library
- Chávez, E., Marroquín, J. L., and Navarro, G. 2001. Fixed Queries Array: A fast and economical data structure for proximity searching. Multimedia Tools Appl. 14, 2, 113--135. Google ScholarDigital Library
- Chávez, E. and Navarro, G. 2003. Probabilistic proximity search: Fighting the curse of dimensionality in metric spaces. Inf. Process. Lett. 85, 1, 39--46. Google ScholarDigital Library
- Chávez, E., Navarro, G., Baeza-Yates, R., and Marroquin, J. L. 2001. Searching in metric spaces. ACM Comput. Surv. 33, 3, 273--321. Google ScholarDigital Library
- Cho, J. and Rajagopalan, S. 2002. A fast regular expression indexing engine. In Proceedings of the 18th International Conference on Data Engineering. IEEE, Los Alamitos, CA, 419--430. Google ScholarDigital Library
- Claude, F., Fariña, A., and Navarro, G. 2009. Re-Pair compression of inverted lists. CoRR abs/0911.3318.Google Scholar
- Coelho, L. and Oliveira, A. 2006. Dotted Suffix Trees a structure for approximate text indexing. In Proceedings of the 13th International Conference on String Processing and Information Retrieval (SPIRE'06). Springer, Berlin, 329--336. Google ScholarDigital Library
- Cole, R., Gottlieb, L.-A., and Lewenstein, M. 2004. Dictionary matching and indexing with errors and don't cares. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing (STOC'04). ACM, New York, 91--100. Google ScholarDigital Library
- Collins, P. 2009. Definately* the most misspelled word in the english language (*it should be definitely). The Daily Record. June 15, 2009. http://www.dailyrecord.co.uk/news/editors-choice/2009/06/15/definately-the-most-misspelled-word-in-the-english-language-it-should-be-definitely-86908-21441847/.Google Scholar
- Covington, M. A. 1996. An algorithm to align words for historical comparison. Comput. Ling. 22, 4, 481--496. Google ScholarDigital Library
- Daciuk, J., Mihov, S., Watson, B. W., and Watson, R. E. 2000. Incremental construction of minimal acyclic finite-state automata. Comput. Ling. 26, 1, 3--16. Google ScholarDigital Library
- Damerau, F. 1964. A technique for computer detection and correction of spelling errors. Commun. ACM 7, 3, 171--176. Google ScholarDigital Library
- D'Amore, R. J. and Mah, C. P. 1985. One-time complete indexing of text: theory and practice. In Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'85). ACM, New York, 155--164. Google ScholarDigital Library
- Demaine, E. D., López-Ortiz, A., and Munro, J. I. 2001. Experiments on adaptive set intersections for text retrieval systems. In Revised Papers from the 3rd International Workshop on Algorithm Engineering and Experimentation (ALENEX'01). Springer-Verlag, Berlin, 91--104. Google ScholarDigital Library
- Dietzfelbinger, M., Karlin, A., Mehlhorn, K., Meyer auf der Heide, F., Rohnert, H., and Tarjan, R. E. 1994. Dynamic perfect hashing: Upper and lower bounds. SIAM J. Comput. 23, 4, 738--761. Google ScholarDigital Library
- Dömölki, B. 1968. A universal compiler system based on production rules. BIT Numer. Math. 8, 262--275.Google ScholarCross Ref
- Doster, W. 1977. Contextual postprocessing system for cooperation with a multiple-choice character-recognition system. IEEE Trans. Comput. 26, 1090--1101. Google ScholarDigital Library
- Du, M. W. and Chang, S. C. 1994. An approach to designing very fast approximate string matching algorithms. IEEE Trans. Knowl. Data Eng. 6, 4, 620--633. Google ScholarDigital Library
- Elias, P. 1974. Efficient storage and retrieval by content and address of static files. J. ACM 21, 2, 246--260. Google ScholarDigital Library
- Faloutsos, C. 1996. Searching Multimedia Databases by Content. Kluwer Academic Publisher, Dordrecht, The Netherlands. Google ScholarDigital Library
- Ferragina, P., Muthukrishnan, S., and de Berg, M. 1999. Multi-method dispatching: a geometric approach with applications to string matching problems. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing (STOC'99). ACM, New York, 483--491. Google ScholarDigital Library
- Ferragina, P. and Venturini, R. 2007. Compressed permuterm index. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'07). ACM, New York, 535--542. Google ScholarDigital Library
- Figueroa, K. and Fredriksson, K. 2007. Simple space-time trade-offs for AESA. In Proceedings of the 6th International Workshop on Experimental Algorithms (WEA'07). Springer, Berlin, 229--241. Google ScholarDigital Library
- Fischer, J., Mäkinen, V., and Navarro, G. 2008. An(other) entropy-bounded compressed suffix tree. In Proceedings of the 19th Annual Symposium on Combinatorial Pattern Matching (CPM'08). Springer-Verlag, Berlin, 152--165. Google ScholarDigital Library
- Ford, G., Hause, S., Le, D. X., and Thoma, G. R. 2001. Pattern matching techniques for correcting low confidence OCR words in a known context. In Proceedings of the 8th SPIE International Conference on Document Recognition and Retrieval. SPIE, Bellingham, WA, 241--249.Google Scholar
- Fredman, M. L., Komlós, J., and Szemerédi, E. 1984. Storing a sparse table with o(1) worst case access time. J. ACM 31, 3, 538--544. Google ScholarDigital Library
- Fredriksson, K. 2007. Engineering efficient metric indexes. Pattern Recogn. Lett. 28, 1, 75--84. Google ScholarDigital Library
- Friedman, J. H., Bentley, J. L., and Finkel, R. A. 1977. An algorithm for finding best matches in logarithmic expected time. ACM Trans. Math. Software 3, 3, 209--226. Google ScholarDigital Library
- Gaede, V. and Günther, O. 1998. Multidimensional access methods. ACM Comput. Surv. 30, 2, 170--231. Google ScholarDigital Library
- Giegerich, R., Kurtz, S., and Stoye, J. 2003. Efficient implementation of lazy suffix trees. Softw., Pract. Exper. 33, 11, 1035--1049.Google ScholarCross Ref
- Giladi, E., Walker, M. G., Wang, J. Z., and Volkmuth, W. 2002. SST: an algorithm for finding near-exact sequence matches in time proportional to the logarithm of the database size. Bioinformatics 18, 6, 873--877.Google ScholarCross Ref
- Glass, J. R. 2003. A probabilistic framework for segment-based speech recognition. Comput. Speech Lang. 17, 2-3, 137 -- 152.Google ScholarCross Ref
- Gollapudi, S. and Panigrahy, R. 2006. A dictionary for approximate string search and longest prefix search. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM'06). ACM, New York, 768--775. Google ScholarDigital Library
- Gorin, R. E. 1971. SPELL: Spelling check and correction program. http://pdp-10.trailing-edge.com/decuslib10-03/01/43,50270/spell.doc.html.Google Scholar
- Gravano, L., Ipeirotis, P., Jagadish, H. V., Koudas, N., Muthukrishnan, S., Pietarinen, L., and Srivastava, D. 2001. Using q-grams in a DBMS for approximate string processing. IEEE Data Eng. Bull. 24, 4, 28--34.Google Scholar
- Grossi, R. and Luccio, F. 1989. Simple and efficient string matching with k mismatches. Inf. Process. Lett. 33, 3, 113--120.Google ScholarCross Ref
- Grossi, R. and Vitter, J. S. 2005. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35, 2, 378--407. Google ScholarDigital Library
- Gusfield, D. 1999. Algorithms on Strings, Trees, and Sequences -- Computer Science and Computational Biology. Cambridge University Press, Cambridge, UK. Google ScholarDigital Library
- Guttman, A. 1984. R-trees: a dynamic index structure for spatial searching. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, New York, 47--57. Google ScholarDigital Library
- Gwehenberger, G. 1968. Anwendung einer binären verweiskettenmethode beim aufbau von listen. Elektronische Rechenanlagen 10, 223--226.Google Scholar
- Hall, P. and Dowling, G. 1980. Approximate string matching. ACM Computing Surveys 12, 4, 381--402. Google ScholarDigital Library
- Hellerstein, J. and Pfeffer, A. 1994. The RD-tree: An index structure for sets. Tech. rep. 1252, University of Wisconsin at Madison.Google Scholar
- Hoeffding, W. 1963. Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58, 301, 13--30.Google ScholarCross Ref
- Hyyrö, H. 2003a. Bit-parallel approximate string matching algorithms with transposition. In Proceedings of the 10th International Symposium on String Processing. Springer-Verlag, Berlin, 95--107.Google ScholarCross Ref
- Hyyrö, H. 2003b. Practical methods for approximate string matching. Ph.D. thesis, Department of Computer Science, University of Tampere, Finland.Google Scholar
- Hyyrö, H. 2005. Bit-parallel approximate string matching algorithms with transposition. J. Discrete Algorithms 3, 2-4, 215--229.Google ScholarCross Ref
- James, E. B. and Partridge, D. P. 1973. Adaptive correction of program statements. Commun. ACM 16, 1, 27--37. Google ScholarDigital Library
- Jeong, I.-S., Park, K.-W., Kang, S.-H., and Lim, H.-S. 2010. An efficient similarity search based on indexing in large dna databases. Comput. Biol. Chem. 34, 2, 131--136. Google ScholarDigital Library
- Jokinen, P., Tarhio, J., and Ukkonen, E. 1996. A comparison of approximate string matching algorithms. Software Pract. Exp. 26, 12, 1439--1458. Google ScholarDigital Library
- Jokinen, P. and Ukkonen, E. 1991. Two algorithms for approximate string matching in static texts. In Proceedings of the 16th International Symposium on Mathematical Foundations of Computer Science. Springer, Berlin, 240--248.Google Scholar
- Kahveci, T., Ljosa, V., and Singh, A. K. 2004. Speeding up whole-genome alignment by indexing frequency vectors. Bioinformatics 20, 13, 2122--2134. Google ScholarDigital Library
- Kahveci, T. and Singh, A. 2001. An efficient index structure for string databases. In Proceedings of the 27th International Conference on Very Large Databases. Morgan Kaufmann, San Francisco, CA, 351--360. Google ScholarDigital Library
- Klovstad, J. and Mondshein, L. 1975. The CASPERS linguistic analysis system. IEEE Trans. Acoust. Speech Signal Process. 23, 118--123.Google ScholarCross Ref
- Knuth, D. 1973. The Art of Computer Programming. Sorting and Searching, 1st ed., vol. 3. Addison-Wesley, Upper Saddle River, NJ.Google Scholar
- Knuth, D. 1997. The Art of Computer Programming. Sorting and Searching, 2d ed., vol. 3. Addison-Wesley, Upper Saddle River, NJ.Google Scholar
- Kondrak, G. 2003. Phonetic alignment and similarity. Computers and the Humanities 37, 3, 273--291.Google ScholarCross Ref
- Kuenning, G., Gorin, R. E., Willisson, P., Buehring, W., and Stevens, K. 1988. International spell: a fast screen-oriented spelling checker. http://www.lasr.cs.ucla.edu/geoff/ispell.html.Google Scholar
- Kukich, K. 1992. Technique for automatically correcting words in text. ACM Comput. Surv. 24, 2, 377--439. Google ScholarDigital Library
- Kurtz, S. 1996. Approximate string searching under weighted edit distance. In Proceedings of the 3rd South American Workshop on String Processing. Carleton University Press, Ottawa, Ontario, 156--170.Google Scholar
- Kurtz, S. 1999. Reducing the space requirement of suffix trees. Softw. Pract. Exper. 29, 13, 1149--1171. Google ScholarDigital Library
- Levenshtein, V. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Doklady Akademii Nauk SSSR, 163, 4, 845--848.Google Scholar
- Lowrance, R. and Wagner, R. 1975. An extension of the string-to-string correction problem. J. ACM 22, 2, 177--183. Google ScholarDigital Library
- Maly, K. 1976. Compressed tries. Commun. ACM 19, 7, 409--415. Google ScholarDigital Library
- Manber, U. and Myers, G. 1990. Suffix arrays: a new method for on-line string searches. In Proceedings of the 1st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'90). Society for Industrial and Applied Mathematics, Philadephia, PA, 319--327. Google ScholarDigital Library
- Manber, U. and Wu, S. 1994a. An algorithm for approximate membership checking with application to password security. Inf. Process. Lett. 50, 4, 191--197. Google ScholarDigital Library
- Manber, U. and Wu, S. 1994b. GLIMPSE: a tool to search through entire file systems. In Proceedings of the USENIX Winter 1994 Technical Conference (WTEC'94). USENIX, Berkeley, CA, 4--4. Google ScholarDigital Library
- McCreight, E. M. 1976. A space-economical suffix tree construction algorithm. J. ACM 23, 2, 262--272. Google ScholarDigital Library
- Melichar, B. 1996. String matching with k differences by finite automata. In Proceedings of the International Congress on Pattern Recognition. IEEE, Los Alamitos, CA, 256--260. Google ScholarDigital Library
- Micó, M. L., Oncina, J., and Vidal-Ruiz, E. 1994. A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements. Pattern Recogn. Lett. 15, 1, 9--17. Google ScholarDigital Library
- Mihov, S. and Schulz, K. U. 2004. Fast approximate string search in large dictionaries. Comput. Ling. 30, 4, 451--477. Google ScholarDigital Library
- Mochizuki, H. and Hayashi, Y. 2000. An efficient retrieval algorithm of compound words using extendible hashing. Int. J. Comput. Process. Oriental Lang. 13, 1, 15--33.Google ScholarCross Ref
- Moore, T. and Edelman, B. 2010. Measuring the perpetrators and funders of typosquatting. In Proceedings of the 14th International Conference on Financial Cryptography and Data Security. Springer, Berlin. Google ScholarDigital Library
- Mor, M. and Fraenkel, A. S. 1982. A hash code method for detecting and correcting spelling errors. Commun. ACM 25, 12, 935--938. Google ScholarDigital Library
- Morrison, D. R. 1968. PATRICIA—practical algorithm to retrieve information coded in alphanumeric. J. ACM 15, 4, 514--534. Google ScholarDigital Library
- Muth, R. and Manber, U. 1996. Approximate multiple string search. In Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching. Springer, Berlin, 75--86. Google ScholarDigital Library
- Myers, E. 1994. A sublinear algorithm for approximate keyword searching. Algorithmica 12, 4/5, 345--374.Google Scholar
- Myers, E. 1999. A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM 46, 3, 395--415. Google ScholarDigital Library
- Navarro, G. 1997a. Multiple approximate string matching by counting. In Proceedings of the 4th South American Workshop on String Processing. Carleton University Press, Ottawa, Ontario, 95--111.Google Scholar
- Navarro, G. 1997b. A partial deterministic automaton for approximate string matching. In Proceedings of the 4th South American Workshop on String Processing. Carleton University Press, Ottawa, Ontario, 112--124.Google Scholar
- Navarro, G. 2001a. A guided tour to approximate string matching. ACM Comput. Surv. 33, 1, 31--88. Google ScholarDigital Library
- Navarro, G. 2001b. NR-grep: a fast and flexible pattern matching tool. Software Pract. Exp. 31, 13, 1265--1312. Google ScholarDigital Library
- Navarro, G. and Baeza-Yates, R. 1998. A practical q-gram index for text retrieval allowing errors. CLEI Electron. J. 1, 2.Google Scholar
- Navarro, G. and Baeza-Yates, R. 2000. A hybrid indexing method for approximate string matching. J. Discrete Algorithms, 1, 1, 205--239.Google Scholar
- Navarro, G., Baeza-Yates, R., Sutinen, E., and Tarhio, J. 2001. Indexing methods for approximate string matching. IEEE Data Engin. Bull. 24, 4, 19--27.Google Scholar
- Navarro, G., Paredes, R., and Chávez, E. 2002. t-Spanners as a data structure for metric space searching. In Proceedings of the 9th International Symposium on String Processing and Information Retrieval (SPIRE'02), Springer, Berlin, 195--200. Google ScholarDigital Library
- Navarro, G. and Raffinot, M. 2000. Fast and flexible string matching by combining bit-parallelism and suffix automata. J. Exp. Algorithmics, 5, 4. Google ScholarDigital Library
- Navarro, G. and Salmela, L. 2009. Indexing variable length substrings for exact and approximate matching. In Proceedings of the 16th International Symposium on String Processing and Information Retrieval (SPIRE'09). Springer, Berlin, 214--221. Google ScholarDigital Library
- Navarro, G., Sutinen, E., and Tarhio, J. 2005. Indexing text with approximate q-grams. J. Discrete Algorithms 3, 2-4, 157--175.Google ScholarCross Ref
- Needleman, S. B. and Wunsch, C. D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 3, 443--453.Google ScholarCross Ref
- Ng, K. and Zue, V. W. 2000. Subword-based approaches for spoken document retrieval. Speech Commun. 32, 3, 157--186. Google ScholarDigital Library
- Nievergelt, J., Hinterberger, H., and Sevcik, K. C. 1984. The grid file: An adaptable, symmetric multikey file structure. ACM Trans. Database Syst. 9, 1, 38--71. Google ScholarDigital Library
- Nilsson, S. and Karlsson, G. 1999. IP-address lookup using LC-tries. IEEE J. Sel. Areas Commun. 17, 6, 1083--1092. Google ScholarDigital Library
- Orenstein, J. 1982. Multidimensional tries used for associative searching. Inf. Process. Lett. 14, 4, 150--157.Google ScholarCross Ref
- Owolabi, O. 1996. Dictionary organizations for efficient similarity retrieval. J. Syst. Software 34, 2, 127--132. Google ScholarDigital Library
- Owolabi, O. and McGregor, D. R. 1988. Fast approximate string matching. Software Pract. Exp. 18, 4, 387--393. Google ScholarDigital Library
- Ozturk, O. and Ferhatosmanoglu, H. 2003. Effective indexing and filtering for similarity search in large biosequence databases. In Proceedings of the 3rd IEEE Symposium on BioInformatics and BioEngineering (BIBE'03). IEEE, Los Alamitos, CA 359. Google ScholarDigital Library
- Pagh, R. and Rodler, F. 2001. Cuckoo hashing. In Proceedings of the 9th Annual European Symposium on Algorithms (ESA '01). Springer, Berlin, 121--133. Google ScholarDigital Library
- Peterson, J. L. 1980. Computer programs for detecting and correcting spelling errors. Commun. ACM 23, 12, 676--687. Google ScholarDigital Library
- Peterson, J. L. 1986. A note on undetected typing errors. Commun. ACM 29, 7, 633--637. Google ScholarDigital Library
- Raman, R., Raman, V., and Satti, S. R. 2007. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms 3, 4, 43. Google ScholarDigital Library
- Riseman, E. M. and Hanson, A. R. 1974. A contextual postprocessing system for error correction using binary n-grams. IEEE Trans. Comput. 23, 5, 480--493. Google ScholarDigital Library
- Robinson, J. 1981. The K-D-B-Tree: A search structure for large multidimensional dynamic indexes. In Proceedings of 1981 ACM SIGMOD International Conference on Management of Data. ACM, New York, 10--18. Google ScholarDigital Library
- Roussopoulos, N. and Leifker, D. 1985. Direct spatial search on pictorial databases using packed R-trees. SIGMOD Rec. 14, 4, 17--31. Google ScholarDigital Library
- Russo, L. and Oliveira, A. 2005. An efficient algorithm for generating super condensed neighborhoods. In Proceedings of the 16th Annual Combinatorial Pattern Matching Symposium (CPM'05). Springer, Berlin, 104--115. Google ScholarDigital Library
- Russo, L. M. S., Navarro, G., and Oliveira, A. L. 2008. Fully-compressed suffix trees. In Proceedings of the 8th Latin American Symposium on Theoretical Informatics (LATIN'08), Springer, Berlin, 362--373. Google ScholarDigital Library
- Russo, L. M. S., Navarro, G., Oliveira, A. L., and Morales, P. 2009. Approximate string matching with compressed indexes. Algorithms 2, 3, 1105--1136.Google ScholarCross Ref
- Sadakane, K. 2007. Compressed suffix trees with full functionality. Theor. Comp. Sys. 41, 4, 589--607. Google ScholarDigital Library
- Sakoe, H. and Chiba, S. 1971. A dynamic programming approach to continuous speech recognition. In Proceedings of the 7th International Congress on Acoustics. 65--68.Google Scholar
- Samet, H. 1995. Spatial data structures. In Modern Database Systems in The Object Model, Interoperability and Beyond. Addison-Wesley, Upper Saddle River, NJ, 361--385. Google ScholarDigital Library
- Samet, H. 2005. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann, San Francisco, CA. Google ScholarDigital Library
- Sankoff, D. 2000. The early introduction of dynamic programming into computational biology. Bioinformatics 16, 1, 41--47.Google ScholarCross Ref
- Schek, H. J. 1978. The reference string indexing method. In Proceedings of the 2nd Conference of the European Cooperation in Informatics. Springer-Verlag, Berlin, 432--459. Google ScholarDigital Library
- Scholer, F., Williams, H. E., Yiannis, J., and Zobel, J. 2002. Compression of inverted indexes for fast query evaluation. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in information Retrieval. ACM, New York, 222--229. Google ScholarDigital Library
- Schuegraf, E. J. 1973. Selection of equifrequent work fragments for information retrieval. Inf. Storage Retrieval 9, 12, 697--711.Google ScholarCross Ref
- Sellers, P. 1974. On the computation of evolutionary distances. SIAM J. Appl. Math. 26, 787--793.Google ScholarCross Ref
- Sellis, T. K., Roussopoulos, N., and Faloutsos, C. 1987. The R<sup>+</sup>-Tree: A dynamic index for multi-dimensional objects. In Proceedings of the 13th International Conference on Very Large Data Bases (VLDB'87). Morgan Kaufmann, San Francisco, CA, 507--518. Google ScholarDigital Library
- Siegler, M., Witbrock, M. J., Slattery, S. T., Seymore, K., Jones, R. E., and Hauptmann, A. G. 1997. Experiments in Spoken Document Retrieval at CMU. In Proceedings of the 6th Text REtrieval Conference. National Institute of Standard and Technology, Gaithersburg, MD, 291--302.Google Scholar
- Skut, W. 2004. Incremental construction of minimal acyclic sequential transducers from unsorted data. In Proceedings of the 20th International Conference on Computational Linguistics (COLING'04). Association for Computational Linguistics, Stroudsburg, PA, 15. Google ScholarDigital Library
- Sung, W.-K. 2008. Indexed approximate string matching. In Encyclopedia of Algorithms. 408--410.Google Scholar
- Sutinen, E. and Tarhio, J. 1996. Filtration with q-samples in approximate string matching. In Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching (CPM '96). Springer-Verlag, Berlin, 50--63. Google ScholarDigital Library
- Uhlmann, J. 1991. Satisfying general proximity similarity queries with metric trees. Inf. Process. Lett. 40, 175--179.Google ScholarCross Ref
- Ukkonen, E. 1985a. Algorithms for approximate string matching. Inf. Control 64, 1-3, 100--118. Google ScholarDigital Library
- Ukkonen, E. 1985b. Finding approximate patterns in strings. J. Algorithms 6, 1, 132--137.Google ScholarCross Ref
- Ukkonen, E. 1993. Approximate string matching over suffix trees. In Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching, Springer-Verlag, Berlin, 228--242. Google ScholarDigital Library
- Ukkonen, E. 1995. On-line construction of suffix trees. Algorithmica 14, 249--260.Google ScholarDigital Library
- Velichko, V. and Zagoruyko, N. 1970. Automatic recognition of 200 words. Int. J. Man-Machine Stud. 2, 3, 223--234.Google ScholarCross Ref
- Veronis, J. 1988. Computerized correction of phonographic errors. Computers and the Humanities 22, 1, 43--56.Google ScholarCross Ref
- Vidal, E. 1986. An algorithm for finding nearest neighbors in (approximately) constant average time. Pattern Recognit. Lett. 4, 3, 145--147. Google ScholarDigital Library
- Vintsyuk, T. 1968. Speech discrimination by dynamic programming. Cybernetics 4, 1, 52--57.Google ScholarCross Ref
- Wagner, R. A. and Fischer, M. J. 1974. The string-to-string correction problem. J. ACM 21, 1, 168--173. Google ScholarDigital Library
- Wang, B., Xie, L., and Wang, G. 2009. A two-tire index structure for approximate string matching with block moves. In Proceedings of the 2nd International Workshop on Managing Data Quality in Collaborative Information Systems and 1st International Workshop on Data and Process Provenance. Springer, Berlin, 197--211. Google ScholarDigital Library
- Weber, R., Schek, H. J., and Blott, S. 1998. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of the 24th International Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, 194--205. Google ScholarDigital Library
- Weiner, P. 1973. Linear pattern matching algorithms. In Proceedings the 14th IEEE Symposium on Switching and Automata Theory. IEEE, Los Alamitos, CA, 1--11. Google ScholarDigital Library
- Wilbur, W. J., Kim, W., and Xie, N. 2006. Spelling correction in the PubMed search engine. Information Retrieval 9, 5, 543--564. Google ScholarDigital Library
- Willet, P. 1979. Document retrieval experiments using indexing vocabularies of varying size. II. Hashing, truncation, digram and trigram encoding of index terms. J. Doc. 35, 4, 296--305.Google ScholarCross Ref
- Woodland, P., Odell, J., Valtchev, V., and Young, S. 1994. Large vocabulary continuous speech recognition using HTK. In Proceedings of the IEEE International Conference on Acoustic, Speech, and Signal Processing. IEEE, Los Alamitos, CA, 3, 125--128.Google Scholar
- Wu, S. and Manber, U. 1992a. Agrep -- a fast approximate matching tool. In Proceedings of the USENIX Winter Technical Conference. USENIX, Berkeley, CA, 153--162.Google Scholar
- Wu, S. and Manber, U. 1992b. Fast text searching allowing errors. Commun. ACM 35, 10, 83--91. Google ScholarDigital Library
- Wu, S., Manber, U., and Myers, E. 1996. A subquadratic algorithm for approximate limited expression matching. Algorithmica 15, 1, 50--67.Google ScholarDigital Library
- Zaliznyak, M. 2006. Telefonica's dispute over an accented domain in Chile. Multilingual Search. Sep 30, 2006. http://www.multilingual-search.com/telefonica%C2%B4s-dispute-over-an-accented-domain-in-chile/30/09/2006.Google Scholar
- Zamora, E., Pollock, J., and Zamora, A. 1981. The use of trigram analysis for spelling error detection. Inf. Process. Manage. 17, 6, 305--316.Google ScholarCross Ref
- Zezula, P., Amato, G., Dohnal, V., and Batko, M. 2005. Similarity Search: The Metric Space Approach (Advances in Database Systems). Springer-Verlag, Berlin. Google Scholar
- Zobel, J. and Dart, P. 1995. Finding approximate matches in large lexicons. Software Pract. Exp. 25, 2, 331--345. Google ScholarDigital Library
- Zobel, J. and Dart, P. 1996. Phonetic string matching: lessons from information retrieval. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 166--172. Google ScholarDigital Library
- Zobel, J., Moffat, A., and Ramamohanarao, K. 1998. Inverted files versus signature files for text indexing. ACM Trans. Database Syst. 23, 4, 453--490. Google ScholarDigital Library
- Zobel, J., Moffat, A., and Sacks-davis, R. 1993. Searching large lexicons for partially specified terms using compressed inverted files. In Proceedings of the 19th international Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, 290--301. Google ScholarDigital Library
- Zue, V., Glass, J., Phillips, M., and Seneff, S. 1989. The MIT SUMMIT speech recognition system: a progress report. In Proceedings of the Workshop on Speech and Natural Language (HLT '89). Association for Computational Linguistics, Morristown, NJ, 179--189. Google ScholarDigital Library
Index Terms
Indexing methods for approximate dictionary searching: Comparative analysis
Recommendations
A guided tour to approximate string matching
We survey the current techniques to cope with the problem of string matching that allows errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online ...
Text indexing with errors
In this paper we address the problem of constructing an index for a text document or a collection of documents to answer various questions about the occurrences of a pattern when allowing a constant number of errors. In particular, our index can be ...
Dictionary matching: review of the aho-corasick algorithm and vision for large dictionaries
ICIST '18: Proceedings of the 8th International Conference on Information Systems and TechnologiesPattern-matching techniques have recently been applied to network security applications such as intrusion detection, virus protection, and spam filters. The widely used the Aho-Corasick algorithm can simultaneously match multiple patterns while ...
Comments