Skip to main content

2013 | OriginalPaper | Buchkapitel

3. Learning to Match Names Across Languages

verfasst von : Inderjeet Mani, Alex Yeh, Sherri Condon

Erschienen in: Multi-source, Multilingual Information Extraction and Summarization

Verlag: Springer Berlin Heidelberg

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

We report on research on matching names in different scripts across languages. We explore two trainable approaches based on comparing pronunciations. The first, a cross-lingual approach, uses an automatic name-matching program that exploits rules based on phonological comparisons of the two languages carried out by humans. The second, monolingual approach relies only on automatic comparison of the phonological representations of each pair. Alignments produced by each approach are fed to a machine learning algorithm. Results show that the monolingual approach results in machine-learning based comparison of person-names in English and Chinese at an accuracy of over 97.0 F-measure.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
4
For the MALINE row in Table 3.3, the ALINE documentation explains the notation as follows: “every phonetic symbol is represented by a single lowercase letter followed by zero or more uppercase letters. The initial lowercase letter is the base letter most similar to the sound represented by the phonetic symbol. The remaining uppercase letters stand for the feature modifiers which alter the sound defined by the base letter. By default, the output contains the alignments together with overall similarity scores. The aligned subsequences are delimited by ‘|’ signs. The ‘<’ sign signifies that the previous phonetic segment has been aligned with two segments in the other sequence, a case of compression/expansion. The ‘–’ sign denotes a “skip”, a case of insertion/deletion.”
 
5
The Predictive Accuracy was computed with exactly half the test examples being positive.
 
Literatur
1.
Zurück zum Zitat Al-Onaizan, Y., Knight, K.: Machine transliteration of names in Arabic text. In: Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Philadelphia, pp. 1–13. Association for Computational Linguistics, Stroudsburg (2002) Al-Onaizan, Y., Knight, K.: Machine transliteration of names in Arabic text. In: Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Philadelphia, pp. 1–13. Association for Computational Linguistics, Stroudsburg (2002)
2.
Zurück zum Zitat Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, pp. 39–48. ACM, New York (2003) Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, pp. 39–48. ACM, New York (2003)
3.
Zurück zum Zitat Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, pp. 475–480. ACM, New York (2002) Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, pp. 475–480. ACM, New York (2002)
4.
Zurück zum Zitat Damerau, F.J.A.: Technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171176 (1964) Damerau, F.J.A.: Technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171176 (1964)
5.
Zurück zum Zitat Fellegi, I., Sunter, A.: A theory for record linkage. J. Am. Stat. Soc. 64, 1183–1210 (1969) Fellegi, I., Sunter, A.: A theory for record linkage. J. Am. Stat. Soc. 64, 1183–1210 (1969)
6.
Zurück zum Zitat Freeman, A., Condon, S., Ackermann, C.: Cross linguistic name matching in English and Arabic. In: Proceedings of the Human Language Technology Conference, New York, pp. 471–478. Association for Computational Linguistics, Stroudsburg (2006) Freeman, A., Condon, S., Ackermann, C.: Cross linguistic name matching in English and Arabic. In: Proceedings of the Human Language Technology Conference, New York, pp. 471–478. Association for Computational Linguistics, Stroudsburg (2006)
7.
Zurück zum Zitat Freitag, D., Khadivi, S.: A sequence alignment model based on the averaged perceptron. In: Proceedings of EMNLP-CONLL, Prague (2007) Freitag, D., Khadivi, S.: A sequence alignment model based on the averaged perceptron. In: Proceedings of EMNLP-CONLL, Prague (2007)
8.
Zurück zum Zitat Gao, W., Wong, K., Lam, W.: Phoneme-based transliteration of foreign names for OOV problem. In: Proceedings of First International Joint Conference on Natural Language Processing (IJCNLP), Hainan Island, China, pp. 374–381 (2004) Gao, W., Wong, K., Lam, W.: Phoneme-based transliteration of foreign names for OOV problem. In: Proceedings of First International Joint Conference on Natural Language Processing (IJCNLP), Hainan Island, China, pp. 374–381 (2004)
9.
Zurück zum Zitat Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1) (2009). www.cs.waikato.ac.nz/%ml/weka/ Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1) (2009). www.cs.waikato.ac.nz/%ml/weka/
10.
Zurück zum Zitat Huang, F., Vogel, S., Waibel, A.: Improving named entity translation combining phonetic and semantic similarities. In: Proceedings of HLT-NAACL, Boston (2004) Huang, F., Vogel, S., Waibel, A.: Improving named entity translation combining phonetic and semantic similarities. In: Proceedings of HLT-NAACL, Boston (2004)
11.
Zurück zum Zitat Ji, H., Grishman, R., Freitag, D., Blume, M., Wang, J., Khadivi, S., Zens R., Ney, H.: Name extraction and translation for distillation. In: Olive, J., Christianson, C., McCary, J. (eds.) Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation, Springer (2011). DOI: 10.1007/978-1-4419-7713-7_3 Ji, H., Grishman, R., Freitag, D., Blume, M., Wang, J., Khadivi, S., Zens R., Ney, H.: Name extraction and translation for distillation. In: Olive, J., Christianson, C., McCary, J. (eds.) Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation, Springer (2011). DOI: 10.​1007/​978-1-4419-7713-7_​3
12.
Zurück zum Zitat Jiampojamarn, S., Bhargava, A., Dou, Q., Dwyer, K., Kondrak, G.: DIRECTL: a language-independent approach to transliteration. In: Proceedings of the 2009 Named Entities Workshop, ACL-IJCNLP, Singapore, pp. 28–31 (2009) Jiampojamarn, S., Bhargava, A., Dou, Q., Dwyer, K., Kondrak, G.: DIRECTL: a language-independent approach to transliteration. In: Proceedings of the 2009 Named Entities Workshop, ACL-IJCNLP, Singapore, pp. 28–31 (2009)
13.
Zurück zum Zitat Joachims, T.: Making large-Scale SVM Learning Practical. In: Scholkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning. MIT Press, Cambridge, MA (1999). svmlight.joachims.org/ Joachims, T.: Making large-Scale SVM Learning Practical. In: Scholkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning. MIT Press, Cambridge, MA (1999). svmlight.​joachims.​org/​
14.
Zurück zum Zitat Jung, S., Hong, S., Paek, E.: An English to Korean transliteration model of extended Markov window. In: Proceedings of the 18th Conference on Computational Linguistics (COLING), Saarbrücken, Germany, vol. 1, pp. 383–389. Association for Computational Linguistics, Stroudsburg (2000) Jung, S., Hong, S., Paek, E.: An English to Korean transliteration model of extended Markov window. In: Proceedings of the 18th Conference on Computational Linguistics (COLING), Saarbrücken, Germany, vol. 1, pp. 383–389. Association for Computational Linguistics, Stroudsburg (2000)
15.
Zurück zum Zitat Kondrak, G.: A new algorithm for the alignment of phonetic sequences. In: Proceedings of the First Meeting of the North American Chapter of the Association for Computational Linguistics, Seattle, WA, pp. 288–295. Association for Computational Linguistics, Stroudsburg (2000) Kondrak, G.: A new algorithm for the alignment of phonetic sequences. In: Proceedings of the First Meeting of the North American Chapter of the Association for Computational Linguistics, Seattle, WA, pp. 288–295. Association for Computational Linguistics, Stroudsburg (2000)
16.
Zurück zum Zitat Knight, K., Graehl, J.: Machine transliteration. Comput. Linguist. 27(4), 599–612 (1998) Knight, K., Graehl, J.: Machine transliteration. Comput. Linguist. 27(4), 599–612 (1998)
17.
Zurück zum Zitat Lait, A., Randell, B.: An assessment of name matching algorithms. Technical Report, Department of Computer Science, University of Newcastle upon Tyne, UK (1996) Lait, A., Randell, B.: An assessment of name matching algorithms. Technical Report, Department of Computer Science, University of Newcastle upon Tyne, UK (1996)
18.
Zurück zum Zitat Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966) Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)
19.
Zurück zum Zitat Li, H., Kumaran, A., Pervouchine, V., Zhang, M.: Report of NEWS 2009 machine transliteration shared task. In: Proceedings of the 2009 Named Entities Workshop, ACL-IJCNLP, Singapore (2009) Li, H., Kumaran, A., Pervouchine, V., Zhang, M.: Report of NEWS 2009 machine transliteration shared task. In: Proceedings of the 2009 Named Entities Workshop, ACL-IJCNLP, Singapore (2009)
20.
Zurück zum Zitat Li, H., Zhang, M., Su, J.: A joint source-channel model for machine transliteration. In: Proceedings of Conference of the Association for Computation Linguistics, Barcelona, Spain, pp. 159–166. Association for Computational Linguistics, Stroudsburg (2004) Li, H., Zhang, M., Su, J.: A joint source-channel model for machine transliteration. In: Proceedings of Conference of the Association for Computation Linguistics, Barcelona, Spain, pp. 159–166. Association for Computational Linguistics, Stroudsburg (2004)
21.
Zurück zum Zitat McCallum, A., Bellare, K., Pereira, F.: A conditional random field for discriminatively-trained finite-state string edit distance. In: Proceedings of the Conference on Uncertainty in AI, Edinburgh, Scotland, pp. 388–395 (2005) McCallum, A., Bellare, K., Pereira, F.: A conditional random field for discriminatively-trained finite-state string edit distance. In: Proceedings of the Conference on Uncertainty in AI, Edinburgh, Scotland, pp. 388–395 (2005)
22.
Zurück zum Zitat Meng, H., Lo, W., Chen B., Tang, T.: Generating phonetic cognates to handle named entities in English-Chinese cross-language spoken document retrieval. In: Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, Madonna di Campiglio, Italy (2001) Meng, H., Lo, W., Chen B., Tang, T.: Generating phonetic cognates to handle named entities in English-Chinese cross-language spoken document retrieval. In: Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, Madonna di Campiglio, Italy (2001)
23.
Zurück zum Zitat (NEWS-2009) 2009 named entities workshop: shared task on transliteration. In: Proceedings of the 2009 Named Entities Workshop, ACL-IJCNLP, Singapore (2009) (NEWS-2009) 2009 named entities workshop: shared task on transliteration. In: Proceedings of the 2009 Named Entities Workshop, ACL-IJCNLP, Singapore (2009)
24.
Zurück zum Zitat Oh, J., Choi, K., Isahara, H.: A comparison of different machine transliteration models. J. Artif. Intell. Res. 27, 119–151 (2006) Oh, J., Choi, K., Isahara, H.: A comparison of different machine transliteration models. J. Artif. Intell. Res. 27, 119–151 (2006)
25.
Zurück zum Zitat Ristad, E.S., Yianilos, P.N.: Learning string edit distance. In: IEEE Transactions on Pattern Recognition and Machine Intelligence, pp. 522–532. IEEE Computer Society, Washington, DC (1998) Ristad, E.S., Yianilos, P.N.: Learning string edit distance. In: IEEE Transactions on Pattern Recognition and Machine Intelligence, pp. 522–532. IEEE Computer Society, Washington, DC (1998)
27.
Zurück zum Zitat Samuel, K., Rubenstein, A., Condon, S., Yeh, A.: Name matching between Chinese and Roman scripts: machine complements human. In: Proceedings of the 2009 Named Entities Workshop, Singapore, pp. 152–160. ACL-IJCNLP, Stroudsburg (2009) Samuel, K., Rubenstein, A., Condon, S., Yeh, A.: Name matching between Chinese and Roman scripts: machine complements human. In: Proceedings of the 2009 Named Entities Workshop, Singapore, pp. 152–160. ACL-IJCNLP, Stroudsburg (2009)
28.
Zurück zum Zitat Sproat, R., Tao, T., Zhai, C.: Named entity transliteration with comparable corpora. In: Proceedings of the Conference of the Association for Computational Linguistics, Sydney, Australia, pp. 73–80. Association for Computational Linguistics, Stroudsburg (2006) Sproat, R., Tao, T., Zhai, C.: Named entity transliteration with comparable corpora. In: Proceedings of the Conference of the Association for Computational Linguistics, Sydney, Australia, pp. 73–80. Association for Computational Linguistics, Stroudsburg (2006)
29.
Zurück zum Zitat Tao, T., Yoon, S., Fister, A., Sproat, R., Zhai, C.: Unsupervised named entity transliteration using temporal and phonetic correlation. In: Proceedings of the Empirical Methods in Natural Language Processing Conference, Sydney, Australia, pp. 250–257. Association for Computational Linguistics, Stroudsburg (2006) Tao, T., Yoon, S., Fister, A., Sproat, R., Zhai, C.: Unsupervised named entity transliteration using temporal and phonetic correlation. In: Proceedings of the Empirical Methods in Natural Language Processing Conference, Sydney, Australia, pp. 250–257. Association for Computational Linguistics, Stroudsburg (2006)
31.
Zurück zum Zitat Ukkonnen, E.: Approximate string-matching with Q-grams and maximal matches. Theor. Comput. Sci. 92, 191–211 (1992) Ukkonnen, E.: Approximate string-matching with Q-grams and maximal matches. Theor. Comput. Sci. 92, 191–211 (1992)
32.
Zurück zum Zitat Virga, P., Khudanpur, S.: Transliteration of proper names in cross-lingual information retrieval. In: Proceedings of the ACL 2003 Workshop on Multilingual and Mixed-language Named Entity Recognition, Sapporo, Japan. Association for Computational Linguistics, Stroudsburg (2003) Virga, P., Khudanpur, S.: Transliteration of proper names in cross-lingual information retrieval. In: Proceedings of the ACL 2003 Workshop on Multilingual and Mixed-language Named Entity Recognition, Sapporo, Japan. Association for Computational Linguistics, Stroudsburg (2003)
33.
Zurück zum Zitat Wan, S., Verspoor, C.M.: Automatic English-Chinese name transliteration for development of multilingual resources. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, Montreal, Quebec, pp. 1352–1356. Association for Computational Linguistics, Stroudsburg (1998) Wan, S., Verspoor, C.M.: Automatic English-Chinese name transliteration for development of multilingual resources. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, Montreal, Quebec, pp. 1352–1356. Association for Computational Linguistics, Stroudsburg (1998)
35.
Zurück zum Zitat Winkler, W., Thibaudeau, Y.: An application of the fellegi-sunter model of record linkage to the 1990 U.S. decennial census. Technical Report RR91/09, Energy Information Administration, Washington, DC (1991) Winkler, W., Thibaudeau, Y.: An application of the fellegi-sunter model of record linkage to the 1990 U.S. decennial census. Technical Report RR91/09, Energy Information Administration, Washington, DC (1991)
36.
Zurück zum Zitat Zobel, J., Dart, P.: Finding approximate matches in large lexicons. Softw. Pract. Exp. 25(3), 331–345 (1995) Zobel, J., Dart, P.: Finding approximate matches in large lexicons. Softw. Pract. Exp. 25(3), 331–345 (1995)
Metadaten
Titel
Learning to Match Names Across Languages
verfasst von
Inderjeet Mani
Alex Yeh
Sherri Condon
Copyright-Jahr
2013
Verlag
Springer Berlin Heidelberg
DOI
https://doi.org/10.1007/978-3-642-28569-1_3