nach oben

Erschienen in:

2013 | OriginalPaper | Buchkapitel

3. Learning to Match Names Across Languages

verfasst von : Inderjeet Mani, Alex Yeh, Sherri Condon

Erschienen in: Multi-source, Multilingual Information Extraction and Summarization

Verlag: Springer Berlin Heidelberg

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

We report on research on matching names in different scripts across languages. We explore two trainable approaches based on comparing pronunciations. The first, a cross-lingual approach, uses an automatic name-matching program that exploits rules based on phonological comparisons of the two languages carried out by humans. The second, monolingual approach relies only on automatic comparison of the phonological representations of each pair. Alignments produced by each approach are fed to a machine learning algorithm. Results show that the monolingual approach results in machine-learning based comparison of person-names in English and Chinese at an accuracy of over 97.0 F-measure.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Information Extraction: Past, Present and Future

Nächstes Kapitel Computational Methods for Name Normalization Using Hypocoristic Personal Name Variants

www.informatica.com/solutions/identity_resolution_solution/Pages/index.aspx.

www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T34.

webdocs.cs.ualberta.ca/~kondrak/aline1.1.zip.

For the MALINE row in Table 3.3, the ALINE documentation explains the notation as follows: “every phonetic symbol is represented by a single lowercase letter followed by zero or more uppercase letters. The initial lowercase letter is the base letter most similar to the sound represented by the phonetic symbol. The remaining uppercase letters stand for the feature modifiers which alter the sound defined by the base letter. By default, the output contains the alignments together with overall similarity scores. The aligned subsequences are delimited by ‘|’ signs. The ‘<’ sign signifies that the previous phonetic segment has been aligned with two segments in the other sequence, a case of compression/expansion. The ‘–’ sign denotes a “skip”, a case of insertion/deletion.”

The Predictive Accuracy was computed with exactly half the test examples being positive.

sourceforge.net/projects/carafe.

projects.ldc.upenn.edu/LCTL/.

Al-Onaizan, Y., Knight, K.: Machine transliteration of names in Arabic text. In: Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Philadelphia, pp. 1–13. Association for Computational Linguistics, Stroudsburg (2002)

Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, pp. 39–48. ACM, New York (2003)

Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, pp. 475–480. ACM, New York (2002)

Damerau, F.J.A.: Technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171176 (1964)

Fellegi, I., Sunter, A.: A theory for record linkage. J. Am. Stat. Soc. 64, 1183–1210 (1969)

Freeman, A., Condon, S., Ackermann, C.: Cross linguistic name matching in English and Arabic. In: Proceedings of the Human Language Technology Conference, New York, pp. 471–478. Association for Computational Linguistics, Stroudsburg (2006)

Freitag, D., Khadivi, S.: A sequence alignment model based on the averaged perceptron. In: Proceedings of EMNLP-CONLL, Prague (2007)

Gao, W., Wong, K., Lam, W.: Phoneme-based transliteration of foreign names for OOV problem. In: Proceedings of First International Joint Conference on Natural Language Processing (IJCNLP), Hainan Island, China, pp. 374–381 (2004)

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1) (2009). www.cs.waikato.ac.nz/%ml/weka/

10.

Huang, F., Vogel, S., Waibel, A.: Improving named entity translation combining phonetic and semantic similarities. In: Proceedings of HLT-NAACL, Boston (2004)

11.

Ji, H., Grishman, R., Freitag, D., Blume, M., Wang, J., Khadivi, S., Zens R., Ney, H.: Name extraction and translation for distillation. In: Olive, J., Christianson, C., McCary, J. (eds.) Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation, Springer (2011). DOI: 10.1007/978-1-4419-7713-7_3

12.

Jiampojamarn, S., Bhargava, A., Dou, Q., Dwyer, K., Kondrak, G.: DIRECTL: a language-independent approach to transliteration. In: Proceedings of the 2009 Named Entities Workshop, ACL-IJCNLP, Singapore, pp. 28–31 (2009)

13.

Joachims, T.: Making large-Scale SVM Learning Practical. In: Scholkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning. MIT Press, Cambridge, MA (1999). svmlight.joachims.org/

14.

Jung, S., Hong, S., Paek, E.: An English to Korean transliteration model of extended Markov window. In: Proceedings of the 18th Conference on Computational Linguistics (COLING), Saarbrücken, Germany, vol. 1, pp. 383–389. Association for Computational Linguistics, Stroudsburg (2000)

15.

Kondrak, G.: A new algorithm for the alignment of phonetic sequences. In: Proceedings of the First Meeting of the North American Chapter of the Association for Computational Linguistics, Seattle, WA, pp. 288–295. Association for Computational Linguistics, Stroudsburg (2000)

16.

Knight, K., Graehl, J.: Machine transliteration. Comput. Linguist. 27(4), 599–612 (1998)

17.

Lait, A., Randell, B.: An assessment of name matching algorithms. Technical Report, Department of Computer Science, University of Newcastle upon Tyne, UK (1996)

18.

Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)

19.

Li, H., Kumaran, A., Pervouchine, V., Zhang, M.: Report of NEWS 2009 machine transliteration shared task. In: Proceedings of the 2009 Named Entities Workshop, ACL-IJCNLP, Singapore (2009)

20.

Li, H., Zhang, M., Su, J.: A joint source-channel model for machine transliteration. In: Proceedings of Conference of the Association for Computation Linguistics, Barcelona, Spain, pp. 159–166. Association for Computational Linguistics, Stroudsburg (2004)

21.

McCallum, A., Bellare, K., Pereira, F.: A conditional random field for discriminatively-trained finite-state string edit distance. In: Proceedings of the Conference on Uncertainty in AI, Edinburgh, Scotland, pp. 388–395 (2005)

22.

Meng, H., Lo, W., Chen B., Tang, T.: Generating phonetic cognates to handle named entities in English-Chinese cross-language spoken document retrieval. In: Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, Madonna di Campiglio, Italy (2001)

23.

(NEWS-2009) 2009 named entities workshop: shared task on transliteration. In: Proceedings of the 2009 Named Entities Workshop, ACL-IJCNLP, Singapore (2009)

24.

Oh, J., Choi, K., Isahara, H.: A comparison of different machine transliteration models. J. Artif. Intell. Res. 27, 119–151 (2006)

25.

Ristad, E.S., Yianilos, P.N.: Learning string edit distance. In: IEEE Transactions on Pattern Recognition and Machine Intelligence, pp. 522–532. IEEE Computer Society, Washington, DC (1998)

26.

Safalra: www.safalra.com/science/linguistics/pinyin-pronunciation/ (2006)

27.

Samuel, K., Rubenstein, A., Condon, S., Yeh, A.: Name matching between Chinese and Roman scripts: machine complements human. In: Proceedings of the 2009 Named Entities Workshop, Singapore, pp. 152–160. ACL-IJCNLP, Stroudsburg (2009)

28.

Sproat, R., Tao, T., Zhai, C.: Named entity transliteration with comparable corpora. In: Proceedings of the Conference of the Association for Computational Linguistics, Sydney, Australia, pp. 73–80. Association for Computational Linguistics, Stroudsburg (2006)

29.

Tao, T., Yoon, S., Fister, A., Sproat, R., Zhai, C.: Unsupervised named entity transliteration using temporal and phonetic correlation. In: Proceedings of the Empirical Methods in Natural Language Processing Conference, Sydney, Australia, pp. 250–257. Association for Computational Linguistics, Stroudsburg (2006)

30.

The CMU Pronouncing Dictionary: ftp://ftp.cs.cmu.edu/project/speech/dict/ (2008)

31.

Ukkonnen, E.: Approximate string-matching with Q-grams and maximal matches. Theor. Comput. Sci. 92, 191–211 (1992)

32.

Virga, P., Khudanpur, S.: Transliteration of proper names in cross-lingual information retrieval. In: Proceedings of the ACL 2003 Workshop on Multilingual and Mixed-language Named Entity Recognition, Sapporo, Japan. Association for Computational Linguistics, Stroudsburg (2003)

33.

Wan, S., Verspoor, C.M.: Automatic English-Chinese name transliteration for development of multilingual resources. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, Montreal, Quebec, pp. 1352–1356. Association for Computational Linguistics, Stroudsburg (1998)

34.

Wikipedia: Pinyin. en.wikipedia.org/wiki/Pinyin (2006)

35.

Winkler, W., Thibaudeau, Y.: An application of the fellegi-sunter model of record linkage to the 1990 U.S. decennial census. Technical Report RR91/09, Energy Information Administration, Washington, DC (1991)

36.

Zobel, J., Dart, P.: Finding approximate matches in large lexicons. Softw. Pract. Exp. 25(3), 331–345 (1995)

Titel: Learning to Match Names Across Languages
verfasst von: Inderjeet Mani
Alex Yeh
Sherri Condon
Verlag: Springer Berlin Heidelberg
Buch: Multi-source, Multilingual Information Extraction and Summarization
Print ISBN: 978-3-642-28568-4

Electronic ISBN: 978-3-642-28569-1

Copyright-Jahr: 2013
DOI: https://doi.org/10.1007/978-3-642-28569-1_3

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"