nach oben

Discover Computing

Erschienen in:

01.06.2009

On knowledge-poor methods for person name matching and lemmatization for highly inflectional languages

verfasst von: Jakub Piskorski, Karol Wieloch, Marcin Sydow

Erschienen in: Discover Computing | Ausgabe 3/2009

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Web person search is one of the most common activities of Internet users. Recently, a vast amount of work on applying various NLP techniques for person name disambiguation in large web document collections has been reported, where the main focus was on English and few other major languages. This article reports on knowledge-poor methods for tackling person name matching and lemmatization in Polish, a highly inflectional language with complex person name declension paradigm. These methods apply mainly well-established string distance metrics, some new variants thereof, automatically acquired simple suffix-based lemmatization patterns and some combinations of the aforementioned techniques. Furthermore, we also carried out some initial experiments on deploying techniques that utilize the context, in which person names appear. Results of numerous experiments are presented. The evaluation carried out on a data set extracted from a corpus of on-line news articles revealed that achieving lemmatization accuracy figures greater than 90% seems to be difficult, whereas combining string distance metrics with suffix-based patterns results in 97.6–99% accuracy for the name matching task. Interestingly, no significant additional gain could be achieved through integrating some basic techniques, which try to exploit the local context the names appear in. Although our explorations were focused on Polish, we believe that the work presented in this article constitutes practical guidelines for tackling the same problem for other highly inflectional languages with similar phenomena.

Vorheriger Artikel Query structuring and expansion with two-stage term dependence for Japanese web retrieval

Nächster Artikel Towards enhancing retrieval effectiveness of search engines for diacritisized Arabic documents

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

There are seven cases, two numbers and five genders.

The declension of such surnames depends on the local tradition and sometimes can be identical with the pattern used for common nouns.

This metric was used as an inner metric in recursive metrics described later in this section since as a stand-alone metric it is not capable to accurately match multi-token strings.

Pairs, where inflected form is identical with the base form have been excluded from the experiments since in such a case finding an answer is straightforward.

The Jaccard coefficient is a statistic used for comparing the similarity and diversity of two sets. It is calculated as the size of the intersection of the sets divided by the size of the union of the two sets, i.e., \(Jaccard(A,B) = \frac{|A \cap B|}{|A \cup B|}\).

Point-wise mutual information (PMI) is a measure of association used in information theory and statistics. The PMI of a pair of outcomes x and y belonging to discrete random variables quantifies the discrepancy between the probability of their coincidence given their joint distribution versus the probability of their coincidence given only their individual distributions and assuming independence. Formally, \(PMI(x,y) = \log_{2}\frac{p(x,y)}{p(x)\cdot p(y)}\)

Agirre, E., Marquez, L., & Wicentowski, R. (2007). Proceedings of SemEval2007 4th International Workshop on Semantic Evaluations, Prague, Czech Republic. ACL.

Bagga, A., & Baldwin, B. (1998). Entity-based Cross-document Co-referencing Using the Vector Space Model. In Proceedings of the ACL 1998, Montreal, Quebec, Canada (pp. 79–85).

Bilenko, M., & Mooney, R. (2003). Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), Washington, USA.

Bollegalla, D., Matsuo, Y., & Ishizuka, M. (2007). Measuring semantic similarity between words using web search engines. In Proceedings of the World Wide Web Conference 2007, Banff, Alberta, Canada.

Bollegalla, D., Honma, T., Matsuo, Y., & Ishizuka, M. (2008). Identification of personal name aliases on the web. In Proceedings of the World Wide Web Conference 2008, Beijing, China.

Christen, P. (2006). A comparison of personal name matching: Techniques and practical issues. Technical report, TR-CS-06-02, Computer Science Laboratory, The Australian National University, Canberra, Australia.

Coates-Steohens, S. (1992). The analysis and acquisition of proper names for the understanding of a free text. Computers and the Humanities, 26, 441–456.CrossRef

Cohen, E., Ravikumar, P., & Fienberg, S. (2003a). A comparison of string metrics for matching names and records. In Proceedings of KDD Workshop on Data Cleaning and Object Consolidation.

Cohen, W., Ravikumar, P., & Fienberg, S. (2003b). A comparison of string distance metrics for name-matching tasks. In Proceedings of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), Acapulco, Mexico (pp. 73–78).

Cucerzan, S. (2007). Large scale named entity disambiguation based on Wikipedia data. In Proceedings of the EMNLP-CoNLL Joint Conference, Prague, Czech Republic, ACL.

Elmagarmid, A., Ipeirotis, P., & Verykios, V. (2007). Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1–16.CrossRef

Fellegi, I., & Sunter, A. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210.CrossRef

Fernandez, M., De la Clergerie, E., & Vilares, M. (2007). Knowledge acquisition through error mining. In Proceedings of RANLP’07, Recent Advances in Natural Language Processing, Tzigov Chark, Bulgaria (pp. 220–224).

Fleischman, M., & Hovy, E. (2004). Multi-document person name resolution. In Proceedings of the Workshop on Reference Resolution at ACL 2004, Barcelona, Spain.

Gravano, L., Ipeirotis, P., Jagadish, H., Koudas, N., Muthukrishnan, S., Pietarinen, L., & Srivastava, D. (2001). Using q-grams in a DBMS for approximate string processing. IEEE Data Engineering Bulletin, 24(4), 28–34.

Grzenia, J. (1998). Słownik nazw własnych—ortografia, wymowa, słowotwórstwo i odmiana. Warszawa: PWN.

Hernandez, M., & Stolfo, S. (1995). The Merge/purge problem for large databases. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, California, USA (pp. 127–138). New York: ACM.

Jaro, M. (1989). Advances in record linking methodology as applied to the 1985 census of Tampa Florida. Journal of the American Statistical Society, 84(406), 414–420.

Keskustalo, H., Pirkola, A., Visala, K., Leppanen, E., & Jarvelin, K. (2003). Non-adjacent digrams improve matching of cross-lingual spelling variants. In Proceedings of SPIRE, LNCS 22857, Manaus, Brazil (pp. 252–265).

Klementiev, A., & Roth, D. (2006). Weakly supervised named-entity transliteration and discovery from multilingual comparable corpora. In Proceedings of ACL 2006 Conference. ACL

Levenshtein, V. (1965). Binary codes for correcting deletions, insertions, and reversals. Doklady Akademii Nauk SSSR, 163(4), 845–848.MathSciNet

Li, X., Morie, P., & Rothd, D. (2004). Identification and tracing of ambiguous names: Discriminative and generative approaches. In Proceedings of the National Conference on Artificial Intelligence 2004.

Lindén, K. (2008). A probabilistic model for guessing base forms of new words by analogy. In Proceedings of CICling-2008, 9th International Conference on Intelligent Text Processing and Computational Linguistics, Haifa, Israel.

Mann, G., & Yarowsky, D. (2003). Unsupervised personal name disambiguation. In Proceedings of CoNLL 2003, Edmonton, Canada (pp. 33–40).

Miłkowski, M. (2007). Morfologik. Web document: http://morfologik.blogspot.com.

Monge, A., & Elkan, C. (1996). The field matching problem: Algorithms and applications. In Proceedings of Knowledge Discovery and Data Mining 1996 (pp. 267–270).

Ntoulas, A., Stamou, S., & Tzagarakis, M. (2001). Using a WWW search engine to evaluate normalization performance for a highly inflectional language. In Proceedings of ACL 2001 (Companion Volume) (pp. 31–36).

On, B., Lee, D., Kang, J., & Mitra, P. (2005). Comparative study of name disambiguation problem using a scalable blocking-based framework. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, JCDL 2005, Denver, CA, USA (pp. 344–353). ACM.

Pedersen, T., Purandare, A., & Kulkarni, A. (2005). Name discrimination by clustering similar contexts. In CICLing (pp. 226–237).

Piskorski, J. (2005). Named-entity recognition for Polish with SProUT. In L. Bolc, Z. Michalewicz, & T. Nishida (Eds.), LNCS Vol 3490: Proceedings of IMTCI 2004, Warsaw, Poland.

Piskorski, J., Sydow, M., & Kupść, A. (2007). Lemmatization of Polish person names. In Proceedings of the ACL Workshop on Balto-Slavonic Natural Language Processing 2007—Special Theme: Information Extraction and Enabling Technologies (BSNLP’2007). Held at ACL’2007, Prague, Czech Republic, 2007. Stroudsburg, PA: ACL.

Piskorski, J., Wieloch, K., Pikuła, M., & Sydow, M. (2008). Towards person name matching for highly inflective languages. In Proceedings of the WWW’2008 workshop on Natural Language Processing Challenges in the Information Explosion Era (NLPIX 2008).

Pouliquen, B., & Steinberger, R. (2009). Automatic construction of multilingual name dictionaries. In: C. Goutte, N. Cancedda, M. Dymetman, & G. Foster (Eds.), Learning machine translation (pp. 59–78). MIT Press – Advances in Neural Information Processing (NIPS) Series.

Przepiórkowski, A. (2005). The IPI PAN corpus in numbers. In Z. Vetulani (Ed.), Proceedings of the 2nd Language & Technology Conference, Poznań, Poland.

Smith, T., & Waterman, M. (1981). Identification of common molecular subsequences. Journal of Molecular Biology, 147, 195–197.CrossRef

Steinberger, R., & Pouliquen, B. (2007). Cross-lingual named entity recognition. Journal Linguisticae Investigationes, Special Issue on Named Entity Recognition and Categorisation, 30(1), 135–162.

Ukkonen, E. (1992). Approximate string matching with q-grams and maximal matches. Theoretical Computer Science, 92(1), 191–211.MATHCrossRefMathSciNet

Vilares, J., Alonso, M., & Vilares Ferro, M. (2004). Morphological and syntactic processing for text retrieval. In DEXA (pp. 371–380).

Weiss, D. (2005). A survey of freely available Polish stemmers and evaluation of their applicability in information retrieval. In Proceedings of the 2nd Language and Technology Conference (LTC’2005), Poznań, Poland, 2005 (pp. 216–221).

Weiss, D. (2007). Korpus Rzeczpospolitej. URL: http://www.cs.put.poznan.pl/dweiss/rzeczpospolita.

Winkler, W. (1999). The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Bureau of the Census, Washington, DC.

Titel: On knowledge-poor methods for person name matching and lemmatization for highly inflectional languages
verfasst von: Jakub Piskorski
Karol Wieloch
Marcin Sydow
Publikationsdatum: 01.06.2009
Verlag: Springer Netherlands
Erschienen in: Discover Computing / Ausgabe 3/2009
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI: https://doi.org/10.1007/s10791-008-9085-5

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 3/2009

Introduction to the special issue on non-english web retrieval

Towards enhancing retrieval effectiveness of search engines for diacritisized Arabic documents

Mixed monolingual homepage finding in 34 languages: the role of language script and search domain

Query structuring and expansion with two-stage term dependence for Japanese web retrieval

A user-centric approach to identifying best deployment strategies for language tools: the impact of content and access language on Web user behaviour and attitudes

Current research issues and trends in non-English Web searching