Skip to main content
Erschienen in: Discover Computing 3/2009

01.06.2009

On knowledge-poor methods for person name matching and lemmatization for highly inflectional languages

verfasst von: Jakub Piskorski, Karol Wieloch, Marcin Sydow

Erschienen in: Discover Computing | Ausgabe 3/2009

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Web person search is one of the most common activities of Internet users. Recently, a vast amount of work on applying various NLP techniques for person name disambiguation in large web document collections has been reported, where the main focus was on English and few other major languages. This article reports on knowledge-poor methods for tackling person name matching and lemmatization in Polish, a highly inflectional language with complex person name declension paradigm. These methods apply mainly well-established string distance metrics, some new variants thereof, automatically acquired simple suffix-based lemmatization patterns and some combinations of the aforementioned techniques. Furthermore, we also carried out some initial experiments on deploying techniques that utilize the context, in which person names appear. Results of numerous experiments are presented. The evaluation carried out on a data set extracted from a corpus of on-line news articles revealed that achieving lemmatization accuracy figures greater than 90% seems to be difficult, whereas combining string distance metrics with suffix-based patterns results in 97.6–99% accuracy for the name matching task. Interestingly, no significant additional gain could be achieved through integrating some basic techniques, which try to exploit the local context the names appear in. Although our explorations were focused on Polish, we believe that the work presented in this article constitutes practical guidelines for tackling the same problem for other highly inflectional languages with similar phenomena.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
There are seven cases, two numbers and five genders.
 
2
The declension of such surnames depends on the local tradition and sometimes can be identical with the pattern used for common nouns.
 
3
This metric was used as an inner metric in recursive metrics described later in this section since as a stand-alone metric it is not capable to accurately match multi-token strings.
 
4
Pairs, where inflected form is identical with the base form have been excluded from the experiments since in such a case finding an answer is straightforward.
 
5
The Jaccard coefficient is a statistic used for comparing the similarity and diversity of two sets. It is calculated as the size of the intersection of the sets divided by the size of the union of the two sets, i.e., \(Jaccard(A,B) = \frac{|A \cap B|}{|A \cup B|}\).
 
6
Point-wise mutual information (PMI) is a measure of association used in information theory and statistics. The PMI of a pair of outcomes x and y belonging to discrete random variables quantifies the discrepancy between the probability of their coincidence given their joint distribution versus the probability of their coincidence given only their individual distributions and assuming independence. Formally, \(PMI(x,y) = \log_{2}\frac{p(x,y)}{p(x)\cdot p(y)}\)
 
Literatur
Zurück zum Zitat Agirre, E., Marquez, L., & Wicentowski, R. (2007). Proceedings of SemEval2007 4th International Workshop on Semantic Evaluations, Prague, Czech Republic. ACL. Agirre, E., Marquez, L., & Wicentowski, R. (2007). Proceedings of SemEval2007 4th International Workshop on Semantic Evaluations, Prague, Czech Republic. ACL.
Zurück zum Zitat Bagga, A., & Baldwin, B. (1998). Entity-based Cross-document Co-referencing Using the Vector Space Model. In Proceedings of the ACL 1998, Montreal, Quebec, Canada (pp. 79–85). Bagga, A., & Baldwin, B. (1998). Entity-based Cross-document Co-referencing Using the Vector Space Model. In Proceedings of the ACL 1998, Montreal, Quebec, Canada (pp. 79–85).
Zurück zum Zitat Bilenko, M., & Mooney, R. (2003). Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), Washington, USA. Bilenko, M., & Mooney, R. (2003). Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), Washington, USA.
Zurück zum Zitat Bollegalla, D., Matsuo, Y., & Ishizuka, M. (2007). Measuring semantic similarity between words using web search engines. In Proceedings of the World Wide Web Conference 2007, Banff, Alberta, Canada. Bollegalla, D., Matsuo, Y., & Ishizuka, M. (2007). Measuring semantic similarity between words using web search engines. In Proceedings of the World Wide Web Conference 2007, Banff, Alberta, Canada.
Zurück zum Zitat Bollegalla, D., Honma, T., Matsuo, Y., & Ishizuka, M. (2008). Identification of personal name aliases on the web. In Proceedings of the World Wide Web Conference 2008, Beijing, China. Bollegalla, D., Honma, T., Matsuo, Y., & Ishizuka, M. (2008). Identification of personal name aliases on the web. In Proceedings of the World Wide Web Conference 2008, Beijing, China.
Zurück zum Zitat Christen, P. (2006). A comparison of personal name matching: Techniques and practical issues. Technical report, TR-CS-06-02, Computer Science Laboratory, The Australian National University, Canberra, Australia. Christen, P. (2006). A comparison of personal name matching: Techniques and practical issues. Technical report, TR-CS-06-02, Computer Science Laboratory, The Australian National University, Canberra, Australia.
Zurück zum Zitat Coates-Steohens, S. (1992). The analysis and acquisition of proper names for the understanding of a free text. Computers and the Humanities, 26, 441–456.CrossRef Coates-Steohens, S. (1992). The analysis and acquisition of proper names for the understanding of a free text. Computers and the Humanities, 26, 441–456.CrossRef
Zurück zum Zitat Cohen, E., Ravikumar, P., & Fienberg, S. (2003a). A comparison of string metrics for matching names and records. In Proceedings of KDD Workshop on Data Cleaning and Object Consolidation. Cohen, E., Ravikumar, P., & Fienberg, S. (2003a). A comparison of string metrics for matching names and records. In Proceedings of KDD Workshop on Data Cleaning and Object Consolidation.
Zurück zum Zitat Cohen, W., Ravikumar, P., & Fienberg, S. (2003b). A comparison of string distance metrics for name-matching tasks. In Proceedings of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), Acapulco, Mexico (pp. 73–78). Cohen, W., Ravikumar, P., & Fienberg, S. (2003b). A comparison of string distance metrics for name-matching tasks. In Proceedings of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), Acapulco, Mexico (pp. 73–78).
Zurück zum Zitat Cucerzan, S. (2007). Large scale named entity disambiguation based on Wikipedia data. In Proceedings of the EMNLP-CoNLL Joint Conference, Prague, Czech Republic, ACL. Cucerzan, S. (2007). Large scale named entity disambiguation based on Wikipedia data. In Proceedings of the EMNLP-CoNLL Joint Conference, Prague, Czech Republic, ACL.
Zurück zum Zitat Elmagarmid, A., Ipeirotis, P., & Verykios, V. (2007). Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1–16.CrossRef Elmagarmid, A., Ipeirotis, P., & Verykios, V. (2007). Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1–16.CrossRef
Zurück zum Zitat Fellegi, I., & Sunter, A. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210.CrossRef Fellegi, I., & Sunter, A. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210.CrossRef
Zurück zum Zitat Fernandez, M., De la Clergerie, E., & Vilares, M. (2007). Knowledge acquisition through error mining. In Proceedings of RANLP’07, Recent Advances in Natural Language Processing, Tzigov Chark, Bulgaria (pp. 220–224). Fernandez, M., De la Clergerie, E., & Vilares, M. (2007). Knowledge acquisition through error mining. In Proceedings of RANLP’07, Recent Advances in Natural Language Processing, Tzigov Chark, Bulgaria (pp. 220–224).
Zurück zum Zitat Fleischman, M., & Hovy, E. (2004). Multi-document person name resolution. In Proceedings of the Workshop on Reference Resolution at ACL 2004, Barcelona, Spain. Fleischman, M., & Hovy, E. (2004). Multi-document person name resolution. In Proceedings of the Workshop on Reference Resolution at ACL 2004, Barcelona, Spain.
Zurück zum Zitat Gravano, L., Ipeirotis, P., Jagadish, H., Koudas, N., Muthukrishnan, S., Pietarinen, L., & Srivastava, D. (2001). Using q-grams in a DBMS for approximate string processing. IEEE Data Engineering Bulletin, 24(4), 28–34. Gravano, L., Ipeirotis, P., Jagadish, H., Koudas, N., Muthukrishnan, S., Pietarinen, L., & Srivastava, D. (2001). Using q-grams in a DBMS for approximate string processing. IEEE Data Engineering Bulletin, 24(4), 28–34.
Zurück zum Zitat Grzenia, J. (1998). Słownik nazw własnych—ortografia, wymowa, słowotwórstwo i odmiana. Warszawa: PWN. Grzenia, J. (1998). Słownik nazw własnych—ortografia, wymowa, słowotwórstwo i odmiana. Warszawa: PWN.
Zurück zum Zitat Hernandez, M., & Stolfo, S. (1995). The Merge/purge problem for large databases. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, California, USA (pp. 127–138). New York: ACM. Hernandez, M., & Stolfo, S. (1995). The Merge/purge problem for large databases. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, California, USA (pp. 127–138). New York: ACM.
Zurück zum Zitat Jaro, M. (1989). Advances in record linking methodology as applied to the 1985 census of Tampa Florida. Journal of the American Statistical Society, 84(406), 414–420. Jaro, M. (1989). Advances in record linking methodology as applied to the 1985 census of Tampa Florida. Journal of the American Statistical Society, 84(406), 414–420.
Zurück zum Zitat Keskustalo, H., Pirkola, A., Visala, K., Leppanen, E., & Jarvelin, K. (2003). Non-adjacent digrams improve matching of cross-lingual spelling variants. In Proceedings of SPIRE, LNCS 22857, Manaus, Brazil (pp. 252–265). Keskustalo, H., Pirkola, A., Visala, K., Leppanen, E., & Jarvelin, K. (2003). Non-adjacent digrams improve matching of cross-lingual spelling variants. In Proceedings of SPIRE, LNCS 22857, Manaus, Brazil (pp. 252–265).
Zurück zum Zitat Klementiev, A., & Roth, D. (2006). Weakly supervised named-entity transliteration and discovery from multilingual comparable corpora. In Proceedings of ACL 2006 Conference. ACL Klementiev, A., & Roth, D. (2006). Weakly supervised named-entity transliteration and discovery from multilingual comparable corpora. In Proceedings of ACL 2006 Conference. ACL
Zurück zum Zitat Levenshtein, V. (1965). Binary codes for correcting deletions, insertions, and reversals. Doklady Akademii Nauk SSSR, 163(4), 845–848.MathSciNet Levenshtein, V. (1965). Binary codes for correcting deletions, insertions, and reversals. Doklady Akademii Nauk SSSR, 163(4), 845–848.MathSciNet
Zurück zum Zitat Li, X., Morie, P., & Rothd, D. (2004). Identification and tracing of ambiguous names: Discriminative and generative approaches. In Proceedings of the National Conference on Artificial Intelligence 2004. Li, X., Morie, P., & Rothd, D. (2004). Identification and tracing of ambiguous names: Discriminative and generative approaches. In Proceedings of the National Conference on Artificial Intelligence 2004.
Zurück zum Zitat Lindén, K. (2008). A probabilistic model for guessing base forms of new words by analogy. In Proceedings of CICling-2008, 9th International Conference on Intelligent Text Processing and Computational Linguistics, Haifa, Israel. Lindén, K. (2008). A probabilistic model for guessing base forms of new words by analogy. In Proceedings of CICling-2008, 9th International Conference on Intelligent Text Processing and Computational Linguistics, Haifa, Israel.
Zurück zum Zitat Mann, G., & Yarowsky, D. (2003). Unsupervised personal name disambiguation. In Proceedings of CoNLL 2003, Edmonton, Canada (pp. 33–40). Mann, G., & Yarowsky, D. (2003). Unsupervised personal name disambiguation. In Proceedings of CoNLL 2003, Edmonton, Canada (pp. 33–40).
Zurück zum Zitat Monge, A., & Elkan, C. (1996). The field matching problem: Algorithms and applications. In Proceedings of Knowledge Discovery and Data Mining 1996 (pp. 267–270). Monge, A., & Elkan, C. (1996). The field matching problem: Algorithms and applications. In Proceedings of Knowledge Discovery and Data Mining 1996 (pp. 267–270).
Zurück zum Zitat Ntoulas, A., Stamou, S., & Tzagarakis, M. (2001). Using a WWW search engine to evaluate normalization performance for a highly inflectional language. In Proceedings of ACL 2001 (Companion Volume) (pp. 31–36). Ntoulas, A., Stamou, S., & Tzagarakis, M. (2001). Using a WWW search engine to evaluate normalization performance for a highly inflectional language. In Proceedings of ACL 2001 (Companion Volume) (pp. 31–36).
Zurück zum Zitat On, B., Lee, D., Kang, J., & Mitra, P. (2005). Comparative study of name disambiguation problem using a scalable blocking-based framework. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, JCDL 2005, Denver, CA, USA (pp. 344–353). ACM. On, B., Lee, D., Kang, J., & Mitra, P. (2005). Comparative study of name disambiguation problem using a scalable blocking-based framework. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, JCDL 2005, Denver, CA, USA (pp. 344–353). ACM.
Zurück zum Zitat Pedersen, T., Purandare, A., & Kulkarni, A. (2005). Name discrimination by clustering similar contexts. In CICLing (pp. 226–237). Pedersen, T., Purandare, A., & Kulkarni, A. (2005). Name discrimination by clustering similar contexts. In CICLing (pp. 226–237).
Zurück zum Zitat Piskorski, J. (2005). Named-entity recognition for Polish with SProUT. In L. Bolc, Z. Michalewicz, & T. Nishida (Eds.), LNCS Vol 3490: Proceedings of IMTCI 2004, Warsaw, Poland. Piskorski, J. (2005). Named-entity recognition for Polish with SProUT. In L. Bolc, Z. Michalewicz, & T. Nishida (Eds.), LNCS Vol 3490: Proceedings of IMTCI 2004, Warsaw, Poland.
Zurück zum Zitat Piskorski, J., Sydow, M., & Kupść, A. (2007). Lemmatization of Polish person names. In Proceedings of the ACL Workshop on Balto-Slavonic Natural Language Processing 2007—Special Theme: Information Extraction and Enabling Technologies (BSNLP’2007). Held at ACL’2007, Prague, Czech Republic, 2007. Stroudsburg, PA: ACL. Piskorski, J., Sydow, M., & Kupść, A. (2007). Lemmatization of Polish person names. In Proceedings of the ACL Workshop on Balto-Slavonic Natural Language Processing 2007—Special Theme: Information Extraction and Enabling Technologies (BSNLP’2007). Held at ACL’2007, Prague, Czech Republic, 2007. Stroudsburg, PA: ACL.
Zurück zum Zitat Piskorski, J., Wieloch, K., Pikuła, M., & Sydow, M. (2008). Towards person name matching for highly inflective languages. In Proceedings of the WWW’2008 workshop on Natural Language Processing Challenges in the Information Explosion Era (NLPIX 2008). Piskorski, J., Wieloch, K., Pikuła, M., & Sydow, M. (2008). Towards person name matching for highly inflective languages. In Proceedings of the WWW’2008 workshop on Natural Language Processing Challenges in the Information Explosion Era (NLPIX 2008).
Zurück zum Zitat Pouliquen, B., & Steinberger, R. (2009). Automatic construction of multilingual name dictionaries. In: C. Goutte, N. Cancedda, M. Dymetman, & G. Foster (Eds.), Learning machine translation (pp. 59–78). MIT Press – Advances in Neural Information Processing (NIPS) Series. Pouliquen, B., & Steinberger, R. (2009). Automatic construction of multilingual name dictionaries. In: C. Goutte, N. Cancedda, M. Dymetman, & G. Foster (Eds.), Learning machine translation (pp. 59–78). MIT Press – Advances in Neural Information Processing (NIPS) Series.
Zurück zum Zitat Przepiórkowski, A. (2005). The IPI PAN corpus in numbers. In Z. Vetulani (Ed.), Proceedings of the 2nd Language & Technology Conference, Poznań, Poland. Przepiórkowski, A. (2005). The IPI PAN corpus in numbers. In Z. Vetulani (Ed.), Proceedings of the 2nd Language & Technology Conference, Poznań, Poland.
Zurück zum Zitat Smith, T., & Waterman, M. (1981). Identification of common molecular subsequences. Journal of Molecular Biology, 147, 195–197.CrossRef Smith, T., & Waterman, M. (1981). Identification of common molecular subsequences. Journal of Molecular Biology, 147, 195–197.CrossRef
Zurück zum Zitat Steinberger, R., & Pouliquen, B. (2007). Cross-lingual named entity recognition. Journal Linguisticae Investigationes, Special Issue on Named Entity Recognition and Categorisation, 30(1), 135–162. Steinberger, R., & Pouliquen, B. (2007). Cross-lingual named entity recognition. Journal Linguisticae Investigationes, Special Issue on Named Entity Recognition and Categorisation, 30(1), 135–162.
Zurück zum Zitat Ukkonen, E. (1992). Approximate string matching with q-grams and maximal matches. Theoretical Computer Science, 92(1), 191–211.MATHCrossRefMathSciNet Ukkonen, E. (1992). Approximate string matching with q-grams and maximal matches. Theoretical Computer Science, 92(1), 191–211.MATHCrossRefMathSciNet
Zurück zum Zitat Vilares, J., Alonso, M., & Vilares Ferro, M. (2004). Morphological and syntactic processing for text retrieval. In DEXA (pp. 371–380). Vilares, J., Alonso, M., & Vilares Ferro, M. (2004). Morphological and syntactic processing for text retrieval. In DEXA (pp. 371–380).
Zurück zum Zitat Weiss, D. (2005). A survey of freely available Polish stemmers and evaluation of their applicability in information retrieval. In Proceedings of the 2nd Language and Technology Conference (LTC’2005), Poznań, Poland, 2005 (pp. 216–221). Weiss, D. (2005). A survey of freely available Polish stemmers and evaluation of their applicability in information retrieval. In Proceedings of the 2nd Language and Technology Conference (LTC’2005), Poznań, Poland, 2005 (pp. 216–221).
Zurück zum Zitat Winkler, W. (1999). The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Bureau of the Census, Washington, DC. Winkler, W. (1999). The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Bureau of the Census, Washington, DC.
Metadaten
Titel
On knowledge-poor methods for person name matching and lemmatization for highly inflectional languages
verfasst von
Jakub Piskorski
Karol Wieloch
Marcin Sydow
Publikationsdatum
01.06.2009
Verlag
Springer Netherlands
Erschienen in
Discover Computing / Ausgabe 3/2009
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI
https://doi.org/10.1007/s10791-008-9085-5

Weitere Artikel der Ausgabe 3/2009

Discover Computing 3/2009 Zur Ausgabe