Skip to main content
Log in

Robust hybrid name disambiguation framework for large databases

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

In many databases, science bibliography database for example, name attribute is the most commonly chosen identifier to identify entities. However, names are often ambiguous and not always unique which cause problems in many fields. Name disambiguation is a non-trivial task in data management that aims to properly distinguish different entities which share the same name, particularly for large databases like digital libraries, as only limited information can be used to identify authors’ name. In digital libraries, ambiguous author names occur due to the existence of multiple authors with the same name or different name variations for the same person. Also known as name disambiguation, most of the previous works to solve this issue often employ hierarchical clustering approaches based on information inside the citation records, e.g. co-authors and publication titles. In this paper, we focus on proposing a robust hybrid name disambiguation framework that is not only applicable for digital libraries but also can be easily extended to other application based on different data sources. We propose a web pages genre identification component to identify the genre of a web page, e.g. whether the page is a personal homepage. In addition, we propose a re-clustering model based on multidimensional scaling that can further improve the performance of name disambiguation. We evaluated our approach on known corpora, and the favorable experiment results indicated that our proposed framework is feasible.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. http://www.informatik.uni-trier.de/~ley/db/.

  2. http://clusty.com/.

  3. http://www.powerset.com.

  4. http://en.wikipedia.org/wiki/Wiki.

  5. http://neuroph.sourceforge.net/.

  6. http://www.cs.waikato.ac.nz/ml/weka/.

  7. http://code.google.com/apis/ajaxsearch/.

References

  • Aleman-Meza, B., Nagarajan, M., & Ramakrishnan, C. (2006). Semantic analytics on social networks: Experiences in addressing the problem of conflict of interest detection. World Wide Web Conference Communication (pp. 407–416).

  • Borg, I., & Groenen, P. (2005). Modern multidimensional scaling: Theory and applications. (pp. 207–212) New York: Springer.

    Google Scholar 

  • Canas, A. J., Valerio, A., Lalinde-Pulido, J., Carvalho, M. M., & Arguedas, M. (2003). Using wordnet for word sense disambiguation to support concept map construction. International Symposium on String Processing and Information Retrieval (pp. 350–359).

  • Dongwen, L., Byung-Won, O., Jaewoo, K., & Sanghyun, P. (2005). Effective and scalable solutions for mixed and split citation problems in digital libraries. Proceedings of the 2nd international workshop on Information Quality in information Systems. (pp. 69–76).

  • Han, H., Giles, C. L., & Hong, Y. Z. (2004). Two supervised learning approaches for name disambiguation in author citations. Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital librarie (pp. 296–305).

  • Han, H., Zhang, H., & Giles, C. L. (2005). Name disambiguation in author citations using a k-way spectral clustering method. Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, (pp. 334–343).

  • Haykin, S. (1999). Neural networks: A comprehensive foundation.

  • Huang, J., & Seyda Ertekin, C. L. G. (2006). Efficient name disambiguation for large scale databases. Proc. of 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (pp. 536–544).

  • Kalashnikov, D. V., & Mehrotra, S. (2006). Domain-independent data cleaning via analysis of entity relationship graph. ACM Transactions on Database System 31(2):716–767.

    Google Scholar 

  • Kang, I. S., Na, S. H., Lee, S., Jung, H., Kim, P., Sung, W. K., & Lee, J. H. (2009). On co-authorship for author disambiguation. Information Processing and Management 45(1):84–97.

    Google Scholar 

  • Kennedy, A., & Shepherd, M. (2005). Automatic identification of home pages on the web. Proceedings of the 38th Annual Hawaii International Conference on System Sciences (pp. 99–108).

  • Koehler, H., Zhou, X., Sadiq, S., Shu, Y., & Taylor, K. (2010). Sampling dirty data for matching attributes. SIGMOD (pp. 63–74).

  • Kuncheva, L. I., Bezdek, J. C., & Duin, R. P. (2001). Decision templates for multiple classifier fusion. Pattern Recognition, 34(2), 299–314.

    Article  MATH  Google Scholar 

  • Orrite, C., Rodriguez, M., Martinez, F., & Fairhurst, M. (2008). Classifier ensemble generation for the majority vote rule. Progress in Pattern Recognition, Image Analysis and Applications (pp. 340–347).

  • Pedro, D., & Pazzani, M. (1997). On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning 29(2–3), 103–137.

    Google Scholar 

  • Pereira, D. A., Ribeiro, B. N., Ziviani, N., Alberto, H. F., Goncalves, A. M., & Ferreira, A. A. (2009). Using web information for author name disambiguation. Proceedings of the 9th ACM/IEEE Joint Conference on Digital Libraries (pp. 49–58).

  • Sibson, R. (1973). Slink: An optimally efficient algorithm for the single-link cluster method. The Computer Journal (British Computer Society) 1, 30–34.

    MathSciNet  Google Scholar 

  • Song, Y., Huang, J., Councill, I. G., Li, J., & Giles, C. L. (2007). Efficient topic-based unsupervised name disambiguation. 7th ACM/IEEE Joint Conference on Digital Libraries (pp. 342–352).

  • Tan, Y. F., Kan, M. Y., & Lee, D. W. (2006). Search engine driven author disambiguation. 6th ACM/IEEE Joint Conference on Digital Libraries (pp. 314–315).

  • Wu, J., & Ding, X. (2013). Author name disambiguation in scientific collaboration and mobility cases. Scientometrics (pp. 683–697).

  • Yang, K. H., Peng, H. T., Jiang, J. Y., Lee, H. M., & Ho, J. H. (2008). Author name disambiguation for citations using topic and web correlation. Proceedings of 12th European Conference on Research and Advanced Technology for Digital Libraries (pp. 185–196).

  • Yin, X. X., & Han, J. W. (2007). Object distinction: Distinguishing objects with identical names. IEEE 23rd International Conference on Data Engineering (pp. 1242–1246)

  • Zhou, Y., Cheng, H., & Yu, J. X. (2009). Graph clustering based on structural/attribute similarities. Proceedings of the VLDB Endowment (pp. 718–729).

  • Zhu, J., Fung, G., & Zhou, X. (2010). Efficient web pages identification for entity resolution. 19th International WWW (pp. 1223–1224).

  • Zhu, J., Fung, G. P. C., & Zhou, X. F. (2009). A term-based driven clustering approach for name disambiguation. Proceedings of a Joint conference on APWeb/WAIM (pp. 320–331)

  • Zhu, J., Zhou, X. F., & Fung, G. (2011). Enhance web pages genre identification using neighboring pages. WISE (pp. 282–289).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jia Zhu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, J., Yang, Y., Xie, Q. et al. Robust hybrid name disambiguation framework for large databases. Scientometrics 98, 2255–2274 (2014). https://doi.org/10.1007/s11192-013-1151-0

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-013-1151-0

Keywords

Navigation