Robust hybrid name disambiguation framework for large databases

Zhu, Jia; Yang, Yi; Xie, Qing; Wang, Liwei; Hassan, Saeed-Ul

doi:10.1007/s11192-013-1151-0

Robust hybrid name disambiguation framework for large databases

Published: 26 October 2013

Volume 98, pages 2255–2274, (2014)
Cite this article

Scientometrics Aims and scope Submit manuscript

Jia Zhu¹,
Yi Yang²,
Qing Xie³,
Liwei Wang⁴ &
…
Saeed-Ul Hassan⁵

694 Accesses
20 Citations
Explore all metrics

Abstract

In many databases, science bibliography database for example, name attribute is the most commonly chosen identifier to identify entities. However, names are often ambiguous and not always unique which cause problems in many fields. Name disambiguation is a non-trivial task in data management that aims to properly distinguish different entities which share the same name, particularly for large databases like digital libraries, as only limited information can be used to identify authors’ name. In digital libraries, ambiguous author names occur due to the existence of multiple authors with the same name or different name variations for the same person. Also known as name disambiguation, most of the previous works to solve this issue often employ hierarchical clustering approaches based on information inside the citation records, e.g. co-authors and publication titles. In this paper, we focus on proposing a robust hybrid name disambiguation framework that is not only applicable for digital libraries but also can be easily extended to other application based on different data sources. We propose a web pages genre identification component to identify the genre of a web page, e.g. whether the page is a personal homepage. In addition, we propose a re-clustering model based on multidimensional scaling that can further improve the performance of name disambiguation. We evaluated our approach on known corpora, and the favorable experiment results indicated that our proposed framework is feasible.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

References

Aleman-Meza, B., Nagarajan, M., & Ramakrishnan, C. (2006). Semantic analytics on social networks: Experiences in addressing the problem of conflict of interest detection. World Wide Web Conference Communication (pp. 407–416).
Borg, I., & Groenen, P. (2005). Modern multidimensional scaling: Theory and applications. (pp. 207–212) New York: Springer.
Google Scholar
Canas, A. J., Valerio, A., Lalinde-Pulido, J., Carvalho, M. M., & Arguedas, M. (2003). Using wordnet for word sense disambiguation to support concept map construction. International Symposium on String Processing and Information Retrieval (pp. 350–359).
Dongwen, L., Byung-Won, O., Jaewoo, K., & Sanghyun, P. (2005). Effective and scalable solutions for mixed and split citation problems in digital libraries. Proceedings of the 2nd international workshop on Information Quality in information Systems. (pp. 69–76).
Han, H., Giles, C. L., & Hong, Y. Z. (2004). Two supervised learning approaches for name disambiguation in author citations. Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital librarie (pp. 296–305).
Han, H., Zhang, H., & Giles, C. L. (2005). Name disambiguation in author citations using a k-way spectral clustering method. Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, (pp. 334–343).
Haykin, S. (1999). Neural networks: A comprehensive foundation.
Huang, J., & Seyda Ertekin, C. L. G. (2006). Efficient name disambiguation for large scale databases. Proc. of 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (pp. 536–544).
Kalashnikov, D. V., & Mehrotra, S. (2006). Domain-independent data cleaning via analysis of entity relationship graph. ACM Transactions on Database System 31(2):716–767.
Google Scholar
Kang, I. S., Na, S. H., Lee, S., Jung, H., Kim, P., Sung, W. K., & Lee, J. H. (2009). On co-authorship for author disambiguation. Information Processing and Management 45(1):84–97.
Google Scholar
Kennedy, A., & Shepherd, M. (2005). Automatic identification of home pages on the web. Proceedings of the 38th Annual Hawaii International Conference on System Sciences (pp. 99–108).
Koehler, H., Zhou, X., Sadiq, S., Shu, Y., & Taylor, K. (2010). Sampling dirty data for matching attributes. SIGMOD (pp. 63–74).
Kuncheva, L. I., Bezdek, J. C., & Duin, R. P. (2001). Decision templates for multiple classifier fusion. Pattern Recognition, 34(2), 299–314.
Article MATH Google Scholar
Orrite, C., Rodriguez, M., Martinez, F., & Fairhurst, M. (2008). Classifier ensemble generation for the majority vote rule. Progress in Pattern Recognition, Image Analysis and Applications (pp. 340–347).
Pedro, D., & Pazzani, M. (1997). On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning 29(2–3), 103–137.
Google Scholar
Pereira, D. A., Ribeiro, B. N., Ziviani, N., Alberto, H. F., Goncalves, A. M., & Ferreira, A. A. (2009). Using web information for author name disambiguation. Proceedings of the 9th ACM/IEEE Joint Conference on Digital Libraries (pp. 49–58).
Sibson, R. (1973). Slink: An optimally efficient algorithm for the single-link cluster method. The Computer Journal (British Computer Society) 1, 30–34.
MathSciNet Google Scholar
Song, Y., Huang, J., Councill, I. G., Li, J., & Giles, C. L. (2007). Efficient topic-based unsupervised name disambiguation. 7th ACM/IEEE Joint Conference on Digital Libraries (pp. 342–352).
Tan, Y. F., Kan, M. Y., & Lee, D. W. (2006). Search engine driven author disambiguation. 6th ACM/IEEE Joint Conference on Digital Libraries (pp. 314–315).
Wu, J., & Ding, X. (2013). Author name disambiguation in scientific collaboration and mobility cases. Scientometrics (pp. 683–697).
Yang, K. H., Peng, H. T., Jiang, J. Y., Lee, H. M., & Ho, J. H. (2008). Author name disambiguation for citations using topic and web correlation. Proceedings of 12th European Conference on Research and Advanced Technology for Digital Libraries (pp. 185–196).
Yin, X. X., & Han, J. W. (2007). Object distinction: Distinguishing objects with identical names. IEEE 23rd International Conference on Data Engineering (pp. 1242–1246)
Zhou, Y., Cheng, H., & Yu, J. X. (2009). Graph clustering based on structural/attribute similarities. Proceedings of the VLDB Endowment (pp. 718–729).
Zhu, J., Fung, G., & Zhou, X. (2010). Efficient web pages identification for entity resolution. 19th International WWW (pp. 1223–1224).
Zhu, J., Fung, G. P. C., & Zhou, X. F. (2009). A term-based driven clustering approach for name disambiguation. Proceedings of a Joint conference on APWeb/WAIM (pp. 320–331)
Zhu, J., Zhou, X. F., & Fung, G. (2011). Enhance web pages genre identification using neighboring pages. WISE (pp. 282–289).

Download references

Author information

Authors and Affiliations

School of Computer Science, South China Normal University, Guangzhou, China
Jia Zhu
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
Yi Yang
Division of CEMSE, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Qing Xie
Wuhan University, Wuhan, China
Liwei Wang
COMSATS Institute of Information Technology, Lahore, Pakistan
Saeed-Ul Hassan

Authors

Jia Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Yi Yang
View author publications
You can also search for this author in PubMed Google Scholar
Qing Xie
View author publications
You can also search for this author in PubMed Google Scholar
Liwei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Saeed-Ul Hassan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jia Zhu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, J., Yang, Y., Xie, Q. et al. Robust hybrid name disambiguation framework for large databases. Scientometrics 98, 2255–2274 (2014). https://doi.org/10.1007/s11192-013-1151-0

Download citation

Received: 09 September 2013
Published: 26 October 2013
Issue Date: March 2014
DOI: https://doi.org/10.1007/s11192-013-1151-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust hybrid name disambiguation framework for large databases

Abstract

Access this article

Similar content being viewed by others

Citation-based clustering of publications using CitNetExplorer and VOSviewer

Clustering graph data: the roadmap to spectral techniques

A survey of density based clustering algorithms

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Robust hybrid name disambiguation framework for large databases

Abstract

Access this article

Similar content being viewed by others

Citation-based clustering of publications using CitNetExplorer and VOSviewer

Clustering graph data: the roadmap to spectral techniques

A survey of density based clustering algorithms

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation