Skip to main content
Log in

Author name disambiguation using a graph model with node splitting and merging based on bibliographic information

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Author ambiguity mainly arises when several different authors express their names in the same way, generally known as the namesake problem, and also when the name of an author is expressed in many different ways, referred to as the heteronymous name problem. These author ambiguity problems have long been an obstacle to efficient information retrieval in digital libraries, causing incorrect identification of authors and impeding correct classification of their publications. It is a nontrivial task to distinguish those authors, especially when there is very limited information about them. In this paper, we propose a graph based approach to author name disambiguation, where a graph model is constructed using the co-author relations, and author ambiguity is resolved by graph operations such as vertex (or node) splitting and merging based on the co-authorship. In our framework, called a Graph Framework for Author Disambiguation (GFAD), the namesake problem is solved by splitting an author vertex involved in multiple cycles of co-authorship, and the heteronymous name problem is handled by merging multiple author vertices having similar names if those vertices are connected to a common vertex. Experiments were carried out with the real DBLP and Arnetminer collections and the performance of GFAD is compared with three representative unsupervised author name disambiguation systems. We confirm that GFAD shows better overall performance from the perspective of representative evaluation metrics. An additional contribution is that we released the refined DBLP collection to the public to facilitate organizing a performance benchmark for future systems on author disambiguation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. http://dblp.uni-trier.de/.

  2. http://citeseer.ist.psu.edu/.

  3. http://www.ncbi.nlm.nih.gov/pubmed.

  4. http://www.lbd.dcc.ufmg.br/bdbcomp/.

  5. http://arnetminer.org/.

  6. By a citation record, we mean a set of bibliographic attributes containing author names, paper title, and publication venue of a particular publication.

  7. GFAD also relies on paper title in addition to co-authorship, but it is only used at outlier removal step, if necessary, to meet the specific objectives of the system.

  8. http://meta.wikimedia.org/wiki/WikiAuthors.

  9. http://www.paritycomputing.com/web/index.html.

  10. http://info.scival.com/experts.

  11. We can have isolated vertices during the graph construction process and/or after namesake resolution process.

  12. To maximize the possibility of selecting different name variations denoting the same person, while minimizing the chance of judging similar names denoting different person as the same person, suitable threshold values must be manually determined in the first place. So we empirically determined the threshold value after experimenting with randomly collected 200 name pairs including 100 pairs of name variations and 100 pairs of similar names.

  13. To measure and analyze the ratios of the occurrence frequencies of three failure cases, we randomly selected 18 ambiguous groups from the Arnetminer collection.

  14. http://ieeexplore.ieee.org/Xplore/home.jsp.

  15. http://www.paritycomputing.com/web/index.html.

References

  • Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S. E., & Widom, J. (2009). Swoosh: A generic approach to entity resolution. The International Journal on Very Large Databases, 18(1), 255–276.

    Article  Google Scholar 

  • Bhattacharya, I., & Getoor, L. (2006). A latent Dirichlet model for unsupervised entity resolution. In Proceedings of the Sixth SIAM International Conference on Data Mining.

  • Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transaction on Knowledge Discovery from Data, 1(1), 5.

    Article  Google Scholar 

  • Borgman, C. (1999). What are digital libraries? Competing visions. Information Processing and Management, 35(3), 227–243.

    Google Scholar 

  • Carvalho, A., Ferreira, A., Laender, A., & Goncalves, M. (2011). Incremental unsupervised name disambiguation in cleaned digital libraries. Journal of Information and Data Management, 2(3), 289–304.

    Google Scholar 

  • Cherednichenko, S. (2005). Outlier detection in clustering. Master’s thesis, Department of Computer Science, University of Joensuu.

  • Cota, R., Ferreira, A., Nascimento, C., Goncalves, M., & Laender, A. (2010). An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology, 61(9), 1853–1870.

    Article  Google Scholar 

  • Fan, X., Wang, J., Pu, X., Zhou, L., & LV, B. (2011). On graph-based name disambiguation. ACM Journal of Data and Information Quality, 2(2), 10.

    Google Scholar 

  • Ferreira, A., Goncalves, M., & Laender, A. (2012). A brief survey of automatic methods for author name disambiguation. SIGMOD Record, 41(2), 15–26.

    Article  Google Scholar 

  • Ferreira, A., Veloso, A., Goncalves, M., & Laender, A. (2010). Effective self-training author name disambiguation in scholarly digital libraries. In Proceedings the Tenth Annual Joint Conference on Digital Libraries (pp. 39–48).

  • Han, H., Giles, C., Zha, H., Li, C., & Tsioutsiouliklis, K. (2004). Two supervised learning approaches for name disambiguation in author citations. In Proceedings of the fourth ACM/IEEE-CS joint conference on digital libraries, 296-305.

  • Han, H., Zha, H., & Giles, C. (2005). Name disambiguation in author citations using a k-way spectral clustering method. In Proceedings of the Fifth ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 334–343).

  • Johnson, D. (1975). Finding all the elementary circuits of a directed graph. SIAM Journal on Scientific Computing, 4(1), 77–84.

    Article  MATH  Google Scholar 

  • Kang, I., Na, S., Lee, S., Jung, H., Kim, P., Sung, W., et al. (2009). On co-authorship for author disambiguation. Information Processing and Management, 45(1), 84–97.

    Article  Google Scholar 

  • Klass, V. (2007). Who’s who in the world wide web: Approaches to name disambiguation. Diplomarbeit/diploma thesis, Institute of Computer Science, LMU, Munich.

  • Levin, F., & Heuser, C. (2010). Evaluating the use of social networks in author name disambiguation in digital libraries. Journal of Information and Data Management, 1(2), 183–197.

    Google Scholar 

  • Ley, M. (2002). The DBLP computer science bibliography: Evolution, research issues, perspectives. In Proceedings of the Ninth International Symposium on String Processing and Information Retrieval, 2476, (pp. 1–10).

  • Masada, T., Takasu, A., & Adachi, J. (2007). Citation data clustering for author name disambiguation. In Proceedings of the Second International Conference on Scalable Information Systems.

  • Pasula, H., Marthi, B., Milch, B., Russell, S., & Shpitser, I. (2003). Identity uncertainty and citation matching. Advances in Neural Information Systems, 15, 1401–1408.

    Google Scholar 

  • Peng, H., Lu, C., Hsu, W., & Ho, J. (2012). Disambiguating authors in citations on the web and authorship correlations. Expert Systems with Applications, 39(12), 10521–10532.

    Article  Google Scholar 

  • Pereira, D., Neto, B., & Ziviani, N. (2011). A generic web-based entity resolution framework. Journal of the American Society for Information Science and Technology, 62(5), 919–932.

    Article  Google Scholar 

  • Pereira, D., Neto, B., Ziviani, N., Laender, A., Goncalves, M., & Ferreira, A. (2009). Using web information for author name disambiguation. In Proceedings of the Ninth ACM/IEEE-CS Joint Conference on Digital Libraries (49–58).

  • Scoville, C., Johnson, E., & McConnell, A. (2003). When A. Rose is not A. Rose: The vagaries of author searching. Medical Reference Services Quarterly, 22(4), 1–11.

    Article  Google Scholar 

  • Soler, J. (2007). Separating the articles of authors with the same name. Scientometrics, 72(2), 281–290.

    Article  MathSciNet  Google Scholar 

  • Tan, Y., Kan, M., & Lee, D. (2006). Search engine driven author disambiguation. In Proceedings of the Sixth ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 314–315).

  • Tang, J., Lu, Q., Wang, T., Wang, J., & Li, W. (2011). A bipartite graph based social network splicing method for person name disambiguation. In Proceedings of the Thirty-Fourth International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 1233–1234.

  • Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., & Su, Z. (2008). Arnetmier: Extraction and mining of academic social networks. In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 990–998).

  • Veloso, A., Ferreira, A., Goncalves, M., Laender, A., & Meira, W. (2012). Cost-effective on-demand associative author name disambiguation. Information Processing and Management, 48(4), 680–697.

    Article  Google Scholar 

  • Wang, X., Tang, J., Cheng, H., & Yu, P. (2011). ADANA: Active Name Disambiguation. In Proceedings of the IEEE eleventh International Conference on Data Mining (pp. 794–803).

  • Wooding, S., Wilcox-Jay, K., Lewison, G., & Grant, J. (2006). Co-author inclusion: A novel recursive algorithmic method for dealing with homonyms in bibliometrics analysis. Scientometrics, 66(1), 11–21.

    Article  Google Scholar 

  • Wu, J., & Ding, X. (2013). Author name disambiguation in scientific collaboration and mobility cases. Scientometrics, 96(3), 683–697.

    Article  MathSciNet  Google Scholar 

  • Yang, K., Peng, H., Jiang, J., Lee, H., & Ho, J. (2008). Author name disambiguation for citations using topic and web correlation. In Proceedings of the twelfth European conference on Research and Advanced Technology for Digital Libraries (pp. 185–196).

  • Yin, X., Han, J., & Yu, P. S. (2007). Object distinction: Distinguishing objects with identical names. In Proceedings of the IEEE International Conference on Data Engineering (pp. 1242–1246).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jungsun Kim.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shin, D., Kim, T., Choi, J. et al. Author name disambiguation using a graph model with node splitting and merging based on bibliographic information. Scientometrics 100, 15–50 (2014). https://doi.org/10.1007/s11192-014-1289-4

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-014-1289-4

Keywords

Navigation