Skip to main content
Log in

Collecting large-scale publication data at the level of individual researchers: a practical proposal for author name disambiguation

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

The disambiguation of author names is an important and challenging task in bibliometrics. We propose an approach that relies on an external source of information for selecting and validating clusters of publications identified through an unsupervised author name disambiguation method. The application of the proposed approach to a random sample of Italian scholars shows encouraging results, with an overall precision, recall, and F-measure of over 96%. The proposed approach can serve as a starting point for large-scale census of publication portfolios for bibliometric analyses at the level of individual researchers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. http://cercauniversita.cineca.it/php5/docenti/cerca.php, last accessed 20/09/2019.

  2. The complete list is accessible at attiministeriali.miur.it/userfiles/115.htm, last accessed 20/09/2019.

  3. Note that this filtering stage differs from that in the original DGA method: in the original DGA method, author-identity pairs are filtered; in the proposed approach, complete clusters are filtered.

  4. The percent sign (%) wildcard allows to retrieve any name starting with the text preceding the sign.

  5. In fact, these publications are also ignored by the DGA algorithm, which is applied only to articles indexed in the Italian National Citation Report.

  6. The reader may wonder why a “manual validation” is performed in an approach proposed for “large scale” author name disambiguation. As we will see better below, this scenario is presented only to understand the trade-off between costs and benefits of this scenario and the less costly alternative scenarios in which no manual validation is involved.

  7. All Italian university research staff hold an ORCID identifier, following the IRIDE project launched in 2014 by the MIUR.

  8. We have excluded documents published in a year in which the relevant author was not a tenured professor in the Italian academic system.

  9. Baseline 1 is a simple method often performed by scholars in practice. Given the high share of potential homonyms (29% as shown in Table 5), we expect a low level of precision when applying such method. Baseline 2 should solve most homonym cases but could lead to a low level of recall due to an increasing number of false negatives.

References

  • Abdulhayoglu, M. A., & Thijs, B. (2017). Use of ResearchGate and Google CSE for author name disambiguation. Scientometrics,111(3), 1965–1985.

    Article  Google Scholar 

  • Aksnes, D. W. (2008). When different persons have an identical author name. How frequent are homonyms? Journal of the American Society for Information Science and Technology,59(5), 838–841.

    Article  Google Scholar 

  • Backes, T. (2018). Effective unsupervised author disambiguation with relative frequencies. In Proceedings of the 18th ACM/IEEE on joint conference on digital libraries (pp. 203–212). New York, NY: ACM.

  • Caron, E., & van Eck, N. J. (2014). Large scale author name disambiguation using rule-based scoring and clustering. In E. Noyons (Ed.), 19th international conference on science and technology indicators. “Context counts: Pathways to master big data and little data” (pp. 79–86). Leiden: CWTS-Leiden University.

  • Chinchilla-Rodríguez, Z., Bu, Y., Robinson-García, N., Costas, R., & Sugimoto, C. R. (2018a). Travel bans and scientific mobility: Utility of asymmetry and affinity indexes to inform science policy. Scientometrics,116(1), 569–590.

    Article  Google Scholar 

  • Chinchilla-Rodríguez, Z., Miao, L., Murray, D., Robinson-García, N., Costas, R., & Sugimoto, C. R. (2018b). A global comparison of scientific mobility and collaboration according to national scientific capacities. Frontiers in Research Metrics and Analytics,3, 17.

    Article  Google Scholar 

  • Cornell, L. L. (1982). Duplication of Japanese names: A problem in citations and bibliographies. Journal of the American Society for Information Science and Technology,33(2), 102–104.

    Article  MathSciNet  Google Scholar 

  • Cota, R. G., Ferreira, A. A., Nascimento, C., Gonçalves, M. A., & Laender, A. H. (2010). An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology,61(9), 1853–1870.

    Article  Google Scholar 

  • Culotta, A., Kanani, P., Hall, R., Wick, M., & McCallum, A. (2007). Author disambiguation using error-driven machine learning with a ranking loss function. In Proceedings of the 6th international workshop on information integration on the web (IIWeb 2007) (pp. 32–37). Menlo Park, CA: AAAI Press.

  • D’Angelo, C. A., Giuffrida, C., & Abramo, G. (2011). A heuristic approach to author name disambiguation in bibliometrics databases for large-scale research assessments. Journal of the American Society for Information Science and Technology,62(2), 257–269.

    Article  Google Scholar 

  • Enserink, M. (2009). Are you ready to become a number? Science,323(5922), 1662–1664.

    Article  Google Scholar 

  • Ferreira, A. A., Gonçalves, M. A., & Laender, A. H. F. (2012). A brief survey of automatic methods for author name disambiguation. ACM SIGMOD Record,41(2), 15–26.

    Article  Google Scholar 

  • Ferreira, A. A., Veloso, A., Gonçalves, M. A., & Laender, A. H. F. (2010). Effective self-training author name disambiguation in scholarly digital libraries. In Proceedings of the 2010 ACM/IEEE joint conference on digital libraries (pp. 39–48). New York, NY: ACM.

  • Haak, L. L., Fenner, M., Paglione, L., Pentz, E., & Ratner, H. (2012). ORCID: A system to uniquely identify researchers. Learned Publishing,25(4), 259–264.

    Article  Google Scholar 

  • Han, H., Giles, L., Zha, H., Li, C., & Tsioutsiouliklis, K. (2004). Two supervised learning approaches for name disambiguation in author citations. In Proceedings of the 4th ACM/IEEE-CS joint conference on digital libraries (JCDL 2004) (pp. 296–305). New York, NY: ACM.

  • Han, H., Zha, H., & Giles, C. L. (2005). Name disambiguation in author citations using a K-way spectral clustering method. In Proceedings of the 5th ACM/IEEE-CS joint conference on digital libraries (JCDL 2005) (pp. 334–343). New York, NY: ACM.

  • Harman, G. (2000). Allocating research infrastructure grants in post-binary higher education systems: British and Australian approaches. Journal of Higher Education Policy and Management,22(2), 11–126.

    Article  Google Scholar 

  • Hicks, D. (2009). Evolving regimes of multi-university research evaluation. Higher Education,57(4), 393–404.

    Article  Google Scholar 

  • Hjørland, B. (2010). The foundation of the concept of relevance. Journal of the American Society for Information Science and Technology,61(2), 217–237.

    Google Scholar 

  • Huang, J., Ertekin, S., & Giles, C. (2006). Efficient name disambiguation for large-scale databases. In Proceedings of the 10th European conference on principles and practice of knowledge discovery in databases (PKDD 2006) (pp. 536–544). Berlin: Springer.

  • Huang, S., Yang, B., Yan, S., & Rousseau, R. (2014). Institution name disambiguation for research assessment. Scientometrics,99(3), 823–838.

    Article  Google Scholar 

  • Kanani, P., McCallum, A., & Pal, C. (2007). Improving author coreference by resource-bounded information gathering from the web. In Proceedings of the 20th international joint conference on artificial intelligence (pp. 429–434). San Francisco, CA: Morgan Kaufmann Publishers Inc.

  • Kang, I.-S., Na, S.-H., Lee, S., Jung, H., Kim, P., Sung, W.-K., et al. (2009). On co-authorship for author disambiguation. Information Processing and Management,45(1), 84–97.

    Article  Google Scholar 

  • Kawashima, H., & Tomizawa, H. (2015). Accuracy evaluation of Scopus author ID based on the largest funding database in Japan. Scientometrics,103(3), 1061–1071.

    Article  Google Scholar 

  • Kim, J. (2018). Evaluating author name disambiguation for digital libraries: A case of DBLP. Scientometrics,116(3), 1867–1886.

    Article  Google Scholar 

  • Kim, J., & Kim, J. (2019). Effect of forename string on author name disambiguation. Journal of the Association for Information Science and Technology. https://doi.org/10.1002/asi.24298.

    Article  Google Scholar 

  • Kim, J., Kim, J., & Owen-Smith, J. (2019). Generating automatically labeled data for author name disambiguation: An iterative clustering method. Scientometrics,118(1), 253–280.

    Article  Google Scholar 

  • Larivière, V., & Costas, R. (2016). How many is too many? On the relationship between research productivity and impact. PLoS ONE,11(9), e0162709.

    Article  Google Scholar 

  • Larivière, V., Desrochers, N., Macaluso, B., Mongeon, P., Paul-Hus, A., & Sugimoto, C. R. (2016). Contributorship and division of labor in knowledge production. Social Studies of Science,46(3), 417–435.

    Article  Google Scholar 

  • Levin, M., Krawczyk, S., Bethard, S., & Jurafsky, D. (2012). Citation-based bootstrapping for large-scale author disambiguation. Journal of the American Society for Information Science and Technology,63(5), 1030–1047.

    Article  Google Scholar 

  • Liu, W., Doǧan, R. I., Kim, S., Comeau, D. C., Kim, W., Yeganova, L., et al. (2014). Author name disambiguation for PubMed. Journal of the Association for Information Science and Technology,65(4), 765–781.

    Article  Google Scholar 

  • Mazov, N. A., & Gureev, V. N. (2014). The role of unique identifiers in bibliographic information systems. Scientific and Technical Information Processing,41(3), 206–210.

    Article  Google Scholar 

  • Morillo, F., Santabárbara, I., & Aparicio, J. (2013). The automatic normalisation challenge: Detailed addresses identification. Scientometrics,95(3), 953–966.

    Article  Google Scholar 

  • Müller, M., Reitz, F., & Roy, N. (2017). Data sets for author name disambiguation: An empirical analysis and a new resource. Scientometrics,111(3), 1467–1500.

    Article  MATH  Google Scholar 

  • On, B., Lee, D., Kang, J., & Mitra, P. (2005). Comparative study of name disambiguation problem using a scalable blocking-based framework. In Proceedings of the 5th ACM/IEEE-CS joint conference on digital libraries (JCDL 2005) (pp. 344–353). New York, NY: ACM.

  • Palmblad, M., & Van Eck, N. J. (2018). Bibliometric analyses reveal patterns of collaboration between ASMS members. Journal of the American Society for Mass Spectrometry,29(3), 447–454.

    Article  Google Scholar 

  • Pereira, D. A., Ribeiro-Neto, B. A., Ziviani, N., Laender, A. H. F., Gonçalves, M. A., & Ferreira, A. A. (2009). Using web information for author name disambiguation. In Proceedings of the 2009 ACM/IEEE joint conference on digital libraries (pp. 49–58). New York, NY: ACM.

  • Robinson-Garcia, N., Sugimoto, C. R., Murray, D., Yegros-Yegros, A., Larivière, V., & Costas, R. (2019). The many faces of mobility: Using bibliometric data to measure the movement of scientists. Journal of Informetrics,13(1), 50–63.

    Article  Google Scholar 

  • Ruiz-Castillo, J., & Costas, R. (2014). The skewness of scientific productivity. Journal of Informetrics,8(4), 917–934.

    Article  Google Scholar 

  • Schulz, J. (2016). Using Monte Carlo simulations to assess the impact of author name disambiguation quality on different bibliometric analyses. Scientometrics,107(3), 1283–1298.

    Article  Google Scholar 

  • Schulz, C., Mazloumian, A., Petersen, A. M., Penner, O., & Helbing, D. (2014). Exploiting citation networks for large-scale author name disambiguation. EPJ Data Science,3(1), 11.

    Article  Google Scholar 

  • Smalheiser, N. R., & Torvik, V. I. (2009). Author name disambiguation. Annual Review of Information Science and Technology,43, 1–43.

    Article  Google Scholar 

  • Soler, J. (2007). Separating the articles of authors with the same name. Scientometrics,72(2), 281–290.

    Article  Google Scholar 

  • Song, Y., Huang, J., Councill, I. G., Li, J., & Giles, C. L. (2007). Efficient topic-based unsupervised name disambiguation. In Proceedings of the 7th ACM/IEEE-CS joint conference on digital libraries (JCDL 2007) (pp. 342–351). New York, NY: ACM.

  • Song, M., Kim, E. H. J., & Kim, H. J. (2015). Exploring author name disambiguation on PubMed-scale. Journal of Informetrics,9(4), 924–941.

    Article  MathSciNet  Google Scholar 

  • Strotmann, A., & Zhao, D. Z. (2012). Author name disambiguation: What difference does it make in author-based citation analysis? Journal of the American Society for Information Science and Technology,63(9), 1820–1833.

    Article  Google Scholar 

  • Sugimoto, C. R., Robinson-García, N., Murray, D. S., Yegros-Yegros, A., Costas, R., & Larivière, V. (2017). Scientists have most impact when they’re free to move. Nature,550(7674), 29–31.

    Article  Google Scholar 

  • Sun, X., Kaur, J., Possamai, L., & Menczer, F. (2013). Ambiguous author query detection using crowdsourced digital library annotations. Information Processing and Management,49(2), 454–464.

    Article  Google Scholar 

  • Tekles, A., & Bornmann, L. (2019). Author name disambiguation of bibliometric data: A comparison of several unsupervised approaches. arXiv:1904.12746.

  • Tijssen, R. J. W., & Yegros, A. (2017). Brexit: UK universities and European industry (Correspondence). Nature,544(7648), 35.

    Article  Google Scholar 

  • Treeratpituk, P., & Giles, C. L. (2009). Disambiguating authors in academic publications using random forests. In Proceedings of the 2009 ACM/IEEE joint conference on digital libraries (pp. 39–48). New York, NY: ACM.

  • Veloso, A., Ferreira, A. A., Gonçalves, M. A., Laender, A. H., & Meira, W., Jr. (2012). Cost-effective on-demand associative author name disambiguation. Information Processing and Management,48(4), 680–697.

    Article  Google Scholar 

  • Yang, K.-H., Peng, H.-T., Jiang, J.-Y., Lee, H.-M., & Ho, J.-M. (2008). Author name disambiguation for citations using topic and web correlation. In Proceedings of the 12th European conference on research and advanced technology for digital libraries (pp. 185–196). Berlin: Springer.

  • Youtie, J., Carley, S., Porter, A. L., & Shapira, P. (2017). Tracking researchers and their outputs: New insights from ORCIDs. Scientometrics,113(1), 437–453.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ciriaco Andrea D’Angelo.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

D’Angelo, C.A., van Eck, N.J. Collecting large-scale publication data at the level of individual researchers: a practical proposal for author name disambiguation. Scientometrics 123, 883–907 (2020). https://doi.org/10.1007/s11192-020-03410-y

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-020-03410-y

Keywords

Navigation