Abstract
The disambiguation of author names is an important and challenging task in bibliometrics. We propose an approach that relies on an external source of information for selecting and validating clusters of publications identified through an unsupervised author name disambiguation method. The application of the proposed approach to a random sample of Italian scholars shows encouraging results, with an overall precision, recall, and F-measure of over 96%. The proposed approach can serve as a starting point for large-scale census of publication portfolios for bibliometric analyses at the level of individual researchers.
Similar content being viewed by others
Notes
http://cercauniversita.cineca.it/php5/docenti/cerca.php, last accessed 20/09/2019.
The complete list is accessible at attiministeriali.miur.it/userfiles/115.htm, last accessed 20/09/2019.
Note that this filtering stage differs from that in the original DGA method: in the original DGA method, author-identity pairs are filtered; in the proposed approach, complete clusters are filtered.
The percent sign (%) wildcard allows to retrieve any name starting with the text preceding the sign.
In fact, these publications are also ignored by the DGA algorithm, which is applied only to articles indexed in the Italian National Citation Report.
The reader may wonder why a “manual validation” is performed in an approach proposed for “large scale” author name disambiguation. As we will see better below, this scenario is presented only to understand the trade-off between costs and benefits of this scenario and the less costly alternative scenarios in which no manual validation is involved.
All Italian university research staff hold an ORCID identifier, following the IRIDE project launched in 2014 by the MIUR.
We have excluded documents published in a year in which the relevant author was not a tenured professor in the Italian academic system.
Baseline 1 is a simple method often performed by scholars in practice. Given the high share of potential homonyms (29% as shown in Table 5), we expect a low level of precision when applying such method. Baseline 2 should solve most homonym cases but could lead to a low level of recall due to an increasing number of false negatives.
References
Abdulhayoglu, M. A., & Thijs, B. (2017). Use of ResearchGate and Google CSE for author name disambiguation. Scientometrics,111(3), 1965–1985.
Aksnes, D. W. (2008). When different persons have an identical author name. How frequent are homonyms? Journal of the American Society for Information Science and Technology,59(5), 838–841.
Backes, T. (2018). Effective unsupervised author disambiguation with relative frequencies. In Proceedings of the 18th ACM/IEEE on joint conference on digital libraries (pp. 203–212). New York, NY: ACM.
Caron, E., & van Eck, N. J. (2014). Large scale author name disambiguation using rule-based scoring and clustering. In E. Noyons (Ed.), 19th international conference on science and technology indicators. “Context counts: Pathways to master big data and little data” (pp. 79–86). Leiden: CWTS-Leiden University.
Chinchilla-Rodríguez, Z., Bu, Y., Robinson-García, N., Costas, R., & Sugimoto, C. R. (2018a). Travel bans and scientific mobility: Utility of asymmetry and affinity indexes to inform science policy. Scientometrics,116(1), 569–590.
Chinchilla-Rodríguez, Z., Miao, L., Murray, D., Robinson-García, N., Costas, R., & Sugimoto, C. R. (2018b). A global comparison of scientific mobility and collaboration according to national scientific capacities. Frontiers in Research Metrics and Analytics,3, 17.
Cornell, L. L. (1982). Duplication of Japanese names: A problem in citations and bibliographies. Journal of the American Society for Information Science and Technology,33(2), 102–104.
Cota, R. G., Ferreira, A. A., Nascimento, C., Gonçalves, M. A., & Laender, A. H. (2010). An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology,61(9), 1853–1870.
Culotta, A., Kanani, P., Hall, R., Wick, M., & McCallum, A. (2007). Author disambiguation using error-driven machine learning with a ranking loss function. In Proceedings of the 6th international workshop on information integration on the web (IIWeb 2007) (pp. 32–37). Menlo Park, CA: AAAI Press.
D’Angelo, C. A., Giuffrida, C., & Abramo, G. (2011). A heuristic approach to author name disambiguation in bibliometrics databases for large-scale research assessments. Journal of the American Society for Information Science and Technology,62(2), 257–269.
Enserink, M. (2009). Are you ready to become a number? Science,323(5922), 1662–1664.
Ferreira, A. A., Gonçalves, M. A., & Laender, A. H. F. (2012). A brief survey of automatic methods for author name disambiguation. ACM SIGMOD Record,41(2), 15–26.
Ferreira, A. A., Veloso, A., Gonçalves, M. A., & Laender, A. H. F. (2010). Effective self-training author name disambiguation in scholarly digital libraries. In Proceedings of the 2010 ACM/IEEE joint conference on digital libraries (pp. 39–48). New York, NY: ACM.
Haak, L. L., Fenner, M., Paglione, L., Pentz, E., & Ratner, H. (2012). ORCID: A system to uniquely identify researchers. Learned Publishing,25(4), 259–264.
Han, H., Giles, L., Zha, H., Li, C., & Tsioutsiouliklis, K. (2004). Two supervised learning approaches for name disambiguation in author citations. In Proceedings of the 4th ACM/IEEE-CS joint conference on digital libraries (JCDL 2004) (pp. 296–305). New York, NY: ACM.
Han, H., Zha, H., & Giles, C. L. (2005). Name disambiguation in author citations using a K-way spectral clustering method. In Proceedings of the 5th ACM/IEEE-CS joint conference on digital libraries (JCDL 2005) (pp. 334–343). New York, NY: ACM.
Harman, G. (2000). Allocating research infrastructure grants in post-binary higher education systems: British and Australian approaches. Journal of Higher Education Policy and Management,22(2), 11–126.
Hicks, D. (2009). Evolving regimes of multi-university research evaluation. Higher Education,57(4), 393–404.
Hjørland, B. (2010). The foundation of the concept of relevance. Journal of the American Society for Information Science and Technology,61(2), 217–237.
Huang, J., Ertekin, S., & Giles, C. (2006). Efficient name disambiguation for large-scale databases. In Proceedings of the 10th European conference on principles and practice of knowledge discovery in databases (PKDD 2006) (pp. 536–544). Berlin: Springer.
Huang, S., Yang, B., Yan, S., & Rousseau, R. (2014). Institution name disambiguation for research assessment. Scientometrics,99(3), 823–838.
Kanani, P., McCallum, A., & Pal, C. (2007). Improving author coreference by resource-bounded information gathering from the web. In Proceedings of the 20th international joint conference on artificial intelligence (pp. 429–434). San Francisco, CA: Morgan Kaufmann Publishers Inc.
Kang, I.-S., Na, S.-H., Lee, S., Jung, H., Kim, P., Sung, W.-K., et al. (2009). On co-authorship for author disambiguation. Information Processing and Management,45(1), 84–97.
Kawashima, H., & Tomizawa, H. (2015). Accuracy evaluation of Scopus author ID based on the largest funding database in Japan. Scientometrics,103(3), 1061–1071.
Kim, J. (2018). Evaluating author name disambiguation for digital libraries: A case of DBLP. Scientometrics,116(3), 1867–1886.
Kim, J., & Kim, J. (2019). Effect of forename string on author name disambiguation. Journal of the Association for Information Science and Technology. https://doi.org/10.1002/asi.24298.
Kim, J., Kim, J., & Owen-Smith, J. (2019). Generating automatically labeled data for author name disambiguation: An iterative clustering method. Scientometrics,118(1), 253–280.
Larivière, V., & Costas, R. (2016). How many is too many? On the relationship between research productivity and impact. PLoS ONE,11(9), e0162709.
Larivière, V., Desrochers, N., Macaluso, B., Mongeon, P., Paul-Hus, A., & Sugimoto, C. R. (2016). Contributorship and division of labor in knowledge production. Social Studies of Science,46(3), 417–435.
Levin, M., Krawczyk, S., Bethard, S., & Jurafsky, D. (2012). Citation-based bootstrapping for large-scale author disambiguation. Journal of the American Society for Information Science and Technology,63(5), 1030–1047.
Liu, W., Doǧan, R. I., Kim, S., Comeau, D. C., Kim, W., Yeganova, L., et al. (2014). Author name disambiguation for PubMed. Journal of the Association for Information Science and Technology,65(4), 765–781.
Mazov, N. A., & Gureev, V. N. (2014). The role of unique identifiers in bibliographic information systems. Scientific and Technical Information Processing,41(3), 206–210.
Morillo, F., Santabárbara, I., & Aparicio, J. (2013). The automatic normalisation challenge: Detailed addresses identification. Scientometrics,95(3), 953–966.
Müller, M., Reitz, F., & Roy, N. (2017). Data sets for author name disambiguation: An empirical analysis and a new resource. Scientometrics,111(3), 1467–1500.
On, B., Lee, D., Kang, J., & Mitra, P. (2005). Comparative study of name disambiguation problem using a scalable blocking-based framework. In Proceedings of the 5th ACM/IEEE-CS joint conference on digital libraries (JCDL 2005) (pp. 344–353). New York, NY: ACM.
Palmblad, M., & Van Eck, N. J. (2018). Bibliometric analyses reveal patterns of collaboration between ASMS members. Journal of the American Society for Mass Spectrometry,29(3), 447–454.
Pereira, D. A., Ribeiro-Neto, B. A., Ziviani, N., Laender, A. H. F., Gonçalves, M. A., & Ferreira, A. A. (2009). Using web information for author name disambiguation. In Proceedings of the 2009 ACM/IEEE joint conference on digital libraries (pp. 49–58). New York, NY: ACM.
Robinson-Garcia, N., Sugimoto, C. R., Murray, D., Yegros-Yegros, A., Larivière, V., & Costas, R. (2019). The many faces of mobility: Using bibliometric data to measure the movement of scientists. Journal of Informetrics,13(1), 50–63.
Ruiz-Castillo, J., & Costas, R. (2014). The skewness of scientific productivity. Journal of Informetrics,8(4), 917–934.
Schulz, J. (2016). Using Monte Carlo simulations to assess the impact of author name disambiguation quality on different bibliometric analyses. Scientometrics,107(3), 1283–1298.
Schulz, C., Mazloumian, A., Petersen, A. M., Penner, O., & Helbing, D. (2014). Exploiting citation networks for large-scale author name disambiguation. EPJ Data Science,3(1), 11.
Smalheiser, N. R., & Torvik, V. I. (2009). Author name disambiguation. Annual Review of Information Science and Technology,43, 1–43.
Soler, J. (2007). Separating the articles of authors with the same name. Scientometrics,72(2), 281–290.
Song, Y., Huang, J., Councill, I. G., Li, J., & Giles, C. L. (2007). Efficient topic-based unsupervised name disambiguation. In Proceedings of the 7th ACM/IEEE-CS joint conference on digital libraries (JCDL 2007) (pp. 342–351). New York, NY: ACM.
Song, M., Kim, E. H. J., & Kim, H. J. (2015). Exploring author name disambiguation on PubMed-scale. Journal of Informetrics,9(4), 924–941.
Strotmann, A., & Zhao, D. Z. (2012). Author name disambiguation: What difference does it make in author-based citation analysis? Journal of the American Society for Information Science and Technology,63(9), 1820–1833.
Sugimoto, C. R., Robinson-García, N., Murray, D. S., Yegros-Yegros, A., Costas, R., & Larivière, V. (2017). Scientists have most impact when they’re free to move. Nature,550(7674), 29–31.
Sun, X., Kaur, J., Possamai, L., & Menczer, F. (2013). Ambiguous author query detection using crowdsourced digital library annotations. Information Processing and Management,49(2), 454–464.
Tekles, A., & Bornmann, L. (2019). Author name disambiguation of bibliometric data: A comparison of several unsupervised approaches. arXiv:1904.12746.
Tijssen, R. J. W., & Yegros, A. (2017). Brexit: UK universities and European industry (Correspondence). Nature,544(7648), 35.
Treeratpituk, P., & Giles, C. L. (2009). Disambiguating authors in academic publications using random forests. In Proceedings of the 2009 ACM/IEEE joint conference on digital libraries (pp. 39–48). New York, NY: ACM.
Veloso, A., Ferreira, A. A., Gonçalves, M. A., Laender, A. H., & Meira, W., Jr. (2012). Cost-effective on-demand associative author name disambiguation. Information Processing and Management,48(4), 680–697.
Yang, K.-H., Peng, H.-T., Jiang, J.-Y., Lee, H.-M., & Ho, J.-M. (2008). Author name disambiguation for citations using topic and web correlation. In Proceedings of the 12th European conference on research and advanced technology for digital libraries (pp. 185–196). Berlin: Springer.
Youtie, J., Carley, S., Porter, A. L., & Shapira, P. (2017). Tracking researchers and their outputs: New insights from ORCIDs. Scientometrics,113(1), 437–453.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
D’Angelo, C.A., van Eck, N.J. Collecting large-scale publication data at the level of individual researchers: a practical proposal for author name disambiguation. Scientometrics 123, 883–907 (2020). https://doi.org/10.1007/s11192-020-03410-y
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-020-03410-y