Collecting large-scale publication data at the level of individual researchers: a practical proposal for author name disambiguation

D’Angelo, Ciriaco Andrea; van Eck, Nees Jan

doi:10.1007/s11192-020-03410-y

Collecting large-scale publication data at the level of individual researchers: a practical proposal for author name disambiguation

Published: 07 March 2020

Volume 123, pages 883–907, (2020)
Cite this article

Scientometrics Aims and scope Submit manuscript

Ciriaco Andrea D’Angelo¹ &
Nees Jan van Eck²

1316 Accesses
30 Citations
13 Altmetric
2 Mentions
Explore all metrics

Abstract

The disambiguation of author names is an important and challenging task in bibliometrics. We propose an approach that relies on an external source of information for selecting and validating clusters of publications identified through an unsupervised author name disambiguation method. The application of the proposed approach to a random sample of Italian scholars shows encouraging results, with an overall precision, recall, and F-measure of over 96%. The proposed approach can serve as a starting point for large-scale census of publication portfolios for bibliometric analyses at the level of individual researchers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

How to design bibliometric research: an overview and a framework proposal

Article Open access 06 March 2024

Literature reviews as independent studies: guidelines for academic practice

Article Open access 14 October 2022

How to Write and Publish a Research Paper for a Peer-Reviewed Journal

Article Open access 30 April 2020

Notes

http://cercauniversita.cineca.it/php5/docenti/cerca.php, last accessed 20/09/2019.
The complete list is accessible at attiministeriali.miur.it/userfiles/115.htm, last accessed 20/09/2019.
Note that this filtering stage differs from that in the original DGA method: in the original DGA method, author-identity pairs are filtered; in the proposed approach, complete clusters are filtered.
The percent sign (%) wildcard allows to retrieve any name starting with the text preceding the sign.
In fact, these publications are also ignored by the DGA algorithm, which is applied only to articles indexed in the Italian National Citation Report.
The reader may wonder why a “manual validation” is performed in an approach proposed for “large scale” author name disambiguation. As we will see better below, this scenario is presented only to understand the trade-off between costs and benefits of this scenario and the less costly alternative scenarios in which no manual validation is involved.
All Italian university research staff hold an ORCID identifier, following the IRIDE project launched in 2014 by the MIUR.
We have excluded documents published in a year in which the relevant author was not a tenured professor in the Italian academic system.
Baseline 1 is a simple method often performed by scholars in practice. Given the high share of potential homonyms (29% as shown in Table 5), we expect a low level of precision when applying such method. Baseline 2 should solve most homonym cases but could lead to a low level of recall due to an increasing number of false negatives.

References

Abdulhayoglu, M. A., & Thijs, B. (2017). Use of ResearchGate and Google CSE for author name disambiguation. Scientometrics,111(3), 1965–1985.
Article Google Scholar
Aksnes, D. W. (2008). When different persons have an identical author name. How frequent are homonyms? Journal of the American Society for Information Science and Technology,59(5), 838–841.
Article Google Scholar
Backes, T. (2018). Effective unsupervised author disambiguation with relative frequencies. In Proceedings of the 18th ACM/IEEE on joint conference on digital libraries (pp. 203–212). New York, NY: ACM.
Caron, E., & van Eck, N. J. (2014). Large scale author name disambiguation using rule-based scoring and clustering. In E. Noyons (Ed.), 19th international conference on science and technology indicators. “Context counts: Pathways to master big data and little data” (pp. 79–86). Leiden: CWTS-Leiden University.
Chinchilla-Rodríguez, Z., Bu, Y., Robinson-García, N., Costas, R., & Sugimoto, C. R. (2018a). Travel bans and scientific mobility: Utility of asymmetry and affinity indexes to inform science policy. Scientometrics,116(1), 569–590.
Article Google Scholar
Chinchilla-Rodríguez, Z., Miao, L., Murray, D., Robinson-García, N., Costas, R., & Sugimoto, C. R. (2018b). A global comparison of scientific mobility and collaboration according to national scientific capacities. Frontiers in Research Metrics and Analytics,3, 17.
Article Google Scholar
Cornell, L. L. (1982). Duplication of Japanese names: A problem in citations and bibliographies. Journal of the American Society for Information Science and Technology,33(2), 102–104.
Article MathSciNet Google Scholar
Cota, R. G., Ferreira, A. A., Nascimento, C., Gonçalves, M. A., & Laender, A. H. (2010). An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology,61(9), 1853–1870.
Article Google Scholar
Culotta, A., Kanani, P., Hall, R., Wick, M., & McCallum, A. (2007). Author disambiguation using error-driven machine learning with a ranking loss function. In Proceedings of the 6th international workshop on information integration on the web (IIWeb 2007) (pp. 32–37). Menlo Park, CA: AAAI Press.
D’Angelo, C. A., Giuffrida, C., & Abramo, G. (2011). A heuristic approach to author name disambiguation in bibliometrics databases for large-scale research assessments. Journal of the American Society for Information Science and Technology,62(2), 257–269.
Article Google Scholar
Enserink, M. (2009). Are you ready to become a number? Science,323(5922), 1662–1664.
Article Google Scholar
Ferreira, A. A., Gonçalves, M. A., & Laender, A. H. F. (2012). A brief survey of automatic methods for author name disambiguation. ACM SIGMOD Record,41(2), 15–26.
Article Google Scholar
Ferreira, A. A., Veloso, A., Gonçalves, M. A., & Laender, A. H. F. (2010). Effective self-training author name disambiguation in scholarly digital libraries. In Proceedings of the 2010 ACM/IEEE joint conference on digital libraries (pp. 39–48). New York, NY: ACM.
Haak, L. L., Fenner, M., Paglione, L., Pentz, E., & Ratner, H. (2012). ORCID: A system to uniquely identify researchers. Learned Publishing,25(4), 259–264.
Article Google Scholar
Han, H., Giles, L., Zha, H., Li, C., & Tsioutsiouliklis, K. (2004). Two supervised learning approaches for name disambiguation in author citations. In Proceedings of the 4th ACM/IEEE-CS joint conference on digital libraries (JCDL 2004) (pp. 296–305). New York, NY: ACM.
Han, H., Zha, H., & Giles, C. L. (2005). Name disambiguation in author citations using a K-way spectral clustering method. In Proceedings of the 5th ACM/IEEE-CS joint conference on digital libraries (JCDL 2005) (pp. 334–343). New York, NY: ACM.
Harman, G. (2000). Allocating research infrastructure grants in post-binary higher education systems: British and Australian approaches. Journal of Higher Education Policy and Management,22(2), 11–126.
Article Google Scholar
Hicks, D. (2009). Evolving regimes of multi-university research evaluation. Higher Education,57(4), 393–404.
Article Google Scholar
Hjørland, B. (2010). The foundation of the concept of relevance. Journal of the American Society for Information Science and Technology,61(2), 217–237.
Google Scholar
Huang, J., Ertekin, S., & Giles, C. (2006). Efficient name disambiguation for large-scale databases. In Proceedings of the 10th European conference on principles and practice of knowledge discovery in databases (PKDD 2006) (pp. 536–544). Berlin: Springer.
Huang, S., Yang, B., Yan, S., & Rousseau, R. (2014). Institution name disambiguation for research assessment. Scientometrics,99(3), 823–838.
Article Google Scholar
Kanani, P., McCallum, A., & Pal, C. (2007). Improving author coreference by resource-bounded information gathering from the web. In Proceedings of the 20th international joint conference on artificial intelligence (pp. 429–434). San Francisco, CA: Morgan Kaufmann Publishers Inc.
Kang, I.-S., Na, S.-H., Lee, S., Jung, H., Kim, P., Sung, W.-K., et al. (2009). On co-authorship for author disambiguation. Information Processing and Management,45(1), 84–97.
Article Google Scholar
Kawashima, H., & Tomizawa, H. (2015). Accuracy evaluation of Scopus author ID based on the largest funding database in Japan. Scientometrics,103(3), 1061–1071.
Article Google Scholar
Kim, J. (2018). Evaluating author name disambiguation for digital libraries: A case of DBLP. Scientometrics,116(3), 1867–1886.
Article Google Scholar
Kim, J., & Kim, J. (2019). Effect of forename string on author name disambiguation. Journal of the Association for Information Science and Technology. https://doi.org/10.1002/asi.24298.
Article Google Scholar
Kim, J., Kim, J., & Owen-Smith, J. (2019). Generating automatically labeled data for author name disambiguation: An iterative clustering method. Scientometrics,118(1), 253–280.
Article Google Scholar
Larivière, V., & Costas, R. (2016). How many is too many? On the relationship between research productivity and impact. PLoS ONE,11(9), e0162709.
Article Google Scholar
Larivière, V., Desrochers, N., Macaluso, B., Mongeon, P., Paul-Hus, A., & Sugimoto, C. R. (2016). Contributorship and division of labor in knowledge production. Social Studies of Science,46(3), 417–435.
Article Google Scholar
Levin, M., Krawczyk, S., Bethard, S., & Jurafsky, D. (2012). Citation-based bootstrapping for large-scale author disambiguation. Journal of the American Society for Information Science and Technology,63(5), 1030–1047.
Article Google Scholar
Liu, W., Doǧan, R. I., Kim, S., Comeau, D. C., Kim, W., Yeganova, L., et al. (2014). Author name disambiguation for PubMed. Journal of the Association for Information Science and Technology,65(4), 765–781.
Article Google Scholar
Mazov, N. A., & Gureev, V. N. (2014). The role of unique identifiers in bibliographic information systems. Scientific and Technical Information Processing,41(3), 206–210.
Article Google Scholar
Morillo, F., Santabárbara, I., & Aparicio, J. (2013). The automatic normalisation challenge: Detailed addresses identification. Scientometrics,95(3), 953–966.
Article Google Scholar
Müller, M., Reitz, F., & Roy, N. (2017). Data sets for author name disambiguation: An empirical analysis and a new resource. Scientometrics,111(3), 1467–1500.
Article MATH Google Scholar
On, B., Lee, D., Kang, J., & Mitra, P. (2005). Comparative study of name disambiguation problem using a scalable blocking-based framework. In Proceedings of the 5th ACM/IEEE-CS joint conference on digital libraries (JCDL 2005) (pp. 344–353). New York, NY: ACM.
Palmblad, M., & Van Eck, N. J. (2018). Bibliometric analyses reveal patterns of collaboration between ASMS members. Journal of the American Society for Mass Spectrometry,29(3), 447–454.
Article Google Scholar
Pereira, D. A., Ribeiro-Neto, B. A., Ziviani, N., Laender, A. H. F., Gonçalves, M. A., & Ferreira, A. A. (2009). Using web information for author name disambiguation. In Proceedings of the 2009 ACM/IEEE joint conference on digital libraries (pp. 49–58). New York, NY: ACM.
Robinson-Garcia, N., Sugimoto, C. R., Murray, D., Yegros-Yegros, A., Larivière, V., & Costas, R. (2019). The many faces of mobility: Using bibliometric data to measure the movement of scientists. Journal of Informetrics,13(1), 50–63.
Article Google Scholar
Ruiz-Castillo, J., & Costas, R. (2014). The skewness of scientific productivity. Journal of Informetrics,8(4), 917–934.
Article Google Scholar
Schulz, J. (2016). Using Monte Carlo simulations to assess the impact of author name disambiguation quality on different bibliometric analyses. Scientometrics,107(3), 1283–1298.
Article Google Scholar
Schulz, C., Mazloumian, A., Petersen, A. M., Penner, O., & Helbing, D. (2014). Exploiting citation networks for large-scale author name disambiguation. EPJ Data Science,3(1), 11.
Article Google Scholar
Smalheiser, N. R., & Torvik, V. I. (2009). Author name disambiguation. Annual Review of Information Science and Technology,43, 1–43.
Article Google Scholar
Soler, J. (2007). Separating the articles of authors with the same name. Scientometrics,72(2), 281–290.
Article Google Scholar
Song, Y., Huang, J., Councill, I. G., Li, J., & Giles, C. L. (2007). Efficient topic-based unsupervised name disambiguation. In Proceedings of the 7th ACM/IEEE-CS joint conference on digital libraries (JCDL 2007) (pp. 342–351). New York, NY: ACM.
Song, M., Kim, E. H. J., & Kim, H. J. (2015). Exploring author name disambiguation on PubMed-scale. Journal of Informetrics,9(4), 924–941.
Article MathSciNet Google Scholar
Strotmann, A., & Zhao, D. Z. (2012). Author name disambiguation: What difference does it make in author-based citation analysis? Journal of the American Society for Information Science and Technology,63(9), 1820–1833.
Article Google Scholar
Sugimoto, C. R., Robinson-García, N., Murray, D. S., Yegros-Yegros, A., Costas, R., & Larivière, V. (2017). Scientists have most impact when they’re free to move. Nature,550(7674), 29–31.
Article Google Scholar
Sun, X., Kaur, J., Possamai, L., & Menczer, F. (2013). Ambiguous author query detection using crowdsourced digital library annotations. Information Processing and Management,49(2), 454–464.
Article Google Scholar
Tekles, A., & Bornmann, L. (2019). Author name disambiguation of bibliometric data: A comparison of several unsupervised approaches. arXiv:1904.12746.
Tijssen, R. J. W., & Yegros, A. (2017). Brexit: UK universities and European industry (Correspondence). Nature,544(7648), 35.
Article Google Scholar
Treeratpituk, P., & Giles, C. L. (2009). Disambiguating authors in academic publications using random forests. In Proceedings of the 2009 ACM/IEEE joint conference on digital libraries (pp. 39–48). New York, NY: ACM.
Veloso, A., Ferreira, A. A., Gonçalves, M. A., Laender, A. H., & Meira, W., Jr. (2012). Cost-effective on-demand associative author name disambiguation. Information Processing and Management,48(4), 680–697.
Article Google Scholar
Yang, K.-H., Peng, H.-T., Jiang, J.-Y., Lee, H.-M., & Ho, J.-M. (2008). Author name disambiguation for citations using topic and web correlation. In Proceedings of the 12th European conference on research and advanced technology for digital libraries (pp. 185–196). Berlin: Springer.
Youtie, J., Carley, S., Porter, A. L., & Shapira, P. (2017). Tracking researchers and their outputs: New insights from ORCIDs. Scientometrics,113(1), 437–453.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Engineering and Management, University of Rome “Tor Vergata”, Rome, Italy
Ciriaco Andrea D’Angelo
Centre for Science and Technology Studies, Leiden University, Leiden, The Netherlands
Nees Jan van Eck

Authors

Ciriaco Andrea D’Angelo
View author publications
You can also search for this author in PubMed Google Scholar
Nees Jan van Eck
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ciriaco Andrea D’Angelo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

D’Angelo, C.A., van Eck, N.J. Collecting large-scale publication data at the level of individual researchers: a practical proposal for author name disambiguation. Scientometrics 123, 883–907 (2020). https://doi.org/10.1007/s11192-020-03410-y

Download citation

Received: 25 September 2019
Published: 07 March 2020
Issue Date: May 2020
DOI: https://doi.org/10.1007/s11192-020-03410-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Collecting large-scale publication data at the level of individual researchers: a practical proposal for author name disambiguation

Abstract

Access this article

Similar content being viewed by others

How to design bibliometric research: an overview and a framework proposal

Literature reviews as independent studies: guidelines for academic practice

How to Write and Publish a Research Paper for a Peer-Reviewed Journal

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Collecting large-scale publication data at the level of individual researchers: a practical proposal for author name disambiguation

Abstract

Access this article

Similar content being viewed by others

How to design bibliometric research: an overview and a framework proposal

Literature reviews as independent studies: guidelines for academic practice

How to Write and Publish a Research Paper for a Peer-Reviewed Journal

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation