Abstract
Link prediction in collaboration networks is often solved by identifying structural properties of existing nodes that are disconnected at one point in time, and that share a link later on. The maximally possible recall rate or upper bound of this approach’s success is capped by the proportion of links that are formed among existing nodes embedded in these properties. Consequentially, sustained links as well as links that involve one or two new network participants are typically not predicted. The purpose of this study is to highlight formational constraints that need to be considered to increase the practical value of link prediction methods targeted for collaboration networks. In this study, we identify the distribution of basic link formation types based on four large-scale, over-time collaboration networks, showing that roughly speaking, 25% of links represent continued collaborations, 25% of links are new collaborations between existing authors, and 50% are formed between an existing author and a new network member. This implies that for collaboration networks, increasing the accuracy of computational link prediction solutions may not be a reasonable goal when the ratio of collaboration links that are eligible to the classic link prediction process is low.
Similar content being viewed by others
Notes
In this study, ‘collaboration’ means ‘coauthorship’ in a research paper and these two terms are used interchangeably.
http://dblp.org/xml/release/; for this study, we downloaded the April 2015 release.
A list of 392 journal was obtained from Thomson Reuters Journal Citation Report 2012 for the category “Computer Science”. We retrieved records on these papers published in these journals from DBLP.
http://journals.aps.org/datasets; for this study, we obtained the APS 2014 release version under the permission of the American Physical Society.
Mark E. J. Newman at the University of Michigan Department of Physics kindly provided the disambiguation code.
http://scholar.ndsl.kr/index.do; for this study, we obtained the KISTI 2016 version under a research agreement with the Korea Institute for Science and Technology Information.
This demonstrates why varying past–present network time frames matters for this study. The idea of using different past–present network periods was suggested by one of the reviewers of this paper.
In some fields, such as natural language processing, recall and precision are often inversely related and therefore an average score such as the F metric (e.g., harmonic mean of precision and recall) is calculated.
For a detailed explanation for the Degree Product predictor, see “Appendix”.
This does not mean that all preferential attachment models are designed to explain power-law obeying networks. However, many studies on preferential attachment have attempted to model power-law obeying networks.
Many studies on power-law distribution in collaboration networks have fitted distribution tails (i.e., distribution of certain x values and above) to power-law slopes to assess the performance of proposed network generation models. Several studies have divided a degree distribution into two parts (below and above a certain x value) and fit them separately to different power-law slopes (e.g., Wagner and Leydesdorff 2005). A few others have tested power-law distributions with cut-offs (below certain x value) (e.g., Newman 2001b).
References
Adamic, L. A., & Adar, E. (2003). Friends and neighbors on the web. Social Networks, 25(3), 211–230. https://doi.org/10.1016/So378-8733(03)00009-1.
Barabási, A. L., Jeong, H., Neda, Z., Ravasz, E., Schubert, A., & Vicsek, T. (2002). Evolution of the social network of scientific collaborations. Physica A-Statistical Mechanics and Its Applications, 311(3–4), 590–614. https://doi.org/10.1016/s0378-4371(02)00736-7.
Braun, T., Glänzel, W., & Schubert, A. (2001). Publication and cooperation patterns of the authors of neuroscience journals. Scientometrics, 51(3), 499–510. https://doi.org/10.1023/A:1019643002560.
Cabanac, G., Hubert, G., & Milard, B. (2015). Academic careers in Computer Science: Continuance and transience of lifetime co-authorships. Scientometrics, 102(1), 135–150. https://doi.org/10.1007/s11192-014-1426-0.
Chen, D.-B., Xiao, R., & Zeng, A. (2014). Predicting the evolution of spreading on complex networks. Scientific Reports. https://doi.org/10.1038/srep06108
Chen, H., Li, X., & Huang, Z. (2005). Link prediction approach to collaborative filtering. Paper presented at the proceedings of the 5th ACM/IEEE-CS joint conference on digital libraries (JCDL ‘05).
Choudhury, N., & Uddin, S. (2017). Mining actor-level structural and neighborhood evolution for link prediction in dynamic networks. Paper presented at the Proceedings of the 2017 IEEE/ACM international conference on advances in social networks analysis and mining 2017, Sydney, Australia.
Choudhury, N., & Uddin, S. (2018). Evolutionary community mining for link prediction in dynamic networks. Paper presented at the complex networks & their applications VI, Lyon, France.
Clauset, A., Shalizi, C. R., & Newman, M. E. J. (2009). Power-law distributions in empirical data. Siam Review, 51(4), 661–703. https://doi.org/10.1137/070710111.
Fegley, B. D., & Torvik, V. I. (2013). Has large-scale named-entity network analysis been resting on a flawed assumption? PLoS ONE, 8(7), 1–16. https://doi.org/10.1371/journal.pone.0070299.
Guns, R. (2014). Link prediction. In Measuring scholarly impact (pp. 35–55). Springer.
Guns, R., & Rousseau, R. (2014). Recommending research collaborations using link prediction and random forest classifiers. Scientometrics, 101(2), 1461–1473. https://doi.org/10.1007/s11192-013-1228-9.
Kim, J. (2018). Evaluating author name disambiguation for digital libraries: A case of DBLP. Scientometrics, 116(3), 1867–1886. https://doi.org/10.1007/s11192-018-2824-5.
Kim, J., & Diesner, J. (2015). The effect of data pre-processing on understanding the evolution of collaboration networks. Journal of Informetrics, 9(1), 226–236. https://doi.org/10.1016/j.joi.2015.01.002.
Kim, J., & Diesner, J. (2016). Distortive effects of initial-based name disambiguation on measurements of large-scale coauthorship networks. Journal of the Association for Information Science and Technology, 67(6), 1446–1461.
Kim, J., & Diesner, J. (2017). Over-time measurement of triadic closure in coauthorship networks. Social Network Analysis and Mining, 7(1), 1–12. https://doi.org/10.1007/s13278-017-0428-3.
Kim, J., Tao, L., Lee, S.-H., & Diesner, J. (2016). Evolution and structure of scientific co-publishing network in Korea between 1948–2011. Scientometrics, 107(1), 27–41. https://doi.org/10.1007/s11192-016-1878-5.
Lerchenmueller, M. J., & Sorenson, O. (2016). Author Disambiguation in PubMed: Evidence on the precision and recall of author-ity among NIH-funded scientists. PLoS ONE, 11(7), e0158731.
Liben-Nowell, D., & Kleinberg, J. (2007). The link-prediction problem for social networks. Journal of the American Society for Information Science and Technology, 58(7), 1019–1031. https://doi.org/10.1002/asi.20591.
Lü, L., & Zhou, T. (2011). Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and Its Applications, 390(6), 1150–1170.
Martin, T., Ball, B., Karrer, B., & Newman, M. E. J. (2013). Coauthorship and citation patterns in the Physical Review. Physical Review E, 88(1), 012814. https://doi.org/10.1103/physreve.88.012814.
Milojević, S. (2010). Modes of collaboration in modern science: Beyond power laws and preferential attachment. Journal of the American Society for Information Science and Technology, 61(7), 1410–1423. https://doi.org/10.1002/asi.21331.
Mohdeb, D., Boubetra, A., & Charikhi, M. (2016). Tie persistence in academic social networks. Informatica, 40(3), 353.
Mollenhorst, G., Volker, B., & Flap, H. (2011). Shared contexts and triadic closure in core discussion networks. Social Networks, 33(4), 292–302. https://doi.org/10.1016/j.socnet.2011.09.001.
Newman, D., Karimi, S., & Cavedon, L. (2009). Using topic models to interpret MEDLINE’s medical subject headings. In A. Nicholson, & X. Li (Eds.), AI 2009: Advances in artificial intelligence (Vol. 5866, pp. 270–279). Berlin, Heidelberg: Springer.
Newman, M. E. J. (2001a). Clustering and preferential attachment in growing networks. Physical Review E. https://doi.org/10.1103/physreve.64.025102.
Newman, M. E. J. (2001b). The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences of the United States of America, 98(2), 404–409. https://doi.org/10.1073/pnas.021544898.
Pennock, D. M., Flake, G. W., Lawrence, S., Glover, E. J., & Giles, C. L. (2002). Winners don’t take all: Characterizing the competition for links on the web. Proceedings of the National Academy of Sciences of the United States of America, 99(8), 5207–5211. https://doi.org/10.1073/pnas.032085699.
Perc, M. (2014). The Matthew effect in empirical data. Journal of The Royal Society Interface. https://doi.org/10.1098/rsif.2014.0378.
Price, D., & Gürsey, S. (1976). Studies in scientometrics. 1. Transience and continuance in scientific authorship. Paper presented at the international forum on information and documentation.
Reitz, F., & Hoffmann, O. (2011). Did they notice? A case-study on the community contribution to data quality in DBLP. In S. Gradmann, F. Borri, C. Meghini, & H. Schuldt (Eds.), Research and advanced technology for digital libraries, TPDL 2011 (Vol. 6966, pp. 204–215). Berlin: Springer.
Resnick, P., & Varian, H. R. (1997). Recommender systems. Communications of the ACM, 40(3), 56–58.
Schubert, A., & Glänzel, W. (1991). Publication dynamics—Models and indicators. Scientometrics, 20(1), 317–331. https://doi.org/10.1007/Bf02018161.
Taskar, B., Wong, M. F., Abbeel, P., & Koller, D. (2003). Link prediction in relational data. Paper presented at the advances in neural information processing systems.
Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304.
Wagner, C. S., & Leydesdorff, L. (2005). Network structure, self-organization, and the growth of international collaboration in science. Research Policy, 34(10), 1608–1618. https://doi.org/10.1016/j.respol.2005.08.002.
Yan, E., & Guns, R. (2014). Predicting and recommending collaborations: An author-, institution-, and country-level analysis. Journal of Informetrics, 8(2), 295–309. https://doi.org/10.1016/j.joi.2014.01.008.
Acknowledgements
This work is supported, in part, by Korea Institute of Science and Technology Information (KISTI). We would like to thank Vetle Torvik (University of Illinois at Urbana-Champaign), the American Physical Society, DBLP, and KISTI for providing datasets. We are also grateful to Mark E. J. Newman (University of Michigan) for providing code for disambiguating author names in APS data and Raf Guns (University of Antwerp) for comments on link prediction processes in LinkPred.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Degree Product: (Barabási et al. 2002) showed that if links in a network are formed based on preferential attachment, the probability of two nodes to form a link is proportional to the product of the degrees of those two nodes. This is frequently used to predict link formation among nodes present in both past and present networks. In the following equation, S(x, y) is the prediction score for a pair of node x and y, and Γ(x) is the set of nodes connected to x.
Rights and permissions
About this article
Cite this article
Kim, J., Diesner, J. Formational bounds of link prediction in collaboration networks. Scientometrics 119, 687–706 (2019). https://doi.org/10.1007/s11192-019-03055-6
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-019-03055-6