Skip to main content
Log in

Formational bounds of link prediction in collaboration networks

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Link prediction in collaboration networks is often solved by identifying structural properties of existing nodes that are disconnected at one point in time, and that share a link later on. The maximally possible recall rate or upper bound of this approach’s success is capped by the proportion of links that are formed among existing nodes embedded in these properties. Consequentially, sustained links as well as links that involve one or two new network participants are typically not predicted. The purpose of this study is to highlight formational constraints that need to be considered to increase the practical value of link prediction methods targeted for collaboration networks. In this study, we identify the distribution of basic link formation types based on four large-scale, over-time collaboration networks, showing that roughly speaking, 25% of links represent continued collaborations, 25% of links are new collaborations between existing authors, and 50% are formed between an existing author and a new network member. This implies that for collaboration networks, increasing the accuracy of computational link prediction solutions may not be a reasonable goal when the ratio of collaboration links that are eligible to the classic link prediction process is low.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. In this study, ‘collaboration’ means ‘coauthorship’ in a research paper and these two terms are used interchangeably.

  2. https://www.nlm.nih.gov/bsd/licensee/medpmmenu.html.

  3. https://databank.illinois.edu/datasets/IDB-4222651.

  4. http://dblp.org/xml/release/; for this study, we downloaded the April 2015 release.

  5. A list of 392 journal was obtained from Thomson Reuters Journal Citation Report 2012 for the category “Computer Science”. We retrieved records on these papers published in these journals from DBLP.

  6. http://journals.aps.org/datasets; for this study, we obtained the APS 2014 release version under the permission of the American Physical Society.

  7. Mark E. J. Newman at the University of Michigan Department of Physics kindly provided the disambiguation code.

  8. http://scholar.ndsl.kr/index.do; for this study, we obtained the KISTI 2016 version under a research agreement with the Korea Institute for Science and Technology Information.

  9. This demonstrates why varying past–present network time frames matters for this study. The idea of using different past–present network periods was suggested by one of the reviewers of this paper.

  10. In some fields, such as natural language processing, recall and precision are often inversely related and therefore an average score such as the F metric (e.g., harmonic mean of precision and recall) is calculated.

  11. For a detailed explanation for the Degree Product predictor, see “Appendix”.

  12. This does not mean that all preferential attachment models are designed to explain power-law obeying networks. However, many studies on preferential attachment have attempted to model power-law obeying networks.

  13. https://cran.r-project.org/web/packages/poweRlaw/index.html.

  14. Many studies on power-law distribution in collaboration networks have fitted distribution tails (i.e., distribution of certain x values and above) to power-law slopes to assess the performance of proposed network generation models. Several studies have divided a degree distribution into two parts (below and above a certain x value) and fit them separately to different power-law slopes (e.g., Wagner and Leydesdorff 2005). A few others have tested power-law distributions with cut-offs (below certain x value) (e.g., Newman 2001b).

References

Download references

Acknowledgements

This work is supported, in part, by Korea Institute of Science and Technology Information (KISTI). We would like to thank Vetle Torvik (University of Illinois at Urbana-Champaign), the American Physical Society, DBLP, and KISTI for providing datasets. We are also grateful to Mark E. J. Newman (University of Michigan) for providing code for disambiguating author names in APS data and Raf Guns (University of Antwerp) for comments on link prediction processes in LinkPred.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jinseok Kim.

Appendix

Appendix

Degree Product: (Barabási et al. 2002) showed that if links in a network are formed based on preferential attachment, the probability of two nodes to form a link is proportional to the product of the degrees of those two nodes. This is frequently used to predict link formation among nodes present in both past and present networks. In the following equation, S(x, y) is the prediction score for a pair of node x and y, and Γ(x) is the set of nodes connected to x.

$$S\left( {x, y} \right) = \left| {\Gamma \left( x \right)} \right| \times \left| {\Gamma \left( y \right)} \right|$$
(2)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, J., Diesner, J. Formational bounds of link prediction in collaboration networks. Scientometrics 119, 687–706 (2019). https://doi.org/10.1007/s11192-019-03055-6

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-019-03055-6

Keywords

Navigation