1 Introduction
The advancement of modern science has led to an increase in the complexity of scientific problems, and a rise in the cost of scientific instruments, resulting in the emergence of
big science [
1‐
5]. This paradigm shift has led to the accumulation of knowledge, making it almost impossible for a single scientist to possess comprehensive expertise required for one scientific project, known as the burden of knowledge [
6]. Therefore, scientists have increasingly formed scientific teams to address these challenges [
3,
4,
7,
8]. Previous research has demonstrated that teams dominate knowledge creation in contemporary science, operating across institutional and national boundaries [
8‐
10]. Collaboration networks have thus become a powerful tool for studying team structures and scientific collaborations [
7].
Past two decades have witnessed numerous studies on the properties of collaboration networks, suggesting that collaboration networks exhibit scale-free, small-world, assortativity and strong community structures [
7,
11‐
13]. Recent studies expanded the scope of collaboration networks from binary to weighted [
14,
15], temporal [
16‐
18] and multilayer networks [
19]. The availability of large-scale bibliometric datasets as well as quantitative tools enables the study of the relationship between collaboration network structure and scientific performance. From the macroscopic point of view, previous studies showed that macroscopic network properties significantly affect scientists’ academic performance, including productivity and citation impact [
20‐
28]. From the individual paper’s point of view, empirical studies explored microscopic team formation, examining the association between team diversity, team structures and paper citation, novelty, disruption and multidisciplinarity [
9,
28‐
39]. However, existing studies mainly constructed collaboration networks at a dyadic level, potentially overlooking valuable information, as scientific collaboration now is dominated by group interactions beyond dyadic levels.
In recent years, researchers have made substantial progress in network science and computational topology, leading to the emergence of higher-order representations that capture multi-agent relationships beyond conventional dyadic interactions. Notable examples include simplicial complexes [
40,
41] and hypergraphs [
42,
43], which have been widely applied in analyzing various types of networks across social systems [
44], neuroscience [
45,
46], ecology [
47,
48], and other biological systems [
49,
50]. Despite of similar frameworks in the field of science of science [
51‐
55], to the best of our knowledge, there is limited research exploring the association between higher-order properties and individual scientific productivity. In fact, prior research demonstrated that higher-order holes play necessary roles in biological systems especially the brain functioning [
56,
57]. This highlights an encouraging and promising direction in the collaboration system, i.e., investigating how these higher-order characteristics affect scientific outcomes. This calls for a further analysis into translating the original co-authorship data into structures that preserve group interactions. Additionally, existing studies have drawn conclusions from specific scientific domains, raising questions regarding the generalizability of the findings.
In this paper, we fill this gap by leveraging the Microsoft Academic Graph data (MAG), a large-scale scholarly dataset. We utilize a simplicial complex framework to construct local collaboration networks for a cohort of more than 3.7 million scientists. Our primary objective is to investigate the association between higher-order structural properties of local collaboration networks and scientists’ productivity. Specifically, we delve into two key higher-order characteristics: the 0th Betti number (\(\beta _{0}\)), representing the number of disconnected components, and the 1st Betti number (\(\beta _{1}\)), indicating the presence of higher-order loops. There are three key findings. Firstly, we find that there is an intriguing inverted U-shaped relationship between the number of disconnected components and individual productivity. Secondly, we observe that the presence of higher-order loops within local co-authorship networks is positively associated with scientists’ productivity, suggesting interesting underlying forces related to group interactions. Thirdly, the uncovered relationship can be generalizable to major scientific domains, indicating strong generalizability of our results. This study has several contributions. First, we use a simplicial complex approach to depict scientific collaboration networks, which helps to capture group interactions and higher-order structural properties that cannot be obtained in the conventional dyadic view. Second, our work encompasses scientists from diverse scientific disciplines, offering insights that extend beyond specific scientific domains. These results may help us better understand individual careers and have policy implications for nurturing scientists towards high academic performance.
3 Data
In this paper, we leverage the Microsoft Academic Graph dataset (MAG), which comprises more than 260 million digital publications spanning from 1800 to 2021. MAG offers comprehensive information regarding each publication, including publication year, scientific field(s), and author name(s). It has emerged as a pivotal data source for research on individual careers [
62‐
68]. MAG employs cutting-edge techniques for distinguishing author identities. In addition to machine learning algorithms that leverage publication records for author disambiguation, MAG goes further by harnessing the power of web search engines to access public information such as personal websites and public curricula vitae [
69]. Recent studies have established a gold standard dataset for author name disambiguation based on ORCID, finding that MAG author IDs achieve an impressive 81.87% accuracy, 78.13% F1 score, and 98.49% precision, underscoring the reliability of MAG’s author identification methods [
34,
70].
In this study, we focus on journal articles and conference papers published prior to 2011. Our analysis includes papers with scientific field information as well as venue information, resulting in a dataset of 56,895,201 papers. Furthermore, we focus on scientists who published at least 5 papers and no more than 500 papers during their entire career. This approach helps us mitigate potential errors related to author name disambiguation within the Microsoft Academic Graph (MAG), including instances of author under-conflation, where an author’s publication count may be erroneously lower than the actual number, or over-conflation, which involves wrongly assigning additional publication records to an author. This method also allows us to reduce the influence of outliers, which could include authors with very few or exceptionally high numbers of publications. This selection criterion aligns with recent research practices [
38,
60]. Moreover, we exclude scientists who have collaborated with more than 36 distinct partners in any given year. The reason for this exclusion is rooted in the considerable computational complexities associated with high-order network analyses. In particular, the computation of homology necessitates enumerating all conceivable combinations of simplices, with computational complexity growing exponentially with the dimension of the simplicial complex [
54]. This threshold helps us manage these computational challenges, balancing the need for accuracy with the constraints of available computational resources. Additionally, we focus on scientists who published his/her first paper later than 1960 in order to reduce the noise derived from the relatively small number of publications before.
Our final sample comprises a total of 3,785,807 scientists. For each scientist, we construct his/her yearly local collaboration networks by considering interactions among collaborators (see details in Methodology), resulting in a total of 27,786,774 scientist-year observations till 2011 (see the data frame of “scientist-year observations” in the
Appendix, Table
A1). Note that scientists with less than a 3-year publication history were excluded to ensure the consistency of the number of samples included into the regression analysis of the panel data.
6 Conclusions
In an era where scientific knowledge creation is dominated by collaborative teams, it is of paramount importance to delve into the higher-order structures inherent in scientific collaboration networks. The conventional approach, which primarily adopts a dyadic perspective to construct local collaboration networks, may inadvertently overlook invaluable information for group interactions. Leveraging a vast dataset encompassing over 56 million research articles from 1960 to 2011 from the Microsoft Academic Graph, our objective is to explore the intricate link between the higher-order structural features characterizing local collaboration networks and their impact on scientific productivity. Furthermore, we endeavor to ascertain the generalizability of these findings across a diverse set of scientific domains. Throughout our analysis, a noteworthy trend becomes apparent – both the number of disconnected components and the prevalence of higher-order holes exhibit a consistent upward trajectory over time. The fraction of local networks featuring higher-order holes reached 11% in 2011. This surge may be attributed to the remarkable expansion of the scientific community during this period. While higher-order holes are indeed evident in various domains, with domains such as medicine and biology sharing common features, the dominance of triatic closure remains a prevailing characteristic within scientific collaboration networks.
Furthermore, our investigation reveals an intriguing inverted U-shaped association between the number of disconnected components in local collaboration networks and scientific productivity. These results partly speak to the strength of weak tie theory [
81], which suggests that individuals spanning over structural holes in social networks can gain significant advantages in accessing new opportunities, fostering innovation [
82], and enhancing their overall performance [
83]. Previous research, largely rooted in macroscopic collaboration networks, has consistently demonstrated the advantages reaped by scientists who span structural holes. These benefits include paper publication, citation counts, and a higher likelihood of contributing novel research [
20,
25,
60]. However, such studies have rarely ventured into the intricate realm of scientists’ local networks. Structural holes [
84,
85], which foster diversity within local collaboration networks, are primed to play a pivotal role [
86]. One would expect significant advantages upon scientists in the realms of productivity. It is plausible that structural diversity acts as a catalyst for resource-sharing and the seamless transmission of knowledge, empowering scientists to harness a spectrum of expertise, diverse ideas, and even the valuable lessons extracted from failure across a heterogeneous pool of collaborators [
87‐
91]. These diverse local collaboration structures equip scientists to acquire a wide array of skills. Ultimately, this dynamic bolsters their productivity. This interpretation aligns with prior findings that suggest novel and multidisciplinary research flourishes within newly-formed teams [
38]. This research reinforces this perspective by illuminating a positive correlation between the number of disconnected components within local collaboration networks and scientific productivity – up to a certain threshold. These empirical results effectively substantiate the tenets of structural holes and the significance of weak ties.
This study reveals that as the number of disconnected components reaches a certain threshold, a negative correlation emerges with regard to productivity. This intriguing discovery propels us to explore the potential underlying forces at play. In the realm of scientific collaborations, where the advantages of structural holes and disconnected team members are evident, effective communication and coordination between individuals remain critical [
92,
93]. A key facilitator in this regard is familiarity, which results in positive outcomes. Earlier research spotlighted the benefits of strong ties between scientists, often referred to as “super-ties,” underscoring their substantial contributions to productivity and citations [
94]. Furthermore, the diverse structures present within local collaboration networks can have the unintended consequence of slowing down the assimilation of ideas, leading to lower consensus and, in some cases, potential conflicts [
32,
95,
96]. For example, international collaborations tend to produce less novel papers [
32], and remote collaborations show a negative association with disruptive research [
39]. Similarly, Liu et al. found an inverted-U shaped relationship between team freshness and citations using paper-level data [
34].
This study makes a pivotal observation: the presence of higher-order loops within local collaboration networks is positively correlated with productivity in scientific careers. These higher-order loops shed light on the dynamic interplay among multiple agents that goes beyond the typical dyadic interactions. For instance, the phenomena of complex contagion, where an influence requires the involvement of more than two individuals, may exhibit unique characteristics. As highlighted by Iacopini et al. [
97], “
the simplicial model of contagion is able to capture the basic mechanisms and effects of higher-order interactions in social contagion processes.” In scientific collaboration, researchers engage in discussions, knowledge diffusion, and the adoption of innovative ideas. Describing these intricate interactions through the lens of higher-order networks provides invaluable insights. This leads to intriguing questions about how resources and knowledge are transmitted within these higher-order loops, as well as the underlying forces driving the positive correlation between higher-order loops and scientific performance. As we conclude, these findings not only provide answers but also raise stimulating questions, paving the way for promising directions in future research within this domain.
In conclusion, these results remain consistent across a spectrum of scientific domains, highlighting its generalizability. This work contributes significantly to the understanding of higher-order collaboration networks by delving into the roles of higher-order holes. Furthermore, it advances our comprehension of how network structures can influence the scientific performance of researchers. Of paramount significance is our discovery of an intriguing inverted U-shaped relationship driven by the number of disconnected components within local collaboration networks. This insight offers a nuanced understanding of the interplay between structural complexity and scientific output. Additionally, our work transcends disciplinary boundaries by encompassing scientists from diverse fields. The insights gleaned from this study hold the potential to benefit a wide array of research areas, extending beyond specific scientific domains. Our findings have important policy implications for nurturing scientific personnel and accelerating innovative breakthroughs. Scientists need to carefully consider the structure of his/her collaboration network. It is crucial for scientists to strive for a well-balanced and properly disconnected or loosely connected local co-authorship network, which is crucial to high productivity.
This study contains several limitations. First, we use publication data to describe collaboration patterns, while collaborative work does not always result in written outputs, and the presence of ghost authors, where individuals contribute to research but are not acknowledged as authors, cannot be ruled out [
34,
98,
99]. This may introduce possible biases in our findings and limit the generalizability of our results to all forms of scientific collaboration. Secondly, we gauge scientific productivity using the number of publications. However, the number of publications alone may not be a perfect indicator that captures scientists’ scientific performance [
100]. Prior research proposed various indicators to measure the quality of academic outputs, such as citations [
101], novelty indicators [
102,
103] aligning with Schumpeter’s innovation economics that “innovation combines components in a new way” [
104], disruption index [
59,
105], as well as other metrics capturing the interdisciplinarity [
106]. It is thus interesting to understand the effect of higher-order structures on scientists’ academic performance taking into account the quality of works. Thirdly, it is worth noting that despite we control for possible confounding variables, our study is still of a correlational nature and does not establish causal relationships. Despite these limitations, our study offers valuable insights into the relationship between higher-order structural properties and scientific outcomes, contributing to a growing body of literature in the field of science of science and data science.
Further research is needed to conduct systematic investigations to unravel the underlying mechanisms driving these associations between higher-order properties and productivity. What are the factors that prompt scientists with higher-order structures to publish significantly more papers than their counterparts without higher-order structures? In an era of big science, there are a tremendous number of publications and citations each year, future work could examine the evolution of the effect of high-order structures on scientific achievements, which may untangle the effect of the growth of science and higher-order structures. Finally, future research could go beyond scientific productivity and explore how higher-order structures affect knowledge recombination, originality and interdisciplinarity.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.