Introduction
Related work
Information adoption and diffusion
Big data analysis
Present work
numpy
, pandas
, tensorflow
) that is imported into a Python module through an import
statement. For example, the random
submodule of numpy
can be imported using a command such as from numpy import random
or import numpy.random as rnd
. There are dozens of variations on Python import statements. We assume that each import
statement is written deliberately and therefore define a library by its full library path as written. Therefore, submodules like import numpy.random
are considered distinct from import numpy
in the present work. The appeal of focusing on Python libraries is that they provide a clearly defined and rich information unit that is adopted rather than shared. To adopt the library, the user must understand its mechanics and be able to incorporate its functionality to produce new source code.Methods
Data collection
diff
specifying lines that were added, edited, or deleted within each file. A commit also contains a name and email address defined by the user’s local Git configuration. This configuration may or may not be correlated with a user’s Github account; we make no attempt to associate the Github user account with the Git email or username. Because names may be ambiguous, we uniquely identify a user u by their email address. A single user may contribute to multiple repositories. In total, we collected 170,413 unique Git users. An important caveat, however, is that users may change their Git configurations to alter their email address. It is unclear exactly how often this happens, but individuals may be represented as multiple users in our dataset.Library extraction
from numpy import random
denotes the library numpy.random
, which is considered a separate, albeit, closely related, library from numpy.linalg
and even the top level library numpy
, etc.
Library adoption
+import
\(\ell \), n to r. Conversely, despite pulling code containing m, u did not adopt m in this commit. Figure 2 also demonstrates that the libraries \(\ell \), n, and m were introduced to repositories \(r^\prime \) and \(r^{\prime \prime }\) by users v and x.
Adoption event extraction
Commit features (C) | |
C1 | # libs added by user |
C2 | # libs updated since last commit |
C3 | —C1 \(\cap \) C2— |
User features (U) | |
U1 | Size of productive vocab \(P_u\) |
U2 | Size of receptive vocab \(R_u\) |
U3 | Time since last commit |
U4 | Intra-commit duration in last 10% of commits |
U5 | # Repos committed |
U6 | # Repos committed in last 10% of commits |
U7 | % commits with added libs |
U8 | % commits with added libs in last 10% of commits |
User–Library pair features (P) | |
P1 | # times user has seen \(\ell \) |
P2 | # times user has seen any library |
P3 | P1/P2 |
P4 | # times user has seen \(\ell \) in last 10% |
P5 | # times user has seen any library in last 10% |
P6 | P4/P5 |
Library features (L) | |
L1 | # commits adding \(\ell \) |
L2 | # users who have committed \(\ell \) |
L3 | # repos containing \(\ell \) |
L4 | Time since last commit of \(\ell \) |
L5 | Avg time between last 10% of commits adding \(\ell \) |
L6 | Avg time between last 10% of commits adding \(\ell \) |
StackOverflow features (S) | |
S1 | # posts containing \(\ell \) |
S2 | # views of posts containing \(\ell \) |
S3 | # posts containing \(\ell \) created in last 30 days |
S4 | # views of posts containing \(\ell \) created in last 30 days |
Adoption graph
StackOverflow
numpy.random
and numpy
are both condensed as the top-level library numpy
. Only 19% of adopted libraries also appeared in at least one post on StackOverflow.error
and help
have almost no adoptions, as we define them, but hundreds of millions of StackOverflow views. Conversely, homeassistant
, an open-source home automation library, has hundreds of adoptions but only a dozen views on StackOverflow. We show a small positive correlation (Pearson \(R=0.16\), \(\hbox {p}<0.001\)) between StackOverflow Views and Adoption Commits. However, its overall effect is tempered by the finding that a relative-few (22%) adoption events were of a library that appeared on StackOverflow.
Results and discussion
Behavior change near adoption events
Adoption model
Model data and features
Training and testing methodology
Model tuning
Feature ablation tests and model performance
Main findings
Limitations
Conclusions
imported
libraries of Python repositories. This corpus allowed for the identification of library adoptions at the user and repository level.