Abstract
Scientific datasets are increasingly crucial for knowledge accumulation and reproducibility, making it essential to understand how they are used. Although usage information is hard to obtain, features from the publications that describe a dataset can provide clues. This article associates dataset downloads with the authors’ h-index, institutional prestige, journal ranking, and the references used in the publication that first introduces them. Tens of thousands of datasets and associated publications from figshare.com are used in our analysis. We found that a gradient boosting model achieved the highest performance against linear regression, random forests, and artificial neural networks. Our interpretation results suggest that journal ranking is highly predictive of usage while the author’s institutional prestige and h-index are less critical. In addition, we found that publications with a long but focused body of references are associated with more dataset downloads. We also show that prediction performance decays rapidly the farther we estimate downloads into the future. Finally, we discuss the implications of our work for reproducibility and data policies.