1 Introduction
2 Data
3 Machine-learning approach
3.1 Features
3.1.1 Author features
-
Cumulative visibility, \(F^{\mathrm{tot}}\), counts the total pageviews of an author up until the book’s publication date.
-
Longevity, \(t^{F}\), counts the days since the first appearance of an author’s Wikipedia page until the book’s publication date.
-
Normalized cumulative visibility, \(f^{\mathrm{tot}}\), divides the cumulative visibility with its longevity, i.e., \(f^{\mathrm{tot}} = \frac{F ^{\mathrm{tot}}}{t^{F}}\).
-
Recent visibility, \(F^{\mathrm{rec}}\), counts the total pageviews of an author during the month before the book’s publication. It captures the momentary popularity of the author around publication time.
-
Total sales, \(S^{\mathrm{tot}}\), obtained by querying an author’s entire publishing history from Bookscan and summing up the sales of her previous books up until the publication date of the predicted book.
-
Sales in this genre, \(S^{\mathrm{tot}}_{\mathrm{in}}\), counts the author’s previous total sale in the same genre as the predicted book.
-
Sales in other genres, \(S^{\mathrm{tot}}_{\mathrm{out}}\), counts the author’s previous sale in other genres.
-
Career length, \(t^{p}\), counts the number of days from the date of the author’s first book publication till the publishing date of the upcoming book.
-
Normalized sales, \(s^{\mathrm{tot}}\), normalizes the total sales based on the author’s career length, i.e., \(s^{\mathrm{tot}} = \frac{S^{ \mathrm{tot}}}{t^{p}}\).
3.1.2 Book features
3.1.3 Publisher features
3.2 Learning algorithms
3.2.1 Learning to place
3.2.2 Baseline methods
-
Linear Regression We compare Learning to Place method with the Linear Regression method. We observe that most features we explored are heavy-tail distributed, and so are the one year sales. Therefore, we take the logarithm of our dependent and independent variables, obtaining the model:where \(f_{i}\) denotes the ith feature among the studied features.$$ \log (PS_{i}) \sim \sum_{i} a_{i} \log (f_{i}) + \text{const,} $$
-
K-Nearest Neighbors (KNN) We employ regression based on k-nearest neighbors as an additional baseline model. The target variable is predicted by local interpolation of the targets associated with the nearest neighbors in the training set. We employed same feature transformation as in the linear regression models with an Euclidean distance metric between instances and five nearest neighbors considered (\(k=5\)). The features are preprocessed in the same fashion as in Linear Regression.
-
Neural Network The above two baselines do not capture nonlinear relationship between features, therefore we use a simple Multilayer Perceptron with one layer of 100 neurons as another baseline. The features are preprocessed in the same fashion as Linear Regression.
3.3 Model testing
-
AUC and ROC: Evaluate the ranking obtained through the algorithm directly with the true ranking. We consider the true value of each train instance as a threshold and we binarize any predicted value and target value depending on this threshold. Having these two binarized lists, we compute the true positive rate (TPR) and the false positive rate (FPR) for a given threshold. For various thresholds of high- and low-sale books, we compute true positive rates and false positive rates of the ROC (Receiver Operating Characteristic) curve and then calculate the AUC (Area Under Curve) score (see Additional file 1).
-
High-end RMSE: We calculate RMSE (Root-mean-square Error) for high-selling books to measure the accuracy of the sales prediction for high-selling books in the top 20 percentile. Since book sales follow a heavy tailed distribution, we calculate the RMSE based on the log values of the predicted and the actual sales.
4 Results
4.1 Predictions
4.2 Feature importance
4.3 Case studies
5 Robustness analysis
Quarter | Quarter 2 | Quarter 3 | Quarter 4 | Quarter 2 | Quarter 3 | Quarter 4 |
---|---|---|---|---|---|---|
Category | Fiction (AUC) | Nonfiction (AUC) | ||||
KNN | 0.83 | 0.82 | 0.82 | 0.81 | 0.80 | 0.82 |
Linear Regression | 0.85 | 0.83 | 0.86 |
0.85
|
0.84
|
0.86
|
Neural Network |
0.88
| 0.83 | 0.73 | 0.83 | 0.83 | 0.85 |
Learning to Place |
0.88
|
0.85
|
0.88
|
0.85
| 0.83 | 0.85 |
Category | Fiction (High-end RMSE) | Nonfiction (High-end RMSE) | ||||
KNN | 0.60 | 0.91 | 1.03 | 0.71 | 0.72 | 0.77 |
Linear Regression | 0.61 | 0.77 | 0.89 | 0.58 | 0.61 | 0.58 |
Neural Network | 0.44 | 0.81 | 2.83 | 0.71 | 0.57 | 0.63 |
Learning to Place |
0.42
|
0.71
|
0.45
|
0.49
|
0.51
|
0.62
|