1 Introduction
2 Scope of the survey
3 A brief history of the evolution of popularity prediction methods
4 Types of web content
5 Evaluating the prediction models
5.1 5.1 Numerical prediction
5.2 5.2 Classification
6 A classification of web content popularity prediction methods
6.1 6.1 Single domain
6.1.1 6.1.1 Before publication
6.1.2 6.1.2 After publication
6.2 6.2 Cross domain
7 A survey on popularity prediction methods
Class | Methods | Data sets | Benchmark model | Performance/Remarks |
---|---|---|---|---|
Before publication | SVM, Naive Bayes, Bagging, Decision Trees, Regression [72] | Feedzilla | Shows an accuracy of 84% in predicting the popularity range of a news article. | |
Before publication | Random Forests [71] | AD, De Pers, FD, NUjiji, Spits, Telegraaf, Trpuw, WMR | Good performance in identifying which articles will receive at least one comment. | |
Cumulative growth | Constant growth [29] | Slashdot | Good performance in predicting the number of comments one day after the publication of an article (MSE = 36%). | |
Cumulative growth | Constant scaling [30] | Digg, YouTube | Constant growth, Log-linear | Outperforms the constant growth and the log-linear models in terms of MRSE. |
Cumulative growth | Log-linear [30] | Digg, YouTube | Constant growth, Constant scaling | Outperforms the constant growth and the constant scaling models in terms of MSE. |
Cumulative growth | Survival analysis [65] | DPreview, MySpace | Using the information received in the first day after the publication it can detect with 80% accuracy which threads will receive more than 100 comments. | |
Cumulative growth | Logistic regression [61] | Twitter | The model can successfully identify which messages will not be retweeted (99% accuracy) and those that will be retweeted more than 10,000 times (98% accuracy). | |
Temporal analysis | Multivariate linear regression [33] | YouTube | Constant scaling | An average improvement of 15% in terms of MRSE compared to the constant scaling model. |
Temporal analysis | Reservoir computing [77] | YouTube | Constant scaling | Minor improvement compared to the constant scaling model. |
Temporal analysis | Time series prediction [32] | YouTube | Designed for frequently-accessed videos. Good performance in predicting the daily number of views. | |
Temporal analysis | kSAIT [63] | Twitter | Regression-based methods | Predict the number of tweets using information from the first hour after content publication. An improvement of up to 10% compared to regression-based methods. |
Popularity evolution patterns | Hierarchical clustering [32] | YouTube | Designed for rarely-accessed videos. The model shows good performance for short-term predictions but significantly larger ones for long-term predictions. | |
Popularity evolution patterns | MRBF [33] | YouTube | Constant scaling, Multivariate linear regression | An average improvement of 5% in terms of MRSE compared to multivariate linear regression and 21% compared to constant scaling model. |
Popularity evolution patterns | Temporal-evolution prediction [34] | YouTube, Vimeo, Digg | Log-linear | Significant improvement compared to the log-linear method. The model can be used to predict the temporal evolution of popularity. |
Individual behavior | Social dynamics [81] | Digg | Log-linear | It incorporates information about the design of the web site. Shows an accuracy of 95% in identifying which articles will get on Digg’s front page. |
Individual behavior | Conformer Maverick [67] | JokeBox | Collaborative filtering solutions | Adequate for platforms that rank content based on user votes. Better performances than collaborative filtering solutions. |
Individual behavior | Bayessian networks [64] | Twitter | MRE of 40% when predicting the total number of tweets using the information received in the first five minutes after publication. | |
Cross-domain | Linear regression [36] | IMDb, Twitter, YouTube | Designed to predict movie ratings using social media signals. The best performance was achieved when using textual features from Twitter and the fraction of likes over dislikes from YouTube. | |
Cross-domain | Linear regression [14] | Al Jazeera | Results show that a model based on social media reactions in the first ten minutes has the same performance as one based on the number of views received in the first three hours. | |
Cross-domain | Social transfer [35] | YouTube, Twitter | SVM basic | Shows a 70% accuracy in identifying which videos will receive sudden bursts of popularity (60% improvement over a model that uses only the information available on YouTube). |
7.1 7.1 Single domain
7.1.1 7.1.1 Before publication
7.1.2 7.1.2 After publication
growth profile
(we adopt the terminology used in [30]), assumes that, depending on the time of the publication, news stories follow a constant growth that can be described by the following function:constant scaling
model [30]:log-linear
) expressed assurvival analysis
that allows one to model the time until an event occurs (a typical event is “death”, from which the term survival analysis is derived). While the main utilization of this method could be to predict the lifetime of a web content, by changing the definition of an event, the method can also be used for popularity prediction tasks. The solution proposed by Lee et al. is to consider as event the time when a web content will reach a popularity value above a certain threshold. The performance of this method was tested on threads from two online discussion forums, DPreview and MySpace, with popularity expressed as the number of comments per thread. Using different statistics related to the users’ comment arrival rate the authors show that, by observing user activity in the first day after publication, the method can detect with 80% accuracy the threads that will receive more than 100 comments.multivariate linear regression
expressed asconstant scaling
model. For instance, predicting the popularity of a video one-month after its publication using data from the first week shows an average improvement of 14% over the constant scalingconstant scaling
model. The main drawback of this algorithm, as mentioned by the authors, is that in order for the prediction methods to be effective, additional exploration is needed to decide on the optimal history length and the sampling rate.Reservoir computing
[76], a novel paradigm in recurrent neural networks, is proposed as a model that could consider more complex interactions between early and late popularity values (between X
c
(t
i
) and N
c
(t
r
)). More specifically, this technique is used to build a large recurrent neural network that allows one to create and evaluate nonlinear relationships between X
c
(t
i
) and N
c
(t
r
) [77]. On a small sample of YouTube videos this model shows a minor improvement over the constant scalingconstant scaling
model in predicting the daily number of views based on the observations received in the previous ten days.time series prediction
model using Autoregressive Moving Average (ARMA). Thus, the popularity of a video at a given day n, x
c
(n), can be predicted using the following formula:kSAIT
(top-k Similar Author-Identical historic Tweets), an algorithm that can predict the popularity of tweets one, two, or three days after publication based on the retweet information received in the first hour [63]. The underlying assumption of this algorithm is that, tweets are retweeted in a similar manner depending on the author of the tweet. The prediction algorithm is thus user-specific (there is one prediction function for each user) and uses as predictive features only users’ retweeting behavior as it does not include any information about content itself or about users’ centrality in the graph of social interactions. Each tweet is described by a set of features (e.g., retweet acceleration, retweet depth) derived from the time-series of the retweets published in the first hour after publication by the direct and n-level followers, the publication time of the tweet, and information about the users who retweeted the original tweet. When a new tweet is posted, the algorithm computes the similarity of the tweet and all other tweets published by the same user, selects the top-k most similar tweets, and estimates the popularity of the target tweet as an average of the popularity of the top-k most similar tweets.hierarchical clustering
based on the time-series of videos popularity during 64 days centered on the peak. This strategy reveals that, for videos that are viewed during short periods of time, there are ten common shapes that describe the temporal evolution for most of the videos. Once these shapes are detected the prediction task consists in mapping videos to the clusters that best describe their evolution until the prediction moment (t
i
) and in using the temporal evolution trends of the clusters to deduce future video popularity. On a sample of YouTube videos, this method shows good performance in making short-term predictions (predict the number of views in the next day) but significantly larger ones in making long-term predictions.multivariate linear regression
model by proposing a solution that captures the similarity between videos in terms of their temporal evolution patterns [33]. The model assumes that the temporal popularity evolution of a subset of videos is representative for the entire population and could be used to improve the prediction accuracy. More specifically, the prediction model, called multivariate radial basis functionmultivariate radial basis function
(MRBF), is described by the following relationship:multivariate linear regression
and 20% compared to the constant scalingconstant scaling
model.log-linear
model. For example, when using the observations received in the first 24 hours to predict the popularity four hours ahead, this model shows a MRSE error of 1% for Digg and 3.5% for Vimeo and YouTube; a significant improvement compared to the log-linearlog-linear
model that shows a performance of 17% for Digg, 24.2% for Vimeo, and 29.7% for YouTube.Social dynamics
, the model proposed by Lerman and Hogg, describes the temporal evolution of web content popularity as a stochastic process of user behavior during a browsing session on a social media site [81]. In its original form, the model is designed according to the characteristics of the social bookmarking site Digg: stories can be found in three sections of the site (front, upcoming, and friend list pages), users can express their opinions through votes, and stories are arranged in pages or promoted to different sections of the site based on the dynamics of votes.Conformer Maverick
, a model used to predict content popularity based on users’ voting profiles [67]. The underlying assumption of the model is that, in the voting process, users can have two behaviors: obey the general users’ opinion (the “conformers”) or be against them (the “mavericks”). The profile of a user is in-between these two extremes but in general one trait prevails.7.2 7.2 Cross domain
Social Transfer
, extracts information from Twitter to detect videos that will experience sudden bursts of popularity on YouTube [35]. The model consists of the following steps: extract popular topics from Twitter, associate these topics to YouTube videos, and compare the popularity of videos on Twitter with their popularity on YouTube. A disproportionate share of attention on Twitter compared to YouTube is then used as strong evidence that a video will experience a sudden burst of popularity.multiple linear regression
that uses as input the following variables: number of Facebook shares, number of tweets and retweets, entropy of tweet vocabulary, and the mean number of followers sharing the articles on Twitter. Using a collection of Al Jazeera news stories, the authors show that a model based on the social media signals received in the first ten minutes after publication achieves the same performance as one based on the number of page views received in the first three hours.8 What makes web content popular?
9 Predictive features
10 Shaping the future: Applications of web content popularity prediction
log-linear
model could be an effective method for this ranking problem [73]. It is also important to understand how users react to information about the predicted popularity. Even if web content shows a mild resilience to self-fulfilling prophecies [98] the prediction outcome can become a strong form of social influence that inflates or dampens the success of a web item. One solution to this problem is to create a feedback loop to listen to users’ reactions and adjust the decisions depending on how the audience is responding to the prediction outcome.