Background
-
We build an effective hashtag recommendation system using a proposed hashtag ranking method, Hashtag Frequency-Inverse Hashtag Ubiquity (HF-IHU).
-
We provide scalable Map-Reduce algorithms to construct two fundamental structures, the term-frequency map for hashtags (THFM) and hashtag-frequency map (HFM). These indices are used to support fast HF-IHU calculations.
-
We conduct a nuanced evaluation of HF-IHU over a large Twitter data set. We compare HF-IHU against several popular schemes, including k-nearest neighbors using Cosine similarity, k-popularity, and Naïve Bayes. Our results show that HF-IHU achieves substantially higher recall than the other schemes and is resistant to retweets.
Background
-
Tweet: In Twitter, a tweet is a short message limited to 140 characters posted by a user. To tweet is also used as a verb for posting such messages. Unless an account is private, all tweets are public by default.
-
Follow: In Twitter, subscribing to other users to read their tweets is called a follow. Users can become a follower of someone without his or her approval unless the user has set tweet protection or have blocked the user. Unlike friend-ing in other social networks, following is not mutual.
-
User: A user is identified by a Twitter handle, @user. Users can mention other users by adding another their handle in their tweet. When mentioned, a user is notified by the Twitter API, and the mentioning tweet is displayed on the user’s feed.
-
Retweet: When users want to share someone else’s tweet, they can retweet it to their own followers. There are two ways: automatic retweeting and manual retweeting. Automatic retweeting is Twitter’s built-in feature where a tweet is shared verbatim and marked as a retweet. Users can manually retweet by copying the body of a tweet they want to share and pasting it into their tweet box. Since these tweets are not automatically marked as retweets, Twitter instructs users to add the keyword RT and the initial author’s handle in the tweet content. Sometimes, users will add their own thoughts to a manual retweet, which changes the content of the original tweet.
-
Hashtag: A hashtag is a keyword prefixed with # and can be placed anywhere in the body of the tweet to categorize or mark words/phrases as keywords related to their tweets. By clicking hashtags in Tweets, users can view all Tweets containing the hashtag. Extremely popular hashtags often become trends.
-
Trend: Twitter displays a list of immediately popular keywords and hashtags on the user’s homepage to help users discover the emerging topics in Twitter, and these keywords are referred to as trends. Trends are user-locality aware, but are not context-sensitive.
Index generation and ranking
Map-Reduce
Frequency map generation algorithm
#hashtag:term
, where #hashtag
is a hashtag appeared in a tweet and term
is a term appeared in a tweet with the hashtag. For example, when the input tweet contains multiple hashtags such as washington state university #wsuv #cs
, then the result prints six lines: #wsuv:washington
, #wsuv:state
, #wsuv:university
, #cs:washington
, #cs:state
, #cs:university
. The Map function to generate the HFM is shown in Algorithm 1.Ranking hashtags with HF-IHU
Experimental evaluation
Tweet corpus
null
tweets after several attempts to re-fetch them. Most of the missed tweets are returned with HTTP status code 301, 302 or 404. The description of code 301 is moved permanently and that of code 302 is moved temporarily, and both denote retweets, which is not completely handled by the tool. Code 404 means that the requested page was not found, which denotes deleted tweets. In all, the corpus size was 3 GB.
Characteristic | Value |
---|---|
Downloaded tweets | 8,320,161 |
Cleaned tweets | 8,234,098 |
Tweets containing no hashtags | 7,182,506 |
Tweets containing at least one hashtag | 1,051,592 |
One hashtag per tweet | 827,630 |
Two hashtags per tweet | 145,718 |
More than two hashtags per tweet | 78,244 |
Maximum number of hashtags used in a tweet | 28 |
Experimental setup
-
kNN with Cosine similarity Provided a tweet in the training set, \(t_1\) and another tweet \(t_2\) from the test set, this method computes the Cosine similarity:For each tweet in the test data, we iterated through all tweets in the training data and computed the Cosine similarity between them. We found the k-Nearest Neighbors \((k=200)\) of the test tweet and used these neighbors to produce a ranked list of recommended hashtags.$$\begin{aligned} \cos (t_1,t_2) = \frac{t_1 \cdot t_2}{\parallel t_1 \parallel \parallel t_2 \parallel }. \end{aligned}$$(3)
-
Naïve Bayes This method makes recommendations based on the results of a multinomial Naive Bayes model that is standard for text documents with large vocabularies and sparse data. In this model, the hashtag ranking depends on the posterior probability of a hashtag \(H_i\) given a tweet composed of a set of terms \(t_j\) each with frequency \(f_{t_j}\):We use Laplacian smoothing to deal with edge conditions in the conditional probability tables.$$\begin{aligned} P(H_i|t_1, ...,t_n) \propto P(H_i)\prod _jP(t_j|H_i)^{f_{t_j}}. \end{aligned}$$(4)
-
Overall popularity This method simply recommends the most frequently occurring (popular) hashtags in the training set for each test tweet. Table 2 shows the top 30 popular hashtags in our data set. This ranking method is not designed to make personalized recommendation, and therefore, the recommendations are consistently the same hashtags for any given tweet.
-
User similarity and Tweet similarity This method is proposed by Kywe et al. [15]. They score candidate hashtags based on the combination of user similarity and tweet similarity employing TF-IDF as a means of scoring each similarity. A user is represented by the preference weight for each hashtag in the data. The preference weight \(w_{ij}\) for user \(u_j\) for hashtag \(h_i\) is defined by the following formula:$$\begin{aligned} w_{ij}&= \text{TF}_{ij} \cdot \text{IDF}_i ,\end{aligned}$$(5)where \(\text{Freq}_{ij}\) is the usage frequency of hashtag \(h_i\) by \(u_j\), \(\text{Max}_j\) is the total number of hashtags used by \(u_j\), \(N_u\) is the total number of users, and \(n_i\) denotes the number of users who used \(h_i\). Similar tweets are retrieved in a similar manner as shown in the following formula:$$\begin{aligned}&= \frac{\text{Freq}_{ij}}{\text{Max}_j} \cdot \text{log} \left( \frac{N_u}{n_i}\right), \end{aligned}$$(6)$$\begin{aligned} w_{kl}&= \text{TF}_{kl} \cdot \text{IDF}_l\end{aligned}$$(7)where \(\text{Freq}_{kl}\) is the frequency of word \(w_l\) in tweet \(t_k\), \(\text{Max}_k\) is the total number of word used in \(t_k\), \(N_t\) is the total number of tweets, and \(n_l\) denotes the number of tweets in which \(w_l\) appears. To find the top X similar users, HTofUsers(u), the cosine similarity between a target user u and another user \(u_i\), is measured as follows:$$\begin{aligned}&= \frac{\text{Freq}_{kl}}{\text{Max}_k} \cdot \text{log}\left( \frac{N_t}{n_l}\right), \end{aligned}$$(8)Similarly, to find the top Y similar tweets, HTofTweets(t), the cosine similarity between a target tweet t and another tweet \(t_k\), is measured as follows.$$\begin{aligned} \cos (u,u_i) = \frac{u \cdot u_i}{\parallel u \parallel \parallel u_i \parallel }. \end{aligned}$$(9)After finding HTofUsers(u) and HTofTweets(t), the candidate hashtags for the target tweet t posted by user u are obtained in the following formula:$$\begin{aligned} \cos (t,t_k) = \frac{t \cdot t_k}{\parallel t \parallel \parallel t_k \parallel }. \end{aligned}$$(10)The recommendations are ranked by hashtag frequency in SuggestedHashtags(u, t). Since they reported that this method performed the best when \(X=5\) and \(Y=50\), we used the same numbers for these parameters in our experiment.$$\begin{aligned} SuggestedHashtags(u,t) = HTofUsers(u) \cup HTofTweets(t). \end{aligned}$$(11)
No. | Hashtag | No. | Hashtag | No. | Hashtag |
---|---|---|---|---|---|
1 | #ff | 11 | #bbb | 21 | #nicovideo |
2 | #egypt | 12 | #news | 22 | #1 |
3 | #jan25 | 13 | #icantdateyou | 23 | #partiu |
4 | #nowplaying | 14 | #fail | 24 | #shoutout |
5 | #np | 15 | #sougofollow | 25 | #music |
6 | #mentionke | 16 | #sotu | 26 | #followme |
7 | #fb | 17 | #rt | 27 | #follow |
8 | #jobs | 18 | #tcot | 28 | #win |
9 | #teamfollowback | 19 | #famouslies | 29 | #nw |
10 | #followmejp | 20 | #improudtosay | 30 | #iphone |
Evaluation metric: precision and recall
Experimental results
Full corpus
Stratified retweets
washington #wsuv
and a test tweet washington #DC
. Since hashtags are ignored when Cosine similarity is calculated, the similarity score for these two test tweets is 1.0. This training tweet is then considered as a retweet of the test tweet washington
, even though it returns #wsuv
instead of #DC
.Case study: recommendations for users
@XboxSupport
|
@jewishblogger
|
@freeprojectinfo
| |||
---|---|---|---|---|---|
#vuze |
\(\bullet\)
| #israel |
\(\bullet\)
| #jobs |
\(\bullet\)
|
#kinect |
\(\bullet\)
| #jewish |
\(\bullet\)
| #freelance |
\(\bullet\)
|
#egypt | #obama | #webdevelopment |
\(\bullet\)
| ||
#jan25 | #israeli |
\(\bullet\)
| #job |
\(\bullet\)
| |
#jobs | #telaviv |
\(\bullet\)
| #egypt | ||
#fb | #synagogue |
\(\bullet\)
| #design |
\(\bullet\)
| |
#sissyboys | #gasztro |
\(\bullet\)
| #jan25 | ||
#xbox |
\(\bullet\)
| #parashat |
\(\bullet\)
| #fb | |
#ff | #jan25 | #seo |
\(\bullet\)
| ||
#nowplaying | #jerusalem |
\(\bullet\)
| #wordpress |
\(\bullet\)
| |
Hits | 3 | 8 | 7 |
#vuze
was the only tag that did not have an intuitive semantic value. A cursory search indicates that Vuze is a program that allows users to stream music and videos through devices, such as XBox consoles, so it was deemed a hit.
@XboxSupport
|
@jewishblogger
|
@freeprojectinfo
| |||
---|---|---|---|---|---|
#codysimpsonu... | #codysimpsonu... | #codysimpsonu... | |||
#backintheday | #cruzazul | #cruzazul | |||
#nowplaying | #lastfm | #lastfm | |||
#fb | #nowplaying | #nowplaying | |||
#np | #news | #news | |||
#ff | #followmejp | #followmejp | |||
#mentionke | #zodiacfacts | #zodiacfacts | |||
#zodiacfacts | #magistream | #magistream | |||
#bkstage | #sougofollow | #sougofollow | |||
#codysimpson | #win | #win | |||
Hits | 0 | 0 | 0 |
@XboxSupport
|
@jewishblogger
|
@freeprojectinfo
| |||
---|---|---|---|---|---|
#vuze |
\(\bullet\)
| #gasztro |
\(\bullet\)
| #jobs |
\(\bullet\)
|
#kinect |
\(\bullet\)
| #parashat |
\(\bullet\)
| #freelance |
\(\bullet\)
|
#egypt | #jerusalem |
\(\bullet\)
| #webdevelopment |
\(\bullet\)
| |
#jan25 | #egypt |
\(\bullet\)
| #job |
\(\bullet\)
| |
#jobs | #holocaust |
\(\bullet\)
| #egypt | ||
#fb | #judaism |
\(\bullet\)
| #design |
\(\bullet\)
| |
#sissyboys | #mentionke | #jan25 | |||
#xbox |
\(\bullet\)
| #jew |
\(\bullet\)
| #fb | |
#ff | #talmud |
\(\bullet\)
| #seo |
\(\bullet\)
| |
#nowplaying | #nowplaying | #wordpress |
\(\bullet\)
| ||
Hits | 3 | 8 | 7 |
#egypt
and #jan25
during the Egypt revolution, for example, many of them still consist of frequently used twitter terms such as #ff
(short for follow-friday), #nowplaying
(tagged with songs), and others. Table 6 shows the recommendations when we exclude any of the top 30 most popular hashtags.
@XboxSupport
|
@jewishblogger
|
@freeprojectinfo
| |||
---|---|---|---|---|---|
#vuze |
\(\bullet\)
| #gasztro |
\(\bullet\)
| #freelance |
\(\bullet\)
|
#kinect |
\(\bullet\)
| #parashat |
\(\bullet\)
| #webdevelopment |
\(\bullet\)
|
#sissyboys | #jerusalem |
\(\bullet\)
| #job |
\(\bullet\)
| |
#xbox |
\(\bullet\)
| #holocaust |
\(\bullet\)
| #design |
\(\bullet\)
|
#xbox360 |
\(\bullet\)
| #judaism |
\(\bullet\)
| #seo |
\(\bullet\)
|
#taddei | #jew |
\(\bullet\)
| #wordpress |
\(\bullet\)
| |
#job | #talmud |
\(\bullet\)
| #lukewilliamss | ||
#5 | #bethaderej |
\(\bullet\)
| #html |
\(\bullet\)
| |
#coupon | #sm | #css |
\(\bullet\)
| ||
#deals | #orangotag | #marketing |
\(\bullet\)
| ||
Hits | 4 | 8 | 9 |
@XboxSupport
|
@jewishblogger
|
@freeprojectinfo
| |||
---|---|---|---|---|---|
#codysimpsonust\(\dots\)
| #codysimpsonust\(\dots\)
| #codysimpsonust\(\dots\)
| |||
#backintheday | #cruzazul | #cruzazul | |||
#zodiacfacts | #lastfm | #lastfm | |||
#bkstage | #zodiacfacts | #zodiacfacts | |||
#codysimpson | #magistream | #magistream | |||
#thatswhatiwant | #sougofollow | #sougofollow | |||
#bears | #ebay | #win | |||
#goodwoman | #sagittarius | #ebay | |||
#packers | #codysimpson | #sagittarius | |||
#cruzazul | #qanow | #qanow | |||
Hits | 0 | 0 | 0 |
Performance evaluation
Related work
Keyword extraction
SPORT IN ENGLAND
and CHICAGO CUBS
can be used for marketing.