1 Introduction
- We propose a new topic model, Bi-Labeled LDA, to model the process in which non-famous user follows famous users and infer tags for non-famous users effectively. Comparing to existing model, it takes the relation between famous users into consideration, incorporating more supervision information into traditional LDA. Bi-Labeled LDA is further improved to address two issues: strong assumption behind LDA and high popularity of topic and famous user.
- A Random Walk model is proposed to rank the tags inferred through Bi-Labeled LDA, adjusting the importance of unpopular tags among famous users.
- We conducted comprehensive experiments on real dataset and compared the interest tags found based on the proposed models and state-of-the-art approaches. We find that interest tags extracted by our methodology are far superior to others either in accuracy and have better generalization.
2 Related Work
3 Problem Definition
4 Extracting Interest Tags of Famous Users
5 Inferring Tags for Non-famous Users
- Intuition 1 If a non-famous user u follows more users who are famous in topic a than the ones who are famous in topic b, u follows a famous user v who is famous both in topics a and b more because of interest in topic a than in topic b.For example, suppose user u follows ten famous users. We count tags of these famous users and get three tags with counts: entertainment (6), business (1), and food (5). For a particular famous user v with tags entertainment and business, we think user u follows v more because of interest in topic entertainment than in topic business.
- Intuition 2 If a famous user v is followed by more non-famous users with interest in topic a than in topic b, then v is followed by a non-famous user u more due to u’s interest in topic a than in topic b.For example, suppose a famous user v with tags entertainment and business is followed by ten non-famous users. Among these non-famous users, six have interest in topic entertainment, one has interest in business, and five have interest in food. Then, non-famous user u follows v more because of his/her interest in topic entertainment than topic business.
5.1 Bi-Labeled LDA
Symbol | Description |
---|---|
U | The set of non-famous users |
V, \(V_{i}^{\left( u \right)}\) | The set of famous users and the ith famous user followed by user u |
K | The set of topics (tags) famous users have |
\(Z_{i}\) | The ith topic |
\(N_{u}\) | The number of famous users followed by user u |
\({\rm T}^{\left( v \right)}\) | The famous user v’s tag set |
\(\varLambda^{\left( u \right)}\) | A binary vector to represent u’s candidate tags |
\(\alpha , \beta\) | Dirichlet smoothing parameters of topics and words, respectively |
\(\delta , \gamma\) | Label priors for topics and non-famous users, respectively |
\(\theta^{\left( u \right)}\) | The non-famous user u’s topic distribution |
\(\phi^{\left( k \right)}\) | The topic tk’s distribution over famous users |
\(\alpha^{\left( u \right)}\) | The Dirichlet smoothing parameters for non-famous user u |
\(\beta^{\left( k \right)}\) | The Dirichlet smoothing parameters for topic k |
5.2 Learning and Inference
Symbol | Meaning |
---|---|
\(\varvec{Z}\) | Denoting \(\varvec{Z} = \left( {z_{1} , z_{2} , \ldots ,z_{\left| U \right|} } \right)\), in which each \(\varvec{z}_{\varvec{u}} = \left( {z_{u,1} ,z_{u,2} , \ldots ,z_{{u,n_{u} }} } \right)\) represents the topic assignment of user u’s followings. |
\(\varvec{W}\) | Denoting \(W = \left( {w_{1} , w_{2} , \ldots ,w_{\left| U \right|} } \right)\), in which each \(\varvec{w}_{\varvec{u}} = \left( {w_{u,1} ,w_{u,2} , \ldots ,w_{{u,n_{u} }} } \right)\), representing \(n_{u}\) famous users whom user u follows. |
\(\varvec{z}_{{\varvec{u},\varvec{m}}}\) | The topic assignment of the mth famous user followed by user u |
\(\varvec{w}_{{\varvec{u},\varvec{m}}}\) | The mth famous user followed by user u |
\(\varvec{c}_{{\varvec{k},\varvec{u},\varvec{v}}}\) | The number of associations between a topic \(t_{k}\) and a famous user \(v\) followed by user \(u\) |
\(\varvec{c}_{{\varvec{k},\varvec{u},\varvec{*}}}\) | The number of associations in which u follows a famous user due to topic k. Symbol * denotes a summation over all possible subscript variables and here means all possible famous users followed by u |
\(\varvec{c}_{{\varvec{k},\varvec{u},\varvec{*}}}^{{ - \left( {\varvec{u},\varvec{m}} \right)}}\) | The number of times user u follows a famous user due to topic k excluding the current following behavior that non-famous user \(u\) follows the mth famous user |
- If a non-famous user u follows more users who are famous in aspect x than the ones who are famous in aspect y, then \(c_{x,u,*} > c_{y,u,*} \to \widehat{{\theta_{x}^{\left( u \right)} }} > \widehat{{\theta_{y}^{\left( u \right)} }}\), i.e., u follows a famous user v who is famous both in aspects x and y more because of interest in aspect x than in y.
- If a famous user v is followed by more non-famous users with interest in aspect x than that in aspect y, then \(c_{x,*,v} > c_{y,*,v} \to \widehat{{\phi_{v}^{\left( x \right)} }} > \widehat{{\phi_{v}^{\left( y \right)} }}\), i.e., v is followed by a non-famous user u more due to u’s interest in aspect x than in aspect y.
5.3 Extension of Bi-Labeled LDA
- User u is interested in some topics in which user v is famous for.
- User v is very popular.
- One topic in which user v is famous is very popular.
Symbol | Meaning |
---|---|
\(\varvec{B}\) | Denoting \(\varvec{B} = \left( {b_{1} , b_{2} , \ldots ,b_{\left| U \right|} } \right)\), in which each \(\varvec{b}_{\varvec{u}} = \left( {b_{u,1} ,b_{u,2} , \ldots ,b_{{u,n_{u} }} } \right)\), representing the single topic label of each famous user followed by user u |
\(\varvec{\lambda}_{1}\) | A concentration scalar constructing the global distribution of topics (tags) |
\(\varvec{\lambda}_{2}\) | A concentration scalar constructing the global distribution of famous users |
\(\varvec{\sigma}_{\varvec{k}}\) | The smoothing parameter of topic k in the corpus |
\(\varvec{\tau}_{\varvec{v}}\) | The smoothing parameter of famous user v in the corpus |
\({\varvec{\upzeta}}\) | The prior observation of the topics in the corpus |
\({\varvec{\uppi}}\) | The prior observation of the famous users in the corpus |
\(\varvec{\varepsilon}\) | The Dirichlet smoothing parameter for the single topic labels of famous users |
\(\varvec{\psi}\) | The probability the famous users are picked to follow through single topic |
st | The indicator of whether a famous user is picked to follow through single topic |
\(\varvec{c}_{{\varvec{k},\varvec{u},\varvec{m},\varvec{st}}}\) | The number of associations between a topic \(t_{k}\) and the mth famous user \(v\) followed by a non-famous user \(u\) when the single topic label is st |
5.4 Ranking Tags of Non-famous Users
Symbol | Meaning |
---|---|
\(I_{u}^{\left( x \right)}\) | Non-famous user u’s topic distribution after the xth iteration in the process of Random Walk |
FN(u) | The set of all the non-famous users followed by u |
\(FD\left( f \right)\) | The set of non-famous users who follow user f |
\(\rho\) | The decay factor in the process of interests spreading |
\(p_{fuk}\) | The weight of influence of topic \(t_{k}\) spreading from user f to user u |
6 Experiments
6.1 Dataset
Items | Value |
---|---|
# of non-famous users | 26,478 |
# of famous users | 14,147 |
# of follow relationships | 2,771,580 |
# of tweet vocabulary | 23,385 |
# of tags | 159 |
6.2 Evaluated Approaches
- Labeled LDA-Text-Follow This baseline is the same as Labeled LDA-Text, except that it models both the generative process of user’s tweets and followings at the same time.
- Labeled LDA-Follow This baseline is similar to Labeled LDA-Text, except that it models the generative process of user’s followings instead of user’s tweets, and labels of the top ranked topics instead of tweet words are recommended to users as their final tags.
- Tag-LDA This baseline was proposed to model the generative process of words and tags of a labeled document at the same time [15]. Due to the large noise in tweets, we model the generative of hashtags in tweets and famous users’ tags at the same time. We finally recommend the hashtags and famous users’ tags to users. Different from Labeled LDA, it has no restriction on the topics a document can have.
- Tag-LDA-Follow For this baseline, it is the same as Tag-LDA, but we replace famous users’ tags with users’ followings and finally recommend hashtags in tweets to users.
6.3 Comparison of Bi-Labeled LDA1 with Bi-Labeled LDA2
6.4 Comparison of Bi-Labeled Walk with Bi-Labeled LDA2
6.5 Comparing with Existing Methods
6.5.1 Comparison of Users with Different Levels of Activeness
6.6 Case Study
User with their bio | List-Based | Labeled LDA-Text | Tag-LDA-Follow | Labeled LDA-Follow | Bi-Labeled-LDA-RandomWalk |
---|---|---|---|---|---|
wisesumo @CodeSling founder, husband, father of two girls, iOS developer & Jesus follower. My Mantra: Keep Moving Forward | news, media, tech, business, social media, marketing, web, world, stuff, influencer | nba, sport, basketball, nfl, book, tech, online, developer, social media, woman | team, football, code, season, web, app, open, interesting, health, nfl | startup, content, nfl, internet, site, film, star, resource, writer, pro | startup, development, social media, tech, guru, speaker, service, blogger, government, content |
keepsloanweird: Geek, father (2 boys), #EagleScout, #Cubmaster (Pack 289, Circle 10), private pilot, former #PFE, and overall #MSFT junkie turned #InfoSecengineer/admin. #HYDR | news, celeb, media, stuff, tech, geek, entertainment, business, peep, web | news, movie, media, organization, show, tech, resource, industry, developer, sport | bed, hours, run, phone, kids, weekend, car, early, apple, office | video, science, comedian, space, deal, youtuber, musician, game, comedy, journalist | geek, tv, peep, space, science, game, film, video, hollywood, shopping |
leftonred Native New Yorker fixes Computers, Photographs stuff, Drinks Sake, Beer, Wine, Whisky, Spirits & Tea, Creates Origami, Plays Backgammon & Coaches Ping Pong. | news, media, stuff, celeb, entertainment, business, food, art, culture, blogger | event, entertainer, fan, science, pr, food, club, app, comedian, wine | nyc, york, city, park, brooklyn, street, ave, mayor, jersey, train | beer, fashion, wine, influencer, nyc, foody, author, movie, musician, rock | beer, nyc, wine, movie, geek, fm, art, social media, food, peep |
iheni Sino-hippie, Chinese foodie and kickboxer working on accessible UX, mobile and multimedia currently for the Paciello Group, formally BBC. Tweets are my own etc. | development, media, news, tech, stuff, web, celeb, art, design, peep | tech, social media, web design, personality, book, phone, blog, geek, interesting, people, write | web design, mobile, stuff, app, code, bitpage, support, open, phone | resource, uk, love, science, news, world, developer, tv, inspiration, influencer | development, uk, actor, web, tech, radio, startup, education, web, design, science |
steveklein Professor Emeritus/Journalism at George Mason University. Also teach at the University of Mary Washington. I love the Red Wings. I ride Trek. I play TaylorMade. | news, world, media, business, celeb, journalist, sport, stuff, politics, startup | cycling, athlete, tech, nyc, art, uk, agency, health, nfl, basketball | run, bike, running, miles, ride, race, team, marathon, ran, training | cycling, player, film, government, education, event, personality, interest, life, culture | journalist, cycling, journalism, sport, startup, personality, player, brand, education, nfl |
Famous users with their bios | The top 10 topics |
---|---|
Rainn Wilson I am an actor and a writer and I co-created SoulPancake and my son, Walter | Humor, tv, star, fm, hollywood, culture, movie, film, music, peep |
Library of Congress We are the largest library in the world, with millions of books, recordings, photographs, maps and manuscripts in our collections | Book, organization, education, government, stuff, news, media, world, info, tech |
Dave McClure Geeks. Entrepreneurs. Startups. The Internet Revolution, Act II | Startup, peep, speaker, news, influencer, web, tech, industry, guru, finance |
Danah Boyd Internet scholar, social media researcher, youth advocate | Microsoft Research, Harvard Berkman Center | Education, speaker, influencer, blogger, guru, pr, tech, social media, culture marketing |
Felicia Day Actress, New Media Geek, Gamer, Misanthrope. I like to keep my Tweets real and not waste people’s time | Folk, family, game, youtuber, video, tv, film, entertainment, peep, media |