1 Introduction
Twitter
1 has more than 100 million registered users (Yarow 2011). Millions of users post “tweets” that reveal what they are doing, what is on their mind, or what is currently important for them. According to Evans (2011), the number of tweets per day surpassed 50 millions in early 2010.
Twitter
thus represents a rich data source for text-based information extraction (IE) and information retrieval (IR).artist-album-track
hierarchy) traditionally offered.last.fm
and
Netflix
, cf. Celma (2008), Koren (2009). The second important question that is addressed in this work is how to model similarities between the entities of interest. There exists a large number of possibilities to construct term vectors from texts/microblogs related to the named entities under consideration (in regard to term selection, term weighting, or normalization, for example). The corresponding algorithmic choices, together with the actual similarity measure employed, have a great impact on the accuracy of the similarity estimates between the music or movie entities. The objective of this work is hence to identify well-performing combinations of these choices and to derive general rules for modeling similarities between named entities from microblogs. Performance is measured by an evaluation approach resembling (Sanderson and Zobel 2005). More precisely, Mean Average Precision (MAP) scores are computed on genre labels predicted by a k-Nearest Neighbor (kNN) classifier. To reduce the computational complexity of evaluating the otherwise enormous set of different algorithmic combinations, results are first computed on a smaller set and only combinations statistically insignificantly different from the top-performing combination will be assessed on the larger data sets.-
query scheme
-
index term set
-
term frequency
-
inverse document frequency
-
normalization with respect to document length
-
similarity function
2 Related work
2.1 Text-based similarity measures
Epinions.com
.2 In Schedl (2010) the author extracts user posts associated with music artists from the microblogging service
Twitter
3 and models term profiles using term lists specific to the music domain.2.2 Microblog mining
Twitter
queries are shorter and more popular than
bing
4 queries on average. Furthermore, microblogs are more often sought for people, opinions, and breaking news. In terms of query formulation, reissuing the same query can be more frequently observed in microblog search. In Web search, by contrast, modifying and extending a query is very popular.Twitter
users. The authors report that
Twitter
is most popular in North America, Europe, and Asia (Japan), and that same language is an important factor for cross-connections (“followers” and “friends”) over continents. Employing the HITS algorithm (Kleinberg 1999) on the network of “friend”-relations, Java et al. further derived user intentions from structural properties. They identified the following categories: information sharing, information seeking, and friendship-wise relationships. Analyzing the content of
Twitter
posts, the authors distilled the following intentions: daily chatter, conversations, sharing information/URLs, and reporting news.Twitter
network. The authors report a low level of reciprocity, i.e., only 22% of the connections between users are bidirectional. The average path length was found to be only four, which is surprisingly small for a network the size of the Twittersphere and considering the directional network structure. Moreover, a moderate level of homophily, i.e., a higher likelihood for connections between similar people than between dissimilar people, was discovered when measuring similarity in terms of geographic location and user popularity. In addition, Kwak et al.’s study indicates that information diffusion after the first retweet is very fast.Twitter
users based on spatial cues (“local” words) extracted from their tweets’ content (Cheng et al. 2010). To this end, in a first step several classifiers are trained to identify words with a strong geospatial meaning. In order to deal with the sparsity in the distribution of these cues, different smoothing approaches, e.g., taking into account neighboring cities when constructing the term representation of a city, are applied subsequently. In an experiment conducted on a set of tweets posted within the USA, Cheng et al.’s approach placed more than a half of the users within a 100-mile-radius of their correct location.Twitter
posts. To this end, the authors investigate content-based features extracted from the tweets, an authority score computed for each user, and Twitter-specific statistical features, such as number of retweets or number of users that replied to a message containing a tiny URL. They show that these features can be used to improve both recency ranking and relevance ranking in real-time Web search. Another work that aims at improving ranking can be found in Duan et al. (2010). Duan et al. propose a novel ranking strategy for tweet retrieval. To this end, they investigate different feature sets, including content-based features,
Twitter
-specific features, and authority scores of users (followers, retweeters, mentioners). Using a learning to rank algorithm, the authors found that the best-performing features are authority scores, length of a tweet, and whether the tweet contains a URL.3 Modeling the microblog term vector space
\({{\mathcal D}}\)
| Set of documents |
N
| Number of documents |
f
d,t
| Number of occurrences of term t in document d
|
f
t
| Number of documents containing term t
|
F
t
| Total number of occurrences of t in the collection |
\({{\mathcal T}_d}\)
| Set of distinct terms in document d
|
f
d
m
| Largest f
d,t
of all terms t in d
|
f
m
| Largest f
t
in the collection |
r
d,t
| Term frequency (cf. Table 5) |
w
t
| Inverse document frequency (cf. Table 6) |
W
d
| Document length of d
|
3.1 Query scheme
Twitter
as previous work on Web-based IE (Schedl et al. 2005; Whitman and Lawrence 2002) has shown that adding domain-specific key terms to a search request generally improves the quality of feature vectors in terms of similarity-based classification accuracy. In Web-based music information research, for example, common terms used as additional key words are “music review” or “music genre style”. Taking into account the 140-character-limitation of tweets, we decided to include only “music” as additional query term (
QS_M
) for the music data sets, or we query without any additional key terms, i.e., use only the artist name (
QS_A
). For the movie data set, the setting
QS_M
refers to including the term “movie” in the query. Table 2 summarizes the two query schemes investigated.Abbr. | Query scheme |
---|---|
QS_A
| “artist name” / “movie name” |
QS_M
| “artist name”+music / “movie name”+movie |
3.2 Index term set
TS_A
contains all terms found in the corpus (after casefolding, stopping, and stemming). Set
TS_S
is the entire term dictionary of
SCOWL
,5 which is an aggregation of several spell checker dictionaries for various English languages and dialects. Set
TS_N
encompasses all artist names present in the music data set. Previous work has shown that the corresponding co-occurrence approach to music artist similarity estimation yields remarkable results (Schedl and Knees 2008; Schedl et al. 2005). Term set
TS_D
is a manually created dictionary of music-related terms that resembles the one used in Pampalk et al. (2005). It contains, for example, descriptors of genre, instruments, geographic locations, epochs, moods, and musicological terms. Set
TS_L
represents the 250 most popular tags utilized by users of
last.fm
. Set
TS_F
comprises the aggregated data sets for the data types musical genre, musical instrument, and emotion, extracted from
Freebase
.6
Twitter
postsAbbr./term set | Cardinality | Description |
---|---|---|
TS_A
/all_terms
| C224a, QS_A:
38,133 C224a, QS_M:
19,133 C3ka, QS_A:
1,489,459 C3ka, QS_M:
437,014 | All terms (stemmed) that occur in the corpus of the retrieved
Twitter
posts |
TS_S
/scowl_dict
| 698,812 | All terms that occur in the entire
SCOWL
dictionary |
TS_N
/artist_names
| 224/3,000 | Names of the artists for which data was retrieved |
TS_D
/dictionary
| 1,398 | Manually created dictionary of musically relevant terms |
TS_L
/last.fm_toptags
| 250 | Overall top-ranked tags returned by
last.fm
’s Tags.getTopTags function |
TS_F
/freebase
| 3,628 | Music-related terms extracted from
Freebase
(genres, instruments, emotions) |
Twitter
postsAbbr./term set | Cardinality | Description |
---|---|---|
TS_A
/all_terms
| QS_A:
1,843,286/54,378 QS_M:
754,067/29,532 | All terms (stemmed) that occur in the corpus of the retrieved
Twitter
posts |
TS_S
/scowl_dict
| QS_A:
698,812/28,355 QS_M:
698,812/12,473 | All terms that occur in the entire
SCOWL
dictionary |
TS_D
/dictionary
| QS_A:
25,527/4,877 QS_M:
25,527/3,569 | Dictionary of filtered
IMdb
key words |
TS_A
and
TS_S
conceptually equal the corresponding sets used to index music-related tweets. Term set
TS_D
, in contrast, is a dictionary of movie-related terms, which we extracted from the “key words” provided by
IMDb
. Since this key word set is considerably noisy, we performed frequency-based filtering. We retained only terms that were assigned to at least 10 different movies, but to not more than 100 different movies. The former constraint effectively removes noise, the latter discards terms that are unlikely to discriminate well between different categories of movies.Lucene
,7 which we extended to represent
Twitter
posts. The extensions will be made available through the
CoMIRVA
framework8 (Schedl et al. 2007). When creating the indexes for the different term sets, we commonly employ casefolding and stopping, e.g., Baeza-Yates and Ribeiro-Neto (2011). Stemming, in contrast, is only performed for the term sets for which it seems reasonable, i.e., for term sets
TS_A
and
TS_S
.3.3 TF and IDF: term weighting
Abbr. | Description | Formulation |
---|---|---|
TF_A
| Formulation used for binary match SB = b |
\({ r_{d,t} = \left\{ \begin{array}{ll} 1 & \hbox{if }\,\, t\, \in {\mathcal T}_d \\ 0 & \hbox{otherwise} \\ \end{array} \right.}\)
|
TF_B
| Standard formulation SB = t |
r
d, t
= f
d,t
|
TF_C
| Logarithmic formulation |
r
d,t
= 1 + log
e
f
d,t
|
TF_C2
| Alternative logarithmic formulation suited for f
d,t
< 1 |
r
d,t
= log
e
(1 + f
d,t
) |
TF_C3
| Alternative logarithmic formulation as used in ltc variant |
r
d,t
= 1 + log 2
f
d,t
|
TF_D
| Normalized formulation |
\(r_{d,t} = \frac{f_{d,t}}{f_{d}^{m}}\)
|
TF_E
|
\(r_{d,t} = K + (1-K)\cdot \frac{f_{d,t}}{f_{d}^{m}}\)
| |
TF_F
|
\(r_{d,t} = \frac{f_{d,t}}{f_{d,t}+W_d / av_{d\in D}(W_d)}\)
| |
TF_G
| Okapi BM25 formulation, according to Robertson et al. (1999) |
\(r_{d,t} = \frac{(k_1+1) \cdot f_{d,t}}{f_{d,t} + k_1 \cdot\left[ (1-b) + b \cdot \frac{W_d}{av_{d\in D}(W_d)} \right]}\)
|
k
1 = 1.2, b = 0.75 |
Abbr. | Description | Formulation |
---|---|---|
IDF_A
| Formulation used for binary match SB = x
|
w
t
= 1 |
IDF_B
| Logarithmic formulation SB = f
|
\(w_t = \log _e\left( 1 + \frac{N}{f_t} \right)\)
|
IDF_B2
| Logarithmic formulation used in ltc variant |
\(w_t = \log _e\left(\frac{N}{f_t} \right)\)
|
IDF_C
| Hyperbolic formulation |
\(w_t = \frac{1}{f_t}\)
|
IDF_D
| Normalized formulation |
\(w_t = \log _e\left( 1 + \frac{f_m}{f_t} \right)\)
|
IDF_E
| Another normalized formulation SB = p
|
\(w_t = \log _e \frac{N - f_t}{f_t}\)
|
The following definitions are based on the term’s noise n
t
and signal s
t
. |
\(n_t = \sum\limits_{d \in \mathcal D_t } {\left( { - \frac{{f_{d,t} }}{{F_t }}\log _2 \frac{{f_{d,t} }}{{F_t }}} \right)}\)
| |
s
t
= log 2 (F
t
− n
t
) | ||
IDF_F
| Signal |
w
t
= s
t
|
IDF_G
| Signal-to-noise ratio |
\(w_t = \frac{s_t}{n_t}\)
|
IDF_H
|
\(w_t = \left( {\mathop {\max n_{t^{\prime}} }\limits_{t^{\prime} \in \mathcal T} } \right) - n_t\)
| |
IDF_I
| Entropy measure |
\(w_t = 1 - \frac{n_t}{\log _2 N}\)
|
IDF_J
|
\(w_t = \log \frac{N-f_t+0.5}{f_t+0.5}\)
|
Abbr. | Description | Formulation |
---|---|---|
NORM_NO
| No normalization | |
NORM_SUM
| Normalize sum of each virtual document’s term feature vector to 1 |
\(\sum\limits_{t \in \mathcal T_{d}}{r_{d,t}} = 1\)
|
NORM_MAX
| Normalize maximum of each virtual document’s term feature vector to 1 |
\(\max\limits_{t \in \mathcal T _{d}}{r_{d,t}} = 1\)
|
Abbr. | Description | Formulation |
---|---|---|
SIM_INN
| Inner product |
\(S_{d_1,d_2} = \sum\limits_{t \in \mathcal T _{d_1,d_2}}{\left( w_{d_1,t} \cdot w_{d_2,t} \right) }\)
|
SIM_COS
| Cosine measure |
\(S_{d_1,d_2} = \frac{\sum_{t \in \mathcal T _{d_1,d_2}}{\left( w_{d_1,t} \cdot w_{d_2,t} \right)}}{W_{d_1} \cdot W_{d_2}}\)
|
SIM_DIC
| Dice formulation |
\(S_{d_1,d_2} = \frac{2\sum_{t \in \mathcal T _{d_1,d_2}} \left( w_{d_1,t} \cdot w_{d_2,t} \right) }{W_{d_1}^2 + W_{d_2}^2}\)
|
SIM_JAC
| Jaccard formulation |
\(S_{d_1,d_2} = \frac{\sum_{t \in \mathcal T _{d_1,d_2}} \left( w_{d_1,t} \cdot w_{d_2,t} \right) }{W_{d_1}^2 + W_{d_2}^2 - \sum_{t \in \mathcal T _{d_1,d_2}} \left( w_{d_1,t} \cdot w_{d_2,t} \right) }\)
|
SIM_OVL
| Overlap formulation |
\(S_{d_1,d_2} = \frac{\sum_{t \in \mathcal T _{d_1,d_2}} \left( w_{d_1,t} \cdot w_{d_2,t} \right) }{\min ( W_{d_1}^2, W_{d_2}^2 )}\)
|
SIM_EUC
| Euclidean similarity |
\(D_{d_1,d_2} = \sqrt{\sum\limits_{t \in \mathcal T _{d_1,d_2}} {\left( w_{d_1,t} - w_{d_2,t} \right)^2}}\)
|
\(S_{d_1,d_2} = \left( \max _{d_1^{\prime},d_2^{\prime}} (D_{d_1^{\prime},d_2^{\prime}}) \right) - D_{d_1,d_2}\)
| ||
SIM_JEF
| Jeffrey divergence-based similarity |
\(S_{d_1,d_2} = \left( \max _{d_1^{\prime},d_2^{\prime}} (D_{d_1^{\prime},d_2^{\prime}}) \right) - D_{d_1,d_2}\)
|
\(D\left( {F,G} \right) = \sum\limits_i {\left( {f_i \log \frac{{f_i }}{{m_i }} + g_i \log \frac{{g_i}}{{m_i}}} \right)}\)
| ||
\(m_i=\frac{f_i+g_i}{2}\)
|
Okapi
framework for text-based probabilistic retrieval, cf. Robertson et al. (1995, 1999). This model assumes a priori knowledge on topics from which different queries are derived. Moreover, based on information about which documents are relevant for a specific topic and which are not, the term weighting function can be tuned to the corpus under consideration. Since BM25 is a well-established term ranking method, we included it in the experiments. However, it has to be noted that we cannot assume categorical a priori knowledge here, neither on the level of single tweets, nor on the level of named entities. On the level of tweets, manually classifying hundreds of thousands of posts would be too labor-intensive. On the named entity level, we could obviously group the entities (or more precisely, the corresponding tweets) according to a genre taxonomy and optimize BM25 correspondingly. However, we believe that this is not justifiable for two reasons: First, for arbitrary media repositories, we cannot assume to have access to genre information. Second, using genre information would obviously bias the results of the genre classification experiments as the other term weighting measures do not incorporate such a priori knowledge. Thus, BM25 would be unjustifiably favored. For our experiments, we therefore use a simpler BM25 formulation as the one proposed in Robertson et al. (1999), cf. variants
TF_G
and
IDF_J
in Tables 5 and 6, respectively.3.4 Virtual documents and normalization
Twitter
posts to 140 characters.3.5 Similarity function
SIM_EUC
) and similarity inferred from Jeffrey divergence (
SIM_JEF
) (Lin 1991) were included.3.6 Notation
-
<Query Scheme>.<Index Term Set>.<Normalization>
. -
<TF>.<IDF>.<Similarity Measure>
TF_C.IDF_B.SIM_COS
refers to all experiments with term frequency formulation
TF_C
, inverse document formulation
IDF_B
, and the cosine similarity function, irrespective of query scheme, index term set, and document normalization.4 Evaluation
4.1 Data sets
4.1.1 Music artists
C224a
, consists of 224 well-known artists and has a uniform genre distribution (14 genres,9 16 artists each). It has been frequently used to evaluate Web-/text-based music information retrieval approaches.10
last.fm
’s API11 to extract the most popular artists for each country of the world, which we then aggregated into a single list of 201,135 unique artist names. Since
last.fm
’s data is prone to misspellings or other mistakes due to its collaborative, user-generated knowledge base, we cleaned the data set by matching each artist name with the database of the expert-based music information system
allmusic.com
,12 from which we also extracted genre information. Starting this matching process from the most popular artist found by
last.fm
and including only artist names that also occur in
allmusic.com
, we eventually obtained a list of 3,000 music artists. This artist set, which will be denoted
C3ka
in the following, is publicly available.13 According to
allmusic.com
the artists are categorized into 18 distinct genres. The distribution of the genres in
C3ka
is shown in Table 9. Please note that the editors of
allmusic.com
use the genre “Rock” to denote a widespread range of music; basically, everything from Pop to Dark Metal is classified as “Rock”. Therefore, the genre distribution is considerably unbalanced.C3ka
Genre | Artists | Share (%) |
---|---|---|
Avantgarde | 8 | 0.267 |
Blues | 11 | 0.367 |
Celtic | 5 | 0.167 |
Classical | 42 | 1.400 |
Country | 24 | 0.800 |
Easy listening | 6 | 0.200 |
Electronica | 149 | 4.967 |
Folk | 24 | 0.800 |
Gospel | 23 | 0.767 |
Jazz | 106 | 3.533 |
Latin | 91 | 3.033 |
Newage | 18 | 0.600 |
Rap | 203 | 6.767 |
Reggae | 29 | 0.967 |
RnB | 101 | 3.367 |
Rock | 2,031 | 67.700 |
Vocal | 30 | 1.000 |
World | 99 | 3.300 |
4.1.2 Movies
IMDb
(Jass 2003). For 25 movie genres, we gathered the 50 top-ranked movies. We further added the overall 50 top-ranked movies of each decade, from the 1910s to the 2010s. This adds a further 11 categories. Please note that some movies occur in the top-ranked list for more than one genre, hence the total number of 1,008 distinct movie titles. The movie data set will be referred to as
C1km
in the following, and the movie names are available for download.14
4.2 Acquiring tweets
Twitter
’s API15 to issue queries according to the schemes indicated in Table 2. Accounting for the time-varying behavior of the search results and to obtain a broad coverage, we queried
Twitter
from December 2010 to February 2011 and aggregated the posts retrieved over time for each query. The resulting set of tweets per query/named entity is then pre-processed by employing casefolding and stopping. When using the term sets
TS_A
and
TS_S
, stemming is employed additionally.C224a
, we achieved a coverage of 100%; for set
C3ka
, we achieved a coverage of 96.87%, i.e., for 2,906 artists out of the 3,000 tweets were available. Coverage for the movie data set
C1km
was considerably lower (82.8% or 834 movies), likely due to the fact that
IMDb
always lists the full, official movie title, which is often replaced by a shortened version when referring to the movie in a microblog, e.g., “The Fog of War: Eleven Lessons from the Life of Robert S. McNamara”.C224a
, 21,336 tweets were gathered for
QS_A
and 10,867 for
QS_M
. For set
C3ka
, 3,161,582 tweets were retrieved for
QS_A
and 2,972,130 for
QS_M
. For the movie set
C1km
, we retrieved 11,684,074 tweets using query scheme
QS_A
and 4,958,223 tweets using query scheme
QS_M
.4.3 Experimental setup
allmusic.com
and
IMDb
are used for the music sets and the movie set, respectively. Although genre taxonomies are often inconsistent and erroneous (Pachet and Cazaly 2000), it has become commonplace to use genre as a proxy for similarity. In principle, a more precise ground truth could be established from human similarity judgments. Complete similarity judgments are, however, not publicly available on a large scale, neither for music, nor for movies. Hence, we have to restrict evaluation to the retrieval task of determining k artists/movies similar to a given query artist/movie. This task resembles k nearest neighbor (kNN) classification, where the class of a seed item is predicted as the most frequent class among the seed’s k most similar items. In the case of the single-class classification problem given by the music data sets, performing kNN is straightforward. However, when dealing with multiple labels/classes assigned to each item, as in the case of the movie set, we opted to employ a strict decision rule: Given a seed item with s class labels associated and a number of k nearest neighbors to consider, we accumulate the number of occurrences of up to s classes among the k neighbors. We then calculate the (proportionate) precision of the top s classes given by the accumulated counts on the seed’s s classes, i.e., each of the top s classes among the k nearest neighbors that match one of the seed’s s classes account for a precision score of 1/s. The algorithm used to compute precision@k for the multi-class experiments is illustrated in Algorithm 1.C224a
. In a second set of experiments, we then evaluated the remaining variants on the real-world artist set
C3ka
. On the movie set
C1km
all variants were evaluated.C224a
. Excluding redundant combinations, a total of 23,100 single experiments have been conducted for set
C224a
and 11,627 for set
C1km
. In the first stage of the experiments, only variants that fulfill at least one of the following two conditions are retained:-
there is a relative MAP difference of 10% or less to the top-ranked variant
-
or the t test does not show a significant difference to the top-ranked variant (at 5% significance level).
C224a
, the top 577 variants have a relative MAP difference (from the 1st to the respective rank, taking the respective rank as basis) of less than 10%. The pairwise t test shows a significant difference for the top-ranked 1,809 variants. For the second stage of experimentation, conducted on collection
C3ka
, we therefore evaluated only these top-ranked 1,809 variants. For the movie set
C1km
, these numbers are 2,392 (relative MAP difference) and 5,629 (t test), respectively (Figs. 1, 2).
4.4 Results and discussion
4.4.1 MAP scores
C224a
. The MAP scores of the 23,100 evaluated variants span a wide range and are quite diverse (cf. Fig. 3), with a mean of μ = 37.89 and a standard deviation of σ = 17.16. From Table 10 it can be seen that highest MAP scores can only be achieved when using
QS_A
,
TS_A
, and
NORM_NO
. At the other end of the ranking we see that
QS_M
and
SIM_OVL
dominate the most inferior variants.C224a
MAP | Variant |
---|---|
64.018 | QS_A.TS_A.NORM_NO.TF_C2.IDF_E.SIM_JAC
|
63.929 | QS_A.TS_A.NORM_NO.TF_C2.IDF_J.SIM_JAC
|
63.839 | QS_A.TS_A.NORM_NO.TF_C.IDF_E.SIM_JAC
|
63.810 | QS_A.TS_A.NORM_NO.TF_C2.IDF_E.SIM_COS
|
63.780 | QS_A.TS_A.NORM_NO.TF_C.IDF_E.SIM_COS
|
63.780 | QS_A.TS_A.NORM_NO.TF_C2.IDF_B2.SIM_JAC
|
63.780 | QS_A.TS_A.NORM_NO.TF_C2.IDF_B2.SIM_DIC
|
63.720 | QS_A.TS_A.NORM_NO.TF_C2.IDF_E.SIM_DIC
|
63.601 | QS_A.TS_A.NORM_NO.TF_C2.IDF_J.SIM_COS
|
63.542 | QS_A.TS_A.NORM_NO.TF_C.IDF_J.SIM_JAC
|
\(\cdots\)
|
\(\cdots\)
|
3.482 | QS_M.TS_A.NORM_MAX.TF_G.IDF_G.SIM_OVL
|
3.452 | QS_M.TS_S.NORM_SUM.TF_B.IDF_F.SIM_OVL
|
3.423 | QS_M.TS_A.NORM_SUM.TF_C3.IDF_J.SIM_OVL
|
3.363 | QS_M.TS_S.NORM_MAX.TF_G.IDF_F.SIM_OVL
|
3.274 | QS_M.TS_A.NORM_SUM.TF_C.IDF_E.SIM_OVL
|
3.065 | QS_M.TS_A.NORM_SUM.TF_C.IDF_J.SIM_OVL
|
3.006 | QS_M.TS_A.NORM_MAX.TF_G.IDF_F.SIM_OVL
|
2.976 | QS_M.TS_S.NORM_MAX.TF_F.IDF_F.SIM_OVL
|
2.857 | QS_M.TS_A.NORM_MAX.TF_F.IDF_G.SIM_OVL
|
2.649 | QS_M.TS_A.NORM_MAX.TF_F.IDF_F.SIM_OVL
|
C1km
(considering up to 50 nearest neighbors). Note that these MAP scores are overall lower than the scores for the music collections, with a mean of μ = 23.12 and a standard deviation of σ = 2.61. This lower overall performance is partly due to the higher number of classes, partly because of the stricter decision rule employed in the classification process, cf. Sect. 4.3. Highest ranks are again dominated by query scheme
QS_A
and term set
TS_A
, whereas the lowest-ranking variants are dominated by
QS_A.TS_S.NORM_SUM.SIM_JEF
.C1km
MAP | Variant |
---|---|
27.964 | QS_A.TS_A.NORM_SUM.TF_G.IDF_C.SIM_INN
|
27.962 | QS_A.TS_A.NORM_NO.TF_A.IDF_C.SIM_DIC
|
27.962 | QS_A.TS_A.NORM_NO.TF_A.IDF_C.SIM_JAC
|
27.962 | QS_A.TS_A.NORM_MAX.TF_A.IDF_C.SIM_DIC
|
27.962 | QS_A.TS_A.NORM_MAX.TF_A.IDF_C.SIM_JAC
|
27.895 | QS_A.TS_A.NORM_NO.TF_E.IDF_C.SIM_DIC
|
27.895 | QS_A.TS_A.NORM_NO.TF_E.IDF_C.SIM_JAC
|
27.895 | QS_A.TS_A.NORM_SUM.TF_E.IDF_C.SIM_DIC
|
27.895 | QS_A.TS_A.NORM_SUM.TF_E.IDF_C.SIM_JAC
|
27.895 | QS_A.TS_A.NORM_MAX.TF_E.IDF_C.SIM_DIC
|
\(\cdots\)
|
\(\cdots\)
|
17.101 | QS_M.TS_S.NORM_SUM.TF_C3.IDF_H.SIM_JEF
|
17.101 | QS_M.TS_S.NORM_SUM.TF_C3.IDF_I.SIM_JEF
|
17.101 | QS_M.TS_S.NORM_SUM.TF_D.IDF_F.SIM_JEF
|
17.101 | QS_M.TS_S.NORM_SUM.TF_D.IDF_G.SIM_JEF
|
17.101 | QS_M.TS_S.NORM_SUM.TF_E.IDF_F.SIM_JEF
|
17.101 | QS_M.TS_S.NORM_SUM.TF_E.IDF_G.SIM_JEF
|
17.101 | QS_M.TS_S.NORM_SUM.TF_F.IDF_F.SIM_JEF
|
17.101 | QS_M.TS_S.NORM_SUM.TF_F.IDF_G.SIM_JEF
|
17.101 | QS_M.TS_S.NORM_SUM.TF_G.IDF_F.SIM_JEF
|
17.101 | QS_M.TS_S.NORM_SUM.TF_G.IDF_G.SIM_JEF
|
4.4.2 Distribution of specific algorithmic choices
C224a
. Figure 5 shows this distribution among the 1,809 top-ranked variants. Figure 6 shows the top-ranked algorithmic choices for artist set
C3ka
and Fig. 7, eventually, shows this distribution for the movie data set
C1km
.
QS_A
) outperforms adding domain-specific key words. This result at first glance contrasts earlier work on Web-based music artist classification (Knees et al. 2008). However, Knees et al. analyze Web pages, not microblogs. It seems that adding any additional key word too strongly prunes
Twitter
’s result set.TS_A
), for both the music and the movie data sets. It is noteworthy, however, that the good performance of the general term sets (
TS_A
and
TS_S
) comes at the price of much higher computational complexity (cf. Tables 3, 4 for term set cardinalities). Hence, when performance is crucial, the results suggest using other term sets. A particularly good choice when the domain is music at first glance seems to be
TS_N
, the list of artist names, as it is the set that most frequently occurs among the top-ranked variants (32.5% or 588 times). However,
TS_N
yields very unstable results, as will be shown in the subsequent subsection. Another interesting finding is that the music dictionary
TS_D
, despite its good performance for similarity-based artist clustering using Web pages, cf. Pampalk et al. (2005), occurs first only at rank 1,112. An empirically verified reason for this may be that
Twitter
users tend to refrain from using a decent music-specific vocabulary, even when they twit about music-related issues.16 For the movie set
C1km
, in contrast,
TS_D
represents a good trade-off between computational complexity and accuracy as it does not significantly more seldom occur among the top-ranked variants than the set
TS_S
(both about 28 vs. 44% for
TS_A
). It seems that a collaboratively assembled dictionary, such as
TS_D
for the movie domain, outperforms a domain-specific one assembled by experts, such as
TS_D
for the music domain, provided it is not too small.TF_A
occurred most seldom among the top-ranked variants, regardless of the data set. This variant should thus be avoided. The most frequently occurring formulations on the other hand are
TF_C2
(15.69% of the top-ranks for the music sets) and
TF_E
(16.80%), the latter being particularly present in the very top ranks for the music data sets.
TF_C2
also occurs frequently among the top-ranked variants of the movie set
C1km
(13.52%), together with
TF_D
(14.55%),
TF_F
(13.82%), and
TF_G
(13.87%).IDF_G
(0.50% among
C3ka
’s top ranks),
IDF_F
(0.66%), and
IDF_A
(2.54%). However, we were not able to determine a single variant that clearly outperforms all others. The IDF variants most frequently occurring within the top ranks of the music sets are
IDF_B2
(13.93%),
IDF_J
(13.71%), and
IDF_E
(13.38%). For the movie set
C1km
, the very same variants perform best (
IDF_E
with 11.16% occurrence,
IDF_J
with 11.09%, and
IDF_B2
with 9.95%).SIM_COS
), the de-facto standard measure in IR, generally outperforms the others. It is likely that the key advantage of
SIM_COS
, the document length normalization, plays a minor role here, because tweets are limited to 140 characters which are usually exhausted by
Twitter
users. Further support for this hypothesis is given by the remarkably good performance of the simple inner product
SIM_INN
measure that does not perform any length normalization. On all three data sets,
SIM_INN
occurs almost twice as often as
SIM_COS
among the top-ranked variants (about 32 vs. 16%). Also among the virtual document normalization methods, using no normalization at all (
NORM_NO
) outperforms the other variants investigated, accounting for 52.24% of the top ranks for the music sets, and for 39.94% of the top variants using set
C1km
. In addition to
SIM_INN
, also the Jeffrey divergence-based similarity
SIM_JEF
performed comparably well over all data sets (31.5% for the music sets, 17.77% for
C1km
).C224a
to the real-world set
C3ka
is valid, we calculated Spearman’s rank-order correlation coefficient (e.g., Sheskin 2004) on the two rankings obtained with the two artist sets. The computation revealed a moderate correlation of 0.37. This correlation indicates that the rankings produced by the same algorithmic choices are not largely influenced by factors such as size of artist collection or number of artists per genre.4.4.3 Average quality and performance variance
TS_F
) is less sensitive to changes in other choices (for example, employing a different normalization).C224a
Variant | Rank | MAP | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Median | Mean | SD | Min | Max | Median | Mean | SD | Min | Max | |
QS_A
|
8,929
|
10,021
| 7,247.2 |
1
|
23,080
|
48.066
|
40.380
| 18.016 |
3.869
|
64.018
|
QS_M
| 13,340 | 13,080 |
5,633.8
| 1,112 | 23,100 | 39.732 | 35.399 |
15.866
| 2.649 | 57.083 |
TS_A
| 12,421 | 12,257 | 7,869.8 |
1
| 23,100 | 42.158 | 34.325 | 20.659 | 2.649 | 64.018 |
TS_D
| 10,388 | 10,983 | 5,349.6 | 1,112 | 23,052 | 45.774 | 40.693 | 14.028 | 4.851 | 57.083 |
TS_F
|
8,639
|
9,068
|
4,831.1
| 529 |
19,098
|
48.527
|
45.408
|
10.402
|
12.857
|
58.393
|
TS_L
| 10,343 | 9,949 | 5,561.6 | 318 | 23050 | 45.863 | 43.012 | 12.610 | 4.970 | 59.702 |
TS_N
| 15,660 | 12,880 | 7,901.3 | 221 | 23,058 | 33.214 | 33.181 | 18.853 | 4.702 | 60.595 |
TS_S
| 14,444 | 14,165 | 6,439.0 | 225 | 23,098 | 35.997 | 30.717 | 18.527 | 2.976 | 60.536 |
TF_A
| 13,152 | 12,310 |
5,325.9
| 144 |
22,979
| 40.223 | 38.829 |
12.361
|
5.982
| 61.875 |
TF_B
| 10,034 | 10,505 | 6,373.3 | 90 | 23,092 | 46.399 | 40.625 | 15.846 | 3.452 | 62.321 |
TF_C
| 17,240 | 14912 | 6,462.9 | 3 | 23,096 | 26.116 | 28.360 | 18.222 | 3.066 | 63.839 |
TF_C2
|
9,006
|
9,854
| 6,509.2 |
1
| 23,088 |
47.961
|
41.976
| 15.709 | 3.631 |
64.018
|
TF_C3
| 17,972 | 15,421 | 6,618.3 | 22 | 23,093 | 19.137 | 26.607 | 18.628 | 3.423 | 63.066 |
TF_D
| 9,871 | 10,371 | 6,290.5 | 89 | 23,053 | 46.682 | 40.974 | 15.697 | 4.821 | 62.321 |
TF_E
| 10,587 | 10,981 | 6,110.3 | 15 | 23,063 | 45.417 | 40.524 | 14.978 | 4.435 | 63.184 |
TF_F
| 9,448 | 10,079 | 6,276.5 | 62 | 23,100 | 47.321 | 41.836 | 15.187 | 2.649 | 62.589 |
TF_G
| 9,274 | 10,028 | 6,303.5 | 25 | 23,097 | 47.589 | 41.899 | 15.288 | 3.006 | 62.976 |
IDF_A
| 12,855 | 12,632 | 5,775.5 | 449 | 23,055 | 41.012 | 36.061 | 15.921 | 4.792 | 58.839 |
IDF_B
| 9,773 | 10,475 | 6,681.4 | 29 | 23,000 | 46.860 | 40.374 | 16.573 | 5.804 | 62.946 |
IDF_B2
|
8,665
| 9,780 | 6,765.0 | 6 | 23,058 |
48.482
| 41.785 | 16.607 | 4.702 | 63.780 |
IDF_C
| 12,782 | 12,409 | 6,020.9 | 87 | 23,027 | 41.220 | 37.195 | 15.609 | 5.446 | 62.351 |
IDF_D
| 9,790 | 10,554 | 6,700.2 | 13 | 23,015 | 46.845 | 40.208 | 16.601 | 5.595 | 63.363 |
IDF_E
| 8,744 | 9,784 | 6,750.8 |
1
| 23,095 | 48.363 | 41.722 | 16.674 | 3.274 |
64.018
|
IDF_F
| 17,108 | 16,061 |
4,884.2
| 459 | 23,100 | 27.173 | 26.772 | 15.586 | 2.649 | 58.750 |
IDF_G
| 17,009 | 15,439 | 5,334.3 | 258 | 23,099 | 28.021 | 28.219 |
16.568
| 2.857 | 60.179 |
IDF_H
| 9,731 | 10,304 | 6,596.3 | 19 |
22,970
| 46.920 | 40.977 | 16.472 |
6.042
| 63.155 |
IDF_I
| 8,889 | 9,887 | 6,637.9 | 12 | 23,038 | 48.125 | 41.664 | 16.435 | 5.298 | 63.393 |
IDF_J
| 8,673 |
9,731
| 6,750.9 | 2 | 23,096 |
48.482
|
41.807
| 16.660 | 3.066 | 63.929 |
SIM_COS
| 9,281 | 10,275 | 6,463.8 | 4 | 23,060 | 47.559 | 41.047 | 16.073 | 4.583 | 63.809 |
SIM_DIC
| 9,201 | 10,127 | 6,471.4 | 7 |
22,910
| 47.679 | 41.323 | 16.013 |
6.339
| 63.780 |
SIM_EUC
| 18,116 | 17,262 |
4,082.2
| 562 | 23,027 | 18.095 | 23.070 |
14.796
| 5.446 | 58.274 |
SIM_INN
| 10,896 | 11,135 | 6,407.8 | 141 | 23,005 | 44.926 | 39.429 | 16.049 | 5.774 | 61.905 |
SIM_JAC
| 9,194 | 10,132 | 6,470.0 |
1
| 22,972 | 47.708 | 41.318 | 16.017 | 6.042 |
64.018
|
SIM_JEF
|
8,235
|
9,299
| 6,820.6 | 87 | 23,058 |
49.137
|
42.635
| 16.488 | 4.702 | 62.351 |
SIM_OVL
| 13,004 | 12,624 | 6,074.8 | 18 | 23,100 | 40.610 | 36.403 | 16.155 | 2.649 | 63.155 |
NORM_MAX
| 11,851 | 11,781 | 6,415.5 | 27 | 23,100 | 43.452 | 37.418 | 16.919 | 2.649 | 62.976 |
NORM_NO
|
9,482
|
9,811
|
5,908.0
|
1
|
23,054
|
47.292
|
43.276
|
13.289
|
4.821
|
64.018
|
NORM_SUM
| 14,978 | 13,277 | 7,219.1 | 15 | 23,096 | 34.241 | 32.300 | 19.258 | 3.066 | 63.184 |
-
QS_A
clearly outperformsQS_M
in terms of quality, although the results obtained withQS_M
are more robust. -
TS_F
outperforms all other term sets, both in quality and robustness. This superiority becomes even more clearly visible when using MAP scores as quality measure (Fig. 2) instead of ranks (Fig. 1). Interestingly, term setsTS_A
andTS_N
do not perform well overall, since the results they produce are spread across a wide range of ranks (or MAP values), and their quality is not too good either. Figure 4b reveals the reason for the huge spread ofTS_N
: Even thoughTS_N
is employed in some of the highest ranked variants, there are also two large clusters of variants employingTS_N
towards the very end of the rank-ordered set of experiments (> rank 15,000). Especially its combination with the algorithmic choicesQS_M
,TF_B
,TF_D
,TF_E
,IDF_A
,IDF_H
,SIM_INN
, orSIM_OVL
proves detrimental. Looking at the quality scores ofTS_A
, a particularly interesting fact stands out, which is thatTS_A
performs much better in terms of MAP than in terms of rank score. Hence, although the findings presented in Sect. 4.4.2 suggest thatTS_A
is well-suited to yield top results, this seems to be true only when particular other algorithmic choices are present. As a consequence,TS_A
should be used with caution, only when computational complexity is not an issue and when other algorithmic choices can be ensured (cf. Table 10). -
As for the term frequency, formulations
TF_C
andTF_C3
perform poorly and are unstable. We therefore strongly recommend to refrain from these. The binary formulationTF_A
is the most stable one, but performs inferior to all but the worst variants mentioned above. Among the other, preferably performing variants,TF_C2
sticks out as yielding particularly good results, in terms of both rank score and MAP. Furthermore,TF_F
andTF_G
perform equally well asTF_C2
in terms of MAP and slightly worse than the top-performing variant in terms of rank score. BothTF_F
andTF_G
are slightly more robust thanTF_C2
. Hence, as an overall recommendation one should select one of the term frequency formulationsTF_C2
,TF_F
, orTF_G
, with a slight preference for the former one if top-performance is crucial and a slight preference for one of the latter two variants if stability of the results is more important. -
Variants
IDF_A
,IDF_C
,IDF_F
, andIDF_G
perform significantly worse than the other formulations of inverse document frequency. As for top-performing choices,IDF_E
ranks at the very top according to both MAP and rank scores. AlsoIDF_B2
andIDF_J
are not significantly inferior. -
Among the similarity functions,
SIM_EUC
performs remarkably inferior to all other variants.SIM_OVL
does not perform considerably better. Best results can be achieved employingSIM_JEF
, while at the same time maintaining a reasonable stability level. -
NORM_NO
performs best in terms of quality and robustness, whereasNORM_SUM
performs worst in both regards.
4.4.4 Comparison with web page-based experiments
-
TF_C3.IDF_I.SIM_COS
-
TF_C3.IDF_H.SIM_COS
-
TF_C2.IDF_I.SIM_COS
-
TF_C2.IDF_H.SIM_COS
IDF_I
,
IDF_H
, and
IDF_B2
perform best for Web pages, while
IDF_B2
,
IDF_E
, and
IDF_J
yield highest MAP scores for microblogs. Thus, again logarithmic formulations considerably outperform other variants for both data sources. Regarding the similarity measure, the top-ranked variants on the corpus of Web pages employ cosine similarity, while for microblogs no clear indication for the cosine measure to outperform the others can be found. Furthermore, normalization does not improve results when the corpus is constituted of tweets. In contrast, when the corpus comprises Web pages, normalization on the level of virtual documents considerably ameliorates the MAP scores. No comparison can be made on the level of term sets due to the fact that Schedl et al. (2011) does not take into account different dictionaries for indexing. As for the query scheme,
QS_M
which includes the term “music” in addition to the named entity sought for clearly outperforms
QS_A
on the Web-page-corpus, while the inverse holds on the microblog-corpus. It seems that adding additional, domain-specific search terms to the query is counterproductive when looking for microblogs since it prunes the set of tweets too heavily, while doing so is a necessity to filter unrelated Web pages from the search results.4.5 Alternative classifiers
QS_A
clearly outperforms
QS_M
. It can be observed that SVM benefits from having access to as much data as possible, i.e., it achieves highest accuracies when operating on term set
TS_A
. The Random Forest classifier yields significantly lower accuracies and performs best when using artist names as term set
TS_N
. The rule learner RIPPER seemingly performs best on the Freebase set
TS_F
, the reason for which is probably the clearest semantic distinction between the terms in this dictionary. Performing no normalization proved beneficial also for classifiers other than kNN, although in the case of RF, this becomes apparent better from looking at the top ranks in Fig. 9e than from Table 14. To yield top performance with the RF classifier, the use of
IDF_F
(in addition to
QS_A.TS_N
) seems to be more important than employing a particular normalization function. No clear picture emerges, in contrast, when analyzing the impact of the term frequency formulation. Even though the top 4 performers with SVM employ variants of the
TF_C
formulation, combinations including several other formulations can be found among the top-performing variants as well. Among the top-performing variants in the RF experiments, the simple binary match function
TF_A
appears surprisingly often. For the decision tree learner, the IDF formulation hence seems to be more important. For RIPPER, variants of
TF_C
clearly outperform all other choices.C224a
Acc. | Variant |
---|---|
71.875 | QS_A.TS_A.NORM_NO.TF_C.IDF_G
|
71.875 | QS_A.TS_A.NORM_NO.TF_C2.IDF_G
|
71.429 | QS_A.TS_A.NORM_NO.TF_C3.IDF_G
|
70.089 | QS_A.TS_A.NORM_NO.TF_C2.IDF_F
|
69.643 | QS_A.TS_A.NORM_MAX.TF_F.IDF_G
|
69.196 | QS_A.TS_A.NORM_NO.TF_C.IDF_F
|
68.750 | QS_A.TS_A.NORM_MAX.TF_B.IDF_F
|
68.750 | QS_A.TS_A.NORM_MAX.TF_D.IDF_F
|
68.750 | S_A.TS_A.NORM_MAX.TF_G.IDF_F
|
68.750 | QS_A.TS_A.NORM_MAX.TF_G.IDF_G
|
68.750 | QS_A.TS_A.NORM_NO.TF_D.IDF_F
|
68.750 | QS_A.TS_A.NORM_SUM.TF_D.IDF_F
|
\(\cdots\)
|
\(\cdots\)
|
7.143 | QS_A.TS_S.NORM_MAX.TF_A.IDF_F
|
7.143 | QS_A.TS_S.NORM_MAX.TF_E.IDF_F
|
7.143 | QS_A.TS_S.NORM_NO.TF_A.IDF_F
|
7.143 | QS_A.TS_S.NORM_NO.TF_E.IDF_F
|
7.143 | QS_A.TS_S.NORM_NO.TF_F.IDF_F
|
7.143 | QS_A.TS_S.NORM_NO.TF_G.IDF_F
|
7.143 | QS_A.TS_S.NORM_SUM.TF_A.IDF_F
|
7.143 | QS_A.TS_S.NORM_SUM.TF_E.IDF_F
|
6.696 | QS_A.TS_S.NORM_MAX.TF_C.IDF_F
|
6.696 | QS_A.TS_S.NORM_MAX.TF_C3.IDF_F
|
C224a
Acc. | Variant |
---|---|
57.589 | QS_A.TS_N.NORM_MAX.TF_A.IDF_F
|
57.589 | QS_A.TS_N.NORM_SUM.TF_E.IDF_F
|
57.589 | QS_A.TS_N.NORM_NO.TF_A.IDF_F
|
57.589 | QS_A.TS_N.NORM_SUM.TF_A.IDF_F
|
57.589 | QS_A.TS_N.NORM_MAX.TF_E.IDF_F
|
57.589 | QS_A.TS_N.NORM_NO.TF_E.IDF_F
|
57.143 | QS_A.TS_N.NORM_SUM.TF_G.IDF_F
|
56.696 | QS_A.TS_N.NORM_NO.TF_F.IDF_F
|
55.357 | QS_A.TS_N.NORM_MAX.TF_C2.IDF_F
|
54.911 | QS_A.TS_N.NORM_NO.TF_A.IDF_G
|
54.911 | QS_A.TS_N.NORM_NO.TF_A.IDF_G
|
54.911 | QS_A.TS_N.NORM_SUM.TF_A.IDF_G
|
54.911 | QS_A.TS_N.NORM_MAX.TF_A.IDF_G
|
54.911 | QS_A.TS_N.NORM_MAX.TF_C2.IDF_G
|
54.911 | QS_A.TS_N.NORM_SUM.TF_C2.IDF_G
|
\(\cdots\)
|
\(\cdots\)
|
7.589 | QS_A.TS_S.NORM_MAX.TF_C2.IDF_F
|
7.143 | QS_A.TS_S.NORM_SUM.TF_A.IDF_F
|
7.143 | QS_A.TS_S.NORM_MAX.TF_A.IDF_F
|
7.143 | QS_A.TS_S.NORM_MAX.TF_F.IDF_F
|
7.143 | QS_A.TS_S.NORM_NO.TF_B.IDF_F
|
7.143 | QS_A.TS_S.NORM_SUM.TF_C2.IDF_F
|
7.143 | QS_A.TS_S.NORM_NO.TF_A.IDF_F
|
7.143 | QS_A.TS_S.NORM_NO.TF_C2.IDF_F
|
6.697 | QS_A.TS_S.NORM_SUM.TF_G.IDF_F
|
6.697 | QS_A.TS_S.NORM_SUM.TF_B.IDF_F
|
6.250 | QS_A.TS_S.NORM_MAX.TF_C.IDF_F
|
C224a
Acc. | Variant |
---|---|
58.4821 | QS_A.TS_F.NORM_NO.TF_B.IDF_H
|
58.4821 | QS_A.TS_F.NORM_NO.TF_C3.IDF_H
|
58.0357 | QS_A.TS_F.NORM_MAX.TF_C.IDF_C
|
58.0357 | QS_A.TS_F.NORM_MAX.TF_C.IDF_D
|
58.0357 | QS_A.TS_F.NORM_MAX.TF_C.IDF_E
|
58.0357 | QS_A.TS_F.NORM_MAX.TF_C.IDF_J
|
57.5893 | QS_A.TS_F.NORM_NO.TF_C.IDF_J
|
57.5893 | QS_A.TS_F.NORM_NO.TF_C2.IDF_I
|
57.5893 | QS_A.TS_F.NORM_NO.TF_C.IDF_A
|
57.5893 | QS_A.TS_F.NORM_NO.TF_C2.IDF_A
|
57.5893 | QS_A.TS_F.NORM_NO.TF_C.IDF_B2
|
57.5893 | QS_A.TS_F.NORM_NO.TF_B.IDF_D
|
57.5893 | QS_A.TS_F.NORM_NO.TF_C2.IDF_D
|
57.5893 | QS_A.TS_F.NORM_NO.TF_B.IDF_C
|
57.5893 | QS_A.TS_F.NORM_NO.TF_B.IDF_A
|
57.5893 | QS_A.TS_F.NORM_NO.TF_C3.IDF_B
|
\(\cdots\)
|
\(\cdots\)
|
4.4643 | QS_A.TS_S.NORM_SUM.TF_G.IDF_F
|
4.4643 | QS_A.TS_S.NORM_NO.TF_E.IDF_F
|
4.4643 | QS_A.TS_S.NORM_NO.TF_B.IDF_F
|
4.4643 | QS_A.TS_S.NORM_SUM.TF_C.IDF_F
|
4.4643 | QS_A.TS_D.NORM_MAX.TF_A.IDF_G
|
4.4643 | QS_A.TS_S.NORM_NO.TF_G.IDF_F
|
4.4643 | QS_A.TS_S.NORM_SUM.TF_B.IDF_F
|
4.4643 | QS_A.TS_S.NORM_NO.TF_F.IDF_F
|
4.4643 | QS_A.TS_S.NORM_SUM.TF_E.IDF_F
|
4.4643 | QS_A.TS_S.NORM_SUM.TF_C2.IDF_F
|
4.4643 | QS_A.TS_D.NORM_SUM.TF_A.IDF_G
|
4.4643 | QS_A.TS_S.NORM_MAX.TF_G.IDF_F
|
4.4643 | QS_A.TS_S.NORM_MAX.TF_C2.IDF_F
|
4.4643 | QS_A.TS_S.NORM_NO.TF_A.IDF_F
|
4.4643 | QS_A.TS_S.NORM_NO.TF_C2.IDF_F
|
4.4643 | QS_A.TS_S.NORM_SUM.TF_D.IDF_F
|
-
QS_A
clearly outperformsQS_M
with all classifiers. -
TS_A
is found frequently among the top ranks in the SVM experiments, but also peaks at the very bottom ranks. The top 600 ranks of the RF experiments are entirely dominated byTS_N
, andTS_F
performs very well with the RIPPER classifier. The generic but broad vocabulariesTS_A
andTS_S
perform remarkably inferior when using RF or RIPPER. It seems that rule learners and decision tree learners benefit from a smaller, but more well defined vocabulary, such asTS_N
orTS_F
. -
It is hard to give advice for favoring or refraining from specific choices of the term frequency function. When using an SVM classifier, Fig. 8c might suggest to employ
TF_E
orTF_G
, because both are frequently found among the top-ranked variants; however, neither of them does consistently perform well. In particularTF_E
also occupies inferior positions around rank 2,600. For the RF classifier,TF_G
seems the most favorable TF formulation, too. One clear advice that can be given is to refrain fromTF_A
, regardless of the classifier applied. Even though binary match performs well in some settings, the peaks at mediocre and lowest ranks do by no means suggest the use ofTF_A
. -
The rather uniform distribution of the IDF variants among all ranks does not encourage the formulation of specific advices.
-
Slight (when using RF or RIPPER) to rather dominant (SVM) peaks of the
NORM_NO
setting, correspond well to the observation already made for the kNN experiments using MAP as performance measure. Due to the special characteristics of tweets, it is not advisable to perform document length normalization.
C224a
SVM | RF | RIPPER | |
---|---|---|---|
SVM | 0.071 | 0.528 | |
RF | 0.071 | 0.189 | |
RIPPER | 0.528 | 0.189 |
5 Conclusions and future work
Twitter
posts for the purpose of similarity estimation between named entities. To this end, we performed tens of thousands single experiments on three data sets, two related to the music domain, one from the movie domain. Different algorithmic choices related to query scheme, index term set, length normalization, \(TF\cdot IDF\) formulation, and similarity measure were thoroughly investigated. The main findings can be summarized as follows:
-
Restricting the search by domain-specific key words prunes the resulting set of tweets too heavily. Using only the named entity as query (
QS_A
) should be favored. -
Top-ranked results are achieved using all terms in the corpus (
TS_A
), though at high computational costs and little robustness against small changes in other algorithmic choices. If computational complexity or robustness is an issue, the results suggest using as index term set a domain-specific dictionary (TS_F
for the music domain orTS_D
for the movie domain). -
Normalizing for length does not significantly improve the results, neither when performed on term vectors, nor when included in the similarity function. Taking into account the higher computational costs, we therefore recommend refraining from normalization (
NORM_NO
) and using as similarity measure, for example, the inner product (SIM_INN
) or the Jeffrey divergence-based similarity (SIM_JEF
). BothSIM_EUC
andSIM_OVL
should definitively be avoided. -
The binary match TF formulation
TF_A
should not be used. The most favorable variants areTF_C2
andTF_E
. But alsoTF_F
andTF_G
do not perform significantly worse, regardless of the data set used. -
Among the IDF formulations, we suggest to refrain from using
IDF_A
,IDF_F
, andIDF_G
, as they performed poorly on all data sets. Better alternatives are given by formulationsIDF_B2
,IDF_E
, andIDF_J
, which ranked well on all sets.