1 Introduction
We define term scoring as follows: We have a document collection D consisting of one or more documents. Our goal is to generate a list of terms T with for each \(t \in T\) a score that indicates how relevant t is for describing D. Each t is a candidate term. t is a sequence of n words: it can be a single-word term or a multi-word term.What factors determine the success of a term scoring method for keyword extraction?
-
What is the influence of the collection size?
-
What is the influence of the background collection?
-
What is the influence of multi-word phrases?
2 Our approach
[a-z]
) are skipped, and n-grams that contain a stopword or a 1-letter word are skipped. We do not to apply filtering for part-of-speech patterns because it cannot be known in advance which POS-patterns are relevant for the collection. For example, for some domains we might only be interested in noun phrases as terms, while for another domain verb phrases are important too.1 Table 1 shows the list of candidate terms extracted for a short example text.Example text: Information retrieval is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches can be based on metadata or on full-text (or other content-based) indexing | |
---|---|
Candidate terms | Skipped n-grams |
Information | Is |
Retrieval | The |
Activity | Of |
Obtaining | To |
Resources | An |
Relevant | From |
Need | a |
Collection | Activity of |
Information retrieval | Relevant to |
Obtaining information | Need from |
Information resources | Collection of |
Resources relevant | Retrieval is |
Information need | Resources relevant to |
Obtaining information resources | Information need from |
Information resources relevant | Can |
Searches | Be |
Based | On |
Metadata | Or |
Full-text | Other |
Content-based | Searches can |
Based on | |
Metadata or |
3 Term scoring
3.1 Term scoring methods
-
implemented as the maximum likelihood estimate of the probability of occurrence of a term in the collection, i.e. P(t|D) is estimated as the relative term frequency of t in D: \(tf(t,D) = \frac{count(t,D)}{|D|}\), in which |D| is the size of D (the total number of words in D).
-
Informativeness is related to specificity: how much information does a term t provide about D? Most methods for extracting informative terms from a collection use a background collection to determine the informativeness of a term: terms that are much more frequent in D than in a background collection C are the most informative for D. This background collection can be either the collection in which D is included (Hiemstra et al. 2004), or an external collection (Rayson and Garside 2000). An exception is the work by Matsuo and Ishizuka (2004) that exploits the top-k most frequent terms in the document as background model instead.
-
Phraseness is a score for how strong (or how ‘tight’) the combination of words in the multi-word sequence is. Phraseness methods were specifically designed for the extraction of multi-word terms. These methods measure the relevance of a term, using the relative frequencies of these terms and their component unigrams (Tomokiyo and Hurst 2003), or the frequencies of the longer terms in which a multi-word term is embedded (Frantzi et al. 2000).
3.2 Methods for scoring the informativeness of terms
3.2.1 Parsimonious language models (PLM)
3.2.2 Kullback–Leibler divergence for informativeness (KLI)
3.2.3 Frequency profiling (FP)
3.2.4 Co-occurrence based \(\chi ^2\) (CB)
3.3 Methods for scoring the phraseness of terms
3.3.1 C-Value
3.3.2 Kullback–Leibler divergence for phraseness (KLP)
3.4 Combining informativeness and phraseness
Method | Principle | Designed for modelling a... | Section | |
---|---|---|---|---|
CB | I | Single document | Independent of a collection | |
PLM | I | Single document | As part of a collection | |
FP | I | Collection | In comparison to another collection | |
C-Value | P | Collection | Independent of another collection | |
KLIP | I & P | Collection | In comparison to a background collection |
3.5 Hypotheses: strengths of the term scoring methods
4 Evaluation collections
4.1 Author profiling using a personal scientific document collection
4.1.1 Task
4.1.2 Collection and preprocessing
4.1.3 Evaluation method
4.2 Query term suggestion for news monitoring (QUINN)
4.2.1 Task
4.2.2 Collection and preprocessing
4.2.3 Evaluation method
4.3 Personalized query suggestion
4.3.1 Task
4.3.2 Collection and preprocessing
4.3.3 Evaluation method
4.4 Medical query expansion for patient queries
#1()
(treating the string between brackets as a literal phrase) #combine()
(treating the string between brackets as a bag of words) and #uwN()
(all words between brackets must appear within window of length N in any order).10 They find that #uwN
is the most powerful operator. In Sect. 5.1.2, we describe our strategy for query expansion with terms from the discharge summary, based on these findings.4.4.1 Task
4.4.2 Collection and preprocessing
[** ...**]
(e.g. [**MD Number 2860**]
),which were added by the track organizers for the purpose of data anonymization. A topic in the CLEF-eHealth task consists of five descriptive fields: title, description, profile and narrative. We use the title field, or the title together with the description as query. For query construction, all characters that are not alphanumeric, not a hyphen or whitespace are removed from the query and all letters are lowercased. The words in the query are concatenated into one string and combined using the combine
function in the Indri query language. The result is the baseline query for the topic that will be expanded with terms from the discharge summary.4.4.3 Evaluation method
Collection | Use case | Evaluation |
---|---|---|
Personal scientific document collection (English) |
Author Profiling using a personal document collection | Intrinsic, using human term judgments |
News articles, retrieved with Boolean queries (Dutch) | Query term suggestion for news monitoring (QUINN) | Intrinsic, using human term judgments |
Scientific articles, metadata and books (iSearch), retrieved for domain-specific queries (English) |
Personalized Query Suggestion
| Intrinsic, using ground truth search terms |
Discharge summaries (CLEF-eHealth), connected to layman queries (English) |
Medical Query Expansion for patient queries | Extrinsic through retrieval task |
5 Experiments with term scoring methods
-
The influence of collection size on the effectiveness of term scoring (5.1.1)
-
Comparing methods for small data collections (5.1.2)
-
Comparing methods with different background corpora in the Personalized Query Suggestion collection (5.2.1)
-
Comparing methods with different background corpora in the QUINN collection (5.2.2)
5.1 What is the influence of the collection size?
Collection | No. of docs | No. of words | ||
---|---|---|---|---|
Author Profiling
| 22 | Docs (avg per user) | 63,938 | (avg per user) |
QUINN
| 1031 | Docs (avg per query) | 64,953 | (avg per query) |
Personalized Query Suggestion
| 42 | Rel docs (avg per topic) | 2250 | (avg per topic) |
Medical query expansion
| 1 | Discharge summary | 609 | (avg per topic) |
5.1.1 The influence of collection size on the effectiveness of term scoring
Phraseness methods | Informativeness methods | |||
---|---|---|---|---|
KLP | C-Value | PLM | KLI | FP |
Entity ranking | Entity ranking | Category | Pages | Pages |
Ad hoc | Anchor text | Categories | Categories | Categories |
Anchor text | Ad hoc | Query | Query | Query |
Test persons | Test persons | Entity | Results | Results |
et al | Relevance feedback | Pages | Using | Using |
Word clouds | Language model | Using | Retrieval | Retrieval |
Relevance feedback | Word clouds | Results | Documents | Documents |
New york | et al | Retrieval | Topical | Entity |
Language model | Category information | Documents | Wikipedia | Category |
Entity ranking topics | Target categories | Information | Topics | Topical |
5.1.2 Comparing methods for small data collections
Title from CLEF topic: | < title >Esophageal perforation and risk </title > |
---|---|
Indri query (topic title): | # combine(esophageal perforation and risk)
|
Top-5 terms from discharge summary added to query: | |
PLM | mg, patient, day, hospital, tube |
KLIP | mg, hospital day, ampicillin gentamicin, three times, ampicillin |
CB | mg, day, patient, patients, hospital |
FP | mg, ampicillin, hospital day, avonex, baclofen |
C-Value | hospital day, three times, ampicillin gentamicin, location un, advanced multiple sclerosis |
Example of expanded Indri query | #combine(esophageal perforation and risk #weight( 0.024382201790445927 mg 0.01744960633704929 #2(hospital day) 0.016052177097263427 #2(ampicillin gentamicin) 0.013107586537605164 #2(three times) 0.011385981676144982 ampicillin ))
|
mg | tablet | right | blood pressure |
sig one | one | mg tablet sig | admission date |
mg po | sex | tablets | tablet sig |
patient | sig | po | day |
mg tablet | discharge | one tablet | tablet sig one |
5.1.3 Discussion: What is the influence of the collection size?
In Sect. 5.1.2, we found that all methods are hindered by small collection sizes (a few hundred words): the absolute frequencies of specific terms are low and 1 or 2 additional occurrences of a term makes a large relative difference.Hypothesis: We expect that larger collections will lead to better terms for all methods, because the term frequency criterion is harmed by sparseness. In addition, we expect that PLM is best suited for small collections, because the background collection is used for smoothing the (sparse) probabilities for the foreground collection. Although CB was designed for term extraction from small collections without any background corpus, we do expect it to suffer from sparseness, because the co-occurrence frequencies will be low for small collections. We expect KLIP and C-Value to be best suited for larger collections because of the sparseness of multi-word terms. The same holds for FP, which is similar to KLIP, and was developed for corpus profiling.
5.2 What is the influence of the background collection?
5.2.1 Comparing methods with different background corpora in the personalized query suggestion collection
COCA (SD) | iSearch (SD) |
P value for the difference | |
---|---|---|---|
PLM (\(\lambda =0.01\)) | 0.028 (0.050) | 0.042 (0.087) | 0.152 |
FP | 0.025 (0.043) | 0.040 (0.072) | 0.042 |
KLIP (\(\gamma =1.0\)) | 0.026 (0.047) | 0.038 (0.069) | 0.076 |
FP with iSearch | FP with COCA |
---|---|
Magnetic | Magnetic |
Solar | Flux |
Coronal | Fields |
Flux | Simulations |
Magnetic flux | Solar |
Corona | Coronal |
Convection | Corona |
Tube | Heating |
Magnetic fields | Convection |
Tubes | Magnetic flux |
5.2.2 Comparing methods with different background corpora in the QUINN collection
Topic: Biodiversiteit ‘Biodiversity’ | |
Query: (Biodiversiteit AND (natuur! or rode lijst! or planten or dieren or vogels or vissen or zee! or zeeen or oceaan or oceanen or exoten or uitheemse flora or uitheemse fauna or inheemse planten or inheemse dieren or inheemse flora or inheemse fauna or duurzaamheid or soorten!)) OR otter OR gierzwaluw OR kiekendief OR trekvogel AND NOT vogelgriep OR ...) | |
Generic newspaper background corpus | Topic-related background corpus |
natuur ‘nature’ | vogelteldag ‘bird count day’ |
hectare ‘hectare’ | spreeuw ‘starling’ |
vogelteldag ‘bird count day’ | getelde vogel ‘counted bird’ |
trekvogels ‘migrating birds’ | vaakst ‘most often’ |
spreeuw ‘starling’ | getelde ‘counted’ |
Topic: ICT beleid ‘ICT policy’ | |
Query: (sms w/4 (gedragscod! OR meldpun!)) OR (overstap! w/p (telefo! OR internet!)) OR telemarket! OR ((telecomwet! OR regule! OR wet OR wetten OR wetg!) AND (internet! OR cookie!)) OR ((veilen OR geveild OR veiling!) w/p frequenti!) OR frequentieveil! OR (marktrapportag! w/s ele?tron! communic!) OR digitale agenda! OR overheidsdata OR ict office OR ecp epn OR logius OR digipoort OR (duurza! w/s ict) OR (energie! w/s ict) OR (declaration w/2 amsterdam) OR (verklaring w/2 amsterdam) OR WCIT OR (world congress w/s allcaps(IT)) OR (SBR AND NOT bouw) OR standard business reporting OR (mobiel w/2 betalen) OR (betalen w/3 (telefoon OR mobiel OR gsm)) OR sggv OR slim geregeld goed verbonden OR (eod AND NOT explosieven!) OR ele?tron! ondernem! OR ele?tron! zaken! OR (Besluit Universele Dienstverlening w/s Eindgebruikersbelangen) OR apps for amsterdam OR apps for holland OR hack de overheid OR (toegang! w/s (web OR internet)) OR qiy OR ioverheid OR iautoriteit OR (crisis! w/2 ICT!) OR (clearinghouse w/s botnet!) or (deltaplan w/s ict) | |
Generic newspaper background corpus | Topic-related background corpus |
rubricering ‘classification’ | a-film ‘A-film’ |
internet ‘Internet’ | agendapunt ‘item on agenda’ |
staden ‘Staden’ | westrozebeke ‘Westrozebeke’ |
datum ‘date’ | ivm agendapunt ‘concerning item on agenda’ |
google ‘Google’ | moorslede ‘Moorslede’ |
5.2.3 Discussion: what is the influence of the background collection?
With term extraction for query suggestion in the scientific domain (the Personalized Query Suggestion collection, Sect. 5.2.1), we had relatively small collections—2250 words on average per topic—that are part of the background collection. For this type of collections we would expect that PLM would outperform FP and KLIP. The results that we got in terms of Mean Average Precision (Table 8) seem to indicate that PLM indeed is a bit better than the other methods, but these differences are not significant. This is probably due to the strictly defined baseline (a small set of human-formulated query terms). Throughout all experiments we have seen that FP and KLIP perform similarly. We already noted in Sect. 3.2 that the two methods are similar to each other. The a-symmetry of the KLIP function explains why its performance is a little better than FP in Fig. 1. This confirms the second part of our hypothesis.Hypothesis: Three methods use a background collection: PLM, FP and KLIP. Of these, we expect PLM to be best suited for term extraction from a foreground collection (or document) that is naturally part of a larger collection, because the background collection is used for smoothing the probabilities for the foreground collection. FP and KLIP are best suited for term extraction from an independent document collection, in comparison to another collection. KLIP is expected to generate better terms than FP because KLIP’s scoring function is a-symmetric: it only generates terms that are informative for the foreground collection.
Collection | Background | Coverage of top-10 terms (%) | Quality (PLM) |
---|---|---|---|
Personalized Query Suggestion
| iSearch | 100 |
0.042
|
COCA | 76 | 0.028 | |
QUINN
| SONAR | 71 |
11 %
|
Topic-specific | 51 | 6 % |
5.3 What is the influence of multi-word phrases?
KLIP (\(\gamma =0.0\)) | KLIP (\(\gamma =0.3\)) | KLIP (\(\gamma =0.6\)) | KLIP (\(\gamma =0.9\)) |
---|---|---|---|
Author Profiling. Collection of scientific articles authored by one person, who has obtained a PhD in information retrieval. In a short CV, she describes her research topics as “entity ranking, searching in Wikipedia, and generating word/tag clouds.”
| |||
Entity ranking | Categories | Categories | Categories |
Anchor text | Query | Query | Query |
Relevance feedback | Documents | Documents | Documents |
New york | Retrieval | Retrieval | Retrieval |
Word clouds | Pages | Pages | Pages |
Personalized Query Suggestion for one example topic (009). Information need: “I want information on how to measure dielectric properties on cells, for example in microfluidic systems.” | |||
Biological cells | Dielectric | Dielectric | Dielectric |
Alternating current | Biological cells | Cell | Cell |
Elastomer actuators | Alternating current | Biological cells | Suspensions |
Spectral representation | Elastomer actuators | Suspensions | Electrorheological |
Low-frequency sub-dispersion depended | Cell | Electrorheological | Cells |
5.3.1 Discussion: What is the influence of multi-word phrases?
When comparing informativeness methods and phraseness methods for a given collection, two aspects play a role: Multi-word terms are often considered to be better terms than single-word terms (see Sect. 5.3). On the other hand, multi-word terms have lower frequencies than single-word terms (see Sect. 3.1), which makes them sparse in small collections. In the case of a small collection, consisting of 1 or a few documents, the frequency criterion will select mostly single-word terms. For that reason, KLIP performs better than C-Value. In addition to that, we also saw in Sect. 5.1.1 that KLP without the informativeness criterion also outperforms C-Value. As we pointed out in Sect. 3.3, both methods select terms on the basis of different criteria: In C-Value, the score for a term is discounted if the term is nested in frequent longer terms (e.g. the score for ‘surgery clinic’ would be discounted because it is embedded in the relatively frequent term ‘plastic surgery clinic’). In KLP, on the other hand, the frequency of the term as a whole is compared to the frequencies of the unigrams that it contains; the intuition is that relatively frequent multi-word terms that are composed of relatively low-frequent unigrams (e.g. ‘ad hoc’, ‘new york’) are the strongest phrases. We found that the KLP criterion tends to generate better terms than the C-Value criterion.Hypothesis: We expect C-Value and KLIP to give the best results for collections and use cases where multi-word terms are important. CB, PLM and FP are also capable of extracting multi-words but the scores of multi-words are expected to be lower than the scores of single-words for these methods. On the other hand, C-Value cannot extract single-word terms, which we expect to be a weakness because single-words can also be good terms.
-
Language. In compounding languages such as Dutch and German, noun compounds are written as a single word, e.g. boottocht ‘boat trip’. In English, these compounds are written as separate words. As a result, the proportion of relevant terms that consist of multiple words is higher for English than for a compounding language such as Dutch. For example, the proportion of multi-words in the user-formulated Boolean queries for the Dutch collection QUINN is only 16 %. The proportions of multi-word phrases in the ground truth term lists for Author Profiling and Personalized Query Suggestion are 50 and 57 % respectively. This implies that (a) we cannot generalize the results in this paper to all languages and (b) although it is to be recommended to tune the \(\gamma\) parameter for any new collection, this is even more important in the case of a new language.
-
Domain. In the scientific domain (in our case the Author Profiling and Personalized Query Suggestion collections), more than half of the user-selected terms are multi-word terms. A method with a phraseness component is therefore the best choice (KLIP with a low \(\gamma\) or C-Value) for collections of scientific English documents.
-
Use case and evaluation method. For Author Profiling, multi-word terms are highly important if the profile is meant for human interpretation (such as keywords in a digital library, or on an author profile): human readers prefer multi-word terms because of their descriptiveness. This implies that when terms are meant for human interpretation, a method with a phraseness component is the best choice (KLIP with a low \(\gamma\) or C-Value). On the other hand, in cases where terms are used as query terms, single-word terms might be more effective, and PLM or FP would be preferable.
6 Conclusion
-
larger collections lead to better terms.
-
for collections larger than 10,000 words, the best performing method for the task of author profiling is Kullback–Leibler divergence for Informativeness and Phraseness (KLIP) (Tomokiyo and Hurst 2003).
-
for modeling smaller collections up to 5,000 words, the best performing method for the task of author profiling is Parsimonious Language Models (PLM) (Hiemstra et al. 2004). for PLM, we recommend to empirically choose (tune) the \(\lambda\) parameter for each new combination of foreground and background collections because the optimal value differs between collections and background corpora.