1 Introduction
2 Related work
3 Graph representation of text
3.1 Proposed weighted co-occurrence graph representation
4 Automatic enrichment of graphs
4.1 Node enrichment
4.2 Edge enrichment
4.3 Example to illustrate node enrichment and edge enrichment
5 Graph kernel-based text classification
5.1 Graph kernels for measuring document similarity
5.2 Graph kernel-based text classification pipeline
6 Experiments and results
-
Sentence polarity dataset This dataset consists of 5331 positive and 5331 negative movie reviews [23].
-
Subjectivity dataset This dataset consists of 5000 subjective and 5000 objective sentences on movie reviews labelled according to their subjectivity status [22].
-
News This dataset is a collection of 32,602 short text documents which are news collected from RSS feeds of the websites—nyt.com, usatoday.com and reuters.com and classified based on their topics. The topics are sports, business, US, health, sci&tech, world and entertainment. The document consists of the title, description, link, id, data, source and category of the news. We have used only the description and category of the news [34].
-
Multi-domain sentiment dataset This dataset consists of 8000 product reviews obtained from amazon.com where the products are books, dvd, electronics and kitchen [4]. There are 1000 positive reviews and 1000 negative reviews for each of the four product domains.
-
20 Newsgroups1 The 20 Newsgroups dataset contains 20,000 newsgroup documents classified into 20 different categories.
Dataset | Metric | Linear | Cosine | Sorensen | Tanimoto | RBF | Spgk | Proposed method | |||
---|---|---|---|---|---|---|---|---|---|---|---|
d=1 | d=2 | d=3 | d=4 | ||||||||
Polarity | Precision | 77.15 | 77.15 | 76.65 | 77.36 | 77.09 | 77.13 | 77.18 | 77.43 | 77.77 | 81.47 |
Recall | 77.12 | 77.11 | 76.60 | 77.33 | 77.07 | 77.10 | 77.14 | 77.39 | 77.74 | 81.42 | |
F1 | 77.12 | 77.11 | 76.59 | 77.32 | 77.06 | 77.10 | 77.13 | 77.39 | 77.74 | 81.42 | |
Subjectivity | Precision | 90.98 | 91.12 | 90.21 | 90.86 | 91.07 | 90.82 | 91.02 | 90.85 | 90.85 | 92.74 |
Recall | 90.94 | 91.08 | 90.18 | 90.84 | 91.03 | 90.80 | 91.00 | 90.82 | 90.83 | 92.73 | |
F1 | 90.94 | 91.08 | 90.18 | 90.84 | 91.03 | 90.80 | 91.00 | 90.82 | 90.83 | 92.73 | |
Books | Precision | 80.58 | 80.91 | 79.88 | 79.85 | 80.29 | 80.85 | 81.13 | 81.11 | 80.55 | 86.12 |
Recall | 80.44 | 80.77 | 79.69 | 79.74 | 80.09 | 80.74 | 81.09 | 81.04 | 80.49 | 86.04 | |
F1 | 80.42 | 80.79 | 79.66 | 79.72 | 80.06 | 80.72 | 81.09 | 81.03 | 80.48 | 86.04 | |
Dvd | Precision | 81.70 | 82.61 | 79.67 | 80.59 | 81.88 | 80.63 | 81.86 | 81.30 | 81.34 | 87.40 |
Recall | 81.55 | 82.50 | 79.50 | 80.50 | 81.70 | 80.50 | 81.75 | 81.25 | 81.25 | 87.20 | |
F1 | 81.53 | 82.49 | 79.47 | 80.49 | 81.68 | 80.48 | 81.73 | 81.24 | 81.24 | 87.19 | |
Electronics | Precision | 80.72 | 80.25 | 81.07 | 82.36 | 80.29 | 83.07 | 83.38 | 84.13 | 84.05 | 86.01 |
Recall | 80.45 | 80.05 | 81.00 | 82.30 | 79.95 | 83.00 | 83.30 | 84.05 | 84.00 | 85.90 | |
F1 | 80.41 | 80.01 | 80.99 | 82.29 | 79.98 | 82.99 | 83.29 | 84.04 | 83.99 | 85.89 | |
Kitchen | Precision | 84.96 | 85.78 | 85.18 | 85.52 | 85.27 | 85.78 | 85.86 | 85.82 | 86.07 | 90.20 |
Recall | 84.90 | 85.70 | 84.95 | 85.35 | 85.20 | 85.70 | 85.75 | 85.70 | 85.95 | 90.10 | |
F1 | 84.89 | 85.69 | 84.92 | 85.33 | 85.19 | 85.69 | 85.69 | 85.74 | 85.94 | 90.09 |
Dataset | Metric | Linear | Cosine | Sorensen | Tanimoto | RBF | Spgk | Proposed method | |||
---|---|---|---|---|---|---|---|---|---|---|---|
d=1 | d=2 | d=3 | d=4 | ||||||||
20NG | Precision | 80.33 | 83.44 | 83.77 | 83.58 | 80.52 | 81.72 | 81.64 | 81.41 | 81.44 | 85.10 |
Recall | 79.23 | 83.03 | 83.27 | 82.97 | 78.59 | 80.92 | 80.72 | 80.55 | 80.56 | 84.19 | |
F1 | 79.31 | 83.03 | 83.27 | 82.95 | 78.94 | 81.01 | 80.79 | 80.59 | 80.60 | 84.36 | |
News | Precision | 82.49 | 82.89 | 81.34 | 81.39 | 82.63 | 80.88 | 80.92 | 80.91 | 81.01 | 84.39 |
Recall | 82.44 | 82.83 | 81.29 | 81.40 | 82.55 | 80.85 | 80.89 | 80.90 | 81.00 | 84.30 | |
F1 | 82.34 | 82.76 | 81.16 | 81.30 | 82.40 | 80.72 | 80.74 | 80.74 | 80.85 | 84.20 |
Dataset | Proposed method | ||||||||
---|---|---|---|---|---|---|---|---|---|
w=2 | w=3 | w=4 | |||||||
Precision | Recall | F1 | Precision | Recall | F1 | Precision | Recall | F1 | |
Polarity | 81.47 | 81.43 | 81.42 | 81.38 | 81.35 | 81.34 | 81.15 | 81.11 | 81.11 |
Subjectivity | 92.74 | 92.73 | 92.73 | 92.81 | 92.80 | 92.80 | 92.63 | 92.62 | 92.62 |
Books | 86.12 | 86.04 | 86.04 | 86.24 | 86.14 | 86.13 | 85.75 | 85.64 | 85.63 |
Dvd | 87.40 | 87.20 | 87.19 | 87.54 | 87.35 | 87.34 | 87.05 | 86.90 | 86.89 |
Electronics | 86.01 | 85.90 | 85.89 | 86.82 | 86.70 | 86.69 | 86.63 | 86.50 | 86.49 |
Kitchen | 90.20 | 90.10 | 90.09 | 89.90 | 89.80 | 89.79 | 90.02 | 89.90 | 89.89 |
20NG | 85.10 | 84.19 | 84.36 | 85.17 | 83.91 | 84.18 | 85.15 | 83.54 | 83.88 |
News | 84.39 | 84.30 | 84.20 | 83.87 | 83.76 | 83.65 | 83.72 | 83.58 | 83.44 |
Dataset | Metric | CWK |
\({\rm {CWK}}_{{wfs}}\)
| CMK |
\({\rm {CMK}}_{{wfs}}\)
| Proposed method |
---|---|---|---|---|---|---|
Polarity | Precision | 62.56 | 77.49 | 63.16 | 75.89 |
78.38
|
Recall | 62.07 | 77.46 | 62.68 | 75.82 |
78.36
| |
F1 | 61.69 | 77.46 | 62.33 | 75.81 |
78.35
| |
Subjectivity | Precision | 82.83 | 91.43 | 81.73 | 90.10 |
91.46
|
Recall | 82.75 | 91.40 | 81.60 | 90.10 |
91.45
| |
F1 | 82.74 | 91.40 | 81.58 | 90.10 |
91.45
| |
Books | Precision | 66.29 | 72.61 | 72.21 | 76.73 |
81.96
|
Recall | 66.17 | 72.43 | 71.93 | 76.69 |
81.95
| |
F1 | 66.11 | 72.37 | 71.85 | 76.69 |
81.95
| |
Dvd | Precision | 69.57 | 78.00 | 71.93 | 76.50 |
84.32
|
Recall | 69.50 | 78.00 | 71.50 | 76.50 |
84.25
| |
F1 | 69.47 | 78.00 | 71.36 | 76.50 |
84.24
| |
Electronics | Precision | 74.00 | 79.50 | 74.02 | 81.05 |
81.51
|
Recall | 74.00 | 79.50 | 74.00 | 81.00 |
81.50
| |
F1 | 74.00 | 79.50 | 73.99 | 81.00 |
81.50
| |
Kitchen | Precision | 75.71 | 83.27 | 82.32 | 83.35 |
91.25
|
Recall | 75.50 | 83.25 | 82.25 | 83.25 |
91.25
| |
F1 | 75.45 | 83.25 | 82.24 | 83.24 |
91.25
|
Dataset | Metric | CWK |
\({\rm {CWK}}_{{wfs}}\)
| CMK |
\({\rm {CMK}}_{{wfs}}\)
| Proposed method |
---|---|---|---|---|---|---|
20NG | Precision | 73.84 | 79.76 | 69.29 | 73.78 |
85.01
|
Recall | 73.56 | 79.19 | 68.98 | 73.52 |
83.94
| |
F1 | 73.55 | 79.12 | 69.00 | 73.49 |
84.15
| |
News | Precision | 71.34 | 83.04 | 69.24 | 78.60 |
83.95
|
Recall | 71.05 | 83.15 | 68.66 | 78.56 |
84.02
| |
F1 | 71.04 | 83.06 | 68.27 | 78.42 |
83.89
|
Information considered | Linear | Cosine | Sorensen | Tanimoto | RBF | spgk | CWK | CMK | Proposed method |
---|---|---|---|---|---|---|---|---|---|
Importance of terms based on class information (Supervised term weight) | No | No | No | No | No | No | Yes | Yes | Yes |
Co-occurrence information | No | No | No | No | No | Yes | No | No | Yes |
Importance of associations | No | No | No | No | No |
\(\mathrm {Yes}^\mathrm{{a}}\)
| No | No |
\(\mathrm {Yes}^\mathrm{{b}}\)
|
Incorporation of external knowledge | No | No | No | No | No | No | No | No | Yes |
Semantic similarity of terms and associations | No | No | No | No | No | No | No | No | Yes |