1 Introduction
2 Related Work
2.1 Handling Rare and Unknown Words
2.2 Word Clustering
3 Methodology
3.1 Clustering Data
3.2 Methods
3.3 Evaluation
4 Results
4.1 Rare and Unknown Word Thresholds
Parser | Terminal type | Parsed | UNK TH. | F-score | POS Acc. | Lex. Red. |
---|---|---|---|---|---|---|
Lorg | orig | tokens | 1 | 71.80 | 90.81 | N/A |
orig | tagged | 5 | N/A | |||
POS | tokens | 15 |
99.61
| |||
POS | tagged | 15 | 74.65 | 99.54 |
99.61
| |
lemma | tokens | 1 | 71.54 | 90.87 | 27.83 | |
lemma | tagged | 5 | 77.25 | 99.53 | 27.83 | |
lemma_pos | tokens | 1 | 73.15 | 93.70 | 18.95 | |
lemma_pos | tagged | 5 | 77.30 | 18.95 | ||
Berkeley | orig | tokens | 5 | 75.10 | 94.04 | N/A |
orig | tagged | 5 | 99.87 | N/A | ||
POS | tokens | 15/20 | 74.20 |
99.61
| ||
POS | tagged | 15/20 | 74.17 | 99.92 |
99.61
| |
lemma | tokens | 5 | 73.56 | 92.89 | 27.83 | |
lemma | tagged | 5 | 75.91 | 99.83 | 27.83 | |
lemma_pos | tokens | 10 | 95.97 | 18.95 | ||
lemma_pos | tagged | 10 | 76.01 | 18.95 |
4.2 Suffix Results
Token type | Parsed | UNK TH. | F-score | POS Acc. | Lex. Red. |
---|---|---|---|---|---|
raw+orig | tokens | 1 | 75.90 | 93.45 | 59.24 |
raw+orig | tagged | 1 | 78.16 | 99.52 | 59.24 |
raw+unk_suffix0 | tokens | 1 | 75.88 | 93.26 |
93.86
|
raw+unk_suffix0 | tagged | 1 | 78.26 | 99.45 |
93.86
|
raw+unk_suffix1 | tokens | 5 | 76.14 | 94.05 | 93.45 |
raw+unk_suffix1 | tagged | 5 | 78.05 | 93.45 | |
raw+unk_suffix2 | tokens | 5 | 91.09 | ||
raw+unk_suffix2 | tagged | 10 | 78.20 | 99.40 | 91.09 |
raw+unk_suffix3 | tokens | 1 | 76.05 | 93.86 | 86.61 |
raw+unk_suffix3 | tagged | 5 | 78.10 | 99.40 | 86.61 |
raw+unk_suffix4 | tokens | 1 | 76.03 | 93.92 | 80.63 |
raw+unk_suffix4 | tagged | 5 | 99.49 | 80.63 |
4.3 Cluster and Signature Results
Token type | Parsed | UNK TH. | F-score | POS Acc. | Lex. Red. |
---|---|---|---|---|---|
Craw | tokens | 1 | 76.47 | 94.22 |
93.38
|
Craw | tagged | 5 | 99.52 |
93.38
| |
raw_suffix2 | tokens | 5 | 76.27 | 94.23 | 91.09 |
raw_suffix2 | tagged | 10 | 78.10 | 99.40 | 91.09 |
Craw_suffix2 | tokens | 1 | 76.50 | 94.57 | 89.98 |
Craw_suffix2 | tagged | 1 | 78.17 | 99.40 | 89.98 |
raw_noCC | tokens | 1 | 76.00 | 93.68 | 92.73 |
raw_noCC | tagged | 1 | 78.10 | 92.73 | |
Craw_suffix2_noCC | tokens | 1 | 88.86 | ||
Craw_suffix2_noCC | tagged | 5 | 78.20 | 88.86 | |
Clemma_pos | tokens | 1 | 96.54 | 93.32 | |
Clemma_pos | tagged | 1 | 77.44 | 99.51 | 93.32 |
lemma_pos_suffix2 | tokens | 1 | 76.78 | 91.63 | |
lemma_pos_suffix2 | tagged | 1 | 99.52 | 91.63 | |
Clemma_pos_suffix2 | tokens | 5 | 76.77 | 96.63 | 90.54 |
Clemma_pos_suffix2 | tagged | 5 | 77.46 | 90.54 | |
lemma_pos_noCC | tokens | 1 | 73.67 | 94.08 |
94.04
|
lemma_pos_noCC | tagged | 10 | 77.48 | 99.53 |
94.04
|
Clemma_pos_suffix2_noCC | tokens | 1 | 76.08 | 95.61 | 90.53 |
Clemma_pos_suffix2_noCC | tagged | 5 | 77.45 | 99.53 | 90.53 |
Token type | Parsed | UNK TH. | F-score | POS Acc. | Lex. Red. |
---|---|---|---|---|---|
Craw | tokens | 5 | 75.59 | 93.72 |
93.38
|
Craw | tagged | 5 | 76.89 | 99.76 |
93.38
|
raw_suffix2 | tokens | 5 | 75.28 | 93.82 | 91.09 |
raw_suffix2 | tagged | 10 | 76.50 | 99.84 | 91.09 |
Craw_suffix2 | tokens | 5 | 75.66 | 94.27 | 89.98 |
Craw_suffix2 | tagged | 5 | 76.65 | 99.76 | 89.98 |
raw_noCC | tokens | 1 | 75.23 | 93.29 | 92.73 |
raw_noCC | tagged | 1 | 99.36 | 92.73 | |
Craw_suffix2_noCC | tokens | 1 | 88.86 | ||
Craw_suffix2_noCC | tagged | 10 | 76.60 | 88.86 | |
Clemma_pos | tokens | 5 | 75.76 | 96.27 | 93.32 |
Clemma_pos | tagged | 5 | 75.90 | 99.87 | 93.32 |
lemma_pos_suffix2 | tokens | 5 | 75.64 | 96.46 | 91.63 |
lemma_pos_suffix2 | tagged | 5 | 99.85 | 91.63 | |
Clemma_pos_suffix2 | tokens | 10 | 90.54 | ||
Clemma_pos_suffix2 | tagged | 10 | 75.93 | 90.54 | |
lemma_pos_noCC | tokens | 1 | 72.49 | 93.33 |
94.04
|
lemma_pos_noCC | tagged | 1 | 75.81 | 93.32 |
94.04
|
Clemma_pos_suffix2_noCC | tokens | 1 | 75.00 | 95.23 | 90.53 |
Clemma_pos_suffix2_noCC | tagged | 1 | 75.91 | 99.83 | 90.53 |
UNK type | Count | Top 3 POS categories | ||
---|---|---|---|---|
CUNK_en | 897 | NN (836) | NE (36) | ADJA (15) |
UNK_en | 624 | ADJA (279) | VVINF (134) | VVFIN (89) |
CUNK_er | 429 | NN (332) | NE (72) | ADJA (22) |
CUNK_ng | 255 | NN (231) | NE (23) | ADJA (1) |
CUNK_te | 127 | NN (115) | ADJA (8) | NE (3) |
CUNK_es | 112 | NN (86) | NE (18) | ADJA (7) |
CUNK_rn | 110 | NN (110) | ||
UNK_er | 108 | ADJA (79) | ADJD (18) | NN (7) |
CUNK_in | 106 | NN (69) | NE (37) | |
CUNK_el | 103 | NN (74) | NE (27) | PITA (1) |
UNK type | Count | Top 3 POS categories | ||
---|---|---|---|---|
CUNK_en | 884 | NN (795) | NE (32) | VVPP (6) |
CUNK_er | 515 | NN (351) | NE (123) | ADJA (34) |
UNK_en | 462 | ADJA (185) | VVINF (122) | VVFIN (82) |
CUNK_ng | 265 | NN (253) | NE (10) | FM/ADJD (1) |
CUNK_te | 174 | NN (166) | NE (4) | ADJA (4) |
CUNK_rn | 108 | NN (103) | NE (3) | ADV (2) |
CUNK_ft | 101 | NN (95) | NE (6) | |
UNK_er | 94 | ADJA (68) | ADJD (17) | NN (6) |
CUNK_es | 91 | NN (74) | NE (11) | ADJA (6) |
UNK_te | 89 | VVFIN (49) | ADJA (38) | ADV/NN (1) |
5 Discussion
5.1 External POS Tagger
Token type | System | F-Score | POS Acc. |
---|---|---|---|
Craw_suffix2_noCC | TnT | n/a | 94.43 |
Lorg w/TnT Tags | 74.62 | 94.28 | |
Berkeley w/TnT Tags |
75.70
|
94.65
| |
Clemma_pos | TnT | n/a |
96.66
|
Lorg w/TnT Tags |
76.03
| 96.26 | |
Berkeley w/TnT Tags | 75.56 | 96.26 |
5.2 Number of Clusters
Token type | Cluster size | F-score | POS Acc. |
---|---|---|---|
Craw_suffix2_noCC | 500 | 76.48 | 94.06 |
800 | 94.64 | ||
1000 | 76.57 | 94.93 | |
1500 | 76.60 | 95.12 | |
2000 | 76.45 | ||
Clemma_pos | 500 | 76.67 | 95.73 |
800 | 76.78 | 96.37 | |
1000 | 96.54 | ||
1500 | 76.81 | 96.72 | |
2000 | 76.66 |