1 Introduction
2 Related Work
3 The SPhrase Model
3.1 SPhrase Context Sampling
4 Methods and Datasets
4.1 Dataset
4.2 Parameter Settings
5 Evaluation
6 Experimental Design
6.1 Intrinsic Evaluation
-
CoNLL-2003 English dataset [25]. From this dataset multi-word named entities were extracted. These are used as phrases, in total there are 12,999. The maximum phrase length is 7 in this dataset, so we restricted the following two datasets to this as well.
-
From our Wikipedia training corpus we obtained 16,470 phrases from the first 1,000000 tokens. This dataset comes from our training data, so we assume we should obtain good results in this case.
-
Bristol [15] - from this dataset we selectively used the entity list and found 87,209 phrases.
known unknown informed uninformed
. The analogy task is to predict the final word using the first three using simple vector addition/subtraction of their vector representations. Informally the task attempts to show how well words follow the vector relationshipAccuracy - displayed to 3 decimal places | Count | ||||||
---|---|---|---|---|---|---|---|
Window size 3 | Window size 5 | Window size 10 | |||||
SPhrase | Word2vec | SPhrase | Word2vec | SPhrase | Word2vec | ||
capital-world
| 0.727 | 0.628 | 0.746 | 0.658 | 0.815 | 0.782 | 4524 |
capital-common-countries
| 0.872 | 0.848 | 0.941 | 0.856 | 0.976 | 0.941 | 506 |
city-in-state
| 0.660 | 0.480 | 0.715 | 0.583 | 0.647 | 0.677 | 2467 |
gram3-comparative | 0.848 | 0.806 | 0.758 | 0.813 | 0.643 | 0.670 | 1332 |
gram2-opposite | 0.223 | 0.220 | 0.220 | 0.222 | 0.206 | 0.204 | 812 |
gram8-plural | 0.755 | 0.736 | 0.715 | 0.744 | 0.641 | 0.727 | 1332 |
gram4-superlative | 0.379 | 0.396 | 0.345 | 0.366 | 0.279 | 0.262 | 1122 |
gram9-plural-verbs | 0.639 | 0.559 | 0.536 | 0.546 | 0.453 | 0.521 | 870 |
gram6-nationality-adjective | 0.846 | 0.784 | 0.838 | 0.815 | 0.854 | 0.853 | 1599 |
family | 0.603 | 0.595 | 0.595 | 0.638 | 0.581 | 0.543 | 506 |
gram7-past-tense | 0.472 | 0.515 | 0.474 | 0.492 | 0.441 | 0.470 | 1560 |
currency | 0.047 | 0.042 | 0.021 | 0.021 | 0.018 | 0.016 | 866 |
gram1-adjective-to-adverb | 0.104 | 0.087 | 0.119 | 0.121 | 0.132 | 0.148 | 992 |
gram5-present-participle | 0.517 | 0.520 | 0.509 | 0.486 | 0.479 | 0.455 | 1056 |
all | 0.601 | 0.545 | 0.597 | 0.565 | 0.581 | 0.587 | 19544 |
Accuracy - displayed to 3 decimal places | Count | ||||||
---|---|---|---|---|---|---|---|
Window size 3 | Window size 5 | Window size 10 | |||||
SPhrase | Word2vec | SPhrase | Word2vec | SPhrase | Word2vec | ||
capital-world
| 0.671 | 0.628 | 0.725 | 0.658 | 0.744 | 0.782 | 4524 |
capital-common-countries
| 0.881 | 0.848 | 0.935 | 0.856 | 0.929 | 0.941 | 506 |
city-in-state
| 0.653 | 0.480 | 0.645 | 0.583 | 0.652 | 0.677 | 2467 |
gram3-comparative | 0.706 | 0.806 | 0.696 | 0.813 | 0.519 | 0.670 | 1332 |
gram2-opposite | 0.217 | 0.220 | 0.197 | 0.222 | 0.172 | 0.204 | 812 |
gram8-plural | 0.726 | 0.736 | 0.712 | 0.744 | 0.661 | 0.727 | 1332 |
gram4-superlative | 0.273 | 0.396 | 0.298 | 0.366 | 0.269 | 0.262 | 1122 |
gram9-plural-verbs | 0.577 | 0.559 | 0.548 | 0.546 | 0.477 | 0.521 | 870 |
gram6-nationality-adjective | 0.855 | 0.784 | 0.821 | 0.815 | 0.827 | 0.853 | 1599 |
family | 0.569 | 0.595 | 0.553 | 0.638 | 0.502 | 0.543 | 506 |
gram7-past-tense | 0.453 | 0.515 | 0.483 | 0.492 | 0.414 | 0.470 | 1560 |
currency | 0.039 | 0.042 | 0.024 | 0.021 | 0.028 | 0.016 | 866 |
gram1-adjective-to-adverb | 0.130 | 0.087 | 0.173 | 0.121 | 0.168 | 0.148 | 992 |
gram5-present-participle | 0.511 | 0.520 | 0.509 | 0.486 | 0.492 | 0.455 | 1056 |
all | 0.565 | 0.545 | 0.576 | 0.565 | 0.553 | 0.587 | 19544 |
capital-common-countries
a typical line is:Athens Greece Baghdad Iraq
capital-world
and city-in-state
.Athens Greece Canberra Australia
andChicago Illinois Houston Texas
respectively.Accuracy - displayed to 3 decimal places | Count | ||||||
---|---|---|---|---|---|---|---|
Window size 3 | Window size 5 | Window size 10 | |||||
SPhrase | Word2vec | SPhrase | Word2vec | SPhrase | Word2vec | ||
capital-world | 0.637 | 0.628 | 0.718 | 0.658 | 0.766 | 0.782 | 4524 |
capital-common-countries | 0.858 | 0.848 | 0.903 | 0.856 | 0.953 | 0.941 | 506 |
city-in-state | 0.664 | 0.480 | 0.623 | 0.583 | 0.663 | 0.677 | 2467 |
gram3-comparative | 0.845 | 0.806 | 0.803 | 0.813 | 0.682 | 0.670 | 1332 |
gram2-opposite | 0.224 | 0.220 | 0.245 | 0.222 | 0.196 | 0.204 | 812 |
gram8-plural | 0.772 | 0.736 | 0.731 | 0.744 | 0.655 | 0.727 | 1332 |
gram4-superlative | 0.373 | 0.396 | 0.392 | 0.366 | 0.257 | 0.262 | 1122 |
gram9-plural-verbs | 0.575 | 0.559 | 0.586 | 0.546 | 0.474 | 0.521 | 870 |
gram6-nationality-adjective | 0.818 | 0.784 | 0.824 | 0.815 | 0.831 | 0.853 | 1599 |
family | 0.615 | 0.595 | 0.581 | 0.638 | 0.595 | 0.543 | 506 |
gram7-past-tense | 0.479 | 0.515 | 0.520 | 0.492 | 0.460 | 0.470 | 1560 |
currency | 0.040 | 0.042 | 0.024 | 0.021 | 0.023 | 0.016 | 866 |
gram1-adjective-to-adverb | 0.090 | 0.087 | 0.127 | 0.121 | 0.172 | 0.148 | 992 |
gram5-present-participle | 0.526 | 0.520 | 0.455 | 0.486 | 0.479 | 0.455 | 1056 |
all | 0.576 | 0.545 | 0.588 | 0.565 | 0.576 | 0.587 | 19544 |
Model | Conll2003Eng | Wikigold |
---|---|---|
Word2Vec |
\(83.82\pm 0.3831\)
|
\(55.49\pm 0.4708\)
|
SPhrase |
\(\mathbf {88.93\pm 0.1115}\)
|
\(\mathbf {66.01\pm 0.4172}\)
|