1 Introduction
2 Inferring variable labels
2.1 Our approach
2.2 Models
Dimensions
| |
L
| A unique set of VLs in training data |
D
| The number of ODs |
V
| The number of VLs |
W
| The number of terms |
Matrices
| |
M
| A Term-OD matrix (\(W \times D\)) |
R
| A VL-OD matrix (\(V \times D\)) |
C
| A VL co-occurrence matrix (\(V \times V\)) |
E
| A Term-VL matrix (\(W \times V\)) |
Elements
| |
\(v_{ij}\)
| The number of ith terms occurring in the jth OD |
\(r_{ij}\)
| The number of ith VLs occurring in the jth OD |
\(e_{ij}\)
| The number of ith terms linked with the jth VL |
\(c_{ij}\)
| The number of DJs containing a pair of VLs \(vl_i\) and \(vl_j\) |
Performance measures
| |
\({\textit{Pr}}\)
| Precision: the fraction of relevant VLs among the retrieved VLs |
\({\textit{Re}}\)
| Recall: the fraction of relevant VLs that have been retrieved over the total amount of relevant VLs |
F
| F-measure: the harmonic mean of precision (Pr) and recall (Re) |
2.2.1 Similarity of DJs based on ODs (Model 1)
2.2.2 Co-occurrence of VLs (Model 2)
2.3 Inference process for obtaining VLs
2.3.1 Term-VL matrix E (Model 1)
2.3.2 Term-VL matrix \({\textit{EC}}\) (Models 1 and 2)
2.4 Example
-
Example 1: “Data on amount of beer consumed by foreign tourists visiting Japan at a restaurant”
-
Example 2: “These data represent the transition of population each year in Japan”
Inferred VL | Similarity |
---|---|
Languages that can be offered | 0.381758 |
Languages understood by foreigners | 0.381758 |
National origin | 0.317921 |
Attractions in Tokyo | 0.277441 |
Number of visits by foreigners | 0.277441 |
Number of visitors | 0.277441 |
Experience with or without activity | 0.272505 |
Attribute of visitors (age) | 0.272505 |
Consumed amount | 0.272505 |
Purchase | 0.272505 |
Inferred VL | Similarity |
---|---|
Languages that be offered | 0.381758 |
Languages understood by foreigners | 0.381758 |
Satisfaction level of visit | 0.277441 |
Attractions in Tokyo | 0.277441 |
Number of visits by foreigners | 0.277441 |
Number of visitors | 0.272505 |
Experience with or without activity | 0.272505 |
Attribute of visitors (age) | 0.272505 |
Consumed amount | 0.272505 |
Purchase | 0.272505 |
Inferred VL | Similarity |
---|---|
Total population of farmers | 0.313810 |
Total agricultural workforce | 0.313810 |
Number of births | 0.313131 |
Number of deaths | 0.313131 |
Agricultural workforce (male) | 0.312155 |
Agricultural workforce (female) | 0.312155 |
Number of full-time farmers | 0.312155 |
Number of part-time farmers | 0.312155 |
Every 5 years | 0.311423 |
Number of increases and decreases | 0.311423 |
Inferred VL | Similarity |
---|---|
Number of births | 0.349185 |
Number of deaths | 0.349185 |
In-migrants | 0.334844 |
Fatalities | 0.334844 |
Out-migrants | 0.334844 |
Population | 0.321476 |
Number of households | 0.317914 |
Population (male) | 0.317914 |
Population (female) | 0.317914 |
Fertility | 0.317914 |
3 Experimental details
3.1 Purpose
3.2 Training and test data
Number of Data Jackets | 799 |
Average number of terms in each OD | 39.5 |
Average number of VLs in each Data Jacket | 5.34 |
Unique terms in ODs | 1935 |
Total number of VLs | 4160 |
Unique variable labels | 3216 |
Public data | Business data | |
---|---|---|
Number of Data Jackets | 50 | 50 |
Average number of terms |
\(36.7\pm 8.80\)
|
\(50.7 \pm 43.2\)
|
Average number of VLs |
\(4.70 \pm 1.71\)
|
\(6.60 \pm 4.28\)
|
Total number of VLs | 398 | 2605 |
Unique VLs | 131 | 1862 |
3.3 Method and evaluation
4 Result and discussion
4.1 Results using public data
F-measure | Precision | Recall | |
---|---|---|---|
TSM |
\(0.110\pm 0.091\)
|
\(0.082\pm 0.068\)
|
\(0.185\pm 0.173\)
|
Matrix E |
\(0.235\pm 0.178\)
|
\(0.174\pm 0.131\)
|
\(0.401\pm 0.331\)
|
Matrix \({\textit{EC}}\) |
\(0.196\pm 0.183\)
|
\(0.146\pm 0.133\)
|
\(0.332\pm 0.337\)
|
Mean \({\textit{AS}}\) | Mean \({\overline{{\textit{AS}}}}\) | |
---|---|---|
Matrix E |
\(0.329\pm 0.113\)
|
\( 0.069\pm 0.014\)
|
Matrix \({\textit{EC}}\) |
\(0.399\pm 0.095\)
|
\( 0.111\pm 0.016\)
|
p-value | ** | ** |
4.2 Results using business data
F-measure | Precision | Recall | |
---|---|---|---|
TSM |
\(0.039\pm 0.079\)
|
\(0.024\pm 0.047\)
|
\(0.153\pm 0.331\)
|
Matrix E |
\(0.124\pm 0.119\)
|
\(0.078\pm 0.078\)
|
\(0.403\pm 0.411\)
|
Matrix \({\textit{EC}}\) |
\(0.097\pm 0.102\)
|
\(0.060\pm 0.063\)
|
\(0.324\pm 0.386\)
|
Mean \({\textit{AS}}\) | Mean \({\overline{{\textit{AS}}}}\) | |
---|---|---|
Matrix E |
\(0.190\pm 0.100\)
|
\( 0.025\pm 0.008\)
|
Matrix \({\textit{EC}}\) |
\(0.230\pm 0.111\)
|
\( 0.037\pm 0.013\)
|
p-value | ** | ** |