Introduction
Contributions
Related work
Information extraction
Evaluation of IE
Domain-specific applications of IE
Human trafficking (HT)
Network science
Technical preliminaries
Order |
---|
Power-law exponent |
Number of connected components |
Clustering coefficient |
Degree correlation |
Order (of largest connected component) |
Algebraic connectivity |
Vertex connectivity |
Edge connectivity |
Diameter |
Average shortest path length |
Empirical studies
Data
Dataset | Explanation |
---|---|
Set1 | Ground truth obtained from independent, highly-tuned extractors |
Set2 | Result of precision-favoring extractions |
Set3 | Result of recall-favoring extractions |
Set4 | Set1 ∩ Set2 |
Set5 | Set1 ∩ Set3 |
Set6 | Set1 ∪ Set2 |
Set7 | Set1 ∪ Set3 |
Set8 | Set1 + Random Noise |
Ground-truths
Extraction datasets
Metric | Set1 | Set2 | Set3 | Set4 | Set5 | Set6 | Set7 | Set8 |
---|---|---|---|---|---|---|---|---|
No. of unique extractions | 328 | 1,685 | 1,489 | 253 | 253 | 1,760 | 1,563 | 1,865 |
No. of extractions per advertisement | 1.0 | 3.4720 | 17.1507 | 0.7862 | 0.7818 | 3.6858 | 17.3689 | 3.4722 |
No. of ads with no extractions | 0 | 64 | 65 | 2,465 | 2,516 | 0 | 0 | 178 |
No. of ads with at least 1 extraction | 11,530 | 11,466 | 11,465 | 9,065 | 9,014 | 11,530 | 11,530 | 11,352 |
Precision | 1.0 | 0.2264 | 0.0456 | 1.0 | 1.0 | 0.2713 | 0.0576 | 0.2264 |
Recall | 1.0 | 0.7862 | 0.7818 | 0.7862 | 0.7818 | 1.0 | 1.0 | 0.7862 |
F-score | 1.0 | 0.3516 | 0.0861 | 0.8803 | 0.8775 | 0.4268 | 0.1089 | 0.3516 |
Metric | Set1 | Set2 | Set3 | Set4 | Set5 | Set6 | Set7 | Set8 |
---|---|---|---|---|---|---|---|---|
No. of unique extractions | 925 | 631 | 900 | 468 | 594 | 1,076 | 1,218 | 1,169 |
No. of extractions per advertisement | 0.7563 | 0.6271 | 2.7060 | 0.3794 | 0.5140 | 1.0039 | 2.9484 | 0.6269 |
No. of ads with no extractions | 4,708 | 5,937 | 532 | 7,607 | 6,536 | 3,613 | 328 | 6,362 |
No. of ads with at least 1 extraction | 6,822 | 5,593 | 10,998 | 3,923 | 4,994 | 7,917 | 11,202 | 5,168 |
Precision | 1.0 | 0.6050 | 0.1899 | 1.0 | 1.0 | 0.7533 | 0.2565 | 0.6050 |
Recall | 1.0 | 0.5016 | 0.6796 | 0.5016 | 0.6796 | 1.0 | 1.0 | 0.5015 |
F-score | 1.0 | 0.5485 | 0.2969 | 0.6681 | 0.8092 | 0.8593 | 0.4083 | 0.5484 |
Metric | Set1 | Set2 | Set3 | Set4 | Set5 | Set6 | Set7 | Set8 |
---|---|---|---|---|---|---|---|---|
No. of unique extractions | 2,298 | 1,756 | 3,668 | 1,697 | 1,716 | 2,355 | 4,248 | 2,206 |
No. of extractions per advertisement | 1.0850 | 0.9639 | 1.3004 | 0.9490 | 0.9649 | 1.0999 | 1.4206 | 0.9638 |
No. of ads with no extractions | 194 | 651 | 367 | 708 | 531 | 181 | 132 | 1,623 |
No. of ads with at least 1 extraction | 11,336 | 10,879 | 11,163 | 10,822 | 10,999 | 11,349 | 11,398 | 9,907 |
Precision | 1.0 | 0.9845 | 0.7419 | 1.0 | 1.0 | 0.9864 | 0.7638 | 0.9845 |
Recall | 1.0 | 0.8746 | 0.8892 | 0.8746 | 0.6796 | 1.0 | 1.0 | 0.8745 |
F-score | 1.0 | 0.9263 | 0.8089 | 0.9414 | 0.8092 | 0.9931 | 0.8661 | 0.9263 |
Experiments and methods
Results
Research question 1
-
First, precision was found to be strongly correlated with several structural metrics, as quantified in Table 6, which records the Pearson correlation coefficient using the 8-point precision vector of the eight datasets in Table 2, and the corresponding values of the single-point structural metrics computed over their respective extraction networks. In some cases, the correlations seem intuitive and even obvious. For example, the relatively strong negative correlation between Order and precision can be explained as follows. Since the order corresponds to the set of all entities (for an attribute) extracted over the entire set of documents, and since every unique entity has, in practice, some non-zero probability of being noise, networks with a higher order tend to have lower precision. This is especially true when an attribute in question contains entities from some pre-specified ‘universal’ set, which is true for names8 and cities. In contrast, phones, which are syntactically constrained, but tend to accommodate many more possible unique values, show a weaker (but still quite strong) negative correlation.Table 6Pearson correlation coefficients between precision and network metricsNetwork metricCityNamePhoneOrder-0.7292 ∗-0.8219 ∗-0.4537Power-law exponent-0.69950.8628 ∗∗-0.6169Num. connected components0.9741 ∗∗0.9624 ∗∗0.9886 ∗∗Clustering coefficient0.70100.9017 ∗∗0.8362 ∗∗Degree correlation0.32450.22970.157Order (of largest connected component)-0.9839 ∗∗-0.7776 ∗-0.9958 ∗∗Algebraic connectivity0.9837 ∗∗-0.59250.7055Vertex connectivity0.9838 ∗∗0.52040.6025Edge connectivity0.9834 ∗∗0.52040.7229 ∗Diameter-0.54710.8368 ∗∗-0.9950 ∗∗Avg. shortest path length-0.13730.4637-0.9964 ∗∗
-
Second, more interestingly, the ‘erroneous’ edges in less precise extraction networks tend to serve as ‘weak ties’ that end up collapsing two or more connected components into a single connected component, reducing the number of connected components. In other words, less precise edges tend to straddle components (a rough definition of what would constitute a weak tie in network science). This suggests a potential line of attack in cleaning up noisy extractions, by exploiting hierarchical or agglomerative clustering algorithms (Murtagh and Legendre 2014) that may be able to detect such weak ties (e.g., by iteratively breaking up connected components into clusters using mechanisms such as betweenness centrality for assigning weights to edges). The empirical utility of such methods is an important agenda that we will pursue in future work.
-
Third, precision is positively correlated with the Clustering Coefficient of the extraction network, but the correlation is not as strong as between precision and the number of connected components. This implies that cleaner extraction sets yield a smaller number of, but more tightly knit, groups (in the underlying extraction network) as compared with noisier extraction sets. In other words, in aggregate, the incorrect extraction edges tend to contribute to non-transitivity, since clustering coefficient is related to the number of triadic closures (indicating high transitivity) in the network. On average, therefore, given two links n1−n2 and n1−n3, all else being equal, a third link n2−n3 introduced by a real-world extraction system is more likely to be correct than in the absence of either n1−n2 or n1−n3 (or both). This suggests another potential line of attack in trying to clean up noisy extractions (or selecting a system under the expectation of high precision without access to ground-truth) by systematically making use of global information.
-
Fourth, precision is also positively correlated with the Vertex and Edge Connectivity of the largest connected component. Adding more incorrect extraction edges leads to lower connectivity in the larger connected component compared to the original closely connected component. This offers a finer-grained ‘check’ on systems’ precision, as opposed to a coarse-grained classification (of whether a given extraction set is more precise than another extraction set) compared to the previous observation, which would only check the size of the largest component.
-
Interestingly, in contrast with precision, recall is not heavily correlated with any of the metrics described above, whether positively or negatively. Table 7 shows the correlation between recall and the various metrics discussed above in the context of precision.Table 7Pearson correlation coefficients between recall and network metricsNetwork metricCityNamePhoneOrder0.48160.57940.3993Power-law exponent-0.06280-0.17900.3877Num. connected components0.0228-0.2304-0.1901Clustering coefficient0.1563-0.1438-0.4472Degree correlation-0.0443-0.4175-0.6218Order (of largest connected component)0.07520.43720.2662Algebraic connectivity-0.06500.01820.1217Vertex connectivity-0.0648-0.46290.3248Edge connectivity-0.0660-0.4629-0.0360Diameter-0.1662-0.58240.2573Avg. shortest path length-0.0277-0.7176 ∗0.2519
-
Finally, because F-score is the harmonic mean of both precision and recall, it was unsurprisingly found to be correlated positively with the number of connected components, and also the clustering coefficient, of the network. The correlations were not as strong as those of precision (Table 8).Table 8Pearson correlation coefficients between F-Score and network metricsNetwork metricCityNamePhoneOrder-0.6284-0.5268-0.0769Power-law Exponent-0.59180.7635 ∗-0.1704Num. Connected Components0.9484 ∗∗0.8211 ∗0.6347Clustering Coefficient0.63950.7352 ∗0.3150Degree Correlation0.26930.0863-0.3799Order (of Largest Connected Component)-0.9401 ∗∗-0.3578-0.5885Algebraic Connectivity0.9413 ∗∗-0.70140.6583Vertex Connectivity0.9415 ∗∗0.39800.7384 ∗Edge Connectivity0.9407 ∗∗0.39800.5411Diameter-0.44590.3831-0.5926Avg. Shortest Path Length-0.3948-0.1305-0.5984
Random vs. non-random noise
Metric | Set1 | Set2 | Set3 | Set4 | Set5 | Set6 | Set7 | Set8 |
---|---|---|---|---|---|---|---|---|
Order | 11492 | 11460 | 11465 | 9039 | 8988 | 11525 | 11530 | 11352 |
Power-law exponent | 0.6431 | 2.2155 | 1.3639 | 0.5924 | 0.5889 | 2.3504 | 1.3709 | 3.1598 |
Num. connected components | 290 | 11 | 2 | 227 | 227 | 7 | 2 | 1 |
Clustering coefficient | 0.9944 | 0.8387 | 0.8377 | 0.9947 | 0.9947 | 0.8367 | 0.8367 | 0.5493 |
Degree correlation | 1.0 | 0.5641 | 1.0 | 1.0 | 0.5542 | 0.5542 | 0.6397 | 0.6275 |
Order (of largest connected component) | 533 | 11430 | 11463 | 533 | 513 | 11510 | 11528 | 11352 |
Algebraic connectivity | 532.9999 | 0.8757 | 0.4989 | 533.0 | 513.0 | 0.7767 | FAIL | 7.8806 |
Vertex connectivity | 532 | 1 | 3 | 532 | 512 | 1 | FAIL | 8 |
Edge connectivity | 532 | 1 | 3 | 532 | 512 | 1 | FAIL | 8 |
Diameter | 1 | 7 | 5 | 1 | 1 | 7 | FAIL | 4 |
Avg. shortest path length | 1 | 3.4945 | 1.9510 | 1 | 1 | 2.6403 | FAIL | 2.3603 |
Metric | Set1 | Set2 | Set3 | Set4 | Set5 | Set6 | Set7 | Set8 |
---|---|---|---|---|---|---|---|---|
Order | 6603 | 5429 | 10987 | 3776 | 4833 | 7716 | 11185 | 5095 |
Power-law exponent | 2.7054 | 1.5843 | 0.8963 | 2.4881 | 2.5301 | 2.1872 | 0.8970 | 2.8074 |
Num. connected components | 195 | 153 | 3 | 193 | 201 | 147 | 4 | 138 |
Clustering coefficient | 0.9273 | 0.9162 | 0.8439 | 0.9440 | 0.9355 | 0.8986 | 0.8410 | 0.8502 |
Degree correlation | 0.7681 | 0.7612 | 0.9803 | 0.7651 | 0.8203 | 0.6552 | 0.3280 | 0.6340 |
Order (of largest connected component) | 4859 | 4376 | 8406 | 1829 | 2892 | 6842 | 11005 | 4622 |
Algebraic connectivity | 0.0595 | 0.0842 | 0.9967 | 0.0123 | 0.0294 | 0.1280 | 0.9989 | 0.0738 |
Vertex connectivity | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Edge connectivity | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Diameter | 11 | 11 | 5 | 17 | 15 | 9 | 4 | 13 |
Avg. shortest path length | 0 | 4.0801 | 1.7173 | 5.7881 | 5.3785 | 3.8888 | 5.1463 | 4.7180 |
Metric | Set1 | Set2 | Set3 | Set4 | Set5 | Set6 | Set7 | Set8 |
---|---|---|---|---|---|---|---|---|
Order | 10286 | 9863 | 10252 | 9817 | 9993 | 10313 | 10488 | 8989 |
Power-law exponent | 0.9887 | 1.5696 | 1.5910 | 0.9975 | 1.0085 | 1.5389 | 1.5800 | 1.3138 |
Num. connected components | 656 | 610 | 350 | 619 | 633 | 650 | 368 | 589 |
Clustering coefficient | 0.9476 | 0.9547 | 0.9314 | 0.9563 | 0.9551 | 0.9463 | 0.9273 | 0.9370 |
Degree correlation | 0.9816 | 0.9779 | 0.9803 | 0.9869 | 0.9865 | 0.9730 | 0.9788 | 0.9780 |
Order (of largest connected component) | 189 | 189 | 8406 | 189 | 189 | 189 | 8636 | 199 |
Algebraic connectivity | 2.9177 | 2.9730 | 0.0046 | 1.9555 | 1.9570 | 3.9630 | 0.0047 | 0.2295 |
Vertex connectivity | 4 | 3 | 1 | 2 | 2 | 4 | 1 | 1 |
Edge connectivity | 3 | 3 | 1 | 3 | 3 | 4 | 1 | 1 |
Diameter | 3 | 3 | 19 | 3 | 3 | 3 | 19 | 5 |
Avg. shortest path length | 1.6318 | 1.5709 | 6.9405 | 1.6518 | 1.6417 | 1.5537 | 6.8760 | 1.9739 |