Introduction
-
Weak holistic semantic inference: Distribution-based approaches usually consider distributional statistics and similarity information, and graph-based approaches often use clustering algorithms. This may lead to the loss of holistic semantics among entities.
-
Unsatisfactory robustness: Manual/semimanual and pattern-based approaches achieve relatively high precision but low recall, whereas distribution-based approaches achieve relatively high recall but low precision.
-
Error propagation: Graph-based and two-step approaches often suffer from error propagation. This is because the available resources used by graph-based approaches and the first task of the two-step approaches are not all correct. These issues lead to the propagation of errors in subsequent processing.
-
Lack of labeled training datasets: Distribution-based and two-step approaches usually require labeled training datasets to train a detection model. However, the labeled Chinese entity synonym set training datasets are not always available and are expensive to develop.
-
We propose a bilateral-context-based Siamese network classifier to track Chinese synonyms.
-
We propose a filtering-strategy-based set expansion algorithm to expand Chinese entity synonym sets.
-
Two Chinese real-world entity synonym set expansion datasets are constructed. The datasets and the source code of our approach are available at https://github.com/huangsubin/CNSynSetE.
Related works
Pattern-based approaches
Distribution-based approaches
Graph-based approaches
Two-step approaches
Discussion
Materials and methods
Definitions and problem statement
-
Synonym. Synonyms are strings or words that have the same or almost the same meaning in a language [15]. Synonyms are ubiquitous in all human natural languages. For example, “USA” and “United States” refer to the same country; “Abuse” and “Maltreatment” mean cruel or inhumane treatment.
-
Entity synonym set. An entity synonym set denotes a group of strings or words that represents an identical or similar entity in a language. For example, \(\{\)“The United Kingdom”, “Britain”, “U.K.”\(\}\) is an entity synonym set, because the strings in the set denote the same country: “United Kingdom of Great Britain and Northern Ireland”.
-
Problem statement. Given a Chinese text corpus C and a vocabulary V generated from C, the objective of this study is to expand Chinese entity synonym sets from V based on the clues (e.g., bilateral context and filtering features) mined from C and V. Actually, the entity synonyms are transitive (entities cannot be multivocal words) and symmetric. Transitive: \((a \overset{\text {syn}}{\rightarrow } b \wedge b \overset{\text {syn}}{\rightarrow } c) \Rightarrow (a \overset{\text {syn}}{\rightarrow } c)\). Symmetric: \((a \overset{\text {syn}}{\rightarrow } b) \Rightarrow (b \overset{\text {syn}}{\rightarrow } a)\). Here, a, b and c are the strings or words in V. Relation \(a \overset{syn}{\rightarrow }\ b\) denotes that a and b are synonymous. Therefore, in an entity synonym set, all entities are synonymous with each other.
Overview of framework
-
Distant supervision knowledge acquisition: This component entails obtaining the Chinese entity vocabulary from the Chinese knowledge base and acquiring entity synonym set datasets from Chinese web corpora using the Chinese encyclopedia as training supervision signals.
-
Bilateral-context-based Siamese network classifier: A classifier is built to determine whether a new input Chinese entity should be inserted into the existing Chinese entity synonym set. The classifier contains a Siamese network with entity bilateral context and is capable of learning more synonymous features.
-
Entity synonym set expansion algorithm: A filtering-strategy-based set expansion algorithm is designed to expand Chinese entity synonym sets. The algorithm is combined with the bilateral-context-based Siamese network classifier and entity expansion filtering strategy to improve the performance of the Chinese entity synonym set expansion task.
Distant supervision knowledge acquisition
Bilateral-context-based Siamese network classifier
Bilateral-context-level attention
-
First, context-level attention transforms \(S=\{t_{1},\ldots ,t_{n}\}\) and \(C=\{c_{1},\ldots ,c_{l}\}\) into embedding sets \(e_{s}=\{e_{t1},\ldots ,e_{tn}\}\) and \(e_{c}=\{e_{c1},\ldots ,e_{cl}\}\), respectively.
-
Second, for each context embedding \(e_{ck} \in e_{c} \), context-level attention calculates the align weights \( \alpha =\{\alpha _{c1},\ldots ,\alpha _{ck},\ldots ,\alpha _{cl}\}\), denoted by$$\begin{aligned} \alpha _{ck}=\frac{\sum _{j=1}^{n}e_{ck} \cdot e_{tj}}{\sum _{i=1}^{l}\sum _{j=1}^{n}e_{ci} \cdot e_{tj}}. \end{aligned}$$(5)
-
Third, the outputs of context-level attention are context features \(\text {Atten}=\{a_{1},\ldots ,a_{k},\ldots ,a_{l}\}\), where \(a_{k}=e_{ck} \cdot \alpha _{ck}\).
Bilateral-context-based Siamese network
-
First, given input set \(S=(t_{1}, \ldots ,t_{m} )\), \(S \in \{(t_{1}, \ldots ,t_{n}),(t_{1}, \ldots ,t_{n},t)\}\), where m is equal to n or \(n+1\). The embedding feature extractor represents set \((t_{1}, \ldots ,t_{m})\) as embeddings \((e_{1}, \ldots ,e_{m})\) via the embedding lookup table.
-
Second, embeddings \((e_{1}, \ldots , e_{m})\) are input into neural network \(\theta _{1}(\cdot )\) with a two-layer fully connected structure. Next, embeddings \((e_{1}, \ldots ,e_{m})\) are transformed into m hidden representations as \(H_{1}=(\theta _{1}(e_{1}), \ldots ,\theta _{1}(e_{m}))\).
-
Third, a summation operation is used to change \(H_{1}\) into a hidden representation \(H_{2}=\sum _{i=1}^{m}\theta _{1}(e_{i})\).
-
Fourth, \(H_{2}\) is input into neural network \(\theta _{2}(\cdot )\) with a three-layer fully connected structure. Subsequently, representation \(H_{2}\) is transformed into a hidden representation as \(H_{3}=\theta _{2}(H_{2})\).
-
First, context features \(\text {Atten}=\{a_{1},\ldots ,a_{k},\ldots ,a_{l}\}\) are input into neural network \(\bar{\theta _{1}}(\cdot )\) with a two-layer fully connected structure. Next, \(\text {Atten}=\{a_{1},\ldots ,a_{k},\ldots ,a_{l}\}\) are transformed into l hidden representations as \(\bar{H_{1}}=(\bar{\theta _{1}}(a_{1}), \ldots , \bar{\theta _{1}}(a_{l}))\).
-
Second, a summation operation is used to change \(\bar{H_{1}}\) into hidden representation \(\bar{H_{2}}=\sum _{i=1}^{l}\bar{\theta _{1}}(a_{i})\).
-
Third, hidden representation \(\bar{H_{2}}\) is input into neural network \(\bar{\theta _{2}}(\cdot )\) with a three-layer fully connected structure. Subsequently, representation \(\bar{H_{2}}\) is transformed into a hidden representation as \(\bar{H_{3}}=\bar{\theta _{2}}(\bar{H_{2}})\).
Permutation-invariance-based loss function
Entity synonym set expansion algorithm
Entity expansion filtering strategy
Set expansion algorithm
Description | BDSynSetTra | SGSynSetTra |
---|---|---|
# of entities for training | 33,404 | 4,748 |
# of synonym sets for training | 16,742 | 2305 |
# of entities for testing | 3861 | 577 |
# of synonym sets for testing | 1182 | 255 |
Experiments
Experimental settings
Datasets
-
BDSynSetTra is created from the Baidu Encyclopedia articles. The CN-Dbpedia knowledge base and Hanlp tool are used to process and link the entities in the Baidu Encyclopedia articles. In this dataset, 33,404 entities and 16,742 synonym sets are used for training, and 3861 entities and 1182 synonym sets are used for testing.
-
SGSynSetTra is created from the SogouCA corpus. The CN-Dbpedia knowledge base and Hanlp tool are used to process and link the entities in SogouCA. In this dataset, 4748 entities and 2305 synonym sets are used for training, and 577 entities and 255 synonym sets are used for testing.
Benchmark methods for comparison
-
K-means. K-means5 clustering algorithm is used to discover Chinese entity synonym entity sets from the Chinese entity vocabulary built from the datasets. We predefine a suitable cluster number K for each dataset. The inputs of the K-means algorithm are the entity embeddings, and the outputs are clustered Chinese entity synonym sets.
-
Birch. Birch6 is a hierarchical clustering algorithm. We predefine a suitable cluster number K for each dataset. The inputs of the Birch algorithm are entity embeddings, and the outputs are clustered Chinese synonym entity sets.
-
SVM. SVM7 is a supervised approach. First, the approach trains a support vector machine (SVM) classifier to predict Chinese synonym set-instance pairs. Next, the trained SVM classifier is used to expand Chinese entity synonym sets from the datasets.
-
BPNN. BPNN8 is a supervised approach. First, the approach trains a back propagation neural network (BPNN) classifier to predict Chinese synonym set-instance pairs. Next, the trained BPNN classifier is used to expand Chinese entity synonym sets from the datasets.
-
SynSetMine. SynSetMine9 [15] is a supervised approach. First, the approach trains a Chinese set-instance classifier within embedding and post-transformers to predict Chinese synonym set-instance pairs. Next, the approach uses a set generation algorithm to expand Chinese entity synonym sets from the entity vocabulary built from the datasets.
-
AutoECES. AutoECES [32] is a supervised approach. First, the approach trains a triplet network classifier to predict Chinese synonym set-instance pairs. Next, the trained triplet network classifier is used to expand Chinese entity synonym sets from the datasets.
-
CNSynSetE. CNSynSetE is our proposed approach. In this approach, a bilateral-context-based Siamese network classifier is first designed to predict Chinese synonym set-instance pairs. Next, the approach uses an expansion algorithm within the entity expansion filtering strategy to expand Chinese entity synonym sets from the datasets.
Parameter settings
Parameter | BDSynSetTra | SGSynSetTra |
---|---|---|
Learning rate | 0.00005 | 0.00005 |
Dropout rate | 0.4 | 0.4 |
Negative sample size k | 50 | 50 |
\(\mu \) | 0.3 | 0.2 |
\(\delta \) | 0.7 | 0.1 |
\(\kappa \) | 0.7 | 0.1 |
\(\lambda \) | 0.2 | 0.6 |
BDSynSetTra | SGSynSetTra | |||||
---|---|---|---|---|---|---|
Approach | FMI (%) | ARI (%) | NMI (%) | FMI (%) | ARI (%) | NMI (%) |
K-means | 3.44 | 0.73 | 81.50 | 34.48 | 28.73 | 92.44 |
Birch | 4.96 | 1.44 | 83.97 | 44.56 | 39.93 | 94.36 |
SVM | 9.10 | 5.80 | 84.25 | 17.79 | 11.49 | 84.11 |
BPNN | 37.61 | 32.79 | 92.73 | 51.97 | 51.51 | 94.91 |
SynSetMine | 48.82 | 48.64 | 96.06 | 73.23 | 73.04 | 97.24 |
AutoECES | 52.96 | 52.73 | 96.11 | 74.07 | 73.87 | 97.27 |
SynonymNet | 54.45 | 54.19 | 96.34 | 74.52 | 74.14 | 97.36 |
CNSynSetE | 60.95 | 60.83 | 96.79 | 81.38 | 81.15 | 98.21 |
Metrics
-
FMI. FMI is usually used to compute the similarity between two given clusters. It is calculated as follows:where TP denotes the number of true-positive element pairs belonging to identical clusters in both true labels and prediction labels. FP denotes the number of false-positive element pairs belonging to identical clusters in true labels but not in prediction labels. FN denotes the number of false-negative element pairs belonging to identical clusters in prediction labels but not in true labels.$$\begin{aligned} \text {FMI}=\frac{\text {TP}}{\sqrt{(\text {FP}+\text {TP}) \cdot (\text {FN}+\text {TP})}}, \end{aligned}$$(15)
-
ARI. ARI is another similarity metric computed using Rand index (RI). It is calculated as follows:$$\begin{aligned} \text {RI}=\frac{\text {TP}-\text {TN}}{N}, \end{aligned}$$(16)where TN denotes the number of true-negative element pairs belonging to identical clusters in both false labels and prediction labels. N is the total number of element pairs.$$\begin{aligned} \text {ARI}=\frac{\text {RI}-E(\text {RI})}{\max (\text {RI})-E(\text {RI})}, \end{aligned}$$(17)
-
NMI. NMI is computed using mutual information (MI) and information entropy (IE). It is calculated as follows:where H(A) is the IE of A and I(A, B) is the MI between A and B.$$\begin{aligned} \text {NMI}(A,B)=\frac{I(A,B)}{\sqrt{H(A) \cdot H(B)}}, \end{aligned}$$(18)
Experimental results
Chinese entity synonym set expansion performance analysis
BDSynSetTra | SGSynSetTra | |||||
---|---|---|---|---|---|---|
Approach | P (%) | R (%) | F1 (%) | P (%) | R (%) | F1 (%) |
SVM | 3.33 | 24.86 | 5.88 | 6.76 | 46.80 | 11.82 |
BPNN | 22.07 | 64.09 | 32.84 | 58.55 | 46.13 | 51.60 |
SynSetMine | 45.05 | 52.90 | 48.66 | 68.96 | 77.78 | 73.10 |
AutoECES | 50.34 | 47.35 | 49.26 | 65.34 | 79.37 | 71.67 |
SynonymNet | 59.71 | 49.65 | 54.21 | 67.88 | 81.82 | 74.20 |
CNSynSetE | 57.46 | 64.66 | 60.85 | 87.21 | 75.95 | 81.19 |
Chinese entity synonym set-instance classifier performance analysis
BDSynSetTra | SGSynSetTra | |||||
---|---|---|---|---|---|---|
Approach | P (%) | R (%) | F1 (%) | P (%) | R (%) | F1 (%) |
SVM | 9.64 | 87.31 | 17.36 | 79.96 | 75.61 | 77.72 |
BPNN | 98.10 | 87.46 | 92.48 | 87.50 | 34.15 | 49.12 |
SynSetMine | 96.99 | 80.46 | 87.95 | 98.16 | 79.92 | 88.11 |
AutoECES | 95.97 | 80.34 | 87.46 | 97.34 | 80.16 | 87.92 |
SynonymNet | 98.44 | 84.08 | 90.70 | 97.91 | 87.80 | 92.58 |
CNSynSetE | 98.70 | 88.60 | 93.38 | 98.12 | 87.99 | 92.78 |
BDSynSetTra | SGSynSetTra | |||
---|---|---|---|---|
Approach | AUC (%) | MAP (%) | AUC (%) | MAP (%) |
SVM | 61.44 | 57.60 | 89.42 | 81.37 |
BPNN | 99.71 | 99.09 | 94.62 | 85.99 |
SynSetMine | 99.78 | 98.79 | 99.72 | 98.91 |
AutoECES | 99.38 | 97.82 | 99.06 | 97.40 |
SynonymNet | 99.85 | 99.29 | 99.89 | 99.35 |
CNSynSetE | 99.91 | 99.61 | 99.92 | 99.60 |
Approach | Training | Prediction | ||
---|---|---|---|---|
BDSynSetTra | SGSynSetTra | BDSynSetTra | SGSynSetTra | |
K-means | – | – | 8.86 s | 1.55 s |
Bitch | – | – | 9.74 s | 2.16 s |
SVM | 1.2 h | 6 m | 6.46 s | 1.38 s |
BPNN | 8 h | 49 m | 5.12 s | 1.14 s |
SynSetMine | 9.1 h | 51 m | 5.46 s | 1.21 s |
AutoECES | 9.3 h | 57 m | 7.28 s | 1.57 s |
SynonymNet | 10.2 h | 1.1 h | 7.17 s | 1.45 s |
CNSynSetE | 9.4 h | 53 m | 6.29 s | 1.28 s |
Hyperparameter analysis
-
\(\mu \) analysis. \(\mu \) is the adjustment parameter for bilateral context features. We fix hyperparameters \(\delta \), \(\kappa \), and \(\lambda \) as the optimal parameter values and assign a value between 0.1 and 0.9 to hyperparameter \(\mu \). In Fig. 8a, the FMI values of CNSynSetE are stable on the BDSynSetTra and SGSynSetTra datasets. In particular, on the BDSynSetTra dataset, CNSynSetE obtains a higher FMI value when \(\mu =0.3\); on the SGSynSetTra dataset, CNSynSetE obtains a higher FMI value when \(\mu =0.2\).
-
\(\delta \) analysis. \(\delta \) is the adjustment parameter for similarity and domain filtering. We fix hyperparameters \(\mu \), \(\kappa \), and \(\lambda \) as the optimal parameter values and assign a value between 0.1 and 0.9 to hyperparameter \(\delta \). It is evident from Fig. 8b that the FMI values of CNSynSetE decrease with a increase in \(\delta \) for the SGSynSetTra dataset. The FMI values of CNSynSetE first increase and then decrease with an increase in \(\delta \) for the BDSynSetTra dataset. On the BDSynSetTra dataset, CNSynSetE obtains a higher FMI value when \(\delta =0.7\); on the SGSynSetTra dataset, CNSynSetE obtains a higher FMI value when \(\delta =0.1\).
-
\(\kappa \) analysis. \(\kappa \) is the threshold for the bilateral-context-based Siamese network classifier. We fix hyperparameters \(\mu \), \(\delta \), and \(\lambda \) as the optimal parameter values and assign a value between 0.1 and 0.9 to hyperparameter \(\kappa \). It is evident from Fig. 8c that the FMI values of CNSynSetE decrease with an increase in \(\kappa \) for the SGSynSetTra dataset. The FMI values of CNSynSetE first increase and then decrease with an increase in \(\kappa \) for the BDSynSetTra dataset. On the BDSynSetTra dataset, CNSynSetE obtains a higher FMI value when \(\kappa =0.7\); on the SGSynSetTra dataset, CNSynSetE obtains a higher FMI value when \(\kappa =0.1\).
-
\(\lambda \) analysis. \(\lambda \) is the threshold for the entity expansion filtering strategy. We fix the hyperparameters \(\mu \), \(\delta \), and \(\kappa \) as the optimal parameter values and assign a value between 0.1 and 0.9 to hyperparameter \(\lambda \). It is evident from Fig. 8d that both the FMI values of CNSynSetE first increase and then decrease with an increase in \(\lambda \) for the BDSynSetTra and SGSynSetTra datasets. On the BDSynSetTra dataset, CNSynSetE obtains a higher FMI value when \(\lambda =0.2\); on the SGSynSetTra dataset, CNSynSetE obtains a higher FMI value when \(\lambda =0.6\).
Effect of negative sample size
Time consumption
Approach | BDSynSetTra | SGSynSetTra | ||||
---|---|---|---|---|---|---|
FMI (%) | ARI (%) | NMI (%) | FMI (%) | ARI (%) | NMI (%) | |
No-BiContext | 51.74 | 51.55 | 96.18 | 77.44 | 77.39 | 97.67 |
No-FiltStrategy | 56.22 | 55.65 | 96.18 | 74.15 | 73.38 | 97.34 |
No-SimFiltering | 2.66 | 0.14 | 94.94 | 32.82 | 23.01 | 94.34 |
No-DomFiltering | 56.25 | 55.68 | 96.19 | 78.24 | 77.73 | 97.89 |
SynSetMine | 48.82 | 48.64 | 96.06 | 73.23 | 73.04 | 97.24 |
AutoECES | 52.96 | 52.73 | 96.11 | 74.07 | 73.87 | 97.27 |
SynonymNet | 54.45 | 54.19 | 96.34 | 74.52 | 74.14 | 97.36 |
CNSynSetE | 60.95 | 60.83 | 96.79 | 81.38 | 81.15 | 98.21 |
Ablation study
-
No-BiContext. No-BiContext is an ablation approach that does not use bilateral context information. First, this strategy uses a Siamese network classifier to predict synonym set-instance pairs. Next, it uses an expansion algorithm with similarity and domain filtering strategies to expand Chinese entity synonym sets.
-
No-FiltStrategy. No-FiltStrategy is an ablation approach that does not use any filtering strategy. First, this strategy uses a bilateral-context-based Siamese network classifier to predict synonym set-instance pairs. Next, it uses an expansion algorithm without using the similarity and domain filtering strategies to expand Chinese entity synonym sets.
-
No-SimFiltering. No-SimFiltering is an ablation approach that does not use the similarity filtering strategy. First, this approach uses a bilateral-context-based Siamese network classifier to predict synonym set-instance pairs. Next, it uses an expansion algorithm with only a domain filtering strategy to expand Chinese entity synonym sets.
-
No-DomFiltering. No-DomFiltering is an ablation approach that does not use the domain filtering strategy. First, this approach uses a bilateral-context-based Siamese network classifier to predict synonym set-instance pairs. Next, it uses an expansion algorithm with only a similarity filtering strategy to expand Chinese entity synonym sets.
Case study
BDSynSetTra | |||||
---|---|---|---|---|---|
Output | O | G | Output | O | G |
{公鸡,雄鸡} | 1 | 1 | {第一手资料,原始资料} | 1 | 1 |
{安搏律定,阿普林定,茚满丙二胺,安室律定,茚丙胺} | 1 | 1 | {婚外情,外遇,出轨} | 1 | 1 |
{向日葵,太阳花,向阳花} | 1 | 1 | {银鱼,面条鱼,鲥鱼,凤尾鱼} | 1 | 0 |
{假钱,假钞,假币} | 1 | 1 | {赤脚,光脚,赤足} | 1 | 1 |
{兵马俑,秦兵马俑,秦俑,马踏飞燕,铜奔马} | 1 | 0 | {龙眼干,桂圆干,山楂糕} | 1 | 0 |
SGSynSetTra | |||||
---|---|---|---|---|---|
Output | O | G | Output | O | G |
{堵车,交通拥堵,交通堵塞,塞车} | 1 | 1 | {脚癣,足癣,香港脚,脚气} | 1 | 1 |
{午饭,午餐,中饭} | 1 | 1 | {服务器,伺服器} | 1 | 1 |
{哈尔滨工业大学,哈工大} | 1 | 1 | {元宵节,中秋节,灯节,团圆节} | 1 | 0 |
{妊娠期,怀孕期} | 1 | 1 | {宗谱,家谱,族谱} | 1 | 1 |
{储值卡,消费卡,积分卡} | 1 | 0 | {猎狗,猎犬} | 1 | 1 |
-
On the one hand, the semantic information in the Chinese synonym set becomes more complex when the size of the synonym set increases. This prevents the proposed approach from capturing more synonymous information to predict the Chinese entity synonym sets.
-
On the other hand, some entities are so similar in semantics that our approach cannot identify whether they are synonymous relations or related-to relations. It is still difficult to discriminate between synonymous relations and related-to relations for Chinese entities because of their size and breadth.