Abstract

Data heterogeneity is the obstacle for the resource sharing on Semantic Web (SW), and ontology is regarded as a solution to this problem. However, since different ontologies are constructed and maintained independently, there also exists the heterogeneity problem between ontologies. Ontology matching is able to identify the semantic correspondences of entities in different ontologies, which is an effective method to address the ontology heterogeneity problem. Due to huge memory consumption and long runtime, the performance of the existing ontology matching techniques requires further improvement. In this work, an extended compact genetic algorithm-based ontology entity matching technique (ECGA-OEM) is proposed, which uses both the compact encoding mechanism and linkage learning approach to match the ontologies efficiently. Compact encoding mechanism does not need to store and maintain the whole population in the memory during the evolving process, and the utilization of linkage learning protects the chromosome’s building blocks, which is able to reduce the algorithm’s running time and ensure the alignment’s quality. In the experiment, ECGA-OEM is compared with the participants of ontology alignment evaluation initiative (OAEI) and the state-of-the-art ontology matching techniques, and the experimental results show that ECGA-OEM is both effective and efficient.

1. Introduction

Semantic Web (SW) is proposed by Tims Berners-Lee in 1998, which makes the intelligent applications be able to understand a word’s meaning in semantic level. Ontologies are the solution to the issue of data heterogeneity on SW since it is able to make consensus of a certain conception meaning of a field and provide abundant domain knowledge and semantic vocabulary for the interaction between application systems. However, due to SW’s scattering essence, there might be different definitions on a concept in separate ontologies, which leads to the issue of ontology heterogeneity [1]. Ontology matching is regarded as an effective method to address it, and swarm intelligent algorithm- (SIA-) based ontology matching techniques have achieved good performance in past studies [2], such as genetic algorithm (GA) [3], particle swarm optimization algorithm (PSO) [4], firefly algorithm (FA) [2], and artificial bee colony algorithm (ABC) [5]. However, there are two drawbacks in the existing SIA-based approaches: (1) massive time and memory consumption is required, which heavily blocks the efficiency of the ontology matching process; (2) an expert of related field or a reference alignment is required in the process of ontology matching which is usually not available in real application conditions. To overcome these drawbacks, an extended compact genetic algorithm-based ontology entity matching technique (ECGA-OEM) is proposed in this work, which uses both compact encoding mechanism [6, 7] and linkage learning approach to efficiently match the ontologies. In particular, our contributions are as follows:(i)A new evaluating metrics on the ontology alignment is proposed, which is able to work without the reference alignment and the domain experts.(ii)An optimal model on ontology entity matching problem is constructed.(iii)An ECGA-OEM is proposed, which uses the linkage learning and compact encoding mechanism to efficiently address the ontology entity matching problem.

The rest of this paper is organized as follows: the related works are narrated in Section 2; the statement of ontology, ontology matching, and similarity measures are presented in Section 3; the ontology entity matching through ECGA proposed by this paper is revealed in Section 4; Section 5 presents the experiment results; and finally, Section 6 draws the conclusion and presents the future work.

Ontology matching is a complex, time-consuming, and error-prone work, especially when the scale of ontologies is large. Recently, a number of the machine learning (ML) techniques [814] have been proposed to automatically determine the ontology alignment. To improve the matching efficiency, Araújo et al. [15] presented the matching system through parallel computing (PC) technique and Amin et al. [16] matching ontology based on cloud computing (CC). At the same time, SIA-based technique has achieved great performance in the ontology matching [1, 2, 1720] domain [2126].

Generally, ontology matching techniques are classified into two categories: ontology metamatching techniques and ontology entity matching techniques [27]. The former dedicates to address the problem that how to aggregate different similarity measures with appropriate weights, and the latter tries to directly determine the entity correspondence set between two ontologies. The first SIA-based ontology metamatching system is genetics for ontology alignment (GOAL), which aims at optimizing the aggregating weight set for different matchers [3, 2830]. Memetic algorithms (MAs), which introduce local search (LS) strategy into evolutionary algorithms (EAs) to improve its local optimization capability, are proposed to solve ontology metamatching problem [31]. To overcome the drawback of overreliance reference alignment, Xue et al. presented a partial reference alignment (PRA), in which only a part of standard reference is used to assess the quality of alignment [32]. Furthermore, Xue and Wang proposed an innovative metric named unanimous improvement ratio (UIR) to assess the alignment’s quality, in which the reference alignment is not required [33]. Besides, artificial bee colony (ABC) algorithm is also adopted to address ontology metamatching problem, which further improves the solution’s quality [5].

During the matching process, ontology metamatching techniques need to maintain several similarity matrices, which leads to huge memory consumption. For this reason, ontology entity matching techniques, which aims at directly determining the optimal pair set, attracts the expert’s interests. Genetic algorithm-based ontology matching (GAOM) firstly regards certain matching pairs set as the optimizing objective [34]. MA is also utilized to solve the ontology entity matching problem, whose performance outperforms GA [35]. Bock et al. [4] use PSO to solve the ontology entity matching. In detail, it evaluates the fitness of chromosomes through a certain aggregation strategy for multiple objective functions. Alves et al. [36] argue that instances consisting in the ontology can be used to improve alignment in the condition that knowledge is embedded in them. For this reason, Xue et al. [37] take also instance-level matching into consideration to further improve the quality of alignment.

3. Preliminaries

3.1. Ontology and Ontology Matching

Definition 1. (ontology). An ontology is a 5-tuple [33]. where is a set of classes that cannot be empty, is a set of properties that cannot be empty, is a set (it could be empty) of individuals that represent the instances of classes in the real world, is a nonempty set of axioms that are used to check the consistency of ontologies or deduce new information, and is a set of annotations that provide information metadata so that the researcher can understand. Particularly, , , and make up the entities in ontologies.

Definition 2. (ontology matching). The ontology matching can be considered as a function , where and are two ontologies to be matched; is an existing initial alignment of and ; is a set of parameters, e.g., threshold, in the process of ontology matching; and is a set of external resources, e.g., background knowledge based and dictionaries, which assisted in ontology matching. The process of ontology matching is depicted in Figure 1, where is the obtained ontology alignment.
An example of matching two ontologies is presented in Figure 2, where and are two ontologies to be matched in this figure. The strings in the rounded rectangle are the classes, e.g., “Reference,” “Entry,” and “Book.” The black lines between two classes of the same ontology represent their relation “has a” or “is a” in turn; e.g., “Reference” has a “Book” and “Book” is a “Reference,” which means “Reference” is the supclass of “Book,” and “Book” is the subclass of “Reference.” There are datatype properties that describe the features of a class; e.g., “Data,” “Title,” and “Human Creator” are the properties of “Reference.” The instances of a class are in a rectangle; e.g., “Of Natures Obvious Laws & Processes in Vegetation” is an instance of “Article.” The relation of the entities between the two ontologies are linked by the lines with double arrowhead, and there are symbols: “,” “” (or “”), and “,” which, respectively, means equivalence, more specific (or less specific), and disjointness relation; e.g., classes “Entry” and “Book” of are the equivalent and hyponym of class “Reference” of , property “Event” of , and property “Year” of are irrelevant.

3.2. Similarity Measures

It is necessary to measure to what extend two ontology entities are similar before finding the reliable entity correspondences. In the ontology domain, we usually use the similarity measure to calculate two entities’ similarity values. Generally, there are three categories of similarity measures, i.e., syntactic, linguistic, and taxonomy-based measures [33].

3.2.1. Syntactic Measures

Two syntactic measures, i.e., SMOA (string metric for ontology alignment) [37] and Levenshtein [38], are employed in this paper. Given two strings and , the SMOA and Levenshtein similarity are, respectively, defined as follows:where stands for the common length of and while for the different lengths and is the improvement to results yielded by the method that proposed by Winkler [39].where and are the cardinality of the letters contained in and , respectively, and is the number of letters that need to be modified from to . The final syntactic similarity is equal to the average of SMOA and Levenshtein.

3.2.2. Linguistic Measures

The linguistic similarity between two strings is worked out by considering the semantic relations (such as synonyms and hypernym) which usually requires using the thesaurus and dictionaries. In this work, WordNet [23, 40], an electronic vocabulary database that has collected every meaning of various words, is used. Given two words and , equals:(i)1, if words and are synonyms in Wordnet.(ii)0.5, if word is the hypernym of word or vice versa in Wordnet.(iii)0, otherwise.

3.2.3. Taxonomy-Based Measures

The core ideal of taxonomy-based measures is to make full use of the hierarchy relationship of ontology to determine two entities’ similarity by considering their neighbor’s similarity. In this work, a mutual reasoning between class and property (MRCP) is proposed as the taxonomy-based measure, which is shown in Figure 3.

In Figure 3, the circle is the class of the ontology, the triangle is the properties of the ontology, and the one-way arrow represents the hierarchical relationship; i.e., class is the supclass of class , the dividing line arrow between the class and the property indicates that the property belongs to this class, the bidirectional arrow indicates that there is a high similarity between the two entities, and the dashed two-way arrow indicates that the similarity between them is improved after the operation. Subgraph (a) depicts the classes’ similarity gained from their neighbor, i.e., supclass and subclass. There is high similarity between classes and , so the similarity between their subclasses and is supposed to increase. Likewise, similarity of classes and would be increased because their subclasses and are highly similar. Subgraph (b) is the properties’ similarity gained from their supproperty and subproperty. The similarity of properties and will be improved since their supproperties and and subproperties and are highly similar, respectively. Subgraph (c) is the properties’ similarity gained from the classes which they belong to. and are the classes of two ontologies and , , , , , and are their properties, respectively. The similarities between properties of class and properties of class would be improved because of the high similarity of and ; i.e., the similarity of and would be promoted and so are the remaining eight combinations. On the contrary, classes’ similarity will be increased due to the same or highly similar properties they shared, as is depicted in subgraph (d). Since pairs and , and , and and , the similarity of and is increased too.

3.2.4. Aggregation Strategy

Three similarity matrixes are generated when the three measures have been applied. In this work, three matrices need to be aggregated into one matrix. The final similarity value between two entities and is defined as follows:where , , and is, respectively, the syntactic, linguistic, and taxonomy-based similarity of and ; is the average of and ; and is a given parameter to filter the matching pairs with low similarity.

4. Extended Compact Genetic Algorithm-Based Ontology Entity Matching

GA is an excellent methodology to solve the ontology matching problem due to its potential parallel search characteristic and good searchability. In our work, an ECGA to efficiently address the ontology entity matching problem is proposed.

4.1. Optimal Model

The optimal model of ontology entity matching problem is given as follows:where and , respectively, are the cardinalities of two ontologies and ; is the matching pairs. Particularly, it means there is no matching of entity in when . The objective of this work is to maximize , and for the details of it, refer to Section 4.3.3.

4.2. The Framework of ECGA-OEM

Two ontologies are to be matched as input and a reference alignment as output, and the framework of ECGA-OEM is shown in Figure 4, whose critical components are narrated in the rest of this section.Two ontologies, generally in XML or RDF format, are extracted into two hierarchy schema in the preprocessing model. The operation of the ECGA optimization model relies on the similarity matrix obtained in the similarity measure model, which has been stated in detail in Section 3.2. Finally, the alignment is generated by the solution generation model. In detail, the ECGA optimization model is described in Section 4.3.

4.3. ECGA Optimization Model

Given the virtual population’s maximum generation,  = 2000 (normally, the number of iterations in the convergence of ECGA is much less than ),  = 0.7, and the hierarchy schema of ontology1 and ontology2 as input and final alignment as output. The pseudocode of ECGA is proposed in Algorithm 1, where and are probability vector (see also Section 4.3.1) and building blocks (refer to also Section 4.3.7), respectively.

Input: the hierarchy schemas of and ; the aggregated similarity matrix, ;
Output: the best chromosome, ;
(1)
(2)Whiledo
(3);
(4);
(5);
(6);
(7);
(8)end while
4.3.1. Probability Vector Initialization

Different from binary coding, the probability vector (PV) in this work is two-dimensional. The initialized PV and convergent PV are shown in Tables 1 and 2, respectively. The value in the row and column represents the possibility of matching between the entity in and the entity in ; i.e., it means the probability of entity of (reference) and entity of (entry) is , which is shown in Table 1 (peculiarly, the header “−1 (null)” denotes the probability of no matching). The convergence condition is that the probability of taking a unique number on each locus in the PV is 1; i.e., in Table 2, the probability of the entity of (reference) and the entity of (entry) is and so do the rest.

4.3.2. Chromosomes Generation

Certain size chromosomes are produced in each generation through PV. An example of chromosome is given in Figure 5. In particular, subgraph (a) shows the locus of the chromosome and corresponding code, and subgraph (b) illustrates decoding chromosome in subgraph (b); i.e., it denotes that the fourth entity of “Article” correspondent to the third entity of “Article” as the code of the fourth locus is “3” (“−1 (Null)” indicates no matching).

4.3.3. Fitness Function

The fitness function is used to determine which chromosomes in the population can better adapt the environment. In the context of ontology matching, the objective of fitness function is to find the best chromosome, whose corresponding alignment’s quality is the highest, with algorithm convergence. The objective function of the optimal model is used as the fitness function of this work, and given a chromosome , its fitness function is defined as follows:where is the fitness function that is used in this paper; is the alignment determined by ; is the cardinality of ; , a fraction in the range [0, 1], is the relative weight of and , which is set to 0.25 in this paper; is a normalization function; and is a function that calculates the mean of the matched entity pairs’ similarity values in . In addition, and are defined as follows:where and are, respectively, the cardinality of and and is the similarity of the matching pair in alignment . In particular, is the ratio of the number of matching pair found to the value of the smaller entity number of the two ontologies and is the average similarity of the matching pairs found, which, respectively, approximates the recall value and precision value.

4.3.4. Selection Operator

The selection operator selects the best chromosomes in the current population to participate in the next step [41] and updates the PV. Firstly, the chromosomes are sorted in descending order according to their fitness scores. Secondly, the first half of the chromosomes will be retained as a temporary population. Finally, with fitness as the weight through roulette, we can select the chromosome from the temporary population for subsequent operation.

4.3.5. Elite Strategy

The goal of elite strategy is to keep the historical optimal solution and prevent the fitness of the optimal chromosome from “degenerating” in the process of evolution. In this work, the elite strategy has two steps: the historical optimal solution first competes with the current optimal solution , and the winner will become the new ; the historical optimal solution then participates in the PV update in each generation.

4.3.6. Probability Vector Updating

An example of updating the PV is presented in Figure 6, where subgraph (a) is a chromosome generated by the initialized PV. The similarity is derived from the similarity matrix according to the chromosome’s code. It should be noted that the code of the third locus is “0,” which means that the entity “Part” with sequence number “3” in does not match any entity in , so its similarity is equal to 1 minus the value of the highest similarity between “Part” and all entities in , i.e., . PV will be updated by the normalized similarity of each locus, which is shown in subgraph (b). The probability of being updated in the PV is bold; i.e., the probability of matching pair “Reference” and “Entry” changed from “1” to “1.2” since the normalized similarity of the corresponding locus of chromosome is “0.20.”

4.3.7. Linkage Learning

Building blocks will be saved through linkage learning, thus to improve the efficiency of algorithm and quality of solution [42]. In simple GA, linkage learning is to identify great locus and protect them so that they will not be destroyed in the subsequent crossover and mutation operations; in ECGA, linkage learning keeps good probability distribution so that they are not disturbed in the subsequent update process. A linkage learning approach is proposed in this work, and its detail is shown in Figure 7.

For clarity, only column 1 of the original probability vector is selected for narration. In each generation, a low probability clearing operation is performed. The value of each column is divided by the maximum value of the row, and the value of the column will be cleared if the decimal is less than a specific value (0.2 in this work); i.e., 0, 2, 3, and −1 bits in PV are cleared and marked. The row of PV is a “good” probability distribution when all but one bit of this row are zero. Link learning generates building blocks based on the “good” probability distribution, i.e., a building block, the pair of index “0,” and code “0,” which was included in the rounded rectangle produced by linkage learning. After that, all the building blocks are directly put into each chromosome (the bold numbers), which reduces the consumption of runtime and memory consumption.

5. Experiments and Discussion

5.1. Experiment Setup

In the experiment, the Biblio benchmark provided by the Ontology Alignment Evaluation Initiative (OAEI) is used to verify the effectiveness of our approach. Normally, two ontologies to be matched and a reference alignment are included in each testing case as a standard to evaluate the quality of matching results. The testing cases can be classified into five categories, which are briefly described in Table 3.

In this work, the method is compared with the participants of OAEI-, GA-, and CGA-based ontology matching techniques. The experimental results are the H-mean values of 30 independent runs.

5.2. Alignment Evaluation Metrics according to Reference Alignment

A criterion is needed when evaluating the quality of matching systems. Given an alignment result , two measures, recall and precision, are employed in this work and their formula is as follows [22, 23]:where and are the cardinality of matching pairs in the reference alignment provided in case set and the matching pairs in alignment are produced by the matching system and is the cardinality of matching pairs, which exist in both reference alignment and alignment found. It means that all the matching pairs in reference alignment have been found when recall is 1, while all the matching pairs found is correct when precision is 1.

Both recall and precision are important parameters of the evaluation results and they should be considered at the same time. A weighted harmonic mean of recall and precision, F-measure, is used in this work, which is presented in the following equation[43]:where is the relative weight of recall and precision and it is set as 0.5 in this work, which is named f1-measure.

5.3. Comparison with OAEI’s Participants

The participants from OAEI 2016, 2015, and 2014 are selected to compare with our approach. In particular, if a matching system has participated for more than one year, only the latest results are used. The harmonic mean comparison of participants and ECGA-OEM is shown in Figure 8. The vertical axis represents different matching systems, and the horizontal axis represents the score of their corresponding parameters. In terms of f-measure, ECGA-OEM ranks third, which it is slightly lower than Lily and CroMatch. Wiki has been used as the linguistic measure in Lily and CroMatch, which is able to improve the performance of the algorithm at the expense of efficiency. In addition, ECGA-OEM has an unparalleled performance in maintaining the balance between precision and recall, while one of them is much higher than the other in the participants, which is very important in evaluating results quality.

Further f-measure comparison of OAEI participants and ECGA-OEM in a total of 32 test cases is given. Figure 9 shows the numbers of participants with ECGA-OEM superior, equal, and inferior, respectively. The horizontal axis is the set of different test cases, and the vertical axis is the number of matching systems. In the vast majority of test cases, the number of matching systems with ECGA-OEM superior to is much higher than those ones with ECGA-OEM inferior to. Only in No. 246, No. 247, and No. 254 cases, the ranking of ECGA-OEM is relatively low (refer to also Table 4 for the specific f-measure values in each testing case).

In Table 4, the numbers from 1 to 18 in the first row are edna, AML, CroMatch, Lily, LogMap, LogMapLt, Xmap, LogMapBio, AML-2014, Gmap, LogMap-C, Mamba, AOT-2014, AOTL, MassMatch, OMReasoner, RSDLWB, and Xmap2, respectively. ECGA-OEM is the matching system proposed by us. The value of each column represents the f-measure score of the matching system in the corresponding case. The f-measure of participants higher than that of ECGA-OEM is bold, and the equal ones are underlined. The “+,” “=,” and “−” in the last column, respectively, indicate the number of matching systems, with ECGA-OEM superior, equal, and inferior.

5.4. Comparison among GA, CGA, and ECGA

To verify the performance of linkage learning, we compare ECGA with GA and CGA. The detailed f-measure and runtime of the three competitors are, respectively, shown in Tables 5 and 6. All the GA, CGA, and ECGA’s results are the mean value of 30 independent runs. It can be seen that the replacement of crossover and mutation operators (GA) with probability vector (CGA) improves the f-measure and significantly reduces the runtime. The average f-measure is slightly improved, while the average runtime is reduced from 31.975 seconds to 3.540 seconds; i.e., it largely improves the algorithm’s efficiency with only takes about of runtime. Except testing cases No. 301 and No. 304, CGA is more stable than GA in terms of both f-measure and runtime since their standard deviation is smaller. Testing cases No. 301 and No. 304 are the representatives of real-world cases with unique heterogeneity, which make the f-measure produced by CGA decreased slightly. Linkage learning, the technique applied in ECGA, further increased the score of f-measure and made decrement in runtime with average 1.749 seconds based CGA. A smaller standard deviation than CGA was obtained by ECGA, which certified the strong stability of ECGA. It is worth to be noticed that the f-measure score of ECGA in testing case No. 301 and No. 304 is almost the same as that of GA (only 0.003 score lower in testing case No. 301 and this minor difference could be ignored), which embodied the ability of linkage learning to overcome the CGA’s shortcoming.

6. Conclusions

Ontology matching can effectively solve the problem of data heterogeneity by discovering correspondence between two ontologies’ entities. Compact encoding mechanism shows high efficiency in ontology matching, especially in ontology entity matching. Linkage learning, which is employed in ECGA-OEM proposed in this paper, can produce qualified alignment of ontology matching. The experiment results have shown that our approach outperforms the participants of OAEI in terms of f-measure, recall, and precision. The comparison on terms of f-measure and runtime with GA, CGA, and ECGA shows that ECGA-OEM is able to greatly reduce the runtime consumption while maintaining the alignment’s quality.

In the future, we would like to apply the linkage learning in other compact SIAs for better results. In addition, matching large-scale ontologies, such as anatomy and large biomedical track in OAEI, are an open challenge in the domain of ontology matching. We are interested in using the improved ECGA to match these large-scale ontologies.

Data Availability

The data used to support the findings of this study are available upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Natural Science Foundation of Fujian Province (no. 2020J01875) and the National Natural Science Foundation of China (nos. 61773415, 61801527, and 61103143).