Information Content-Based Gene Ontology Functional Similarity Measures: Which One to Use for a Given Biological Data Type?

Gaston K. Mazandu; Nicola J. Mulder

doi:10.1371/journal.pone.0113859

Abstract

The current increase in Gene Ontology (GO) annotations of proteins in the existing genome databases and their use in different analyses have fostered the improvement of several biomedical and biological applications. To integrate this functional data into different analyses, several protein functional similarity measures based on GO term information content (IC) have been proposed and evaluated, especially in the context of annotation-based measures. In the case of topology-based measures, each approach was set with a specific functional similarity measure depending on its conception and applications for which it was designed. However, it is not clear whether a specific functional similarity measure associated with a given approach is the most appropriate, given a biological data set or an application, i.e., achieving the best performance compared to other functional similarity measures for the biological application under consideration. We show that, in general, a specific functional similarity measure often used with a given term IC or term semantic similarity approach is not always the best for different biological data and applications. We have conducted a performance evaluation of a number of different functional similarity measures using different types of biological data in order to infer the best functional similarity measure for each different term IC and semantic similarity approach. The comparisons of different protein functional similarity measures should help researchers choose the most appropriate measure for the biological application under consideration.

Citation: Mazandu GK, Mulder NJ (2014) Information Content-Based Gene Ontology Functional Similarity Measures: Which One to Use for a Given Biological Data Type? PLoS ONE 9(12): e113859. https://doi.org/10.1371/journal.pone.0113859

Editor: Cynthia Gibas, University of North Carolina at Charlotte, United States of America

Received: August 14, 2014; Accepted: October 31, 2014; Published: December 4, 2014

Copyright: © 2014 Mazandu, Mulder. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The authors confirm that all data underlying the findings are fully available without restriction. All relevant data are within the paper and its Supporting Information files.

Funding: The authors received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

The advancement of high-throughput biology technologies has resulted in a large increase in functional data, eliciting the need for relevant tools that help analyze and extract information from these data. The Gene Ontology (GO) [1] is an established standard for the functional annotation of proteins that successfully provides structured and controlled, organism-independent vocabularies to describe gene functions and a well adapted platform to computationally process data at the functional level [2]. Currently, several proteins are already annotated with GO terms in the existing biological databases [3]–[6], thus enabling protein comparisons on the basis of their GO annotations. Even though the high proportion (more than 98%) of these annotations are inferred electronically (mostly based on transitive mappings from InterPro2GO, SPKW2GO, EC2GO, SPSL2GO, HAMAP2GO and UniPathway2GO), with IEA (Inferred from Electronic Annotation) as the GO evidence code (http://www.geneontology.org/GO.evidence.shtml), these annotations are becoming more and more accurate with an increased level of confidence as the different mappings are manually curated [7].

Several functional similarity measures that quantify similarity between proteins based on their GO annotations have been introduced and successfully applied in many biomedical and biological applications [2], [8]. These measures allow the integration of the biological knowledge contained in the GO structure [9], and have contributed to the improvement of biological analyses [2]. These measures are derived either directly from the GO term information content (IC), a numerical value scoring the description and specificity of a GO term using its position in the GO directed acyclic graph (DAG), or from GO term semantic similarity scores conveying information shared by two GO terms in the GO DAG [8]. It is worth mentioning that several term semantic similarity models have been introduced and a detailed review can be found in [10], [11]. In this study, we are only focusing on term semantic similarity models that are based on term information content, known as node-based models [8], [11]. In order to quantify the information content (IC) value of a given term, several approaches have also been proposed, each depending on how the concept ‘specificity’ is conceived in the context of the GO DAG structure. These approaches are partitioned into two main families, namely annotation- and topology families, and have been largely used to compare GO terms in the GO DAG and proteins at the functional level using their GO annotations.

The annotation family uses GO term statistics in the corpus under consideration. Despite the issue of protein annotation dependence (scores are based on annotation, which may be unbalanced, biased and incomplete), which leads to shallow annotation problem [10] that affects semantic similarity scores produced [12], this family has been used in several applications. Several approaches for comparing GO terms have been tested in the context of the GO DAG, the most popular node-based semantic similarity approaches include the Resnik [13], Lin [14] and Jiang & Conrath [15] approaches, which were initially suggested in the context of the WordNet and adapted to the GO DAG [16]. Recently, the Nunivers approach [8] has been introduced and different enhancements, such as Disjunct Common Ancestor (DCA) [17], relevance similarity [18], information coefficient similarity [19] and eXtended GraSM (XGraSM) [8] model were proposed to improve the existing approaches for GO term comparison. Note that a random walks enhancement [20] was proposed to improve any of the existing similarity measures by modeling inherent uncertainty from the incomplete knowledge of gene annotations and ontology structure. Functional similarity measures induced by GO term semantic similarity approaches include average (Avg) [16], maximum (Max) [21], average of the best matches (ABM) [2], and best match average (BMA) [9], and those using the GO term information content directly, namely SimGIC [22], SimUI [23], SimUIC and SimDIC [2], [9].

The topology-based family, which only uses the structure of the GO DAG in the computation of the IC values, has been proposed to correct for the effect of annotation dependence and provide an effective way of measuring functional similarity between proteins based on their GO annotations. The earliest type of topology-based family, namely edge- or path-based semantic similarity measures, suffers from a serious drawback of producing uniform scores for terms at the same level of the hierarchy under consideration as these scores are obtained using path lengths between terms [8]. These measures ignore the position characteristics of terms in the hierarchy and a solution based on differently weighting edges was suggested, but failed to completely resolve the problem [9], [11]. In this study, we are only considering the node-based approaches as pointed out previously, which use the concept of IC score to compare the properties of the terms themselves and relations to their ancestors or descendants, and taking into account term position characteristics [9]. These measures are referred to as IC-based approaches and overcome the main issue of edge- or path-based approaches, producing a fixed and well defined IC score for a given GO term, independent of the corpus or source under consideration. Each topology-based approach provides its specific semantic similarity measure for comparing GO terms, and functional similarity measure for scoring protein closeness. However, none of the existing studies has attempted to evaluate the effectiveness of functional similarity measures proposed in the context of the annotation-based approaches when applied to the topology-based approaches. Such a study is important to determine the most appropriate functional similarity measure for each approach given the biological application.

Here, we investigate the behaviour of several different IC-based functional similarity measures suggested in the context of annotation-based and topology-based approaches, using different biological data, including protein-protein interaction networks, protein domain and other functional data. Each measure performs differently for different applications [2] and interprets the DAG structure of the GO differently [8], [9]. Thus, one needs to understand these differences in order to choose an effective measure for analysis of a dataset, which can be cumbersome and tedious for someone who just needs a quick GO semantic similarity measure for their biological question. This suggests that the quantitative comparative study of all existing GO semantic similarity measures and approaches is necessary to enable one to quickly identify the most effective measure, among the several semantic similarity tools available, for their application. This study provides a mapping between a term IC or term semantic similarity approach and its corresponding most ‘appropriate’ functional similarity measure, given a particular biological application.

Materials and Methods

To evaluate the existing IC-based functional similarity measures which have been used in the context of biomedical and bioinformatics applications, we use different functional data, including protein sequence, Pfam domain and enzyme commission (EC) similarity data, human gene expression (microarray) and protein-protein interaction (PPI) datasets. All these data represent some form of ‘grouping’ of proteins that should be functionally related and thus provide useful tests for GO similarity measures. The complete set of GO data and protein-GO term associations were extracted from the GO and GOA databases, respectively, released on the 15th April, 2014. We have considered three topology-based approaches, namely the GO-universal metric proposed by Mazandu and Mulder [9], and the methods of Wang et al. [24] and Zhang et al. [25]. In general, the information content (IC) or semantic value of a given term t is computed as follows: (1)where is the relative frequency of occurrence of the term in the protein annotation dataset under consideration [16], which is the D-value [25] and topological position characteristic of in the context of annotation family, the Zhang and GO-universal approaches, respectively. Note that the Zhang et al. model for computing the IC score follows the Seco et al. approach [26] in its conception and it is adapted to the context of the GO-DAG. For the Wang et al. method, the IC score of a given term is the sum of S-value of the term and those of all its ancestors [24]. The term semantic similarity score between GO terms and can be retrieved from the following formula [8]: (2)where and denotes the set of ancestors of the term , and are measures of the commonality between and of the description of and , respectively. The formula 2 is a unified formula of all term semantic similarity models based on IC or SV values of terms. Note that other term semantic similarity models that do not use only or directly IC values were proposed. These include the Hybrid Relative Specificity Similarity (HRSS) method [27], which adapts both node- and edge-based concepts, and the Shortest Semantic Differentiation Distance (SSDD), which assesses the distance between terms in the GO DAG in order to measure their semantic similarity score [28], and these methods are beyond the scope of this study.

Measuring protein similarity at the functional level

Several measures have been proposed for estimating functional similarity scores in the context of annotation-based IC approaches to facilitate protein comparisons at the functional level. These functional similarity scores are obtained using statistical measures of closeness, such as average (Avg), maximum (Max), best-match average (BMA) and averaging all the best matches (ABM). The average and maximum measures are computed as follows: (3)and (4)where is a set of GO terms in representing the molecular function (MF), biological process (BP) or cellular component (CC) ontology annotating a given protein and and are the number of GO terms in these sets, and is the semantic similarity score.

The ABM [2] for two annotated proteins is the mean of best matches of GO terms of each protein against the other, given by the following formula: (5)with . The Best Match Average (BMA) [2], [9] for two annotated proteins and is the mean of the following two values: average of best matches of GO terms annotated to protein against those annotated to protein , and average of best matches of GO terms annotated to protein against those annotated to protein , given by the following formula: (6)

Note that the four functional similarity measures above require GO term semantic similarity scores, and are referred to as IC-based non-direct term or term semantic similarity- or pair-wise term-based measures [2]. For the topology-based family, each approach has been suggested with its functional similarity measure. The GO-universal metric [9] uses BMA, and ABM was used in the Wang et al. approach [24]. The Zhang et al. measure [25] is a context dependent approach and authors initially suggested using the approach proposed by Lord et al. [16], which is the Avg scheme for measuring functional similarity scores between proteins.

In the context of the annotation-based family, it has been observed that measuring the semantic similarity of two GO terms based only on the most informative common ancestor terms cannot discern the semantic contributions of the ancestor terms to these two specific terms and thus may negatively impact functional similarity scores. The GraSM and XGraSM approaches have been proposed and shown to perform better than those using only the most informative common ancestors (MICA) strategy [8]. This argument has been confirmed through the performance evaluation of the SimGIC measure suggested by Pesquita et al. [22], which uses a Jaccard index weighted by IC of terms, thus incorporating the features of all ancestors of the terms. The SimGIC measure computes the functional similarity score between two proteins and as follows: (7)where is the information content value of the term [8] and a set of GO terms together with their ancestors in representing the ontology (MF, BP or CC) annotating a given protein .

Using the observation above, we proposed two other possible functional similarity schemes [2], [9], using Dice (Czekanowski or Lin like measure) and universal indexes, referred to as SimDIC and SimUIC, respectively, and given by the following formulae: (8)(9)

Note that this study provides the first evaluation of these SimDIC and SimUIC measures and their comparison to other functional similarity measures. Unlike the Avg, Max, ABM and BMA measures, in which semantic similarity between GO terms is required in the computation of functional similarity scores, the SimGIC, SimDIC and SimUIC measures use the IC of terms directly and they are referred to as IC-based direct term measures. Note that there exist other functional similarity models, such as shortest-path graph kernel (spgk) [29], using the intrinsic topology of the GO DAG for directly estimating protein functional similarity scores without computing the IC scores of GO terms or semantic similarity scores between terms. Here, we are only focusing on protein functional similarity models that use the IC of terms.

Assessing different functional similarity measures

We systematically assess different functional similarity measures on different types of functional data, including sequence similarity, Pfam domain and Enzyme Commission (EC) similarity data on a selected set of proteins, and human protein-protein interaction (PPI) and co-expression networks. These datasets represent different types of biological data used to evaluate GO semantic similarity measures [10]. Depending on these biological data, different performance measures are used to elucidate the ‘best’ semantic similarity measure or approach.

Correlation with EC, Pfam and sequence similarity

Generally, the comparison of different semantic similarity measures is performed using Pearson's correlation measures with sequence, Pfam domain and Enzyme Commission (EC) similarity data. This correlation provides an indication of how effective the functional similarity measure is in capturing sequence, Pfam, and EC similarity. This means that a measure with a higher correlation is better, since it captures these similarities well and it is likely to be an unbiased measure. To compare different measures, we ran the Collaborative Evaluation of Semantic Similarity Measures (CESSM) online tool [30] at http://xldb.di.fc.ul.pt/tools/cessm/ for BP and MF using a dataset of selected proteins with known relationships downloaded from the CESSM website.

Performance evaluation using a PPI network

Different measures were assessed in terms of their ability to capture functional coherence in a human PPI network based on how interacting proteins are functionally related to each other. Human PPI datasets were downloaded from several different PPI databases, including the IntAct, DIP, BIND, MIPS, MINT and BioGRID databases, and integrated into a single network in which only interactions predicted by at least two different approaches and found in the STRING dataset are considered, to reduce the impact of false positives. This produced a human PPI network with 6031 interactions from which a total of 5366 and 5580 interactions with both interacting partners were among 29844 and 31683 proteins annotated with respect to the GO BP and CC ontologies, respectively. These interaction datasets are available in the supplementary data (see Tables S1, S2 and S3 in File S1) and can also be downloaded from the CBIO website at http://web.cbio.uct.ac.za/ITGOM/funcsimdata.

The set of these 5366 and 5580 interactions are considered as a positive set, while the negative set consists of the same number of interactions randomly selected among annotated human proteins pairs. This is consistent as the chance of randomly selecting a detected PPI is very small (less than 0.0012%). We only considered proteins annotated with BP and CC terms in the network produced since two proteins that interact physically are more likely to be involved in similar biological processes or localized in the same cellular component, but there is no guarantee that they share molecular functions [9]. The classification power of different functional similarity measures was tested using Receiver Operator Characteristic (ROC) curve analysis, which assesses the Area Under the Curve (AUC), plotting the true positive rate or sensitivity vs the false positive rate or 1-specificity. This AUC value is used as a measure of discriminative power and a realistic classifier must have an AUC larger than 0.5.

Clustering power on a gene expression dataset

We use the human co-expression network retrieved from the Bossi et al. [31] and the STRING human network. We retrieved 7228 co-expressed protein pairs of which a total of 6995 pairs have both proteins found among 29844 human proteins annotated with BP terms (see Tables S4 and S5 in File S1, or go to http://web.cbio.uct.ac.za/ITGOM/funcsimdata). We are only considering the BP ontology as co-expressed genes are more likely to share common processes and may at least belong to the same pathway or contribute to a similar biological process [32]. We partitioned these co-expressed proteins into different clusters using the Blondel et al. method [33] and the corresponding partition is considered to be a ground truth, i.e., the true partition of the actual co-expressed network. Thereafter, the interactions from the co-expressed network are weighted using functional similarity scores and proteins clustered using the same clustering method. We assessed the clustering power of a given functional similarity measure by comparing this clustering result to the ground truth using Normalized Mutual Information and Rank Index of pairwise cluster memberships [34].

Let be the number of proteins in the network with the ground truth (g) having p partitions, each with proteins, , and clustering result (c) with q partitions, each with proteins, . The entropy of a given clustering (d) having r partitions, each with proteins, , is given by: (10)and the mutual information between the two partitions is computed as follows: (11)where is the number of common proteins between the th cluster in the ground truth and the th cluster in the clustering result. This implies that the normalized mutual information is given by: (12)

Finally, the Rank Index of pairwise cluster memberships is computed as follows: (13)where is the number of pairs of proteins belonging to the same cluster in the ground truth and clustering result, and the number of protein pairs belonging to different clusters in the ground truth and clustering result. The functional similarity measure providing higher normalized mutual information and accuracy scores is considered to be the ‘best’ one.

Results and Discussion

Previous work on semantic similarity measures has suggested that the appropriate use of functional similarity measures depends on the biological applications and different measures perform differently for different applications [2]. Each semantic similarity approach or functional measure was defined for a specific purpose with a specific application in mind, especially in the context of topology-based approaches, where each approach was set with its specific functional similarity measure, depending on its conception and the applications for which it was designed. These applications include, protein-protein interaction assessments, protein function prediction, protein clustering, etc. and results were often tested against the expectations of the performance scores. Here, we assess the performance of different measures on different biological applications or data, including EC, Pfam domain and sequence similarity on a selected set of protein pairs, and human PPI and co-expression network or expression data, in order to elucidate the most ‘appropriate’ measures for different approaches and biological applications. The summary of different approaches that are combined to construct 57 different IC-based functional similarity measures used is provided in Table 1. Note that the Jiang and Conrath approach is not used explicitly since it has been shown to be a particular case of the Lin approach [8].

Download:

Table 1. Summary of different IC-based functional similarity and term semantic similarity measures.

https://doi.org/10.1371/journal.pone.0113859.t001

Using EC, Pfam and Sequence Similarity data

We used a dataset of proteins with known relationships downloaded from the CESSM online tool. The GO annotations of different proteins in the dataset were retrieved from the GOA-UniProtKB dataset. The CESSM tool has made the comparison of different functional similarity measures using Pearson's correlation measures with sequence, Pfam domain and EC similarity possible. We ran the CESSM online tool and results are shown in Figure 1 for the BP, MF and CC ontologies. Except for the Resnik approach, these results show that in general there is a good correlation between EC, Pfam domain, sequence similarity and functional similarity measures for BP, MF and CC, especially when using measures other than Max and Avg. For EC in particular, the MF ontology tends to display higher levels of correlation. This is unsurprising as EC numbers are very specific for a particular function, so there should be good correlation in MF terms.

Download:

Figure 1. Performance evaluation in terms of Pearson's correlation values.

These different Pearson's correlation values with Enzyme Commission (EC), Pfam and Sequence similarity are obtained from the CESSM online tool. For x-axis labels, the prefixes R, N, L, Li, S, X, A, Z, W, and U represent the approaches and stand for Resnik, Nunivers, Lin, Li, Relevance, XGraSM, Annotation-based, Zhang, Wang and GO-universal, respectively. The suffixes GIC, UIC and DIC represent SimGIC, SimUIC and SimDIC measures, respectively. In cases where the prefix X is used, it is immediately followed by the approach prefix. Refer to Table 2 and 3 for the description of these different measures.

https://doi.org/10.1371/journal.pone.0113859.g001

Recently, it was shown that the normalization model and correction factors have an impact on the performance of functional similarity measures [8]. It is likely that the effect of the normalization factor is a serious drawback of the Resnik approach as this has an impact on its performance and makes it inconsistent with the hierarchy under consideration. This is confirmed by looking at the performance of the Nunivers [8] and Lin [14] approaches (see Table 2), which follow the general pattern, whereas the Resnik approach suggests the Max measure for the MF ontology. In general, BMA and ABM measures provide the best performance and they perform equally in most cases. On the other hand, the use of an efficient correction factor may improve a given approach or measure. If the information coefficient and relevance introduced by Li et al. [19] and Schlicker et al. [18], respectively, which use the IC value of the most informative common ancestor between terms, does not significantly improve the performance of the Lin approach, then one can consider all common informative ancestors in the correction factor to enhance the performance of the approach [8].

Download:

Table 2. Pearson's correlation values of different measures.

https://doi.org/10.1371/journal.pone.0113859.t002

As displayed in Figure 1 and Table 2, applying the XGraSM correction factor to the Resnik, Lin and Nunivers approaches significantly improved their performance. Thus, including common informative ancestors in the conception of a semantic similarity measure improves its performance, especially for approaches that include only the feature of child terms in the computation of IC. This is the case for the annotation-based, Zhang et al. and Wang et al. approaches, where the SimGIC measure shows an overall best performance. Note that this is not the case for the GO-universal metric, in which, the BMA measure performs better than other measures, and it also provides better performance for the Wang et approach when applied to EC data, even though the Wang et al. approach initially used the ABM measure. It follows that in the context of the annotation-based family, if one chooses to use the IC-based non-direct measures, it is advantageous to use the XGraSM enhancement model, in which case, Resnik-BMA shows overall best performance. The SimUI approach [23] refers to the union-intersection protein similarity measure and it is a particular case of SimGIC assigning equal IC value to all terms in the GO-DAG [9]. Even though this assumption is not realistic in the context of the GO DAG, the SimUI measure can still be used as an alternative measure in practice as it shows relatively good performance when applied to these different data.

Using protein-protein interaction and expression data

We used human PPI and co-expressed networks to assess the performance of different functional similarity measures. In the case of the PPI network, we are using the AUC values computed using the ROCR package under the R programming language as a measure of classification power. The larger the upper AUC value, the more efficient the functional similarity measure is. For the co-expression network, we computed the NI and RI values as measures of clustering power, the higher these values, the more powerful the functional similarity measure is. Different values found for different measures are shown in Figure 2 and Table 3. These results indicate that independently of the approaches, the Avg measure, which is the earliest proposal suggested by Lord et al. [16] in the context of the IC-based functional similarity, performs better than any other functional similarity measure.

Download:

Figure 2. Performance evaluation in terms of clustering power (RI and NI) and Area Under the Curve (AUC) values.

Different x-axis labels are the same as in Fig. 1, where different prefixes and suffixes stand for different term semantic similarity approaches and functional similarity measures.

https://doi.org/10.1371/journal.pone.0113859.g002

Download:

Table 3. Area under the curve (AUC), Rand Index (RI) and Normalized Mutual Information (NI) values of different measures.

https://doi.org/10.1371/journal.pone.0113859.t003

It was unexpected to find that the Wang et al. approach performs poorly in terms of AUC values when using the BMA and ABM measures for BP, whereas these measures have shown good performance when used in EC, Pfam domain and sequence similarity data and the authors of this approach initially suggested using the ABM measure. Other approaches show good performance when used with their initial measures even though the Avg measure achieves the best performance. On the other hand, the Max approach performs poorly compared to other approaches, independently of the network (PPI or co-expression) and performance measure. This may be due to the fact that the Max approach tends to over-estimate functional similarity scores between proteins, for example by assigning the similarity score of 1 to two proteins sharing at least one GO terms independently of the number of unrelated terms between these proteins.

Table 4 lists functional similarity measures achieving overall ‘best’ performance for different ontologies (MF, CC and BP) given a biological data type. These results indicate that for the CC ontology, the topology-based approaches, namely SimGIC based on Zhang et al. (ZGIC), Wang et al. (WGIC) and GO-universal (UGIC) measures, provide overall best performance in terms of EC, Pfam and sequence similarity, respectively. For MF and BP ontologies, annotation-based approaches, either XGraSM-Resnik BMA (XRBMA) or SimGIC (AGIC), achieve best overall best performance. This suggests that measures achieving overall best performance for EC, Pfam and Sequence Similarity data are those incorporating all informative common ancestors in their scoring systems. However, this is not the case in the context of PPI and co-expression networks where Average based on Resnik (RAvg) and Wang et al. (WAvg) measures achieve the overall best performance. If the Wang et al. approach incorporates ancestor features when modeling term semantic similarity, Resnik is based on the most informative common ancestor. To provide users with the most appropriate functional similarity measure related to the term information content or term semantic similarity approach they have chosen to use, a summary of the best performing measures for different approaches and different biological data or applications is provided in Table 5.

Download:

Table 4. Summary of overall ‘best’ performing measures for different biological data.

https://doi.org/10.1371/journal.pone.0113859.t004

Download:

Table 5. Summary of the best performing measures for different applications.

https://doi.org/10.1371/journal.pone.0113859.t005

Finally, note that the good performance of the annotation-based family is related to the corpus under consideration because of its dependence on the frequencies of GO term occurrences in the corpus. These annotations may be unbalanced in their distribution across the DAG. This constitutes a serious drawback to these approaches, specifically for organisms with sparse GO annotations and may negatively affect their performances [9]. The use of the whole set of annotations as done in this study may solve this problem but only at the cost of an increase in the running time and the complexity of these annotation-based approaches. This is expected to worsen as the number of protein annotations increases daily, which would potentially hamper the performance of these approaches in their running time, since processing the annotation file would take a lot of time before being able to compute the IC values. This implies that it is may be better to make use of topology-based approaches if one has to choose between the two families.

Conclusion

Several IC-based GO functional similarity measures have been proposed over recent years and have enabled comparison of proteins at the functional level on the basis of their GO annotations. These measures are being used in different biological and biomedical applications and have largely contributed to the efficient exploitation of the biological knowledge embedded in the GO structure. While annotation-based functional similarity measures have been intensively studied and topology-based measures very often deployed to specific applications, none of the previous studies has attempted to quantitatively perform all-against-all semantic similarity measure comparisons. As a result, there were still gaps in our knowledge on the performance of these measures when applied to different biological data or applications, making the choice of the most ‘appropriate’ measure difficult, especially for someone who just needs a quick GO semantic similarity measure for their biological question. Thus, a comparative study was necessary in order to provide a global assessment of these different semantic similarity measures.

Here, we have carried out a quantitative performance evaluation of several different semantic similarity measures between GO terms for different term IC families or semantic similarity approaches and different biological data. Results indicate that a measure used for a given biological data type was not always the most appropriate even for the ‘well’ studied family measures, namely annotation-based measures. In fact, though the SimGIC or the BMA or ABM measure was confirmed to be the best measure, in general, when using EC, Pfam domain and sequence similarity data, this measure was not the best for applications related to PPI and co-expression data (e.g., assessing protein-protein interaction or clustering co-expressed proteins), where the Avg measure showed overall best performance. This is also the case for the topology-based approaches where, in general, the initial measure suggested for use does not provide the overall best performance. This study bridges the gap between the large variety of GO semantic similarity measures and their performance in different biological and biomedical applications by comparing different protein functional similarity measures using different biological data. This should help researchers choose the most appropriate measure for their biological application.

Supporting Information

File S1.

Combined file of supporting tables. Table S1: A human protein-protein interaction dataset used to assess the classification power of different functional similarity measures using Receiver Operator Characteristic (ROC) curve analysis. Table S2: A set of human protein-protein interaction with both interacting partners annotated with respect to the GO BP ontology. Table S3: A set of human protein-protein interaction with both interacting partners annotated with respect to the GO CC ontology. Table S4: A human co-expression network used to assess the clustering power of different functional similarity measures using using Normalized Mutual Information and Rank Index scores. Table S5: A set of human co-expressed protein pairs among human proteins annotated with BP terms.

https://doi.org/10.1371/journal.pone.0113859.s001

(ZIP)

Author Contributions

Conceived and designed the experiments: NM GM. Performed the experiments: GM. Analyzed the data: NM GM. Contributed reagents/materials/analysis tools: GM. Wrote the paper: GM NM.

References

1. GO-Consortium (2009) The Gene Ontology in 2010: extensions and refinements. Nucleic Acids Research 38:D331–D335.
- View Article
- Google Scholar
2. Mazandu GK, Mulder NJ (2013) DaGO-Fun: Tool for Gene Ontology-based functional analysis using term information content measures. BMC Bioinformatics 14:284.
- View Article
- Google Scholar
3. UniProt-Consortium (2010) The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Research 38:D142–D148.
- View Article
- Google Scholar
4. Flicek P, Aken BL, Ballester B, Beal K, Bragin E, et al. (2010) Ensembl's 10th year. Nucleic Acids Research 38(Database issue): D557–D562.
5. Sayers EW, Barrett T, Benson DA, Bryant SH, Canese K, et al. (2009) Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 37(Database issue): D5–D15.
6. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW (2009) Genbank. Nucleic Acids Research 37(Database issue): D26–D31.
7. Mazandu GK, Mulder NJ (2012) Using the underlying biological organization of the Mycobacterium tuberculosis functional network for protein function prediction. Infection, Genetics and Evolution 12(5):922–932.
- View Article
- Google Scholar
8. Mazandu GK, Mulder NJ (2013) Information content-based Gene Ontology semantic similarity approaches: Toward a unified framework theory. BioMed Research International 2013: Ariticle ID 292063, 11 pages.
9. Mazandu GK, Mulder NJ (2012) A topology-based metric for measuring term similarity in the Gene Ontology. Adv Bioinformatics 2012: Ariticle ID 975783, 17 pages.
10. Guzzi PH, Mina M, Guerra C, Cannataro M (2011) Semantic similarity analysis of protein data: assessment with biological features and issues. Brief Bioinform: 1–17.
11. Pesquita C, Faria D, Falcão AO, Lord P, Couto FM (2009) Semantic similarity in biomedical ontologies. PLoS Comput Biol 5(7):e1000443.
- View Article
- Google Scholar
12. Mistry M, Pavlidis P (2008) Gene Ontology term overlap as a measure of gene functional similarity. BMC Bioinformatics 9:327.
- View Article
- Google Scholar
13. Resnik P (1999) Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research 11:95–130.
- View Article
- Google Scholar
14. Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the Fifteenth International Conference on Machine Learning. pp.296–304.
15. Jiang JJ, Conrath DW (1997) Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of the 10th International Conference on Research in Computational Linguistics. pp.19–33.
16. Lord PW, Stevens PW, Brass A, Goble CA (2003) Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 19(10):1275–1283.
- View Article
- Google Scholar
17. Couto F, Silva M, Coutinho P (2007) Measuring semantic similarity between Gene Ontology terms. Data Knowledge Eng 61(1):137–152.
- View Article
- Google Scholar
18. Schlicker A, Domingues FS, Rahnenfuhrer J, Lengauer T (2006) A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics 7:302.
- View Article
- Google Scholar
19. Li B, Wang JZ, Feltus FA, Zhou J, Luo F (2010) Effectively integrating information content and structural relationship to improve the GO-based similarity measure between proteins. ArXiv e-prints: 1001.0958.
20. Yang H, Nepusz T, Paccanaro A (2012) Improving GO semantic similarity measures by exploring the ontology beneath the terms and modelling uncertainty. Bioinformatics 28(10):1383–1387.
- View Article
- Google Scholar
21. Sevilla JL, Segura V, Podhorski A, Guruceaga E, Mato JM, et al. (2005) Correlation between gene expression and GO semantic similarity. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) archive 2(4):330–338.
- View Article
- Google Scholar
22. Pesquita C, Faria D, Bastos H, Ferreira AEN, Falcão AO, et al.. (2008) Metrics for GO based protein semantic similarity: a systematic evaluation. BMC Bioinformatics 9(Suppl 5) S4.
23. Gentleman R (2005) Visualizing and Distances Using GO, http://bioconductor.org/packages/2.6/bioc/vignettes/GOstats/inst/doc/GOvis.pdf.
24. Wang JZ, Du Z, Payattakool R, Yu PS, Chen CF (2007) A new method to measure the semantic similarity of GO terms. Bioinformatics 23(10):1274–1281.
- View Article
- Google Scholar
25. Zhang P, Jinghui Z, Huitao S, Russo JJ, Osborne B, et al. (2006) Gene functional similarity search tool (GFSST). BMC Bioinformatics 7:135.
- View Article
- Google Scholar
26. Seco N, Veale T, Hayes J (2004) An intrinsic information content metric for semantic similarity in wordnet. In: ECAI-04. pp. 1089–1090.
27. Wu X, Pang E, Lin K, Pei Z (2013) Improving the Measurement of Semantic Similarity between Gene Ontology Terms and Gene Products: Insights from an Edge- and IC-Based Hybrid Method. PLoS ONE 8(5):e66745.
- View Article
- Google Scholar
28. Xu Y, Guo M, Shi W, Liu X, Wang C (2013) A novel insight into Gene Ontology semantic similarity. Genomics 101:368–375.
- View Article
- Google Scholar
29. Alvarez MA, Qi X, Yan C (2011) A shortest-path graph kernel for estimating gene product semantic similarity. J Biomed Semant 2:3.
- View Article
- Google Scholar
30. Pesquita C, Faria D, Pessoa D, Couto FM (2009) CESSM: Collaborative Evaluation of Semantic Similarity Measures. JB2009: Challenges in Bioinformatics 157.
31. Bossi A, Lehner B (2009) Tissue specificity and the human protein interaction network. Molecular Systems Biology 5:260.
- View Article
- Google Scholar
32. Mazandu GK, Opap K, Mulder NJ (2011) Contribution of microarray data to the advancement of knowledge on the Mycobacterium tuberculosis interactome: Use of the random partial least squares approach. Infection, Genetics and Evolution 11(4):725–733.
- View Article
- Google Scholar
33. Blondel VD, Guillaume JL, Lambiotte R, Lefebvreet E (2008) Fast unfolding of communities in large networks. J Stat Mech 10008:1–12.
- View Article
- Google Scholar
34. Steinhaeuser K, Chawla NV (2010) Identifying and evaluating community structure in complex networks. Pattern Recognition Letters 31:413–421.
- View Article
- Google Scholar

[ref1] 1. GO-Consortium (2009) The Gene Ontology in 2010: extensions and refinements. Nucleic Acids Research 38:D331–D335.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Mazandu GK, Mulder NJ (2013) DaGO-Fun: Tool for Gene Ontology-based functional analysis using term information content measures. BMC Bioinformatics 14:284.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. UniProt-Consortium (2010) The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Research 38:D142–D148.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Flicek P, Aken BL, Ballester B, Beal K, Bragin E, et al. (2010) Ensembl's 10th year. Nucleic Acids Research 38(Database issue): D557–D562.

[ref5] 5. Sayers EW, Barrett T, Benson DA, Bryant SH, Canese K, et al. (2009) Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 37(Database issue): D5–D15.

[ref6] 6. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW (2009) Genbank. Nucleic Acids Research 37(Database issue): D26–D31.

[ref7] 7. Mazandu GK, Mulder NJ (2012) Using the underlying biological organization of the Mycobacterium tuberculosis functional network for protein function prediction. Infection, Genetics and Evolution 12(5):922–932.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref8] 8. Mazandu GK, Mulder NJ (2013) Information content-based Gene Ontology semantic similarity approaches: Toward a unified framework theory. BioMed Research International 2013: Ariticle ID 292063, 11 pages.

[ref9] 9. Mazandu GK, Mulder NJ (2012) A topology-based metric for measuring term similarity in the Gene Ontology. Adv Bioinformatics 2012: Ariticle ID 975783, 17 pages.

[ref10] 10. Guzzi PH, Mina M, Guerra C, Cannataro M (2011) Semantic similarity analysis of protein data: assessment with biological features and issues. Brief Bioinform: 1–17.

[ref11] 11. Pesquita C, Faria D, Falcão AO, Lord P, Couto FM (2009) Semantic similarity in biomedical ontologies. PLoS Comput Biol 5(7):e1000443.
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref12] 12. Mistry M, Pavlidis P (2008) Gene Ontology term overlap as a measure of gene functional similarity. BMC Bioinformatics 9:327.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref13] 13. Resnik P (1999) Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research 11:95–130.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref14] 14. Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the Fifteenth International Conference on Machine Learning. pp.296–304.

[ref15] 15. Jiang JJ, Conrath DW (1997) Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of the 10th International Conference on Research in Computational Linguistics. pp.19–33.

[ref16] 16. Lord PW, Stevens PW, Brass A, Goble CA (2003) Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 19(10):1275–1283.
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref17] 17. Couto F, Silva M, Coutinho P (2007) Measuring semantic similarity between Gene Ontology terms. Data Knowledge Eng 61(1):137–152.
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref18] 18. Schlicker A, Domingues FS, Rahnenfuhrer J, Lengauer T (2006) A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics 7:302.
View Article
Google Scholar

[37] View Article

[38] Google Scholar

[ref19] 19. Li B, Wang JZ, Feltus FA, Zhou J, Luo F (2010) Effectively integrating information content and structural relationship to improve the GO-based similarity measure between proteins. ArXiv e-prints: 1001.0958.

[ref20] 20. Yang H, Nepusz T, Paccanaro A (2012) Improving GO semantic similarity measures by exploring the ontology beneath the terms and modelling uncertainty. Bioinformatics 28(10):1383–1387.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref21] 21. Sevilla JL, Segura V, Podhorski A, Guruceaga E, Mato JM, et al. (2005) Correlation between gene expression and GO semantic similarity. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) archive 2(4):330–338.
View Article
Google Scholar

[44] View Article

[45] Google Scholar

[ref22] 22. Pesquita C, Faria D, Bastos H, Ferreira AEN, Falcão AO, et al.. (2008) Metrics for GO based protein semantic similarity: a systematic evaluation. BMC Bioinformatics 9(Suppl 5) S4.

[ref23] 23. Gentleman R (2005) Visualizing and Distances Using GO, http://bioconductor.org/packages/2.6/bioc/vignettes/GOstats/inst/doc/GOvis.pdf.

[ref24] 24. Wang JZ, Du Z, Payattakool R, Yu PS, Chen CF (2007) A new method to measure the semantic similarity of GO terms. Bioinformatics 23(10):1274–1281.
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref25] 25. Zhang P, Jinghui Z, Huitao S, Russo JJ, Osborne B, et al. (2006) Gene functional similarity search tool (GFSST). BMC Bioinformatics 7:135.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref26] 26. Seco N, Veale T, Hayes J (2004) An intrinsic information content metric for semantic similarity in wordnet. In: ECAI-04. pp. 1089–1090.

[ref27] 27. Wu X, Pang E, Lin K, Pei Z (2013) Improving the Measurement of Semantic Similarity between Gene Ontology Terms and Gene Products: Insights from an Edge- and IC-Based Hybrid Method. PLoS ONE 8(5):e66745.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref28] 28. Xu Y, Guo M, Shi W, Liu X, Wang C (2013) A novel insight into Gene Ontology semantic similarity. Genomics 101:368–375.
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref29] 29. Alvarez MA, Qi X, Yan C (2011) A shortest-path graph kernel for estimating gene product semantic similarity. J Biomed Semant 2:3.
View Article
Google Scholar

[62] View Article

[63] Google Scholar

[ref30] 30. Pesquita C, Faria D, Pessoa D, Couto FM (2009) CESSM: Collaborative Evaluation of Semantic Similarity Measures. JB2009: Challenges in Bioinformatics 157.

[ref31] 31. Bossi A, Lehner B (2009) Tissue specificity and the human protein interaction network. Molecular Systems Biology 5:260.
View Article
Google Scholar

[66] View Article

[67] Google Scholar

[ref32] 32. Mazandu GK, Opap K, Mulder NJ (2011) Contribution of microarray data to the advancement of knowledge on the Mycobacterium tuberculosis interactome: Use of the random partial least squares approach. Infection, Genetics and Evolution 11(4):725–733.
View Article
Google Scholar

[69] View Article

[70] Google Scholar

[ref33] 33. Blondel VD, Guillaume JL, Lambiotte R, Lefebvreet E (2008) Fast unfolding of communities in large networks. J Stat Mech 10008:1–12.
View Article
Google Scholar

[72] View Article

[73] Google Scholar

[ref34] 34. Steinhaeuser K, Chawla NV (2010) Identifying and evaluating community structure in complex networks. Pattern Recognition Letters 31:413–421.
View Article
Google Scholar

[75] View Article

[76] Google Scholar

Figures

Abstract

Introduction

Materials and Methods

Measuring protein similarity at the functional level

Assessing different functional similarity measures

Correlation with EC, Pfam and sequence similarity

Performance evaluation using a PPI network

Clustering power on a gene expression dataset

Results and Discussion

Using EC, Pfam and Sequence Similarity data

Using protein-protein interaction and expression data

Conclusion

Supporting Information

File S1.

Author Contributions

References