skip to main content
research-article

Towards the assessment of semantic similarity analysis of protein data: main approaches and issues

Published:01 September 2012Publication History
Skip Abstract Section

Abstract

Bioinformatics approaches to the study of proteins yield to the introduction of different methodologies and related tools for the analysis of different types of data related to proteins, ranging from primary, secondary and tertiary structures to interaction data [1], not to mention functional knowledge.

One of the most advanced tools for encoding and representing functional knowledge in a formal way is the Gene Ontology (GO) [2,3]. It is composed of three ontologies, named Biological Process (BP), Molecular Function (MF) and Cellular Component (CC). Each ontology consists of a set of terms (GO terms) representing different functions, biological processes and cellular components within the cell. GO terms are connected each other to form a hierarchical graph. Terms representing similar functions are close to each other within this graph.

Biological molecules are associated with GO terms that represent their functions, biological roles and localization. This process, usually referred to as annotation process, can be performed under the supervision of an expert or in a fully automated way. Obviously, computationally inferred annotations, commonly known as Electronically Inferred Annotation (IEA), are not as reliable as experimentally determined annotations. For this reason every annotation is labeled with an Evidence Code (EC) that keeps track of the type of process used to produce the annotation itself. Considering the release of annotations of April, 2010, about the 98% of all the annotations is an IEA annotation [4].

The term annotation corpus is commonly used to identify all the annotations involving a set of proteins or genes, usually referring the whole proteomes and genomes (i.e. the annotation corpus of yeast). For lack of space we do not further describe the Gene Ontology. A comprehensive review has been provided by du Plessis et al. [4] and by Guzzi et al. [5].

The availability of well formalized functional data enabled the use of computational methods to analyse genes and proteins from the functional point of view. For example, a set of algorithms, known as functional enrichment algorithms, have been developed to determine the statistical significance of the presence (or the absence) of a GO Term in a set of gene products. A detailed review of these algorithms can be found in [4].

An interesting problem is how to express quantitatively the relationships between GO terms. Several measures, referred to as (term) semantic similarity (SS) measures, has been introduced in the last decade. Given two or more GO terms, they try to quantify the similarity of the functional aspects represented by the terms within the cell. Exploiting annotation corpora, semantic similarity measures have been further extended to the evaluation of the similarity of genes and proteins on the basis of their annotations.

Many different works have focused on the following tasks: (i) the definition of ad-hoc semantic similarity measures tailored to the characteristics of Gene Ontology; (ii) the definition of measures of comparison of genes and proteins; (iii) the introduction of methodologies for the systematic assessment of semantic similarity measures; (iv) the use of semantic similarity measures in many different contexts and applications. Despite its relevance, the application of semantic similarity for the systematic analysis of protein data is still an open research area. There are, in fact, two main questions that have to be addressed: (i) the systematic assessment of SS with respect to other biological features, i.e. how much an high or a low value of SS is biologically meaningful; (ii) how reliable are the SS themselves, i.e. is there any systematic error or bias in the calculation of SS? Both these problems are relevant for the diffusion of SS measures; while in the first case several approaches have been proposed, confronting SS measures with a pletora of different biological features, only few works dealt with the second problem in a systematic way [5,6,7].

References

  1. Mario Cannataro, Pietro Hiram Guzzi, Pierangelo Veltri. Protein-to-protein interactions: Technologies, databases, and algorithms. ACM Comput. Surv. 43(1):1, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Francisco Azuaje, Haiying Wang, and Olivier Bodenreider. Ontology-driven similarity approaches to supporting gene functional assessment. Proc. of The Eighth Annual Bio-Ontologies Meeting, pp. 9--10, 2005.Google ScholarGoogle Scholar
  3. M. A. Harris, J. Clark, A. Ireland, J. Lomax, M. Ashburner, R. Foulger, K. Eilbeck, S. Lewis, B. Marshall, C. Mungall, J. Richter, G. M. Rubin, J. A. Blake, C. Bult, M. Dolan, H. Drabkin, J. T. Eppig, D. P. Hill, L. Ni, M. Ringwald, R. Balakrishnan, J. M. Cherry, K. R. Christie, M. C. Costanzo, S. S. Dwight, S. Engel, D. G. Fisk, J. E. Hirschman, E. L. Hong, R. S. Nash, A. Sethuraman, C. L. Theesfeld, D. Botstein, K. Dolinski, B. Feierbach, T. Berardini, S. Mundodi, S. Y. Rhee, R. Apweiler, D. Barrell, E. Camon, E. Dimmer, V. Lee, R. Chisholm, P. Gaudet, W. Kibbe, R. Kishore, E. M. Schwarz, P. Sternberg, M. Gwinn, L. Hannick, J. Wortman, M. Berriman, V. Wood, P. Tonellato, P. Jaiswal, T. Seigfried, and R. White. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Research, 32 (Database Issue):258--261, 2004.Google ScholarGoogle Scholar
  4. Louis du Plessis, Nives kunca and Christophe Dessimoz. The what, where, how and why of gene ontologya primer for bioinformaticians. Briefings in Bioinformatics, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  5. Pietro Hiram Guzzi, Marco Mina, Concettina Guerra and Mario Cannataro. Semantic Similarity Measures: Assessment with biological features and Issues. Briefings In Bioinformatics, 10.1093/bib/BBR066, 2012.Google ScholarGoogle Scholar
  6. Da Wei Huang, Brad T. Sherman and Richard A. Lempicki. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Research, 37(1):1--13, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  7. Young-Rae Cho, Woochang Hwang, Murali Ramanathan and Aidong Zhang. Semantic integration to identify overlapping functional modules in protein interaction networks. BMC bioinformatics, 8:265, 2007.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Towards the assessment of semantic similarity analysis of protein data: main approaches and issues

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM SIGBioinformatics Record
              ACM SIGBioinformatics Record  Volume 2, Issue 3
              September 2012
              20 pages
              ISSN:2331-9291
              EISSN:2159-1210
              DOI:10.1145/2384691
              Issue’s Table of Contents

              Copyright © 2012 Authors

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 1 September 2012

              Check for updates

              Qualifiers

              • research-article

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader