Elsevier

Knowledge-Based Systems

Volume 90, December 2015, Pages 23-32
Knowledge-Based Systems

Improving network topology-based protein interactome mapping via collaborative filtering

https://doi.org/10.1016/j.knosys.2015.10.003Get rights and content

Abstract

High-throughput screening (HTS) techniques enable massive identification of protein–protein interactions (PPIs). Nonetheless, it is still intractable to observe the full mapping of PPIs. With acquired PPI data, scalable and inexpensive computation-based approaches to protein interactome mapping (PIM), which aims at increasing the data confidence and predicting new PPIs, are desired in such context. Network topology-based approaches prove to be highly efficient in addressing this issue; yet their performance deteriorates significantly on sparse HTS-PPI networks. This work aims at implementing a highly efficient network topology-based approach to PIM via collaborative filtering (CF), which is a successful approach to addressing sparse matrices for personalized-recommendation. The motivation is that the problems of PIM and personalized-recommendation have similar solution spaces, where the key is to model the relationship among involved entities based on incomplete information. Therefore, it is expected to improve the performance of a topology-based approach on sparse HTS-PPI networks via integrating the idea of CF into it. We firstly model the HTS-PPI data into an incomplete matrix, where each entry describes the interactome weight between corresponding protein pair. Based on it, we transform the functional similarity weight in topology-based approaches into the inter-neighborhood similarity (I-Sim) to model the protein–protein relationship. Finally, we apply saturation-based strategies to the I-Sim model to achieve the CF-enhanced topology-based (CFT) approach to PIM.

Introduction

It is desired to investigate protein–protein interactions (PPIs) in biological processes for clarifying various biological mechanisms. Credited to the rapid progress of high-throughput screening (HTS) techniques [1], [2], [3], [4], data describing global networks of PPIs in organisms accumulate fast; several large-scale HTS-PPI networks were published for various organisms [5], [6], [7], [8], [9]. With them, both opportunities and challenges in studying biological events are unprecedented.

In spite of the high efficiency of HTS techniques, it is still intractable to observe the full PPI mapping due to the huge economic and time costs. Therefore, the problem of protein interactome mapping (PIM) arises, whose main task is to analyze the obtained HTS-PPIs for addressing the following two issues [10], [11], [12]:

  • (a)

    Assessment. Assessing the reliability of obtained HTS-PPI data, and rejecting the unreliable HTS-PPIs for decreasing the false-positive rate of them; and

  • (b)

    Prediction. Predicting unobserved interactomes based on acquired HTS-PPI data.

Various efforts have been made to deal with the problem of PIM [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27]. State-of-the-art approaches can be categorized into the following groups:

  • (a)

    Experiment reproducibility-based approaches. They work by assigning high reliability to putative PPIs observed in multiple independent experiments [13], [14]. Additional experiments can be either HTS or more specific ones [13], [14]. The former enhances the precision of the repetitively observed HTS-PPI data, while the latter aims at improving the reliability of HTS-PPI data via conducting different kinds of experiments [13], [14].

  • (b)

    Knowledge-based approaches. They address the problem of PIM based on prior knowledge regarding individual proteins. A representative approach of this kind is based on interolog [15], [16]. Pioneering works indicate that with sufficient interolog data, a knowledge-based approach can achieve high performance in addressing the problem of PIM [15], [16].

  • (c)

    Information integration-based approaches [17], [18], [19], [20], [21]. They are based on the fact that biological evidence, e.g., structural and functional annotations, contains rich information regarding protein interactomes. Patil and Nakamura [17], propose to filter HTS-PPI data via integrating the information of genomic features. Dutkowski et al. [18], and Skunca et al. [19], propose that putative PPIs with functional homogeneity (functional similarity) or cellular localization coherence (cellular co-localization) are more reliable than those without. Liu et al. [20], and Troyanskaya et al. [21], propose to integrate heterogeneous biological evidence, e.g., gene expression and genome context, into Bayesian network models to improve the quality of HTS-PPI data.

  • (d)

    Network topology-based approaches [14], [22], [23], [24], [25]. The aforementioned approaches cannot work without auxiliary information, e.g., additional experimental results. Network topology-based approaches, however, have no such limits. Their main idea is to address the problem of PIM by analyzing the topology of the network corresponding to available HTS-PPI data [14], [22], [23], [24], [25].

Saito et al. [22] rank the reliability of HTS-PPIs by the interaction generality (IG), which is an indexing metric inferred from the local topology attached to each pair of proteins. Brun et al. [23] employ Czekanowski–Dice distance (CD) to analyze the neighborhood topology of each protein in HTS-PPI networks for classification tasks. Chen et al. [24] propose the interaction reliability by alternative path (IRAP), which measures the topological connections among proteins through exploring path information of HTS-PPI networks. Chua et al. [14], [25] propose the functional similarity weight (FW), which is highly efficient in representing the protein–protein relationship based on the topological information from a given HTS-PPI network. As indicated in [23], [24], FW is generally more efficient than IG, CD and IRAP in addressing the problem of PIM, especially on large-scale HTS-PPI networks.

Topology-based approaches take advantage in purely relying on HTS-PPI data without requiring any additional information of proteins [14], [22], [23], [24], [25]. However, they share the drawback of low efficiency on sparse HTS-PPI networks [14], [25], [26], [27]. Unfortunately, such cases are very common in real applications [1], [2], [3], [4], [5], [6], [7], [8], [9], [14], [25]. For addressing this issue, this work aims at implementing a highly-efficient network topology-based approach to PIM via collaborative filtering (CF).

CF is initially designed for the problem of personalized-recommendation in recommender systems in the area of e-commerce [28], [29], [30], [31], [32], [33], [34], [35], [36], [37]. Recommender systems are vital for online commercial applications with numerous data, e.g., online book stores, online theatres, and online telecom services [28], [29], [30], [31], [32], [33], [34], [35], [36], [37]. They commonly involve three fundamental kinds of entities, i.e., users, items (e.g., movies and news), and user-item usage history (e.g., scores and comments). The main task is to figure out useful patterns reflecting connections between users and items from user-item usage history, and then make reliable predictions for possible user-item links according to these patterns [28], [29], [30], [31], [32], [33], [34], [35], [36], [37]. Based on these predictions, recommender systems are able to grasp users’ potential preferences hidden in the historical data, generate corresponding recommendations with high accuracy, and drastically improve their experiences when they using the enhanced applications [28], [29], [30], [31], [32], [33], [34], [35], [36], [37]. In a recommender system, due to the large number of items, each user can only touch a tiny fraction of the whole item set, thereby making the observed user-item be far less than missing ones. In other words, the problem of personalized-recommendation features with sparsity, and CF-based approaches have proven to be very effective in dealing with it [28], [29], [30], [31], [32], [33], [34], [35], [36], [37].

Through careful investigations of these two problems, i.e., PIM and personalized-recommendation, we find that they share similar solution spaces, where the principle is to model the relationship among involved entities based on incomplete information. Motivated by this observation, we propose to integrate the idea of CF into network topology-based PIM for achieving high performance. As demonstrated by the experimental results, the proposed approach is able to outperform several sophisticated ones on large, sparse HTS-PPI networks. To the best knowledge of the authors, such efforts have been never seen in any previous work.

We have validated the performance of the proposed approach on three public large, real datasets, respectively are the Tong dataset [38], the BIND dataset [39], and the IntAct dataset [40]. Note that our experiments employ Gene Ontology (GO)-based annotations to evaluate involved algorithms. GO is one of the most important ontologies in the bioinformatics community [41], [42], which is a large database containing many annotations regarding various characteristics of many proteins. It consists of numerous GO terms, each of which annotates one characteristic of involved proteins. The three organizing principles of GO are cellular component, biological process, and molecular function, respectively. In our context, a) cellular component indicates the appearance of a protein as a part of a cell or its extracellular environment, b) molecular function shows the elemental activities of a protein at the molecular level, and c) biological process means operations or sets of molecular events related to specific proteins. In our experiments, we employ these three ontologies in GO annotations as the ground-truth labels to measure the performance of tested algorithms; such validation protocols are commonly adopted by related works [14], [24], [25], [26], [27].

The rest of this paper is organized as follows. Section 2 gives the preliminaries. Section 3 presents our method. Section 4 gives the experimental results and discusses. Finally, Section 5 concludes this paper.

Section snippets

Preliminaries

Among current approaches to PIM, the topology-based ones are highly efficient, and more importantly, purely relying on the HTS-PPI data. With them, a given HTS-PPI network is modeled into an undirected graph G=(V,E), where each vertex uV denotes a specified protein, and each edge (u, v) ∈ E denotes an observed HTS-PPI between proteins u and v. Topology-based methods work by exploring the topological structure of G to model the relationships among involved proteins. Their basis is the widely

Interactome weight matrix

In CF models, the given data is usually modeled into an incomplete matrix, where each known entry is built based on corresponding user-item usage history. With such a matrix, they build patterns reflecting the connections among involved users and items, thereby generating reliable recommendations [28], [29], [30], [31], [32], [33], [34], [35], [36], [37]. From this point of view, we adopt the idea of CF to transform the given HTS-PPI data into an incomplete matrix as the data source. We define

Compared algorithms

The objective is to validate the performance of CFT. Therefore, we compare it with other sophisticated topology-based methods which employ IG, CD and FW, respectively. Note that in this work we consider the cases where only binary HTS-PPI data are available; therefore, each compared algorithm is implemented in the corresponding binary version to draw fair comparisons.

Experimental datasets

All experiments are conducted on three publicly available HTS-PPI datasets, whose details are as follows,

  • (I)

    D1 is the Tong dataset

Conclusions and further studies

Protein–protein interactions (PPIs) are massively identified by high-throughput screening (HTS) techniques; however, HTS-PPI data suffer from low reliability, thereby leading to the problem of protein interactome mapping (PIM). Network topology-based approaches can handle the problem of PIM with high efficiency purely relying on HTS-PPI data, yet their performance deteriorates significantly on sparse HTS-PPI networks. Unfortunately, such cases are very common in real applications. For

Acknowledgments

This research is in part supported by the Young Scientist Foundation of Chongqing under Grant Number cstc2014kjrc-qnrc40005, in part supported by the National Natural Science Foundation of China under Grant Number 61202347, 61472051, 61272194, 61373086 and 61401385, in part supported by the Postdoctoral Science Funded Project of Chongqing under Grant Number Xm2014043, in part supported by the Fundamental Research Funds for the Central Universities under Grant Number 106112015CDJXY180005 and

References (43)

  • UetzP. et al.

    A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae

    Nature

    (2000)
  • CollinsS.R. et al.

    Toward a comprehensive atlas of the physical interactome of Saccharomyces cerevisiae

    Mol. Cell. Proteomics

    (2007)
  • HoY. et al.

    Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry

    Nature

    (2002)
  • MillerJ.P. et al.

    Large-scale identification of yeast integral membrane protein interactions

    Proc. Natl. Acad. Sci. USA

    (2005)
  • VenkatesanK. et al.

    An empirical framework for binary interactome mapping

    Nat. Methods

    (2009)
  • SimonisN. et al.

    Empirically controlled mapping of the Caenorhabditis elegans protein-protein interactome network

    Nat. Methods

    (2009)
  • YuH.Y. et al.

    High-quality binary protein interaction map of the yeast interactome network

    Science

    (2008)
  • GiotL. et al.

    A protein interaction map of Drosophila melanogaster

    Science

    (2003)
  • BraunP. et al.

    Evidence for network evolution in an Arabidopsis interactome map

    Science

    (2011)
  • EdwardsA.M. et al.

    Bridging structural biology and genomics: assessing protein interaction data with known complexes

    Trends Genet.

    (2002)
  • GuimeraR. et al.

    Missing and spurious interactions and the reconstruction of complex networks

    Proc. Natl. Acad. Sci. USA

    (2009)
  • KelleyR. et al.

    Systematic interpretation of genetic interactions using protein networks

    Nat. Biotechnol.

    (2005)
  • VarjosaloM. et al.

    Interlaboratory reproducibility of large-scale human protein-complex analysis by standardized AP-MSMS

    Nat. Methods

    (2013)
  • ChuaH.N. et al.

    Increasing the reliability of protein interactomes

    Drug Discov. Today

    (2008)
  • MatthewsL.R. et al.

    Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or “interologs

    Genome Res.

    (2001)
  • TarailoM. et al.

    Synthetic lethal interactions identify phenotypic “interologs” of the spindle assembly checkpoint components

    Genetics

    (2007)
  • PatilA. et al.

    Filtering high-throughput protein-protein interaction data using a combination of genomic features

    BMC Bioinformatics

    (2005)
  • DutkowskiJ. et al.

    A gene ontology inferred from molecular networks

    Nat. Biotechnol.

    (2013)
  • SkuncaN. et al.

    Quality of computationally inferred gene ontology annotations

    PLoS Comput. Biol.

    (2012)
  • LiuG.M. et al.

    Assessing and Predicting Protein Interactions using Both Local and Global Network Topological Metrics, Genome Informatics Series

    (2008)
  • TroyanskayaO.G. et al.

    A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae)

    Proc. Natl. Acad. Sci. USA

    (2003)
  • Cited by (54)

    View all citing articles on Scopus
    View full text