nach oben

2005 | Buch

Kapitel lesen Erstes Kapitel lesen

Grid Computing in Life Science

First International Workshop on Life Science Grid, LSGRID 2004, Kanazawa, Japan, May 31-June 1, 2004, Revised Selected and Invited Papers

herausgegeben von: Akihiko Konagaya, Kenji Satou

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

Researchers in the ?eld of life sciences rely increasingly on information te- nology to extract and manage relevant knowledge. The complex computational and data management needs of life science research make Grid technologies an attractive support solution. However, many important issues must be addressed before the Life Science Grid becomes commonplace. The 1st International Life Science Grid Workshop (LSGRID 2004) was held in Kanazawa Japan, May 31–June 1, 2004. This workshop focused on life s- ence applications of grid systems especially for bionetwork research and systems biology which require heterogeneous data integration from genome to phenome, mathematical modeling and simulation from molecular to population levels, and high-performance computing including parallel processing, special hardware and grid computing. Fruitful discussions took place through 18 oral presentations, including a keynote address and ?ve invited talks, and 16 poster and demonstration p- sentations in the ?elds of grid infrastructure for life sciences, systems biology, massive data processing, databases and data grids, grid portals and pipelines for functional annotation, parallel and distributed applications, and life science grid projects. The workshop emphasized the practical aspects of grid techno- gies in terms of improving grid-enabled data/information/knowledge sharing, high-performance computing, and collaborative projects. There was agreement among the participants that the advancement of grid technologies for life science research requires further concerted actions and promotion of grid applications. We therefore concluded the workshop with the announcement of LSGRID 2005.

Inhaltsverzeichnis

Frontmatter

Life Science Grid

Gene Trek in Procaryote Space Powered by a GRID Environment

Abstract

More than 100 microbial genomes have been sequenced since 1995 and thousands of microbial genomes will be sequenced in a decade. It implies that millions of open reading frames (ORFs) will be predicted and should be evaluated. Therefore, we need a high throughput system to evaluate the predicted ORFs and understand functions of genes based on comparative genomics. We established and applied a protocol for the prediction and evaluation of ORFs to genome sequences of 124 microbial that were available from the International Nucleotide Sequence Database as of June, 2003. We could carry out the evaluation of about 300,000 predicted ORFs based on clustering and horizontal gene transfer analysis thanks to the GRID environment. This paper introduces mainly the scheme of the GRID environment applied to the comparative genomics.

Hideaki Sugawara

An Integrated System for Distributed Bioinformatics Environment on Grids

Abstract

In this paper, an integrated system called OBIEnv, which has been developed on OBIGrid, is described. In addition to automatic database transfer and deployment, it provides various functionalities for transparent and fault-tolerant processing of bioinformatics tasks on Grid. A feasibility study on the analysis of horizontal gene transfer was done using 119 heterogeneous Linux nodes in 5 different sites, and OBIEnv proved its applicability to practical problems in bioinformatics.

Kenji Satou, Yasuhiko Nakashima, Shin’ichi Tsuji, Xavier Defago, Akihiko Konagaya

Distributed Cell Biology Simulations with E-Cell System

Abstract

Many useful applications of simulation in computational cell biology, e.g. kinetic parameter estimation, Metabolic Control Analysis (MCA), and bifurcation analysis, require a large number of repetitive runs with different input parameters. The heavy requirements imposed by these analysis methods on computational resources has led to an increased interest in parallel- and distributed computing technologies.

We have developed a scripting environment that can execute, and where possible, automatically parallelize those mathematical analysis sessions transparently on any of (1) single-processor workstations, (2) Shared-memory Multiprocessor (SMP) servers, (3) workstation clusters, and (4) computational grid environments. This computational framework, E-Cell SessionManager (ESM), is built upon E-Cell System Version 3, a generic software environment for the modeling, simulation, and analysis of whole-cell scale biological systems.

Here we introduce the ESM architecture and provide results from benchmark experiments that addressed 2 typical computationally intensive biological problems, (1) a parameter estimation session of a small hypothetical pathway and (2) simulations of a stochastic E. coli heat-shock model with different random number seeds to obtain the statistical characteristics of the stochastic fluctuations.

Masahiro Sugimoto, Kouichi Takahashi, Tomoya Kitayama, Daiki Ito, Masaru Tomita

The Architectural Design of High-Throughput BLAST Services on OBIGrid

Abstract

OBIGrid provides high-throughput GRIDBLAST services (OBIGbs) for researchers who need to deal with many BLAST query sequences at one time by exploiting both distributed processing and parallel processing. A new application-oriented grid framework has been introduced to split a BLAST query into independent sub-queries and to execute the sub-queries on remote personal computers and PC clusters connected by a virtual private network (VPN) over the Internet. The framework consists of five functional units: query splitter, job dispatcher, task manager, result collector and result formatter. They enable us to develop a cooperative GRIDBLAST system between a server and heterogeneous remote worker nodes: which consist of various computer architectures, different BLAST implementations and different Job schedulers operated by local resource management policy. The OBIGbs can execute 29,941 PSI-BLAST query sequences in 8.31 hours when using 230 CPUs in total and can return a 1.37 Giga byte result file.

Fumikazu Konishi, Akihiko Konagaya

Heterogeneous Database Federation Using Grid Technology for Drug Discovery Process

Abstract

The rapid progress of biotechnology provides an increasing number of life science databases. These databases have been operated and managed individually on the Internet. Under such a circumstance, it is needed to develop an infrastructure that allows to share information contained in these databases and to conduct research collaboration. Grid technology is an emerging technology for seamless and loose integration of diverse resources distributed on the Internet. In order to achieve federation of the heterogeneous databases, we have developed a system for supporting a drug discovery process using Globus Toolkit3/OGSA-DAI. As an essential part of the system, we introduce a protein-compound interaction search based on a meta-data bridging protein and compound information with their interaction types; such as, inhibitor, agonist, antagonist, etc. The effectiveness of our system is demonstrated by searching for the candidate compounds interacting with the glucocorticoid receptor protein.

Yukako Tohsato, Takahiro Kosaka, Susumu Date, Shinji Shimojo, Hideo Matsuda

Grid Portal Interface for Interactive Use and Monitoring of High-Throughput Proteome Annotation

Abstract

High-throughput proteome annotation refers to the activity of extracting information from all proteins in a particular organism using bioinformatics software on a high performance computing platform such as the grid. The Encyclopedia of Life (EOL) project [1] aims to catalog all proteins in all species for public benefits using an Integrative Genome Annotation Pipeline [2] (iGAP). The intrinsic complexity of the pipeline makes iGAP an ideal life sciences application to drive grid software development. It is a flagship application for the TeraGrid project [3]. The deployment of iGAP on the grid using grid middleware and mediator software has been described previously [4]. The heterogeneous and distributed computing environment on the grid requires an interactive user interface where jobs may be submitted and monitored. Here we describe our international collaborative effort in creating a grid portal solution for grid computing in life sciences under the auspices of PRAGMA [5]. Specifically, the development of GridMonitor for interactive monitor of iGAP workflow, and the use of a GridSpeed [6] generated iGAP application portal are described. The portal solution was part of the EOL demonstration at Supercomputing 2003 (SC’03) [7], where resources from 6 institutions on 5 continents are utilized to annotate more than 36,000 proteins from various species. It is a testimony to the necessity and expediency for international collaboration in an increasingly global grid computational environment to advance life sciences research.

Atif Shahab, Danny Chuon, Toyotaro Suzumua, Wilfred W. Li, Robert W. Byrnes, Kouji Tanaka, Larry Ang, Satoshi Matsuoka, Philip E. Bourne, Mark A. Miller, Peter W. Arzberger

Grid Workflow Software for a High-Throughput Proteome Annotation Pipeline

Abstract

The goal of the Encyclopedia of Life (EOL) Project is to predict structural information for all proteins, in all organisms. This calculation presents challenges both in terms of the scale of the computational resources required (approximately 1.8 million CPU hours), as well as in data and workflow management. While tools are available that solve some subsets of these problems, it was necessary for us to build software to integrate and manage the overall Grid application execution. In this paper, we present this workflow system, detail its components, and report on the performance of our initial prototype implementation for runs over a large-scale Grid platform during the SC’03 conference.

Adam Birnbaum, James Hayes, Wilfred W. Li, Mark A. Miller, Peter W. Arzberger, Phililp E. Bourne, Henri Casanova

Genome-Wide Functional Annotation Environment for Thermus thermophilus in OBIGrid

Abstract

We developed OBITco (Open BioInfomatics Thermus thermophilus Cyber Outlet) for gene annotation of T. thermophilus HB8 strain. To provide system services for numbers of researchers in the project, we adopted Web based technology and high-level user authentication system with three functions which are rollback function, hierarch representative function and easy-and-systematic annotation. The robust and secure network connection protects the confidential information within the project, thus, researchers can easily access real-time information on DNA sequences, ORF annotations or homology search results. T. thermophilus HB8 possesses 2,195 ORFs, 1156 Intergenic regions, 47 putative tRNA regions, and 6 rRNA regions. BLAST against nr/nt database and InterProScan for all ORFs were used to get homology hit records. The system provides an ORF viewer to show basic information of ORFs and database homology hit records. Researchers can update annotation information of ORF by simple operation, and then new annotation is applied to central database in real-time. Latest information can be utilized for lab experiments such as functional analysis, network analysis and structural analysis. The system can be also utilized as data storage/exchange place for the researchers for everyday experiments.

Akinobu Fukuzaki, Takeshi Nagashima, Kaori Ide, Fumikazu Konishi, Mariko Hatakeyama, Shigeyuki Yokoyama, Seiki Kuramitsu, Akihiko Konagaya

Parallel Artificial Intelligence Hybrid Framework for Protein Classification

Abstract

Proteins are classified into families based on structural or functional similarities. Artificial intelligence methods such as Hidden Markov Models, Neural Networks and Fuzzy Logic have been used individually in the field of bioinformatics for tasks such as protein classification and microarray data analysis. We integrate these three methods into a protein classification system for the purpose of drug target identification. Through integration, the strengths of each method can be harnessed as one, and their weaknesses compensated. Artificial intelligence methods are more flexible than traditional multiple alignment methods, and hence, offers greater problem-solving potential.

Martin Chew Wooi Keat, Rosni Abdullah, Rosalina Abdul Salam

Parallelization of Phylogenetic Tree Inference Using Grid Technologies

Abstract

The maximum likelihood method is considered as one of the most reliable methods for phylogenetic tree inference. However, as the number of species increases, the approach quickly loses its applicability due to explosive exponential number of trees that need to be considered. An earlier work by one of the authors [3] demonstrated that, by decomposing the trees into fragments called splits, and calculating the individual likelihood of each (small) split and combining them would result in a very close approximation of the true maximum likelihood value, as well as achieving significant reduction in computational cost. However, the cost was still significant for a practical number of species that need to be considered. To solve this problem, we further extend the algorithm so that it could be effectively parallelized in a Grid environment using Grid middleware such as Ninf and Jojo, and also applied combinatorial optimization techniques. Combined, we achieved over 64 times speedup over our previous results in a testbed of 16 nodes, with favorable speedup characteristics.

Yo Yamamoto, Hidemoto Nakada, Hidetoshi Shimodaira, Satoshi Matsuoka

EMASGRID: An NBBnet Grid Initiative for a Bioinformatics and Computational Biology ServicesInfrastructure in Malaysia

Abstract

The plethora of bioinformatics tools currently available to the biologist, and the diversity of problems that these applications were designed to solve, has necessitated a look at providing a single environment which can serve as an interface to these many applications. At the same time, this environment should also be able to function as an infrastructure resource with adequate computational capacity to solve the data volume which is currently available from numerous genomics projects. We discuss the setting up of a national level infrastructure initiative which utilizes grid computing technology to serve geographically distributed users in Malaysia. This infrastructure was designed to provide access to high performance computational resources made available by Sun Microsystem’s Sun Grid Engine (SGE) using different interfaces which access a pipeline of bioinformatics software. The underlying computer system, such as operating systems and high performance computing applications, were bypassed with the creation of an application layer of bioinformatics tools (BioGrappler) or by accessing the compute resources by the Grid Engine Portal (BioBox).

Mohd Firdaus Raih, Mohd Yunus Sharum, Raja Murzaferi Raja Moktar, Mohd Noor Mat Isa, Ng Lip Kian, Nor Muhammad Mahadi, Rahmah Mohamed

Development of a Grid Infrastructure for Functional Genomics

Abstract

The BRIDGES project is incrementally developing and exploring database integration over six geographically distributed research sites with the framework of a Wellcome Trust biomedical research project (the Cardiovascular Functional Genomics project) to provide a sophisticated infrastructure for bioinformaticians. Grid technologies are being used to facilitate this integration. Key issues to be investigated in BRIDGES are data integration and data federation, security, user friendliness, access to large scale computational facilities and incorporation of existing bioinformatics software solutions, both for visualisation as well as analysis of genomic data sets. This paper outlines the initial experiences in applying Grid technologies and outlines the on-going designs put forward to address these issues.

Richard Sinnott, Micha Bayer, Derek Houghton, David Berry, Magnus Ferrier

Building a Biodiversity GRID

Abstract

In the BiodiversityWorld project we are building a GRID to support scientific biodiversity-related research. The requirements associated with such a GRID are somewhat different from other GRIDs, and this has influenced the architecture that we have developed. In this paper we outline these requirements, most notably the need to interoperate over a diverse set of legacy databases and applications in an environment that supports effective resource discovery and use of these resources in complex workflows. Our architecture provides an invocation model that is usable over a wide range of resource types and underlying GRID middleware. However, there is a trade-off between the flexibility provided by our architecture and its performance. We discuss how this affects the inclusion of computationally intensive applications and applications that are highly interactive; we also consider the broader issue of interoperation with other GRIDs.

Andrew C. Jones, Richard J. White, W. Alex Gray, Frank A. Bisby, Neil Caithness, Nick Pittas, Xuebiao Xu, Tim Sutton, Nick J. Fiddian, Alastair Culham, Malcolm Scoble, Paul Williams, Oliver Bromley, Peter Brewer, Chris Yesson, Shonil Bhagwat

Mega Process Genetic Algorithm Using Grid MP

Abstract

In this study, a new Genetic Algorithm (GA) using the Tabu · Local Search mechanism is proposed. The GA described in this paper is considered a Mega Process GA, which has an effective mechanism to use massive processors, i.e., Mega Processors, in large-scale computing systems. Our proposed method has a GA-specific database that possesses information of searched space and performs a local search for the space that is not searched. Such mechanisms enable us to comprehend the quantitative rate of a searched region during the search. Using this information, the searched space can be expanded linearly as the number of computing resources increases and the exhaustive search is guaranteed under infinite computations. The proposed GA was applied to numerical test functions and the energy minimization problems of protein tertiary structures. The latter problem was performed under a heterogeneous distributed computing environment, which was built up with Grid MP produced by United Devices Inc.

Yoshiko Hanada, Tomoyuki Hiroyasu, Mitsunori Miki, Yuko Okamoto

“Gridifying” an Evolutionary Algorithm for Inference of Genetic Networks Using the Improved GOGA Framework and Its Performance Evaluation on OBI Grid

Abstract

This paper presents a genetic algorithm running on a grid computing environment for inference of genetic networks. In bioinformatics, inference of genetic networks is one of the most important problems, in which mutual interactions among genes are estimated by using gene-expression time-course data. Network-Structure-Search Evolutionary Algorithm (NSS-EA) is a promising inference method of genetic networks that employs S-system as a model of genetic network and a genetic algorithm (GA) as a search engine. In this paper, we propose an implementation of NSS-EA running on a multi-PC-cluster grid computing environment where multiple PC clusters are connected over the Internet. We “Gridifiy” NSS-EA by using a framework for the development of GAs running on a multi-PC-cluster grid environment, named Grid-Oriented Genetic Algorithm Framework (GOGA Framework). We examined whether the “Gridified” NSS-EA works correctly and evaluated its performance on Open Bioinformatics Grid (OBIGrid) in Japan.

Hiroaki Imade, Naoaki Mizuguchi, Isao Ono, Norihiko Ono, Masahiro Okamoto

Backmatter

Titel: Grid Computing in Life Science
herausgegeben von: Akihiko Konagaya
Kenji Satou
Verlag: Springer Berlin Heidelberg
Electronic ISBN: 978-3-540-32251-1
Print ISBN: 978-3-540-25208-5
DOI: https://doi.org/10.1007/b106923