Skip to main content

2018 | Buch

Scalable Big Data Analytics for Protein Bioinformatics

Efficient Computational Solutions for Protein Structures

insite
SUCHEN

Über dieses Buch

This book presents a focus on proteins and their structures. The text describes various scalable solutions for protein structure similarity searching, carried out at main representation levels and for prediction of 3D structures of proteins. Emphasis is placed on techniques that can be used to accelerate similarity searches and protein structure modeling processes.
The content of the book is divided into four parts. The first part provides background information on proteins and their representation levels, including a formal model of a 3D protein structure used in computational processes, and a brief overview of the technologies used in the solutions presented in the book. The second part of the book discusses Cloud services that are utilized in the development of scalable and reliable cloud applications for 3D protein structure similarity searching and protein structure prediction. The third part of the book shows the utilization of scalable Big Data computational frameworks, like Hadoop and Spark, in massive 3D protein structure alignments and identification of intrinsically disordered regions in protein structures. The fourth part of the book focuses on finding 3D protein structure similarities, accelerated with the use of GPUs and the use of multithreading and relational databases for efficient approximate searching on protein secondary structures.
The book introduces advanced techniques and computational architectures that benefit from recent achievements in the field of computing and parallelism. Recent developments in computer science have allowed algorithms previously considered too time-consuming to now be efficiently used for applications in bioinformatics and the life sciences. Given its depth of coverage, the book will be of interest to researchers and software developers working in the fields of structural bioinformatics and biomedical databases.

Inhaltsverzeichnis

Frontmatter

Background

Frontmatter
Chapter 1. Formal Model of 3D Protein Structures for Functional Genomics, Comparative Bioinformatics, and Molecular Modeling
Abstract
Proteins are the main molecules of life. Understanding their structures, functions, mutual interactions, activity in cellular reactions, interactions with drugs, and expression in body cells is a key to efficient medical diagnosis, drug production, and treatment of patients. This chapter shows how proteins can be represented in processes performed in scientific fields, such as functional genomics, comparative bioinformatics, and molecular modeling. The chapter begins with the general definition of protein spatial structure, which can be treated as a base for deriving other forms of representation. The general definition is then referenced to four representation levels of protein structure: primary, secondary, tertiary, and quaternary structures. This is followed by short description of protein geometry. And finally, at the end of the chapter, we will discuss energy features that can be calculated based on the general description of protein structure. The formal model defined in the chapter will be used in the description of the efficient solutions and algorithms presented in the following chapters of the book.
Dariusz Mrozek
Chapter 2. Technological Roadmap
Abstract
Scientific solutions presented in this book rely on various technologies that emerged in computer science. Some of them emerged recently and are quite new in the bioinformatics field. Some of them are widely used in developing efficient and reliable IT systems supporting various forms of business for many years, but are not frequently used in bioinformatics. This chapter provides a technological road map for solutions presented in this book. It covers a brief introduction to the concept of cloud computing, cloud service, and deployment models. It also defines the Big Data challenge and presents benefits of using multi-threading in scientific computations. It then explains graphics processing units (GPU) and CUDA architecture. Finally, it focuses on relational databases and the SQL language used for declarative querying.
Dariusz Mrozek

Cloud Services for Scalable Computations

Frontmatter
Chapter 3. Azure Cloud Services
Abstract
Microsoft Azure Cloud Services support development of scalable and reliable cloud applications that can be used to scientific computing. This chapter provides a brief introduction to Microsoft Azure cloud platform and its services. It focuses on Azure Cloud Services that allow building a cloud-based application with the use of web roles and worker roles. Finally, it shows a sample application that can be quickly developed on the basis of these two types of roles, and it emphasizes the role of queues in passing messages between components of the built system.
Dariusz Mrozek
Chapter 4. Scaling 3D Protein Structure Similarity Searching with Azure Cloud Services
Abstract
Azure Cloud Services allow building scalable and reliable software applications that perform computations in the Cloud. These applications are built with the use of Web roles and Worker roles that abstract from the Cloud infrastructure. In this chapter, we will see how the Cloud computing architecture and Azure Cloud Services can be utilized to scale out and scale up protein similarity searches by utilizing the system, called Cloud4PSi, that was developed for the Microsoft Azure public cloud. We will see the architecture of the system, its components, communication flow, and advantages of using a queue-based model over the direct communication between computing units. Results of various experiments confirm that the protein structure similarity searching can be successfully scaled on cloud platforms by using computation units of different sizes and by adding more computation units.
Dariusz Mrozek
Chapter 5. Cloud Services for Efficient Ab Initio Predictions of 3D Protein Structures
Abstract
Computational methods for protein structure prediction enable determination of a three-dimensional structure of a protein based on its pure amino acid sequence. However, conventional calculations of protein structure may be time-consuming and may require ample computational resources, especially when carried out with the use of ab initio methods. In this chapter, we will see how Cloud Services may help to solve these problems by scaling the computations in a role-based and queue-based Cloud4PSP system, deployed in the Microsoft Azure cloud. The chapter shows the system architecture, the Cloud4PSP processing model, and results of various scalability tests that speak in favor of the presented architecture.
Dariusz Mrozek

Big Data Analytics in Protein Bioinformatics

Frontmatter
Chapter 6. Foundations of the Hadoop Ecosystem
Abstract
The era of Big Data that we entered several years ago has changed our imagination about the type and the volume of data that can be processed, as well as the value of the data. Hadoop and the MapReduce processing model have revolutionized the way how we process and analyze the data today and how much important and valuable information we can get from the data. At the moment, the Hadoop ecosystem covers a broad collection of platforms, frameworks, tools, libraries, and other services for fast, reliable, and scalable data analytics. In this chapter, we will briefly describe the Hadoop ecosystem. We will also focus on two elements of the ecosystem—the Apache Hadoop and the Apache Spark. We will provide details of the MapReduce processing model and differences between MapReduce 1.0 and MapReduce 2.0. The concepts defined here are important for the understanding of complex systems presented in the following chapters of this part of the book.
Dariusz Mrozek
Chapter 7. Hadoop and the MapReduce Processing Model in Massive Structural Alignments Supporting Protein Function Identification
Abstract
Undoubtedly, for a variety of biological data and a variety of scenarios of how these data can be processed and analyzed, Hadoop and the MapReduce processing model bring the potential to make a step forward toward the development of solutions that will allow to get insights into various biological processes much faster. In this chapter, we will see MapReduce-based computational solution for efficient mining of similarities in 3D protein structures and for structural superposition. The solution benefits from the Map-only processing pattern, which utilizes only the Map phase of the MapReduce model. We will also see results of performance tests when scaling up nodes of the Hadoop cluster and increasing the degree of parallelism with the intention of improving efficiency of the computations.
Dariusz Mrozek
Chapter 8. Scaling 3D Protein Structure Similarity Searching on Large Hadoop Clusters Located in a Public Cloud
Abstract
For many reasons, protein structures are worth exploration and this exploration still leaves a lot of reserve for potential applications of the results of the exploration processes. 3D protein structure similarity searching is one of the important exploration processes performed in structural bioinformatics. Due to the complexity of 3D protein structures and exponential growth of protein structures in public repositories, like the Protein Data Bank, the process is time-consuming and requires increased computational resources. In this chapter we will see how 3D protein structure similarity searching can be accelerated by distributing computations on large Hadoop/HBase (HDInsight) clusters that can be broadly scaled out and up in the Microsoft Azure public cloud. We will see that the utilization of public clouds to perform scientific computations is very beneficial and can be successfully applied when performing time-consuming computations over biological data.
Dariusz Mrozek
Chapter 9. Scalable Prediction of Intrinsically Disordered Protein Regions with Spark Clusters on Microsoft Azure Cloud
Abstract
Intrinsically disordered proteins (IDPs) constitute a wide range of molecules that act in cells of living organisms and mediate many protein–protein interactions and many regulatory processes. Computational identification of disordered regions in protein amino acid sequences, thus, became an important branch of 3D protein structure prediction and modeling. In this chapter, we will see the IDP meta-predictor that applies an ensemble of primary predictors in order to increase the quality of IDP prediction. We will also see the highly scalable implementation of the meta-predictor on the Spark cluster (Spark-IDPP) that mitigates the problem of the exponentially growing number of protein amino acid sequences in public repositories. Spark-IDPP responds very well to the current needs of IDP prediction by parallelizing computations on the Spark cluster that can be scaled on demand on the Microsoft Azure cloud according to particular requirements for computing power.
Dariusz Mrozek

Multi-threaded Solutions for Protein Bioinformatics

Frontmatter
Chapter 10. Massively Parallel Searching of 3D Protein Structure Similarities on CUDA-Enabled GPU Devices
Abstract
Finding common molecular substructures in complex 3D protein structures is still challenging. This is especially visible when scanning entire databases containing tens or even hundreds of thousands protein structures. Graphics processing units (GPUs) and general purpose graphics processing units (GPGPUs) promise to give a high speedup of many time-consuming and computationally demanding processes over their original implementations on CPUs. In this chapter, we will see that massive parallelization of the 3D structure similarity searching on many core CUDA-enabled GPU devices leads to reduction of the execution time of the process and allows to perform it in real time.
Dariusz Mrozek
Chapter 11. Exploration of Protein Secondary Structures in Relational Databases with Multi-threaded PSS-SQL
Abstract
Protein secondary structure reveals important information regarding protein construction and regular spatial shapes, including alpha-helices, beta-strands, and loops, which protein amino acid chain can adopt in some of its regions. The relevance of this information and the scope of its practical applications cause the requirement for its effective storage and processing. In this chapter, we will see how protein secondary structures can be stored in the relational database and processed with the use of the PSS-SQL. The PSS-SQL is an extension to the SQL language. It allows formulation of queries against a relational database in order to find proteins having secondary structures similar to the structural pattern specified by a user. In the chapter, we will see how this process can be accelerated by parallel implementation of the alignment using multiple threads working on multi-core CPUs.
Dariusz Mrozek
Backmatter
Metadaten
Titel
Scalable Big Data Analytics for Protein Bioinformatics
verfasst von
Dr. Dariusz Mrozek
Copyright-Jahr
2018
Electronic ISBN
978-3-319-98839-9
Print ISBN
978-3-319-98838-2
DOI
https://doi.org/10.1007/978-3-319-98839-9