Skip to main content

2015 | Buch

Distributed and Sequential Algorithms for Bioinformatics

insite
SUCHEN

Über dieses Buch

This unique textbook/reference presents unified coverage of bioinformatics topics relating to both biological sequences and biological networks, providing an in-depth analysis of cutting-edge distributed algorithms, as well as of relevant sequential algorithms. In addition to introducing the latest algorithms in this area, more than fifteen new distributed algorithms are also proposed. Topics and features: reviews a range of open challenges in biological sequences and networks; describes in detail both sequential and parallel/distributed algorithms for each problem; suggests approaches for distributed algorithms as possible extensions to sequential algorithms, when the distributed algorithms for the topic are scarce; proposes a number of new distributed algorithms in each chapter, to serve as potential starting points for further research; concludes each chapter with self-test exercises, a summary of the key points, a comparison of the algorithms described, and a literature review.

Inhaltsverzeichnis

Frontmatter
Chapter 1. Introduction
Abstract
We provide an overview of the book in this chapter starting with a brief introduction to biological sequences and the problems encountered in these structures. Comparison of sequences, clustering, discovery of sequence repeats, finding genes, detecting genome rearrangements, and haplotype inference are the fundamental tasks in biological sequences. We then describe biological networks and state the main problems to be handled in these systems. Their analysis, clustering, network motif discovery, network alignment, and phylogenetics are the main issues in these networks. In both cases, distributed algorithms are needed due to the high computation requirements because of the huge sizes of these structures as we discuss. We conclude the chapter with the outline of the book.
K. Erciyes

Background

Frontmatter
Chapter 2. Introduction to Molecular Biology
Abstract
This chapter provides a dense review of molecular biology with focus on the molecules of the cell. The cell is the building block of all living things. We first review the functions of various structures in the cell. The important molecules in the cell are the DNA, RNA, proteins, and mitochondria. DNA is made of small molecules called nucleotides and contains genes which are sequences of nucleotides. These sequences in genes are used to form RNA in the transcription process, and RNA in turn is used to produce amino acid sequences to form proteins in the translation process. Each three nucleotide sequence in RNA called a codon is used to form one amino acid. We also briefly review the biotechnological methods which are cloning, polymerase chain reaction (PCR), and DNA sequencing. Cloning and PCR are used to amplify DNA segments and DNA sequencing is the process of obtaining nucleotide sequence of DNA. In the final part of the chapter, we provide a short survey of commonly used nucleotide and protein databases.
K. Erciyes
Chapter 3. Graphs,  Algorithms,  and  Complexity
Abstract
Graphs are widely used to model biological or other types of networks. In the first part of this chapter, we provide a dense review of graph theory in regard to biological networks. An algorithm is a set of instructions to solve a given problem. We introduce basic algorithmic concepts and methods such as greedy and divide and conquer strategies with emphasis on the commonly used approaches of dynamic programming and graph algorithms in bioinformatics. A number of basic graph algorithms are described in detail and special graph structures are also illustrated. Many problems in bioinformatics cannot be solved in polynomial time and for these tasks, approximation algorithms with proven approximation ratios to optimal solutions are preferred. However, in many cases there are no approximation algorithms known to date and heuristics which are commonsense methods shown to work for most of the input combinations are used. We provide a brief survey of complexity classes and these mentioned methods in the final part.
K. Erciyes
Chapter 4. Parallel and Distributed Computing
Abstract
Our aim in this chapter is to provide a dense overview of parallel and distributed computing as a background of the algorithms we will implement in the rest of the book. Parallel computing involves concurrent execution of a number of tasks on a set of processors. Partitioning the job to be done to processes, distribution of these processes to processors and synchronization among tasks are the key issues to be addressed in a parallel computing system. We review architectures for interconnection networks and models of parallel processing first. Parallel computing may be realized using a shared memory which is used for interprocess communication, or as a distributed memory system in which communications are implemented by message-passing between processes only. In line with the general perception, we termed the software methods of the first as parallel algorithms and the latter as distributed algorithms. Access to shared memory must be controlled and parallel processes must be synchronized as we discuss. We provide a brief review of POSIX threads which can be used for parallel processing. Distributed algorithms synchronize using messages; we show common structures of these algorithms and illustrate the widely used message passing interface standard (MPI) with algorithm examples. We also cover the UNIX operating system constructs for parallel process management and distributed communication using the socket structure.
K. Erciyes

Biological Sequences

Frontmatter
Chapter 5. String Algorithms
Abstract
The two main molecules of the cell, DNA and proteins, can be considered as strings from finite alphabets. In exact string matching, our aim is to search for a pattern in a longer text of symbols. Approximate string matching allows a given number of mismatches of the pattern in the text. We will see there are polynomial time algorithms for both cases. However, the sizes of DNA nucleotide and protein amino acid sequences being large necessitate the use of heuristics for lower time complexity. String matching is a potential area for parallel/distributed algorithm applications, although there are hardly any reported studies. We propose a simple distributed algorithm for this purpose. A suffix tree provides a convenient method of representing suffixes of a string as branches of a tree and these trees are widely used for biological sequence analysis. We describe suffix trees, review methods to construct them and survey a number of reported parallel/distributed approaches to build these structures. A suffix array is a compact representation of the suffixes of a string as we discuss. We sketch a possible implementation of a reported study of distributed suffix array construction.
K. Erciyes
Chapter 6. Sequence Alignment
Abstract
Sequence alignment is the basic and most widely used method of comparing two or more biological sequences. It can be used as the first step of various sequence analysis tasks to quantify the similarities between the sequences. The goal of sequence alignment is to transform one sequence to another with minimum number of modifications. It can be performed in polynomial time between two sequences but heuristics are commonly used to compare multiple sequences. Global alignment is the process of comparing two or more sequences as a whole and is commonly used for homologous sequences. On the other hand, local alignment aims to find similar subsequences between two or more inputs which may be very diverse when considered as a whole. Pairwise alignment algorithms input two sequences and multiple alignment methods work on a set of sequences. In this chapter; we describe algorithms, methods, and tools for these different forms of sequence alignment starting with sequential algorithms and then we detail distributed versions of these algorithms.
K. Erciyes
Chapter 7. Clustering of Biological Sequences
Abstract
Clustering is the task of grouping objects based on some similarity metric. Objects in a cluster should be closer to each other than the rest of the objects under consideration. Our aim in this chapter is to describe and analyze clustering algorithms for biological sequences. Classical algorithms may be used for this purpose; however, even the polynomial time algorithms may be time-consuming for DNA, and protein sequences which have very large sizes. We start this chapter by outlining validation methods of clustering and the classical clustering schemes which are the hierarchical clustering and partitional clustering methods. We then describe clustering methods that target biological sequences and outline graph-based clustering methods which can be used for this purpose. There are a number of distributed algorithms for sequence clustering as we discuss. Four new algorithms and a general method for graph partitioning are also proposed which can be implemented in a distributed memory system conveniently.
K. Erciyes
Chapter 8. Sequence Repeats
Abstract
DNA or protein repeats are recurring subsequences in these molecules. These repeats may be adjacent to each other in which case they are called tandem repeats or they may be dispersed named as sequence motifs. Discovery of such subsequences has various implications such as locating genes as they are frequently found near genes, comparing sequences, or disease analysis as the number of repeats is elevated in certain diseases. Instead of searching for exact repeats, we may be interested in finding approximate repeats, as these are encountered more frequently in experiments than exact ones due to mutations in sequences and erroneous measurements. We may search for repeats in a single sequence or a set of sequences. The detected repeats in the latter case provide also the conserved structures in the set which can be used to infer phylogenetic relationships. Discovery of these structures can be performed by combinatorial and probabilistic algorithms as we describe. Graph-based methods involve building of a k-partite similarity graph among k input sequences and then searching for cliques in this graph. A clique found this way will have a vertex in each partition and represent a common motif in all sequences. The distributed algorithms for this purpose are scarce and we propose two new algorithms to detect repeating sequences which can be easily experimented.
K. Erciyes
Chapter 9. Genome Analysis
Abstract
This chapter provides an aggregation of three problems associated with the coarse analysis of biological sequences at subsequence level: gene finding, genome rearrangement, and haplotype inference. Detecting the location of genes in DNA is needed to analyze them efficiently. Gene finding algorithms for this purpose can be classified as statistical and comparison-based methods. The first scheme searches statistically more frequently appearing sequences such as the start and terminating codons, and sequence repeats in or around the location of the genes. In comparison-based methods, our aim is to compare sequences and find their similar structures assuming genes are more conserved throughout the evolution. Subsequences of a genome undergo mutations and discovery of these chain of mutations provides information about the evolutionary process and also the disease state of an organism as some of these arrangements are assumed to be the causes of certain diseases. We review few algorithms that discover the rearrangement events in a genome. The last problem we study is the extraction of data of a single chromosome from the two chromosome data, called haplotype inference. We describe a simple algorithm based on maximum parsimony and a probabilistic algorithm for this purpose. Unfortunately, there are hardly any distributed algorithms for these three important tasks and we propose a distributed algorithm for genome rearrangement and two distributed algorithms for haplotype inference.
K. Erciyes

Biological Networks

Frontmatter
Chapter 10. Analysis of Biological Networks
Abstract
Biological processes such as the interaction between proteins or metabolic reactions can be represented by networks which can be modeled by graphs. Biological networks are present in the cell and outside the cell. Our aim in this chapter is to first introduce the networks in the cell and analyze them as graphs. Centrality analysis provides information about the important nodes and edges in biological networks and we describe algorithms to find various centrality measures. The main problems to investigate in the graph structure of a biological network are the module detection, discovery of recurrent subgraphs called network motifs and aligning two or more networks as we discuss. We will see these networks have interesting features such as small-world, scale-free properties which are not found in random networks. All of these problems are discussed in detail in the rest of this part of the book.
K. Erciyes
Chapter 11. Cluster Discovery in Biological Networks
Abstract
Biological networks can be represented by graphs and detection of modules in such networks is then reduced to finding clusters in graphs. Clustering in graphs is basically detecting dense regions with higher number of edges between the nodes in the clusters than to the rest of the graph. Clustering is NP-hard in the general case and the huge size of biological networks make this process more difficult. We start this chapter by defining parameters for the validation of the cluster quality. We then classify and describe clustering algorithms for biological networks. Although the general graph clustering algorithms can be used for biological networks, we present algorithms that are frequently used for networks in the cell. There are a number of reported distributed clustering algorithms as we describe and we also propose two new algorithms for this purpose.
K. Erciyes
Chapter 12. Network Motif Search
Abstract
Network motifs are recurring subgraph patterns in biological or other networks. A network motif is assumed to have a specific function associated with it, and these structures are considered as the building blocks of biological networks. Discovery of such motifs is important as it gives insight to the functioning of a network and also provides an alternative means to compare and analyze two or more biological networks. If two networks have common motifs, we can assume they perform some similar basic functions. Finding similarity also helps to find phylogenetic relationships between organisms. Detecting network motifs of a given size is a computationally hard task as the number of possible subgraphs grows exponentially with the size, and also this problem is closely related to the graph isomorphism problem which is not tractable either in the general case or as the subgraph isomorphism problem. Motif discovery involves three subtasks: finding subgraphs of a given size, dividing these into equal isomorphic classes and evaluating the statistical significance of the discovered subgraphs. The last step is usually implemented by generating many random graphs similar to the original one and comparing the frequencies of the subgraphs in the original graph and in these graphs. In this chapter, we describe motif discovery problem, show several sequential, and few distributed existing algorithms. We also provide a general framework for distributed processing of the subtasks involved.
K. Erciyes
Chapter 13. Network Alignment
Abstract
Biological networks are dynamic and the measurements can contain errors which results in two networks of the same origin to look different. Alignment of two or more networks is the process of searching for similarities between them. This task requires mapping nodes of one network to another by associating some similarity measure with the aim of pairing nodes that have the highest affinity. Network alignment is needed to compare biological networks to find their phylogenetic relationships and to find approximately conserved subnetworks in them. Since this problem is intractable, various heuristic algorithms are proposed. In the graph domain, the alignment operation can be accomplished by forming a weighted complete k-partite graph with partitions from each network and edge weights showing the similarity. The problem is then reduced to maximal weighted complete k-partite matching problem which has been investigated thoroughly. In this chapter, we first state the relation of the problem to graph isomorphism and the weighted complete k-partite matching and show the metrics for the alignment quality. We then review few sequential algorithms for this problem. Distributed alignment algorithms are scarce and we propose a new algorithm that uses k-partite matching property.
K. Erciyes
Chapter 14. Phylogenetics
Abstract
Phylogenetics is the study of evolutionary relationships among organisms. Sequence alignment is commonly performed as the first step of phylogenetics to determine the similarities of DNA and protein sequences. Searching these relationships is needed to analyze diseases, predict the genetic structures of pathogens, and also to classify the organisms. We pursue an evolutionary structure called the phylogenetic tree with leaves representing the living organisms called taxa and the intermediate nodes as the hypothetical ancestors. We start this chapter by stating the terms of phylogenetic terminology. We then describe various methods of constructing phylogenetic trees and propose a simple distributed algorithm to construct such trees. In the maximum parsimony problem, we are given the evolutionary tree and a number of taxa, and our aim is to label nodes of this tree that explains data with minimum number of mutations. We review a number of algorithms for this purpose and introduce a new algorithm for distributed parsimony implementation. After reviewing the probabilistic maximum likelihood method of constructing phylogenetic trees, we provide a brief review of distributed approaches that use maximum likelihood. Phylogenetic networks are more general as they exhibit events such as horizontal gene transfer and recombination which cannot be represented by the evolutionary trees. We discuss these networks briefly to conclude this chapter.
K. Erciyes
Chapter 15. Epilogue
Abstract
We overview the current challenges and future directions of bioinformatics in this chapter. The current major challenges as we perceive are big data analysis, disease analysis, and bioinformatics education. Management of big data requires specialized middleware, special methods such as data mining, and also parallel/distributed algorithms which is the main topic of this book. Disease analysis in search of cures and drug therapies is probably one of the grand challenges in bioinformatics as well as various other disciplines. Bioinformatics education may also require revisions to meet the current needs of the researchers in this dynamic field. We then provide a detailed description of challenges which is in fact is a dense review of the book. The main future directions of research in bioinformatics, as we see it, will probably be along the same lines of investigation into efficient big data analysis methods including parallel/distributed algorithm implementations; and disease analysis together with more personal medication applications.
K. Erciyes
Backmatter
Metadaten
Titel
Distributed and Sequential Algorithms for Bioinformatics
verfasst von
Kayhan Erciyes
Copyright-Jahr
2015
Electronic ISBN
978-3-319-24966-7
Print ISBN
978-3-319-24964-3
DOI
https://doi.org/10.1007/978-3-319-24966-7

Premium Partner