Skip to main content
main-content

Über dieses Buch

The 14 contributed chapters in this book survey the most recent developments in high-performance algorithms for NGS data, offering fundamental insights and technical information specifically on indexing, compression and storage; error correction; alignment; and assembly.

The book will be of value to researchers, practitioners and students engaged with bioinformatics, computer science, mathematics, statistics and life sciences.

Inhaltsverzeichnis

Frontmatter

Indexing, Compression, and Storage of NGS Data

Frontmatter

Chapter 1. Algorithms for Indexing Highly Similar DNA Sequences

The availability of numerical data grows from one day to the other in an extraordinary way. This is the case for DNA sequences produced by new technologies of high-throughput Next Generation Sequencing (NGS). Hence, it is possible to sequence several genomes of organisms and a project (http://www.1000genomes.org) now provide about 2500 individual human genomes (sequences of more than three billion characters (A, C, G, T).

Nadia Ben Nsira, Thierry Lecroq, Mourad Elloumi

Chapter 2. Full-Text Indexes for High-Throughput Sequencing

Recent advances in High-Throughput Sequencing demand for novel algorithms working on efficient data structures specifically designed for the analysis of large volumes of sequence data. This chapter describes such data structures, called full-text indexes, to represent all substrings (or substrings up to a certain length) contained in a given text (or text collection).

David Weese, Enrico Siragusa

Chapter 3. Searching and Indexing Circular Patterns

Circular DNA sequences can be found in viruses, as plasmids in archaea and bacteria, and in the mitochondria and plastids of eukaryotic cells. Hence, circular sequence comparison finds applications in several biological contexts (Barton et al., Experimental algorithms. Lecture notes in computer science, vol 9125, pp 247–258, 2015; Barton et al., Algorithms Mol Biol 9(9):2014; Uliel et al., Protein Eng 14(8):533–542, 2001). This motivates the design of efficient algorithms (Barton et al., Language and automata theory and applications. Lecture notes in computer science, vol 8977, pp 85–96. Springer, Berlin, 2015) and data structures (Hon et al., Combinatorial pattern matching. Lecture notes in computer science, vol 7922, pp 142–152. Springer, Berlin/Heidelberg, 2013) that are devoted to the specific comparison of circular sequences, as they can be relevant in the analysis of organisms with such structure (Grossi et al., Proceedings of algorithms in bioinformatics - 15th international workshop, WABI 2015, Atlanta, GA, Sept 10–12, 2015. Lecture notes in computer science, vol 9289, pp 203–216. Springer, Berlin, 2015; Gusfield, Algorithms on strings, trees, and sequences - computer science and computational biology. Cambridge University Press, Cambridge, 1997).

Costas S. Iliopoulos, Solon P. Pissis, M. Sohel Rahman

Chapter 4. De Novo NGS Data Compression

This chapter deals with the compression of genomic data without reference genomes. It presents various techniques which have been specifically developed to compress sequencing data in lossless or lossy modes. The chapter also provides an evaluation of different NGS data compressor tools.

Gaetan Benoit, Claire Lemaitre, Guillaume Rizk, Erwan Drezen, Dominique Lavenier

Chapter 5. Cloud Storage-Management Techniques for NGS Data

Current scientific advancements in both computer and biological sciences are bringing new opportunities to intra-disciplinary research topics. On one hand, computers and big-data analytics cloud software tools are being developed rapidly, increasing the capability of processing from terabyte data sets to petabytes and beyond. On the other hand, the advancement in molecular biological experiments is producing huge amounts of data related to genome and RNA sequences, protein and metabolite abundance, protein–protein interactions, gene expression, and so on. In most cases, biological data are forming big, versatile, complex networks.

Evangelos Theodoridis

Error Correction in NGS Data

Frontmatter

Chapter 6. Probabilistic Models for Error Correction of Nonuniform Sequencing Data

Sequencing error correction has become an important step in the analyses of next-generation sequencing (NGS) datasets in order to improve data quality for downstream applications. In this chapter, we discuss different formulations for sequencing read error corrections that are based on probabilistic models able to handle datasets with a nonuniform read coverage. Nonuniform coverage is common in several applications of NGS, including small RNA and messenger RNA sequencing, metagenomics, metatranscriptomics, and single-cell sequencing. Here, we review popular formulations based on the Hamming graph of k-mers found in sequencing reads and introduce a more complete formulation that can also handle insertion and deletion errors. as found in As the breadth of applications is steadily increasing to In this chapter, we will introduce different approaches to correct sequencing errors with probabilistic models. One common formulation is based on models over Hamming graphs. A particular focus will be on a more general formulation using hidden Markov models that can solve indel errors. These methods are suitable for the correction of reads from experiments with nonuniform coverage, like RNA-Seq, single-cell sequencing, or metagenomics, a topic of rising importance in the community.

Marcel H. Schulz, Ziv Bar-Joseph

Chapter 7. DNA-Seq Error Correction Based on Substring Indices

Next-Generation Sequencing (NGS) has revolutionized genomics. NGS technologies produce millions of sequencing reads of a few hundred bases in length. In the following, we focus on NGS reads produced by genome sequencing of a clonal cell population, which has important applications like the de novo genome assembly of previously unknown genomes, for example, recently mutated parasites (Mellmann et al., PLoS ONE 6(7):e22751, 2011) or newly sequenced genomes (Locke et al., Nature 469:529–533, 2011).

David Weese, Marcel H. Schulz, Hugues Richard

Chapter 8. Error Correction in Methylation Profiling From NGS Bisulfite Protocols

Whole genome bisulfite sequencing (WGBS) has emerged as the primary technique for DNA methylation studies, because of its great potential in terms of speed, specificity, and the capability of addressing new biological implications as non-CpG context methylation or hemimethylation. However, despite the improvement that has meant the appearance of WGBS, processing and analyzing the resulting datasets is not as straightforward as in other methylation assays, and special care should be taken to obtain reliable results. As far as we know, an extensive review on the error sources that can bias methylation level measurement and the different algorithms that have been proposed to deal with it does not exist. Therefore, in this chapter all known WGBS error sources will be extensively reviewed and critically evaluated in order to suggest a couple of best practices to deal with all sources of bias in WGBS assays.

Guillermo Barturen, José L. Oliver, Michael Hackenberg

Alignment of NGS Data

Frontmatter

Chapter 9. Comparative Assessment of Alignment Algorithms for NGS Data: Features, Considerations, Implementations, and Future

Due to the nature of massively parallel sequencing use of shorter reads, the algorithms developed for alignment have been crucial to the widespread adoption of Next-Generation Sequencing (NGS). There has been great progress in the development of a variety of different algorithms for different purposes. Researchers are now able to use sensitive and efficient alignment algorithms for a wide variety of applications, including genome-wide variation studies [1], quantitative RNA-seq expression analyses [2], the study of secondary RNA structure [3], microRNA discovery [4], identification of protein-binding sites using ChIP-sequencing [5], recognizing histone modification patterns for epigenetic studies [6], simultaneous alignment of multiple genomes for comparative genomics [7], and the assembly of de novo genomes and transcriptomes [8]. In clinical settings, alignment to reference genomes has led to rapid pathogen discovery [9], identification of causative mutations for rare genetic diseases [10–12], detection of chromosomal abnormalities in tumor genomes [13], and many other advances which similarly depend on rapid and cost-effective genome-wide sequencing.

Carol Shen, Tony Shen, Jimmy Lin

Chapter 10. CUSHAW Suite: Parallel and Efficient Algorithms for NGS Read Alignment

Next generation sequencing (NGS) technologies have enabled cheap, large-scale, and high-throughput production of short DNA sequence reads and thereby have promoted the explosive growth of data volume. Unfortunately, the produced reads are short and prone to contain errors that are incurred during sequencing cycles. Both large data volume and sequencing errors have complicated the mapping of NGS reads onto the reference genome and have motivated the development of various aligners for very short reads, typically less than 100 base pairs (bps) in length. As read length continues to increase, propelled by advances in NGS technologies, these longer reads tend to have higher sequencing error rates and more true mutations (including substitutions, insertions, or deletions) to the genome. Such new characteristics make inefficient the aligners, which are optimized for very short reads and support only ungapped alignments or gapped alignments with very limited number of gaps (typically one gap), and thereby call for new aligners with fully gapped alignment supported. In this chapter, we present the CUSHAW software suite for NGS read alignment, which is open-source and consists of three individual aligners: CUSHAW, CUSHAW2, and CUSHAW3. This suite offers parallel and efficient NGS read alignments to large genomes, such as the human genome, by harnessing multi-core CPUs or compute unified device architecture (CUDA)-enabled graphics processing units (GPUs). Moreover, it has the capability to align both base-space and color-space reads and is consistently shown to be one of the best alignment tools through our performance evaluations.

Yongchao Liu, Bertil Schmidt

Chapter 11. String-Matching and Alignment Algorithms for Finding Motifs in NGS Data

The development of high-throughput Next Generation Sequencing (NGS) technologies allows to massively extract at low cost an extremely large amount of biological sequences in the form of reads, i.e., short fragments of an organism’s genome. The advent of NGS poses new issues for computer scientists and bioinformaticians, leading to the design of algorithms for aligning and merging the reads in order to obtain an efficient and effective reconstruction of the genome. In this chapter, we focus on methods that can quickly and precisely establish whether two reads are similar or not and that allow to analyze biological sequences extracted with NGS technologies. In particular, the most widespread string-matching, alignment-based, and alignment-free algorithms are summarized and discussed.

Giulia Fiscon, Emanuel Weitschek

Assembly of NGS Data

Frontmatter

Chapter 12. The Contig Assembly Problem and Its Algorithmic Solutions

DNA sequencing, assuming no prior knowledge on the target DNA fragment, may be roughly described as the succession of two steps. The first of them uses some sequencing technology to output, for a given DNA fragment (not necessarily a whole genome), a collection of possibly overlapping sequences (called reads) representing small parts of the initial DNA fragment. The second one aims at recovering the sequence of the entire DNA fragment by assembling the reads.

Géraldine Jean, Andreea Radulescu, Irena Rusu

Chapter 13. An Efficient Approach to Merging Paired-End Reads and Incorporation of Uncertainties

Next-Generation Sequencing (NGS) technologies have reshaped the landscape of life sciences. The massive amount of data generated by NGS is rapidly transforming biological research from traditional wet-lab work into a data- intensive analytical discipline (Koboldt et al., Cell 155(1):27–38, 2013). The Illumina “sequencing by synthesis” technique (Mardis, Annu Rev Genomics Hum Genet 9:387–402, 2008) is one of the most popular and widely used NGS technologies.

Tomáš Flouri, Jiajie Zhang, Lucas Czech, Kassian Kobert, Alexandros Stamatakis

Chapter 14. Assembly-Free Techniques for NGS Data

Sequencing technologies have undergone a considerable evolution in the last decades; the first expensive machines (appearing in the late 70s) have today been substituted by cheaper and more effective ones. At the same time, data processing evolved concurrently to face new challenges and problems posed by the new type of sequencing records. In this first section, we briefly outline how such an evolution of sequencing technologies developed and how new challenges were posed by each new generation.

Matteo Comin, Michele Schimd
Weitere Informationen

Premium Partner

Neuer Inhalt

BranchenIndex Online

Die B2B-Firmensuche für Industrie und Wirtschaft: Kostenfrei in Firmenprofilen nach Lieferanten, Herstellern, Dienstleistern und Händlern recherchieren.

Whitepaper

- ANZEIGE -

Product Lifecycle Management im Konzernumfeld – Herausforderungen, Lösungsansätze und Handlungsempfehlungen

Für produzierende Unternehmen hat sich Product Lifecycle Management in den letzten Jahrzehnten in wachsendem Maße zu einem strategisch wichtigen Ansatz entwickelt. Forciert durch steigende Effektivitäts- und Effizienzanforderungen stellen viele Unternehmen ihre Product Lifecycle Management-Prozesse und -Informationssysteme auf den Prüfstand. Der vorliegende Beitrag beschreibt entlang eines etablierten Analyseframeworks Herausforderungen und Lösungsansätze im Product Lifecycle Management im Konzernumfeld.
Jetzt gratis downloaden!

Bildnachweise