1 Introduction
2 Contributions
-
We present an analysis of multiple applications in different subfields of biological sequence analysis. We categorize those applications with computation stages and data flow, then identify the kernel functions which consumes a dominant portion of overall execution time. Then we summarize the similarity of operations and datapath of target applications, enabling the use of a reconfigurable architecture to suite all the requirements.
-
We describe a reconfigurable processing element (PE) which is designed to facilitate both simplification and scalability with integer operations. An array of PEs forming a CGRA as a processing core can be reconfigured to accelerate target applications, while shared buffers are efficiently interconnected to maximize throughput and minimize congestion.
-
We introduce a memory access policy to maximize the throughput of local DRAM access in sub-processor while devoid bus congestion within a circuit-switching many-core system. Benefit from this policy, 3D-stacked DRAM has been adopted, which provides both high bandwidth and low access latency.
-
We present that our architecture is specially customized for biological sequence analysis problems in terms of both operations and datapath. The compute intensive kernel functions of target biological sequence analysis applications can be well mapped to our architecture.
-
We present results from prototyping stripped down implementation of our design on a commodity FPGA, representing the performance of a single processing core of our architecture. Then we scale the results to give a plausible estimation of our design with many-core and 3D stacked architecture. Based on conservative area estimation and simulation results, we show that our design achieves the highest performance and greatest energy efficiency comparing with all other known accelerators.
3 Biological Sequence Analysis Applications
3.1 Pairwise Sequence Alignment
-
For a matching event between segment i from sequence a and segment j from sequence b, which means that i and j are identical character symbol, the matching score \(M_{i,j}\) is given by the maximum alignment score of segment pair at \((i-1\), \(j-1)\) plus \(S_{i,j}\), where \(S_{i,j}\) is the substitution matrix [38] score with entry of residues i, j .
-
For an insertion event between i and j, which means that j is likely to be an inserted symbol character that different from \(i-1\) and i, the insertion score \(I_{i,j}\) is given by the maximum value between the matching score of segment pair \((i-1, j)\) minus a gap open d and the insertion score of which minus a gap extension e.
-
For a deletion event between i and j, which means that a character symbol in b is likely being deleted and j is the same as \(i-1\), the deletion score \(D_{i,j}\) is given by the maximum value between matching score of segment pair \((i, j-1)\) minus a gap open d and insertion score of which minus a gap extension e.
-
The distance score matrix element at location (i, j) is then filled by the maximum value of corresponding matching, insertion and deletion scores.
3.2 Multiple Sequence Alignment
-
All sequences to be processed are pairwise aligned with each other in order to give corresponding similarity distance score matrices, representing the divergence of each pair of sequences. This stage is performed with deterministic method using distance score matrix filling of local sequence alignment.
-
A phylogenetic-like guided tree is calculated from the distance score matrices with its leaves representing sequences and sub sequence groups. The branch lengths reflect sequence divergences along with tree topology.
-
All the sequences are progressively aligned according to the branching order of the guide tree, from leaf to root. Profiles are often created for tree branches instead of aligning two subgroups of sequences directly, but the alignment processes of profile-wise, profile-sequence and sequence-wise are all equivalent to pairwise sequence alignment. The tree topology indicates the order of pairwise sequence alignment for left and right node/branch of each ancestor tree node.
3.3 Database Search
-
For each query sequence, all the k-mers and their close similarities are obtained and tested against a user-defined threshold. Those passed strings are called seed strings. The seed length k is normally 3–5 for protein and 11 for DNA/RNA sequences. All the seeds are used to build a bloom filter [40]. The original database or its partitions is inputted to test against the bloom filter, in order to get a series of candidate matching locations including fake-positives. For each candidate matching location, all the seeds are evaluated, which produces the exact seed-location matching pairs.
-
For each seed-location matching pair, the seed is stepwise extended from both ends towards the query sequence, where pairwise distance scores are generated between seed extension and corresponding extension of matched database string. The extension does not stop until the accumulated total score reaches user-defined threshold.
-
Subsequences with extension distance scores higher than user-defined threshold will be used in the third stage for evaluation. Pairwise local sequence alignment is used to verify the similarity between query sequence and database at the suspected regions that indicated by two nearby extensions.
3.4 Short Read Sequence Mapping
-
For each read, the head part of each is obtained as seed string and properly hashed.
-
Prefix part of each hashed seed is used to access the pointer table, and a valid hit would give an address range pointing to the large hash table. Suffix part of the hashed seed is then compared with a series hash values retrieved from the large hash table respectively. Each equivalent indicates an exact seed matching between the query sequence and the database, called candidate matching location.
-
Pairwise local sequence alignment is executed between the complete short read and the reference sequence at each candidate matching location for verification.
-
For each read, the head part is obtained as seed to query against the FM-index table. Sequentially evaluating every character of the seed, two pointers top and bottom are updated that specify the range of positions of the verified pattern in the suffix array. When the two pointers are equal or if top is less than bottom, the search is terminated indicating target seed does not exist in the reference sequence. After processing the last character of the seed without termination, the two pointers indicate a number of candidate matching locations where the seed string are located in reference sequence.
-
Pairwise local sequence alignment is then executed between the read and original sequence model at each candidate matching location for verification.
3.5 Application Analysis
Target applications | Execution stages |
---|---|
Pairwise sequence alignment | Distance score matrix filling |
Backtracking | |
Multiple sequence alignment | Distance score matrix filling |
Guided tree construction | |
Pairwise local sequence alignment | |
Database searching | Seed generation |
Seed matching\(^{\mathrm{a}}\)
| |
Seed extension\(^{\mathrm{b}}\)
| |
Pairwise local sequence alignment | |
Short read sequence mapping (hash-index) | Seed hashing |
2-hit string matching | |
Pairwise local sequence alignment | |
Short read sequence mapping (FM-index) | Backward searching |
Pairwise local sequence alignment |
4 CGRA for Biological Sequence Analysis
4.1 Reconfigurable Processing Element
-
To address the varying additions and comparisons of the distance score matrix filling stage for pairwise sequence alignment algorithms.
-
To address the varying comparisons of the backtracking stage for pairwise sequence alignment algorithms.
-
To address the varying comparisons of the seed matching and seed extension stages for database searching.
-
To address the varying comparisons of various string matching stages for short read sequence mapping applications.
-
To be deployed as an array of arbitrary size to massively parallelize the computations.
4.2 CGRA Interconnection
4.3 Input and Output Buffers
5 3D-Stacked Many-Core Architecture
5.1 Sub-Processor with Integrated Memory Interface
5.2 Circuit-Switching Many-Core Processor
5.3 3D-Stacked DRAM with Scatter/Gather Read
-
The target row and bank is activated during the Activate phase
-
The column address is selected while a Read command is sent to select specific column of the entire row, and a latency of several clock cycles is required before the data reaches the DQ port.
-
The precharge phase is needed only if the current row must be deactivated; as the precharge command recharge the consumed energy in this row, this bank is ready for a new read or write request on any row.
6 Application Mappings
6.1 Distance Score Matrix Filling
6.2 Backtracking
6.3 Seed Matchings and Seed extension
-
During data preparation, the original database is hashed according to hash functions and indexed to hash table. A pointer table is created using all the possible seeds as access entry, which stores the accessing range of the database hash table that corresponding to each seed.
-
The seeds are used to access the pointer table, and a valid pointer table value indicates a series of exact seed-location matching pairs in the original database.
6.4 2-Hit String Matching
6.5 Backward Searching
6.6 Further Improvements
7 Experimental Results
7.1 Test Environments
7.2 Performance Scaling
7.3 Test Results
7.4 Short Read Mapping Sensitivity
SRR867061 | Single end | Paired end | |
---|---|---|---|
Sensitivity (%) | Sensitivity (%) | Confidence (%) | |
BFAST | 88.85 | 90.77 | 86.65 |
3D-BFAST | 88.87 | 90.81 | 86.65 |
BWA | 90.15 | 96.53 | 96.49 |
BarraCUDA | 91.02 | 96.61 | 96.55 |
3D-BWA | 90.64 | 96.60 | 96.52 |