1. Introduction
-
Compared to traditional methods (PEG, ACE), the SA-based constructor takes both decoding performance and hardware efficiency into consideration during construction process.
-
Compared to existed work [11], the ACO-based scheduler covers both layer and element permutation, and maps the problem to a double-layer TSP, which is a complete solution and can provide better pipelining schedule.
-
Compared to existed works, the GPU-based evaluator first implements the semi-parallel layered architecture on GPU. The obtained net throughput is similar to the highest report [12] (about 25 Mbps), while the proposed scheme has higher precision and better BER performance. Further, we put the whole coding and decoding system into GPU rather than a single decoder.
-
Compared to existed FPGA or ASIC implementations [14‐16], the proposed multi-mode high-throughput decoder not only supports multiple modes with completely on-the-fly configurations, but also has a performance loss within 0.2 dB against float precision and 20 iterations, and a stable net-throughput 721.58 Mbps under code rate 1/2 and 20 iterations. With early-stopping scheme, a net-throughput 1.2 Gbps is further achieved on Stratix III FPGA.
2. Background
2.1. LDPC codes and Tanner graph
2.2. The BP algorithm and effect of cycle
2.3. Decoder architecture and memory conflict
3. The ACO-based pipelining scheduler
3.1. Problem formulation
-
Layer permutation: We can assign which layer to be processed first and which to be next. If two layers i,j have 1s at totally different positions, i.e., such j,l do not exist that h i ,k= h j ,l, they tend to be assigned as the adjacent layers with no conflict.
-
Element permutation: In a certain layer, we can assign which element to be processed first and which to be next. If two adjacent layers i,j still have conflict, i.e., h i ,k= h j ,lfor some k,l, then we can assign element k to be first in layer i, and l to be last in layer j. By this way, we increase the time interval between the conflicting elements k and l.
3.2. The double-layered TSP
-
Layer permutation layer: In this layer we only deal with layer permutation. We define the "distance", also "cost" between layers i and j as the minimum number of idle clocks inserted before the processing of layer j. If more conflict position pairs exist, i.e., , then we should take the maximum one. Thus in this layer, the distance matrix should be defined by(6)and the target function remains the same as (5).
-
Element permutation layer: In this layer we inherit the layer permutation result, and map element permutation of each layer to an independent TSP. In the TSP for layer i, we fix the schedule of the prior layer p (λ p = λ i − 1) and next layer q (λ q = λ i + 1), and only tune the elements of layer i. We define the "distance" d k ,las the change on the number of idle clocks if element k is assigned to the position l, i.e., µ i ,k= l. Note that element k can conflict with layer p or q, and d k ,lvaries by different conflict cases, given by(7)Since the largest d k ,lbecomes the bottleneck of element permutation, the target function should change to the following max form:(8)
3.3. The ACO-based algorithm
4. The SA-based code constructor
4.1. Problem formulation
-
High performance, which means the code should have high coding gain and good BER/BLER performance, including early water-fall region, low error floor and anti-fading ability. This is strongly related to large girth, large ACE spectrum, few trapping sets, and etc.
-
High efficiency, which means the implementation of the encoder and decoder should have moderate complexity, and high throughput. This is strongly related to QC-RA type, high degree of parallelism, short decoding pipeline, few memory conflicts, and etc.
4.2. The double-stage SA framework
4.3. Details of the algorithm
-
sample_temperature
is the temperature sampling function, decreasing with k. It can be an exponential form αe −βk . -
prob
is the accept probability function of the new search point h_new. If h_new is better (E_new < E), it returns 1, otherwise, it decreases with E_new−E, and increases with t. It can be an exponential form αe −β (E_new−E)/t -
perf_energy
is the performance energy function. It evaluates the performance related factors of the matrix h, and gives a lower energy for better performance. Typically, we can calculate the number of length-l cycles c l , then calculate a total cost given by ∑ l w l c l , where w l is the cost weight of a length-l cycle, decreasing with l. -
effi_energy
is the efficiency energy function, similar asperf_energy
except that it gives a lower energy for higher efficiency. Typically, we can calculate the the number of gap-l memory conflicts c l , then calculate a total cost given by ∑ l w l c l , where w l is the cost weight of a layer gap l conflict, decreasing with l. -
perf_neighbor
searches for a neighbor of h in the matrix space when aiming at performance, which is based on minor changes of h. For QC LDPC, we can define three atomic operations for the base matrix H b as follows.-
Horizontal swap: For chosen row i,j and column k, l, swap values of and , then swap values of and .
-
Vertical swap: For chosen row i,j and column k,l, swap values of and , then swap values of and .
-
Permutation change: Change the permutation factor for chosen element .
For a higher temperature t, we allow the neighbor searching process to search in a wider space. This is done by performing the atomic operations more times. -
-
effi_neighbor
searches for a neighbor of h in the matrix space when aiming at efficiency. This is similar asperf_neighbor
, however, typically we should remove the permutation change operation, as it does nothing to help reduce conflicts.
5. The GPU-based performance evaluator
5.1. Motivation and architecture
5.2. Algorithm and procedure
5.3. Details and instructions
-
Ensure "coalesced access" when reading or writing global memory, or the operation will be auto-serialized. In our algorithm, the adjacent threads should access adjacent L(Q i ) and L(r ji ).
-
Shared memory and registers are fast yet limited resources and their use should be carefully planned. In our algorithm, we store L(Q i ) in shared memory and L(r ji ) in registers due to the lack of resources.
-
Make sure all the P × Q cores are running. This calls for careful assignment of limited resources (i.e., warps, shared memory, registers). In our case, we limit the registers per thread to 16 and threads per block to 128, or some of the Q cores on each MP will "starve" and be disabled.
6. Hardware implementation schemes
6.1. Top-level hardware architecture
6.2. The reconfigurable switch network
-
Full cyclic-shift: The output has the cyclic-shift form of the total S inputs, i.e., x c , x c +1,...x S , x1,x2...,x c −1, where 1 ≤ c ≤ S.
-
Partial cyclic-shift: The output has the cyclic-shift form of the first p inputs, while other signal can be in arbitrary order, i.e., i.e., x c , x c +1,...x p , x1,x2...,x c −1, x * ,...x * , where 1 ≤ c<p < S, and x * can be any signal from x p +1to x S .
Scale | W = 9, S = 256, 15 stages, 128 × 15 MUX |
---|---|
Support | p = 64:8:256, 1 ≤ c ≤ p |
Resource | 20866 LE, 388 memory bits |
Clock | 100 MHz for Cyclone II, 240 MHz for Stratix III |
Delay | Two clocks for control signals, four clocks for output |
6.3. The offset-threshold decoding scheme
6.4. The split-row MMSA core
6.5. The early-stopping scheme
6.6. The multi-block scheme
7. Numerical simulation
Candidate code | WiMAX code | |
---|---|---|
Cycle:length 6/8 | 0/ 55 | 5/ 150 |
Conflict:gap 1/2/3 | 0/3/9 | 5/11/15 |
Pipeline occupancy | Before ACO: 76/88 | Only layer |
After ACO: 76/81 | permu.: 76/96 |
GPU (ours) | CPU | GPU[12] | |
---|---|---|---|
Platform | NV. GTX260 | Intel Core2 Quad | NV. 8800GTX |
Clock frequency | 1.24GHz | 2.66 GHz | 1.35 GHz |
Decoding method | Semi-parallel | Semi-parallel | Full-parallel |
LMMSA | LMMSA | BP | |
Blocks×threads | 216 × 96 | 1 | 128 × 256 |
Net throughput | 24.5 Mbps | 50 Kbps | 25Mbps |
Precision | Floating-point | Floating-point | 8-bit fixed-point |
8. The multi-mode high-throughput decoder
FPGA platform | Altera Stratix III EP3SL340F1517C2 |
---|---|
Decoding scheme | Layered offset-threshold MSA |
Modes supported | 9 × 3 = 27 modes |
Code length | N = 1536:768:6144 (z
f
= 64:32:256) |
Code rate | R = 1/ 2,2/ 3,3/ 4 (H
b
:12 × 24, 8 × 24, 6 × 24) |
Iteration number | iter = 1−31, 20 recommended |
Resources usage | 149, 976 LE, 3, 157, 136 bits memory |
BER performance | gap ≤ 0.2 dB vs. 20 iteration float LMMSA |
Clock setup | 225.58MHz |
Stable net throughput | 721.58 Mbps (z
f
= 256, R = 1/ 2, iter = 20) |
Max. net throughput | 1.2 Gbps (early-stopping, iter = 12 ave.) |