main-content

## Über dieses Buch

This book constitutes the refereed proceedings of the 8th International Symposium on Reconfigurable Computing: Architectures, Tools and Applications, ARC 2012, held in Hongkong, China, in March 2012. The 35 revised papers presented, consisting of 25 full papers and 10 poster papers were carefully reviewed and selected from 44 submissions. The topics covered are applied RC design methods and tools, applied RC architectures, applied RC applications and critical issues in applied RC.

## Inhaltsverzeichnis

### Automating Reconfiguration Chain Generation for SRL-Based Run-Time Reconfiguration

Run-time reconfiguration (RTR) of FPGAs is mainly done using the configuration interface. However, for a certain group of designs, RTR using the shift register functionality of the LUTs is a much faster alternative than conventional RTR using the ICAP. This method requires the creation of reconfiguration chains connecting the run-time reconfigurable LUTs (SRL). In this paper, we develop and evaluate a method to generate these reconfiguration chains in an automated way so that their influence on the RTR design is minimised and the reconfiguration time is optimised. We do this by solving a constrained multiple travelling salesman problem (mTSP) based on the placement information of the run-time reconfigurable LUTs. An algorithm based on simulated annealing was developed to solve this new constrained mTSP. We show that using the proposed method, reconfiguration chains can be added with minimal influence on the clock frequency of the original design.

Karel Heyse, Brahim Al Farisi, Karel Bruneel, Dirk Stroobandt

### Architecture-Aware Reconfiguration-Centric Floorplanning for Partial Reconfiguration

Partial reconfiguration (PR) has enabled the adoption of FPGAs in state of the art adaptive applications. Current PR tools require the designer to perform manual floorplanning, which requires knowledge of the physical architecture of FPGAs and an understanding of how to floorplan for optimal performance and area. This has lead to PR remaining a specialist skill and made it less attractive to high level system designers. In this paper we introduce a technique which can be incorporated into the existing tool flow that overcomes the need for manual floorplanning for PR designs. It takes into account overheads generated due to PR as well as the architecture of the latest FPGAs. This results in a floorplan that is efficient for PR systems, where reconfiguration time and area should be minimised.

Kizheppatt Vipin, Suhaib A. Fahmy

### Domain-Specific Language and Compiler for Stencil Computation on FPGA-Based Systolic Computational-Memory Array

This paper presents a domain-specific language for stencil computation (DSLSC) and its compiler for our FPGA-based systolic computational-memory array (SCMA). In DSLSC, we can program stencil computations by describing their mathematical form instead of writing explicit procedure optimally. The compiler automatically parallelizes stencil computations for processing elements (PEs) of SCMA, and schedules multiply-and-add operations for PEs considering data-reference delay via a local memory or communication FIFOs between PEs. For arbitrary grid-sizes of 2D Jacobi compilation with 3x3 and 5x5 stencils, the compiler achieves high utilization of PEs, 85.6 % and 92.18 %, which are close to 87.5 % and 93.75 % for ideal cases, respectively.

Wang Luzhou, Kentaro Sano, Satoru Yamamoto

### Exploiting Both Pipelining and Data Parallelism with SIMD Reconfigurable Architecture

Reconfigurable Architecture (RA), which provides extremely high energy efficiency for certain domains of applications, have one problem that current mapping algorithms for it do not scale well with the number of cores. One approach to this problem is using SIMD (Single Instruction Multiple Data) paradigm. However, SIMD can complicate the mapping problem by adding an additional dimension, i.e.,

iteration mapping

, to the already inter-dependent problems of data mapping and operation mapping, and can significantly affect performance through memory bank conflicts. In this paper we introduce

SIMD reconfigurable architecture

, which allows for SIMD mapping at multiple levels of granularity, and investigate ways to minimize bank conflicts in a SIMD reconfigurable architecture with the related sub-problems taken into consideration. We further present

data tiling

and evaluate a conflict-free scheduling algorithm as a way to eliminate bank conflicts for a certain class of iteration and data mapping.

Yongjoo Kim, Jongeun Lee, Jinyong Lee, Toan X. Mai, Ingoo Heo, Yunheung Paek

### Table-Based Division by Small Integer Constants

Computing cores to be implemented on FPGAs may involve divisions by small integer constants in fixed or floating point. This article presents a family of architectures addressing this need. They are derived from a simple recurrence whose body can be implemented very efficiently as a look-up table that matches the hardware resources of the target FPGA. For instance, division of a 32-bit integer by the constant 3 may be implemented by a combinatorial circuit of 48 LUT6 on a Virtex-5. Other options are studied, including iterative implementations, and architectures based on embedded memory blocks. This technique also computes the remainder. An efficient implementation of the correctly rounded division of a floating-point constant by such a small integer is also presented.

Florent de Dinechin, Laurent-Stéphane Didier

### Heterogeneous Systems for Energy Efficient Scientific Computing

This paper introduces a novel approach for exploring heterogeneous computing engines which include GPUs and FPGAs as accelerators. Our goal is to systematically automate finding solutions for such engines that maximize energy efficiency while meeting requirements in throughput and in resource constraints. The proposed approach, based on a linear programming model, enables optimization of system throughput and energy efficiency, and analysis of energy efficiency sensitivity and power consumption issues. It can be used in evaluating current and future computing hardware and interfaces to identify appropriate combinations. A heterogeneous system containing a CPU, a GPU and an FPGA with a PCI Express interface is studied based on the High Performance Linpack application. Results indicate that such a heterogeneous computing system is able to provide energy-efficient solutions to scientific computing with various performance demands. The improvement of system energy efficiency is more sensitive to some of the system components, for example in the studied system concurrently improving the energy efficiency of the interface and the GPU by 10 times could lead to over 10 times improvement of the system energy efficiency.

Qiang Liu, Wayne Luk

### The Q2 Profiling Framework: Driving Application Mapping for Heterogeneous Reconfigurable Platforms

Heterogeneous multicore architectures pose specific challenges regarding their programmability and they require smart mapping schemes to make efficient use of different processing elements. Various criteria can drive this mapping, such as computational intensity, memory requirements, and area consumption. In order to facilitate this complex mapping task, there is a clear need for tools that investigate the use of such critical resources, like memory and hardware area. For this purpose, we developed the

Q

2

profiling framework

. It consists of two main parts: an advanced memory access profiling toolset, which provides detailed information on the runtime memory access patterns of an application and a statistical modeling component, which makes hardware area predictions early in the design phase based on software metrics. These tools are integrated using a partitioning methodology. We demonstrate the effectiveness of our framework using three applications in our experiments. One application is further detailed in a case study to illustrate the use of our methodology. Experimental results show application speedup of up to 2.92×.

### PPMC: A Programmable Pattern Based Memory Controller

One of the main challenges in the design of hardware accelerators is the efficient access of data from the external memory. Improving and optimizing the functionality of the memory controller between the external memory and the accelerators is therefore critical. In this paper, we advance toward this goal by proposing PPMC, the Programmable Pattern-based Memory Controller. This controller supports scatter-gather and strided 1D, 2D and 3D accesses with programmable tiling. Compared to existing solutions, the proposed system provides better performance, simplifies programming access patterns and eases software integration by interfacing to high-level programming languages. In addition, the controller offers an interface for automating domain decomposition via tiling. We implemented and tested PPMC on a Xilinx ML505 evaluation board using a MicroBlaze soft-core as the host processor. The evaluation uses six memory intensive application kernels: Laplacian solver, FIR, FFT, Thresholding, Matrix Multiplication, and 3D-Stencil. The results show that the PPMC-enhanced system achieves at least 10x speed-ups for 1D, 2D and 3D memory accesses as compared to a non-PPMC based setup.

### A Run-Time Task Migration Scheme for an Adjustable Issue-Slots Multi-core Processor

In this paper, we present a run-time task migration scheme for an adjustable/reconfigurable issue-slots very long instruction word (VLIW) multi-core processor. The processor has four 2-issue

ρ

-VEX VLIW cores that can be merged together to form larger issue-width cores. With a task migration scheme, a code running on a core can be shifted to a larger or a smaller issue-width core for increasing the performance or reducing the power consumption of the whole system, respectively. All the cores can be utilized in an efficient manner, as a core needed for a specific job can be freed at run-time by shifting its running code to another core. The task migration scheme is realized with the implementation of interrupts on the

ρ

-VEX cores. The design is implemented in a Xilinx Virtex-6 FPGA. With different benchmarks, we demonstrate that migrating a task running on a smaller issue-width core to a larger issue-width core at run-time results in a considerable performance gain (up to 3.6x). Similarly, gating off one, two, three, or four cores can reduce the dynamic power consumption of the whole system by 24%, 42%, 61%, or 81%, respectively.

Fakhar Anjam, Quan Kong, Roel Seedorf, Stephan Wong

### Boosting Single Thread Performance in Mobile Processors via Reconfigurable Acceleration

Mobile processors, a subclass of embedded processors, are increasingly employing multicore designs to improve performance. This often requires sacrificing resources in each CPU, degrading single thread performance which is still important according to Amdahl’s law. The traditional technique for efficiently boosting serial performance in embedded processors, dedicated hardware acceleration, is unsuitable for modern mobile processors because of the heterogeneity and the diversity of applications they run. This paper proposes ‘general purpose’ accelerators, reconfigured on an application-by-application basis, as a means of increasing single thread performance. These accelerators are placed within the datapath of CPUs and support dynamic compilation. This paper presents the design of an architecture with such accelerators and evaluates the cost/performance implications of the design.

Geoffrey Ndu, Jim Garside

### Complexity Analysis of Finite Field Digit Serial Multipliers on FPGAs

This paper presents the complexity analysis of digit serial finite field multipliers over

GF

(2

m

) on FPGAs. Instead of discussing the complexity by using AND and XOR gates as primitives, we present the complexity analysis directly based on FPGA primitives, e.g., Look-Up-Tables (LUTs). Given digit size

d

, the number of LUTs and the level of LUT delay are estimated. The previous ASIC based complexity analysis shows the optimum digit size (for Area-Time-Product) is 2

l

− 1. We show in this work that the optimum digit sizes are different on FPGAs. They are those digits

d

s which satisfy

$\lceil \frac{m}{d-1} \rceil \neq \lceil \frac{m}{d} \rceil$

. We also validate our analysis with experimental results on

GF

(2

163

) and

GF

(2

233

).

Gang Zhou, Li Li, Harald Michalik

### ScalableCore System: A Scalable Many-Core Simulator by Employing over 100 FPGAs

FPGA-based processor prototyping system can fast simulate processor behavior and enables longer time simulations to obtain useful evaluation information. In this paper we present ScalableCore system 3.3, which is an FPGA-based simulator of NoC-based tile architectures by employing multiple Xilinx Spartan-6 FPGAs. Two key techniques enable the system to achieve scalable speed of simulations by using corresponding amount of FPGAs to the target number of processor cores. We evaluated behavior of a processor consisting of 100 cores and a mesh NoC by using our developed system. The simulation speed is 129 times faster than the one of a software-based simulator running on a standard computer of Core i7 processor.

Shinya Takamaeda-Yamazaki, Shintaro Sano, Yoshito Sakaguchi, Naoki Fujieda, Kenji Kise

### Scalable Memory Hierarchies for Embedded Manycore Systems

As the size of FPGA devices grows following Moore’s law, it becomes possible to put a complete manycore system onto a single FPGA chip. The centralized memory hierarchy on typical embedded systems in which both data and instructions are stored in the off-chip global memory will introduce the bus contention problem as the number of processing cores increases. In this work, we present our exploration into how distributed multi-tiered memory hierarchies can effect the scalability of manycore systems. We use the Xilinx Virtex FPGA devices as the testing platforms and the buses as the interconnect. Several variances of the centralized memory hierarchy and the distributed memory hierarchy are compared by running various benchmarks, including matrix multiplication, IDEA encryption and 3D FFT. The results demonstrate the good scalability of the distributed memory hierarchy for systems up to 32 MicroBlaze processors, which is constrained by the FPGA resources on the Virtex-6LX240T device.

Sen Ma, Miaoqing Huang, Eugene Cartwright, David Andrews

### Triple Module Redundancy of a Laser Array Driver Circuit for Optically Reconfigurable Gate Arrays

Demand is increasing daily for a robust field programmable gate array that is useful for operations performed in a radiation-rich space environment, such as those of spacecraft, space satellites, and space stations. Optically reconfigurable gate arrays (ORGAs) are under development as robust field programmable gate arrays. Their holographic memories can generate correct configuration contexts at any time, even if up to 20 % of the holographic memory data are damaged. However, up to now, a soft error effect for a laser array on ORGA devices has never been discussed. Therefore, this paper first presents a proposal of a method to find an unexpected configuration procedure caused by a laser array driver circuit facing a soft error on conventional ORGA architectures and to recover from such a procedure. Then this paper presents a proposal of a new robust laser array driver circuit that is applicable for any ORGA architecture, which can perfectly remove the unexpected configuration procedure itself.

Takahiro Watanabe, Minoru Watanabe

### A Routing Architecture for FPGAs with Dual-VT Switch Box and Logic Clusters

In this paper, we present a novel routing architecture for FPGAs with dual-

V

T

LUT and switch box architectures. The use of reverse back bias (RBB) is one strategy for mitigating leakage power, a critical issue as process technologies shrink relentlessly towards sub-nano proportions. FPGAs with the ability to adjust fabric

V

T

at configuration time offer leakage power reduction without sacrificing circuit speed. Most of the related works today investigate dual-

V

T

optimizations at the logic cluster level; Altera’s Stratix-III/IV line of FPGAs already demonstrate the feasibility of a similar architecture. In this work, we present a further advancement to the dual-

V

T

architecture - the switch box, and a routing architecture that demonstrates the effectiveness of this true dual-

V

T

fabric architecture. Our switch box advancement alone yields an average of 17.44% in leakage power savings, and with the full EDA flow an average 29.65% in total power savings is observed.

Wei Ting Loke, Yajun Ha

### Multi-level Customisation Framework for Curve Based Monte Carlo Financial Simulations

One of the main challenges when accelerating financial applications using reconfigurable hardware is the management of design complexity. This paper proposes a multi-level customisation framework for automatic generation of complex yet highly efficient curve based financial Monte Carlo simulators on reconfigurable hardware. By identifying multiple levels of functional specialisations and the optimal data format for the Monte Carlo simulation, we allow different levels of programmability in our framework to retain good performance and support multiple applications. Designs targeting a Virtex-6 SX475T FPGA generated by our framework are about 40 times faster than single-core software implementations on an i7-870 quad-core CPU at 2.93 GHz; they are over 10 times faster and 20 times more energy efficient than 4-core implementations on the same i7-870 quad-core CPU, and are over three times more energy efficient and 36% faster than a highly optimised implementation on an NVIDIA Tesla C2070 GPU at 1.15 GHz. In addition, our framework is platform independent and can be extended to support CPU and GPU applications.

Qiwei Jin, Diwei Dong, Anson H. T. Tse, Gary C. T. Chow, David B. Thomas, Wayne Luk, Stephen Weston

### A Low-Cost and High-Performance Virus Scanning Engine Using a Binary CAM Emulator and an MPU

This paper shows a virus scanning engine using two-stage matching. In the first stage, a binary CAM emulator quickly detects a part of the virus pattern, while in the second stage, the MPU detects the full length of the virus pattern. The binary CAM emulator is realized by four index generation units (IGUs). The proposed system uses four off chip SRAMs and a small FPGA. Thus, the cost and the power consumption are lower than the TCAM-based system. The system loaded 1,290,617 ClamAV virus patterns. As for the area and throughput, this system outperforms existing FPGA-based implementations.

Hiroki Nakahara, Tsutomu Sasao, Munehiro Matsuura

### Cost Effective Implementation of Flux Limiter Functions Using Partial Reconfiguration

Computational Fluid Dynamics (CFD) is used as a common design tool in aerospace industry. UPACS, a package for CFD is convenient for users, since a customized simulator can be built just by selecting required functions. The problem is its computation speed which is hard to be enhanced by using clusters due to its complex memory access patterns. As an economical solution, accelerators using FPGAs are hopeful candidates. However, the total scale of UPACS is too large to be implemented on small numbers of FPGAs. For cost efficient implementation, partial reconfiguration which can dynamically reconfigure only required functions is proposed in this paper. Here, MUSCL algorithm used frequently in UPACS is selected as a target. Partial reconfiguration is applied to the flux limiter functions (FLF) in MUSCL. Four FLFs are implemented for Turbulence MUSCL (TMUSCL) and eight FLFs are for Convection MUSCL (CMUSCL). All FLFs are developed independently and separated from the top MUSCL module. At start-up, only required FLFs are selected and deployed to the system without interfering the other modules. This implementation has successfully reduced the resource utilization by 44% to 63%. Total power consumption also reduced by 33%. Configuration speed is improved by 34-times faster as compared to fully reconfiguration method. All implemented functions achieved at least 17 times speed-up compared with the software implementation.

Mohamad Sofian Abu Talip, Takayuki Akamine, Yasunori Osana, Naoyuki Fujita, Hideharu Amano

### Parallel Tempering MCMC Acceleration Using Reconfigurable Hardware

Markov Chain Monte Carlo (MCMC) is a family of algorithms which is used to draw samples from arbitrary probability distributions in order to estimate - otherwise intractable - integrals. When the distribution is complex, simple MCMC becomes inefficient and advanced variations are employed. This paper proposes a novel FPGA architecture to accelerate Parallel Tempering, a computationally expensive, popular MCMC method, which is designed to sample from multimodal distributions. The proposed architecture can be used to sample from any distribution. Moreover, the work demonstrates that MCMC is robust to reductions in the arithmetic precision used to evaluate the sampling distribution and this robustness is exploited to improve the FPGA’s performance. A 1072x speedup compared to software and a 3.84x speedup compared to a GPGPU implementation are achieved when performing Bayesian inference for a mixture model without any compromise on the quality of results, opening the way for the handling of previously intractable problems.

Grigorios Mingas, Christos-Savvas Bouganis

### A High Throughput FPGA-Based Implementation of the Lanczos Method for the Symmetric Extremal Eigenvalue Problem

Iterative numerical algorithms with high memory bandwidth requirements but medium-size data sets (matrix size ~ a few 100s) are highly appropriate for FPGA acceleration. This paper presents a streaming architecture comprising floating-point operators coupled with high-bandwidth on-chip memories for the Lanczos method, an iterative algorithm for symmetric eigenvalues computation. We show the Lanczos method can be specialized only for extremal eigenvalues computation and present an architecture which can achieve a sustained single precision floating-point performance of 175 GFLOPs on Virtex6-SX475T for a dense matrix of size 335×335. We perform a quantitative comparison with the parallel implementations of the Lanczos method using optimized Intel MKL and CUBLAS libraries for multi-core and GPU respectively. We find that for a range of matrices the FPGA implementation outperforms both multi-core and GPU; a speed up of 8.2-27.3× (13.4× geo. mean) over an Intel Xeon X5650 and 26.2-116× (52.8× geo. mean) over an Nvidia C2050 when FPGA is solving a single eigenvalue problem whereas a speed up of 41-520× (103× geo.mean) and 131-2220× (408× geo.mean) respectively when it is solving multiple eigenvalue problems.

Abid Rafique, Nachiket Kapre, George A. Constantinides

### Optimising Performance of Quadrature Methods with Reduced Precision

This paper presents a generic precision optimisation methodology for quadrature computation targeting reconfigurable hardware to maximise performance at a given error tolerance level. The proposed methodology optimises performance by considering integration grid density versus mantissa size of floating-point operators. The optimisation provides the number of integration points and mantissa size with maximised throughput while meeting given error tolerance requirement. Three case studies show that the proposed reduced precision designs on a Virtex-6 SX475T FPGA are up to 6 times faster than comparable FPGA designs with double precision arithmetic. They are up to 15.1 times faster and 234.9 times more energy efficient than an i7-870 quad-core CPU, and are 1.2 times faster and 42.2 times more energy efficient than a Tesla C2070 GPU.

Anson H. T. Tse, Gary C. T. Chow, Qiwei Jin, David B. Thomas, Wayne Luk

### Teaching Hardware/Software Codesign on a Reconfigurable Computing Platform

This paper reports on a practically oriented undergraduate course in Hardware/Software Codesign which uses an FPGA-based reconfigurable computing platform with a soft processor for analyzing and evaluating hardware/software trade-offs. The Altium Designer design flow was chosen for the practical lab exercises because it smoothly integrates HDL-based FPGA design with Embedded Programming. Furthermore, a “C to hardware” compiler allows to quickly migrate functionality from software to hardware. A complete hardware/software system was emulated on the Altium NanoBoard 3000XN. The board was also used for group projects ranging from image processing to digital audio and video processing.

Markus Weinhardt

### Securely Sealing Multi-FPGA Systems

The importance of hardware security of electronic systems is rapidly increasing due to (1) the increasing reliance of mass-produced and mission-critical systems on embedded electronics, and (2) the ever-growing supply chains that disentangle chip designers and manufacturers from OEMs. Our work shows how to dramatically reduce vulnerability to Trojan-horse injection and in-field component replacement. We propose methods to verify the authenticity and integrity of an FPGA configuration during startup and at runtime. We also developed efficient protocols for electronic sealing of a multi-FPGA system, which automatically enforces the system configuration detected upon power-up and bans further modifications.

Tim Güneysu, Igor Markov, André Weimerskirch

### FPGA Paranoia: Testing Numerical Properties of FPGA Floating Point IP-Cores

In the early days of computing, hardware platforms were developed independently and created their own conventions for floating point to suit their underlying hardware architecture, but this meant computer programmers had to understand these conventions when designing their algorithms, and adapt their algorithms when porting to new platforms. As a result, the IEEE-754-1985 standard was created to simplify design for computer programmers by ensuring that the same software will obtain the same results across all hardware platforms. While most computers largely adhere to the standard, sometimes corner cases can be missed. Paranoia is a test suite written by William Kahan in 1983, designed to discover obvious flaws in non-compliant floating point arithmetic. The Paranoia test suite continues to show errors and inconsistencies in modern computers and compiler libraries, and has recently found similar flaws in GPUs [1]. FPGAs have historically been used to create custom hardware designs, with a focus on performance for an application specific design, meaning such portability has not been an issue. However, transistor scaling has led to FPGAs with the potential for high floating point performance, and as such FPGA-based accelerators are increasingly adopting standard single or double precision cores within hardware accelerators for high-performance computing applications. As a result, this paper has created a framework to allow FPGA IP-cores to be tested against the Paranoia benchmark to ensure that FPGA IP-cores can been subjected to the same rigorous testing as their CPU equivalents. In this paper, we discuss this effort and provide compliance results for the main vendor and open source core generators.

Xuan You Tan, David Boland, George Constantinides

### High Performance Reconfigurable Architecture for Double Precision Floating Point Division

Floating point arithmetic (FPA) are very crucial and critical domain for the hardware acceleration. FPA are widely used in the vast field of application. The division operation of the FPA is a very intensive operation, in terms of complexity, area requirement and performance speed. This paper presents an efficient FPGA implementation of double-precision FPA divisions on Virtex-2pro FPGA platform, for the ease of comparing with prior works. The proposed method is based on the method of binomial expansion, which uses look-up tables and partial block multipliers (PBM). Compared with previously reported work, the proposed design occupies smaller area (in terms of number slices, number of multipliers and the BRAM usage) with a higher performance gain and less latency. By using over 5 million unique random test cases, our results show that the proposed design gives an average error of less than 0.5 ULP (unit at last place), and a maximum error of 2 ULP without using any rounding scheme. However, rounding can also be added to the design to restore some accuracy at a slight cost in area.

Manish Kumar Jaiswal, Ray C. C. Cheung

### A Modular-Based Assembly Framework for Autonomous Reconfigurable Systems

Configurable systems community has recognized the value of FPGAs in adaptable and scalable autonomous systems. While the underlying hardware framework for supporting run-time reconfiguration has existed for years, there have been negligibly few FPGA applications that have benefited from this. This is likely due to the reconfiguration model provided by the vendors and as such several alternative modes of assembly have been suggested, such as a tile-based assembly and a modular-based assembly. This paper proposes a framework based on the aforementioned modular-based assembly. The framework builds on TORC, an open-source C++ infrastructure and tool set for reconfigurable computing. A GNU Radio generated ZigBee demodulator is implemented using the proposed solution.

Tannous Frangieh, Richard Stroop, Peter Athanas, Teresa Cervero

### Constructing Cluster of Simple FPGA Boards for Cryptologic Computations

In this paper, we propose an FPGA cluster infrastructure, which can be utilized in implementing cryptanalytic attacks and accelerating cryptographic operations. The cluster can be formed using simple and inexpensive, off-the-shelf FPGA boards featuring an FPGA device, local storage, CPLD, and network connection. Forming the cluster is simple and no effort for the hardware development is needed except for the hardware design for the actual computation. Using a softcore processor on FPGA, we are able to configure FPGA devices dynamically and change their configuration on the fly from a remote computer. The softcore on FPGA can execute relatively complicated programs for mundane tasks unworthy of FPGA resources. Finally, we propose and implement a fast and efficient dynamic

configuration switch technique

that is shown to be useful especially in cryptanalytic applications. Our infrastructure provides a cost-effective alternative for formerly proposed cryptanalytic engines based on FPGA devices.

Yarkin Doröz, Erkay Savaş

### Reconfigurable Multicore Architecture for Dynamic Processor Reallocation

One of the challenges of multicore design is providing data quickly to all the processor cores running on a system. Recent proposals of hybrid and reconfigurable interconnect architectures try to take advantage of data locality to a certain extent by grouping processors that work on the same data. In this paper, we propose migrating processors instead of data to take advantage of data locality. This is realized by implementing a reconfigurable interconnect that allows reassignment of processor cores to different routers at runtime. We present the proposed architecture in detail, show a segmented hardware implementation of the proposed architecture, and discuss experimental results using PARSEC benchmark showing the performance gains of the proposed architecture. Our results show a gain in average L2 access time of up to 24% when implementing the proposed architecture compared to a hybrid architecture without reconfiguration. Finally we present area and performance data based on a detailed Verilog model and synthesis of the proposed architecture.

Annie Avakian, Natwar Agrawal, Ranga Vemuri

### Efficient Communication for FPGA Clusters

Efficient communication between nodes is critical for achieving high performance in a computer cluster. Based on a dedicated inter-accelerator network, we enhance this communication with advanced networking functions, such as broadcasting and priority routing. This work enables decoupling user applications from physical network implementations, improving overall communication efficiency and modularity. A performance model is introduced taking into account application and platform specific parameters. Experiments are performed for various network configurations and application patterns. The results show up to a 55% reduction of communication time when employing our approach.

Stewart Denholm, Kuen Hung Tsoi, Peter Pietzuch, Wayne Luk

### Performance Analysis of Reconfigurable Processors Using MVA Analysis

Collaboration of Reconfigurable processing elements in Grid Computing (CRGC) promises to provide both flexibility and performance to process computationally intensive tasks found in large applications. Reconfigurable computing provides much more flexibility than Application-Specific Integrated Circuits (ASICs) and much more performance than General-Purpose Processors (GPPs). GPPs, reconfigurable elements (RE) and hybrid (integration of GPPs and REs) elements are the main processing elements in the CRGC. In this paper, we propose closed queuing models for grid networks that incorporate the following processing elements: a GPP, a reconfigurable element (RE), and a hybrid element (combining a GPP with an RE). We examine two different models, one with feedback the other one without feedback. The performance metrics are the average response time and throughput. The proposed models are validated by take average response time and throughput of these models and simulation using OMNeTPP. Mean Value Analysis (MVA) is used to analytically compute the performance measures for these models. The comparison of the experimental (simulation) and analytical results suggest that the total average error for all the models with feedback and without feedback is less than 1.4% and 1.8%, respectively.

### PDPR: Fine-Grained Placement for Dynamic Partially Reconfigurable FPGAs

Dynamic Partial Reconfiguration (DPR) optimizes conventional FPGA application by providing additional benefits. However, considering the arbitrariness during manual floorplan and the limitation of local search when placement, it must be effective and promising if we combine the two stages to build a global optimization structure. In this paper, a novel thought for DPR FPGAs (PDPR) is proposed which tries to offer a one-stop floorplan and placement service. Experimental results show our approach can improve 32.8% on total wire length, 48.5% on reconfiguration cost, and 36.9% on congestion.

Ruining He, Guoqiang Liang, Yuchun Ma, Yu Wang, Jinian Bian

### A Connection Router for the Dynamic Reconfiguration of FPGAs

Dynamic Circuit Specialization (DCS) is a new FPGA CAD tool flow that uses Run-Time Reconfiguration to automatically specialize the FPGA configuration for a whole range of specific data values. DCS implementations are a factor 5 faster and need a factor 8 less

lut

s compared to conventional implementations. We propose a novel routing algorithm for reconfigurable routing, called the Connection router. In contrast to

troute

, another reconfiguration-aware router, our new router is fully automated and far more scalable.

Elias Vansteenkiste, Karel Bruneel, Dirk Stroobandt

### R-NoC: An Efficient Packet-Switched Reconfigurable Networks-on-Chip

Networks-on-Chip (NoC) architectures have been proposed to replace the classical bus and point-to-point global interconnections for the next generation of multiple-core systems-on-a-chips. However, the one-to-one (unicast) based NoC communication paradigm is not efficient for one-to-many (multicast) communication requests, and the address based packet routing method lacks the capability to arrange routing globally for overall communication performance. To address these problems, we here propose a Reconfigurable NoC (R-NoC) architecture. The novelty of the R-NoC is that a structured virtual routing path can be established through the reconfiguration of routers so that packets are delivered fast along the pre-configured routing path. Load balance for overall communication performance can be achieved through the global arrangement of routing paths. In addition, custom network topology is proposed for specific set of applications to reduce the costs on area and power. Software simulations show that the structured data path approach has a significant performance improvement on multicast comparing with the traditional multiple unicast approach.

Hongbing Fan, Yue-Ang Chen, Yu-Liang Wu

### Novel Arithmetic Architecture for High Performance Implementation of SHA-3 Finalist Keccak on FPGA Platforms

We propose high speed architecture for Keccak using Look-Up Table (LUT) resources on FPGAs, to minimize area of Keccak data path and to reduce critical path lengths. This approach allows us to design Keccak data path with minimum resources and higher clock frequencies. We show our results in the form of chip area consumption, throughput and throughput per area. At this time, the design presented in this work is the highest in terms of throughput for any of SHA-3 candidates, achieving a figure of 13.67Gbps for Keccak-256 on Virtex 6. This can enable line rate operation for hashing on 10Gbps network interfaces.

Kashif Latif, M. Muzaffar Rao, Athar Mahboob, Arshad Aziz

### CRAIS: A Crossbar Based Adaptive Interconnection Scheme

This paper proposes a scheme of a crossbar based on-chip adaptive interconnection, named CRAIS. CRAIS utilizes crossbar to connect processors and IP cores in MPSoC. The interconnect topology of CRAIS can be dynamically reconfigured during execution. Empirical results on FPGA prototype demonstrated that CRAIS runs correctly with affordable hardware cost.

Chao Wang, Xi Li, Xuehai Zhou, Xiaojing Feng

### Backmatter

Weitere Informationen