main-content

This book constitutes the proceedings of the 16th International Symposium on Applied Reconfigurable Computing, ARC 2020, held in Toledo, Spain, in April 2020.

The 18 full papers and 11 poster presentations presented in this volume were carefully reviewed and selected from 40 submissions. The papers are organized in the following topical sections: design methods & tools; design space exploration & estimation techniques; high-level synthesis; architectures; applications.

### Improving Performance Estimation for FPGA-Based Accelerators for Convolutional Neural Networks

Abstract
Field-programmable gate array (FPGA) based accelerators are being widely used for acceleration of convolutional neural networks (CNNs) due to their potential in improving the performance and reconfigurability for specific application instances. To determine the optimal configuration of an FPGA-based accelerator, it is necessary to explore the design space and an accurate performance prediction plays an important role during the exploration. This work introduces a novel method for fast and accurate estimation of latency based on a Gaussian process parametrised by an analytic approximation and coupled with runtime data. The experiments conducted on three different CNNs on an FPGA-based accelerator on Intel Arria 10 GX 1150 demonstrated a 30.7% improvement in accuracy with respect to the mean absolute error in comparison to a standard analytic method in leave-one-out cross-validation.
Martin Ferianc, Hongxiang Fan, Ringo S. W. Chu, Jakub Stano, Wayne Luk

### Judiciously Spreading Approximation Among Arithmetic Components with Top-Down Inexact Hardware Design

Abstract
Approximate logic synthesis is emerging as a promising avenue towards the development of efficient and high performance digital designs. Indeed, effective methodologies for the inexact simplification of arithmetic circuits have been introduced in recent years. Nonetheless, strategies enabling the integration of multiple approximate components to realise complex approximate hardware modules, able to maximise gains while controlling ensuing Quality-of-Service degradations, are still in their infancy. Against this backdrop, we herein describe a methodology to automatically distribute the error leeway assigned to a hardware design among its constituent operators. Our strategy is able to identify high-quality trade-offs among resource requirements, performance and exactness in digital implementations, across applications belonging to different domains, and without restrictions on the type and bit-width of their approximable arithmetic components.
Giovanni Ansaloni, Ilaria Scarabottolo, Laura Pozzi

### Optimising Operator Sets for Analytical Database Processing on FPGAs

Abstract
The high throughput and partial reconfiguration capabilities of modern FPGAs make them an attractive hardware platform for query processing in analytical database systems using overlay architectures. The design of existing systems is often solely based on hardware characteristics and thus does not account for all requirements of the application. In this paper, we identify two design issues impeding system integration of low-level database operators for runtime-reconfigurable overlay architectures on FPGAs: First, the granularity of operator sets within each processing pipeline; Second, the mapping of query (sub-)graphs to complex hardware operators. We solve these issues by modeling them as variants of the subgraph isomorphism problem. Via optimised operator fusion guided by a heuristic we reduce the number of required reconfigurable regions between 30% and 85% for relevant TPC-H database benchmark queries. This increase in area efficiency is achieved without performance penalties. In 86% of iterations of the operator fusion process, the proposed heuristic finds optimal candidates, which is 3.6$$\times$$ more often than for a naive greedy approach.
Anna Drewes, Jan Moritz Joseph, Bala Gurumurthy, David Broneske, Gunter Saake, Thilo Pionteck

### Automated Toolchain for Enhanced Productivity in Reconfigurable Multi-accelerator Systems

Abstract
Ease-of-use and faster implementation times are key challenges that the community has to face to extend the use of FPGAs to non-hardware experts. In this paper, these challenges are tackled by integrating ARTICo3 and IMPRESS tools to provide the users with a transparent way to build reconfigurable multi-accelerator systems. ARTICo3 is an integrated framework that provides an automated toolchain to generate a hardware-based processing architecture to transparently manage custom-made accelerators at runtime. IMPRESS is a reconfiguration tool for building highly-flexible reconfigurable systems. The integration of both tools results in an efficient reconfigurable design flow that decouples the implementation of reconfigurable accelerators from the implementation of an ARTICo3 static architecture that transparently distributes data to the accelerators. This static architecture is generated only once and reused in consecutive kernel implementations. This way, the user only needs to design the accelerators that are automatically implemented using interfaces compatible with the static architecture. From the user point of view, the reconfigurable fabric is a set of slots where accelerators can be transparently offloaded to decrease the workload on the processor. The integration of ARTICo3 and IMPRESS also allows building relocatable accelerators, thus reducing the overall memory footprint required for the partial bitstreams. Moreover, model-based design of accelerators using Simulink has also been included as an additional option for users with no hardware background to further simplify the use of reconfigurable systems. Experimental results show that the implementation time is improved by up to 2.96$$\times$$ for a 4-slot reconfigurable system implementation with a memory footprint reduction of 4.54$$\times$$.
Alberto Ortiz, Rafael Zamacola, Alfonso Rodríguez, Andrés Otero, Eduardo de la Torre

### Chisel Usecase: Designing General Matrix Multiply for FPGA

Abstract
To ease developers work in an industry where FPGA usage is constantly growing, we propose an alternative methodology for architecture design. Targeting FPGA boards, we aim at comparing implementations on multiple criteria. We implement it as a tool flow based on Chisel, taking advantage of high level functionalities to ease circuit design, evolution and reutilization, improving designers productivity.
We target a Xilinx VC709 board and propose a case study on General Matrix Multiply implementation using this flow, which demonstrates its usability with performances comparable to the state of the art, as well as the genericity one can benefit from when designing an application-specific accelerator. We show that we were able to generate, simulate and synthesize 80 different architectures in less than 24 h, allowing different trade-offs to be quickly and easily studied, from the most performant to the less costly, to easily comply with integration constraints.
Bruno Ferres, Olivier Muller, Frédéric Rousseau

### Cycle-Accurate Debugging of Embedded Designs Using Recurrent Neural Networks

Abstract
This research work presents a methodology for debugging embedded designs by using recurrent neural networks. In this methodology, a cycle-accurate lossless debugging system with unlimited trace window is used for debugging. The lossless trace resembles a time data series. A recurrent neural network trained either through a golden reference or from the actual time series can be used to predict the incoming debugging data. A bug can be easily isolated based upon the discrepancy between the received and the predicted time series. This allows to draw conclusions to speed up the debugging process.
Habib ul Hasan Khan, Ariel Podlubne, Gökhan Akgün, Diana Göhringer

### Soft-Error Analysis of Self-reconfiguration Controllers for Safety Critical Dynamically Reconfigurable FPGAs

Abstract
Reconfigurable SRAM-based Field Programmable Gate Arrays are increasingly deployed in the aerospace applications, due to their enhanced flexibility, high performance and run-time reconfiguration capabilities. The possibility to adapt on-the-fly the circuit functionality is made possible by the Internal Configuration Access Port (ICAP) that can be managed from the application through a dedicated controller. This feature enables the deployment of new optimized reconfigurable architectures for computationally intensive and fault-tolerant applications. In this context, a promising architecture is the Dynamically Reconfigurable Processing Module (DRPM), an FPGA-based modular system where the content of each reconfigurable module can be rewritten, overwritten or erased to perform performance optimization and functional modification at run-time. However, when these systems are adopted in avionic and space applications, SRAM configuration sensitivity to radiation induced soft-errors should be addressed. In this work, we evaluate the soft-error sensitivity of upsets in the configuration memory of two implementations of the ICAP controller within a DRPM system. We performed a radiation test campaign and a selective fault injection of upsets on the ICAP controller configuration memory to mimic the radiation profiles. The comparative analysis showed meaningful guidelines on the implementations of self-reconfigurable systems for the aerospace domain: the controller with distributed memory results the 28% more tolerant to low radiation environment compared to the integrated memory version, which in return results the 25% more robust considering radiation particles with higher energies.
Ludovica Bozzoli, Luca Sterpone

### SysIDLib: A High-Level Synthesis FPGA Library for Online System Identification

Abstract
Model accuracy is the most important step towards efficient control design. Various system identification techniques exist which are used to estimate model parameters. However, these techniques have their merits and demerits which need to be considered before selecting a particular system identification technique. In this paper, various system identification techniques as the Kalman filter (EKF), recursive least square (RLS) and least mean square (LMS) filters are used to estimate the parameters of linear (DC motor) and nonlinear systems (inverted pendulum and adaptive polynomial models). FPGAs are widely used for rapid prototyping, real-time and high computationally demanding applications. Therefore, a real-time FPGA-in the loop architecture has been used for evaluating each identification algorithm of the SysIDLib library. The identification algorithms are evaluated regarding the convergence rate, accuracy and resource utilization performed on a system-onchip (SoC). The results have shown that the RLS algorithm estimated approximately the parameter values of a nonlinear system. However, it requires up to 17% less lookup-tables, 5.5% less flip-flops and 14% less DSPs compared to EKF with accurate results on the programmable logic (PL).
Gökhan Akgün, Habib ul Hasan Khan, Marawan Hebaish, Mahmoud Elshimy, Mohamed A. Abd El Ghany, Diana Göhringer

### Optimal and Greedy Heuristic Approaches for Scheduling and Mapping of Hardware Tasks to Reconfigurable Computing Devices

Abstract
Executing real-time tasks on dynamically reconfigurable FPGAs requires us to solve the challenges of scheduling and placement. In the past, many approaches have been presented to address these challenges. Still, most of them rely on idealized assumptions about the reconfigurability of FPGAs and the capabilities of commercial tool flows. In our work, we aim at solving these problems leveraging a practically useful 2D slot-based FPGA area model. We present optimal approaches for reconfigurable slot creation, hardware task assignment, and placement creation. We quantitatively compare optimal and heuristics algorithms through simulation experiments and show that the heuristics are rather close to the optimal techniques in terms of solution quality, in particular for reconfigurable slot creation and hardware task assignment. Further, we also derive an indication for the amount of fragmentation of the FPGA surface that is inherent to our 2D area model.
Zakarya Guettatfi, Paul Kaufmann, Marco Platzner

### Accuracy, Training Time and Hardware Efficiency Trade-Offs for Quantized Neural Networks on FPGAs

Abstract
Neural networks have proven a successful AI approach in many application areas. Some neural network deployments require low inference latency and lower power requirements to be useful e.g. autonomous vehicles and smart drones. Whilst FPGAs meet these requirements, hardware needs of neural networks to execute often exceed FPGA resources.
Emerging industry led frameworks aim to solve this problem by compressing the topology and precision of neural networks, eliminating computations that require memory for execution. Compressing neural networks inevitably comes at the cost of reduced inference accuracy.
This paper uses Xilinx’s FINN framework to systematically evaluate the trade-off between precision, inference accuracy, training time and hardware resources of 64 quantized neural networks that perform MNIST character recognition.
We identify sweet spots around 3 bit precision in the quantization design space after training with 40 epochs, minimising both hardware resources and accuracy loss. With enough training, using 2 bit weights achieves almost the same inference accuracy as 3–8 bit weights.
Pascal Bacchus, Robert Stewart, Ekaterina Komendantskaya

### Accelerating a Classic 3D Video Game on Heterogeneous Reconfigurable MPSoCs

Abstract
Heterogeneous Reconfigurable MPSoCs, coupling microprocessors with Programmable Logic, are becoming extremely important in High-Performance Embedded Computing domain where energy consumption is a key factor to be considered by every designer. However, efficient hardware/software co-design still requires experience and a big effort: finding an optimal solution and an acceptable trade-off between performance and energy may require several tests and it is strongly platform-dependent. To this respect, a Dataflow-based method is used in this work for exploring different hardware/software configurations (number of hardware accelerators and FPGA frequency). As a use case, the acceleration of a well-known 3D video game (DOOM) is presented. The method offers rapid trade-off analysis in terms of non-functional parameters such as computing performance or power/energy measurements.
Extensive experimental results show that is possible to speed up the game and, at the same time, reduce the energy consumption of the whole platform. A custom Linux-based Operating System for Zynq Ultrascale+ was created, including a GPU driver to support a graphical interface on an HDMI screen and drivers to manage custom hardware accelerators on the FPGA side.
The best solution to save up to 63% of energy corresponds to the use of four parallel hardware accelerators, where a function speed up of x3.6 and an application speed up of x2 (in line with Amdahl’s law) is obtained.
Additionally, a set of Pareto optimal solutions are reported in the results section.
Leonardo Suriano, David Lima, Eduardo de la Torre

### Cross-layer CNN Approximations for Hardware Implementation

Abstract
Convolution Neural Networks (CNNs) are widely used for image classification and object detection applications. The deployment of these architectures in embedded applications is a great challenge. This challenge arises from CNNs’ high computation complexity that is required to be implemented on platforms with limited hardware resources like FPGA. Since these applications are inherently error-resilient, approximate computing (AC) offers an interesting trade-off between resource utilization and accuracy. In this paper, we study the impact on CNN performances when several approximation techniques are applied simultaneously. We focus on two of the widely used approximation techniques, namely quantization and pruning. Our experimental results showed that for CNN networks of different parameter sizes and 3% loss in accuracy, we can obtain up to 27.9%–47.2% reduction in computation complexity in terms of FLOPs for CIFAR-10 and MNIST datasets.
Karim M. A. Ali, Ihsen Alouani, Abdessamad Ait El Cadi, Hamza Ouarnoughi, Smail Niar

### Technique for Vendor and Device Agnostic Hardware Area-Time Estimation

Abstract
This work proposes a novel technique for hardware area-time estimation of applications on FPGA. The application C code is first converted to the target independent LLVM IR prior to wrapping the basic blocks as functions using a LLVM transformation pass. The LegUp tool’s ‘LLVM IR functions to RTL modules’ conversion is carried out to facilitate RTL synthesis using the Altera Quartus tools. In order to support FPGAs other than Altera, the soft IP cores generated by LegUp were replaced as generic RTL components. Further, additional modules have been incorporated to support floating point operations. This approach, has made it possible to support FPGAs from other vendors with high area-time estimation accuracy. The proposed technique relies on the free versions of the vendor tools and LegUp. Moreover, the approach does not necessitate time consuming post synthesis steps such as Place & Route and Bit Stream Generation in order to obtain reasonably accurate area estimation measures.
Deshya Wijesundera, Kushagra Shah, Kisaru Liyanage, Alok Prakash, Thambipillai Srikanthan, Thilina Perera

### Resource Efficient Dynamic Voltage and Frequency Scaling on Xilinx FPGAs

Abstract
As FPGA devices become increasingly ubiquitous, the need for energy-conscious implementations for battery-powered devices arises. These new energy constraints have to be met in addition to the well-known area, latency and throughput requirements. Furthermore, the power dissipation of such systems is usually considered as a hardware problem. However, it can be solved effectively through hardware and software implementations of power-saving techniques. One generic energy-saving technique that does not require retroactive alteration of an HW/SW-design is dynamic voltage and frequency scaling (DVFS) which adjusts the power consumption and performance of an embedded device at run-time based on its workload and operating conditions. This work investigates the power monitoring and scaling capabilities of Xilinx Zynq-7000 SoCs and UltraScale+ MPSoCs. A real-time operating system (RTOS) manages the resources of an application, the voltage/frequency scaling and the power monitoring with its preemptive scheduling policies. Furthermore, the frequency is scaled without using additional hardware resources on the programmable logic from the processing system. The methodology can easily be used for changing the processor frequency at run-time. As a case study, we apply our technique to find energy-optimal voltage and frequency pairs for an image processing application designed using the open-source high-level synthesis library HiFlipVX. The proposed frequency scaling architecture requires up to 20% less flip-flops and look-up tables as compared to the same design with clocking wizard on the programmable logic.
Gökhan Akgün, Lester Kalms, Diana Göhringer

### RISC-V Based MPSoC Design Exploration for FPGAs: Area, Power and Performance

Abstract
Modern image processing applications, like object detection or image segmentation, require high computation and have high memory requirements. For ASIC-/FPGA-based architectures, hardware accelerators are a promising solution, but they lack flexibility and programmability. To fulfill flexibility, computational and memory intensive characteristics of these applications in embedded systems, we propose a modular and flexible RISC-V based MPSoC architecture on Xilinx Zynq Ultrascale+ MPSoC. The proposed architecture can be ported to other Xilinx FPGAs. Two neural networks (Lenet-5 and Cifar-10 example) were used as test applications to evaluate the proposed MPSoC architectures. To increase the performance and efficiency, different optimization techniques were adapted on the MPSoC and results were evaluated. 16-bit fixed-point parameters were used to have a compression of 50% in data size and algorithms were parallelized and mapped on the proposed MPSoC to achieve higher performance. A 4x parallelization of a NN algorithm on the proposed MPSoC resulted in 3.96x speed up and consumed 3.61x less energy as compared to a single soft-core processor setup.

### A Modular Software Library for Effective High Level Synthesis of Convolutional Neural Networks

Abstract
Convolutional Neural Networks (CNNs) have applications in many valuable domains such as object detection for autonomous cars and security using facial recognition. This vast field of application usually places strict non-functional requirements such as resource-efficient implementations on the hardware devices, while at the same time requiring flexibility. In response, this work presents a C++-based software library of reusable modules to build arbitrary CNNs that support High-Level-Synthesis to be implemented as FPGA hardware accelerators for the inference process. Our work demonstrates how parametrization and modularization of basic building blocks of a CNN enable easier customization of the hardware to match the software model. This project also works with low-precision parameters throughout the CNN to provide a more resource-efficient implementation.
Hector Gerardo Munoz Hernandez, Safdar Mahmood, Marcelo Brandalero, Michael Hübner

### HLS-Based Acceleration Framework for Deep Convolutional Neural Networks

Abstract
Deep Neural Networks (DNNs) have been successfully applied in many fields. Considering performance, flexibility, and energy efficiency, Field Programmable Gate Array (FPGA) based accelerator for DNNs is a promising solution. The existing frameworks however lack the possibility of reusability and friendliness to design a new network with minimum efforts. Modern high-level synthesis (HLS) tools greatly reduce the turnaround time of designing and implementing complex FPGA-based accelerators. This paper presents a framework for hardware accelerator for DNNs using high level specification. A novel architecture is introduced that maximizes data reuse and external memory bandwidth. This framework allows to generate a scalable HLS code for a given pre-trained model that can be mapped to different FPGA platforms. Various HLS compiler optimizations have been applied to the code to produce efficient implementation and high resource utilization. The framework achieves a peak performance of 23 frames per second for SqueezeNet on Xilinx Alveo u250 board.
Ashish Misra, Volodymyr Kindratenko

### FPGA-Based Computational Fluid Dynamics Simulation Architecture via High-Level Synthesis Design Method

Abstract
Today’s High-Performance Computing (HPC) systems often use GPUs as dedicated hardware accelerators to meet the computation requirements of applications such as neural networks, genetic decoding, and hydrodynamic simulations. Meanwhile, FPGAs have also been considered as alternative suitable hardware accelerators due to their advancing computational capabilities and low power consumption. Moreover, the developments of High-Level Synthesis (HLS) allow users to generate FPGA designs directly from mainstream languages, e.g., C, C++, and OpenCL. However, writing efficient high-level programs with good performance is still a time-consuming task, and the lack of knowledge about FPGA architecture can lead to poor scalability and portability. In this paper, we propose an architecture design for Computational Fluid Dynamics (CFD) simulations based on the HLS method. Our design can adjust the performance by utilizing the parallelism inside both temporal and spatial domains of CFD simulations. We also discuss the data reuse buffer optimization choices while considering the potability of HLS codes. A performance model is introduced to guide the design space exploration under the constraints of available resources on FPGA. We evaluate our design via a Xilinx VCU1525 FPGA board and compare the results with other state-of-the-art studies. Experiment results show that VCU1525 can achieve 629.6 GFLOP/s in D2Q9 LBM-BGK model and the design and optimization methods can be used for developing various CFD applications.
Changdao Du, Iman Firmansyah, Yoshiki Yamaguchi

### High-Level Synthesis in Implementing and Benchmarking Number Theoretic Transform in Lattice-Based Post-Quantum Cryptography Using Software/Hardware Codesign

Abstract
Compared to traditional hardware development methodologies, High-Level Synthesis (HLS) offers a faster time-to-market and lower design cost at the expense of implementation efficiency. Although Software/Hardware Codesign has been used in many areas, its usability for benchmarking of candidates in cryptographic competitions has been largely unexplored. This paper provides a comparison of the HLS- and RTL-based design methodologies when applied to the hardware design of the Number Theoretic Transform (NTT) – a core arithmetic function of lattice-based Post-Quantum Cryptography (PQC). As a next step, we apply Software/Hardware Codesign approach to the implementation of three PQC schemes based on NTT. Then, we integrate our HLS implementation into the Xilinx SDSoC environment. We demonstrate that an overhead of SDSoC compared to traditional Bare Metal approach is acceptable. This paper also shows that an HLS implementation obtained by modeling a block diagram is typically much better than an implementation obtained by using design space exploration. We conclude that the HLS/SDSoC and RTL/Bare Metal approaches generate comparable results.
Duc Tri Nguyen, Viet B. Dang, Kris Gaj

### Exploring fpga Optimizations to Compute Sparse Numerical Linear Algebra Kernels

Abstract
The solution of sparse triangular linear systems (sptrsv) is the bottleneck of many numerical methods. Thus, it is crucial to count with efficient implementations of such kernel, at least for commonly used platforms. In this sense, Field–Programmable Gate Arrays (FPGAs) have evolved greatly in the last years, entering the HPC hardware ecosystem largely due to their superior energy–efficiency relative to more established accelerators. Up until recently, the design for FPGAs implied the use of low–level Hardware Description Languages (HDL) such as VHDL or Verilog. Nowadays, manufacturers are making a large effort to adopt High–Level Synthesis languages like C/C++ or OpenCL, but the gap between their performance and that of HDLs is not yet fully studied. This work focuses on the performance offered by FPGAs to compute the sptrsv using OpenCL. For this purpose, we implement different parallel variants of this kernel and experimentally evaluate several setups, varying among others the work–group size, the number of compute units, the unroll–factor and the vectorization–factor.
Federico Favaro, Ernesto Dufrechou, Pablo Ezzatti, Juan P. Oliver

### A CGRA Definition Framework for Dataflow Applications

Abstract
Executing complex scientific applications on Coarse Grain Reconfigurable Arrays (CGRAs) promises execution time and/or energy consumption reduction compared to software execution or even customized hardware solutions. The compute core of CGRA architectures is a cell that typically consists of simple and generic hardware units, such as ALUs, simple processors, or even custom logic tailored to an application’s specific characteristics. However generality in the cell contents, while convenient for serving multiple applications, comes at the cost of execution acceleration and energy consumption.
This work proposes a novel Mixed-CGRA Definition Framework (MC-DeF) targeting a Mixed-CGRA architecture that leverages the advantages of CGRAs by utilizing a customized cell-array, and FPGAs by utilizing a separate LUT array used for adaptability. Our framework employs a custom cell structure and functionality definition phase to create highly customized application/domain specific CGRA designs. This is achieved through the use of cost functions that use metrics such a resource usage, connectivity overhead, chip area occupied, i.a., and user-defined threshold values. Thus, the framework aids the user in creating suitable designs based on the application’s needs and/or design restrictions, energy and/or area constraints.
We evaluate our framework using three applications: Hayashi-Yoshida, Mutual Information and Transfer Entropy and present fully functional, FPGA-based implementations of these applications to demonstrate the validity of our framework. Comparisons with related work show that MC-DeF performs favourably in terms of processing throughput - even when compared with much larger designs, uses fewer resources than most of the compared architectures, while utilizing better the underlying architecture recording the second best efficiency (LUT/GOPs) rating.
George Charitopoulos, Dionisios N. Pnevmatikatos

### Implementing CNNs Using a Linear Array of Full Mesh CGRAs

Abstract
This paper presents an implementation of a Convolutional Neural Network (CNN) algorithm using a linear array of full mesh dynamically and partially reconfigurable Coarse Grained Reconfigurable Arrays (CGRAs). Accelerating CNNs using GPUs and FPGAs is more common and there are few works that address the topic of CNN acceleration using CGRAs. Using CGRAs can bring size and power advantages compared to GPUs and FPGAs. The contribution of this paper is to study the performance of full mesh dynamically and partially reconfigurable CGRAs for CNN acceleration. The CGRA used is an improved version of the previously published Versat CGRA, adding multi CGRA core support and pre-silicon configurability. The results show that the proposed CGRA is as easy to program as the original full mesh Versat CGRA, and that its performance and power consumption scale linearly with the number of instances.
Valter Mário, João D. Lopes, Mário Véstias, José T. de Sousa

### A Block-Based Systolic Array on an HBM2 FPGA for DNA Sequence Alignment

Abstract
Revealing the optimal local similarity between a pair of genomic sequences is one of the most fundamental issues in bioinformatics. The Smith-Waterman algorithm is a method that was developed for that specific purpose. With the continuous advances in the computer field, this method becomes widely used to an extent where it expanded its reach to cover a broad range of applications, even in areas such as network packet inspections and pattern matching. This algorithm is based on Dynamic Programming and is guaranteed to find the optimal local sequence alignment between two base pairs. The computational complexity is O(mn), where m and n are defined as the number of the elements of a query and a database sequence, respectively. Researchers have investigated several manners to accelerate the calculation using CPU, GPU, Cell B.E., and FPGA. Most of them have proposed a data-reuse approach because the Smith-Waterman algorithm has rather high “bytes per operation”; in other words, the Smith-Waterman algorithm requires large memory bandwidth. In this paper, we try to minimize the impact of the memory bandwidth bottleneck through the implementation of a block-based systolic array approach that maximizes the usage of memory banks in HBM2 (High Bandwidth Memory). The proposed approach demonstrates a higher performance in terms of GCUPS (Giga Cell Update Per Second) compared to one of the best cases reported in previous works, and also achieves a significant improvement in power efficiency. For example, our implementation could reach 429.39 GCUPS while achieving a power efficiency of 7.68 GCUPS/W. With a different configuration, it could reach 316.73 GCUPS while hitting a peak power efficiency of 8.86 GCUPS/W.

### Comparison of Direct and Indirect Networks for High-Performance FPGA Clusters

Abstract
As field programmable gate arrays (FPGAs) become a favorable choice in exploring new computing architectures for the post-Moore era, a flexible network architecture for scalable FPGA clusters becomes increasingly important in high performance computing (HPC). In this paper, we introduce a scalable platform of indirectly-connected FPGAs, where its Ethernet-switching network allows flexibly customized inter-FPGA connectivity. However, for certain applications such as in stream computing, it is necessary to establish a connection-oriented datapath with backpressure between FPGAs. Due to the lack of physical backpressure channel in the network, we utilized our existing credit-based network protocol with flow control to provide receiver FPGA awareness and tailored it to minimize overall communication overhead for the proposed framework. To know its performance characteristics, we implemented necessary data transfer hardware on Intel Arria 10 FPGAs, modeled and obtained its communication performance, and compared it to a direct network. Results show that our proposed indirect framework achieves approximately 3% higher effective network bandwidth than our existing direct inter-FPGA network, which demonstrates good performance and scalability for large HPC applications.
Antoniette Mondigo, Tomohiro Ueno, Kentaro Sano, Hiroyuki Takizawa

### A Parameterisable FPGA-Tailored Architecture for YOLOv3-Tiny

Abstract
Object detection is the task of detecting the position of objects in an image or video as well as their corresponding class. The current state of the art approach that achieves the highest performance (i.e. fps) without significant penalty in accuracy of detection is the YOLO framework, and more specifically its latest version YOLOv3. When embedded systems are targeted for deployment, YOLOv3-tiny, a lightweight version of YOLOv3, is usually adopted. The presented work is the first to implement a parameterised FPGA-tailored architecture specifically for YOLOv3-tiny. The architecture is optimised for latency-sensitive applications, and is able to be deployed in low-end devices with stringent resource constraints. Experiments demonstrate that when a low-end FPGA device is targeted, the proposed architecture achieves a 290x improvement in latency, compared to the hard core processor of the device, achieving at the same time a reduction in mAP of 2.5 pp (30.9% vs 33.4%) compared to the original model. The presented work opens the way for low-latency object detection on low-end FPGA devices.
Zhewen Yu, Christos-Savvas Bouganis

### Hardware/Algorithm Co-optimization for Fully-Parallelized Compact Decision Tree Ensembles on FPGAs

Abstract
Decision tree ensembles, such as random forests, are well-known classification and regression methods with high accuracy and robustness, especially for categorical data that combines multiple weak learners called decision trees. We propose an architecture/algorithm co-design method for implementing fully parallelized fast decision tree ensembles on FPGAs. The method first produces compact and almost equivalent representations of original input decision trees by threshold compaction. For each input feature, comparisons with similar thresholds are merged into fewer variations, so the number of comparisons is reduced. The decision tree with merged thresholds is perfectly extracted as hard-wired logic for the highest throughput. In this study, we developed a prototype hardware synthesis compiler that generates a Verilog hardware description language (HDL) description from a compressed representation. The experiment successfully demonstrates that the proposed method reduces the sizes of generated hardware without accuracy degradation.
Taiga Ikeda, Kento Sakurada, Atsuyoshi Nakamura, Masato Motomura, Shinya Takamaeda-Yamazaki

### StocNoC: Accelerating Stochastic Models Through Reconfigurable Network on Chip Architectures

Abstract
Spreading dynamics of many real-world processes lean heavily on the topological characteristics of the underlying contact network. With the rapid temporal and spatial evolution of complex inter-connected networks, microscopic modeling and stochastic simulation of individual-based interactions have become challenging in both, time and state space. Driven by the surge to reduce the time complexity associated with system behavior analysis over different network structures, we propose a network-on-chip (NoC) based FPGA solution called StocNoC. The proof of concept is supported by the design, implementation and evaluation of the classical heterogeneous susceptible-infected-susceptible (SIS) epidemic model on a scalable NoC. The steady-state results from the proposed implementation for the fractions of susceptible and infected nodes are shown to be comparable to those acquired from software simulations, but in a significantly shorter time period. Analogous to network information diffusion, implementation of the SIS model and its variants will be beneficial to foresee possible epidemic outbreaks earlier in time and expedite control decisions.
Arshyn Zhanbolatov, Kizheppatt Vipin, Aresh Dadlani, Dmitriy Fedorov

### Implementation of FM-Index Based Pattern Search on a Multi-FPGA System

Abstract
Pattern matching is a versatile task which has a variety of applications including genome sequencing as a major application. During the analysis, short read mapping technique is used where short DNA sequences are mapped relative to a known reference sequence. This paper discusses the use of reconfigurable hardware to accelerate the short read mapping problem. The proposed design is based on the FM-index algorithm. Although several pattern matching techniques are available, FM-index based pattern search is perfectly suitable for genome sequencing due to the fastest mapping from known indices. In order to make use of inherent parallelism, a multi-FPGA system called Flow-in-Cloud (FiC) is used. FiC consists of multiple boards, mounting middle scale Xilinx’s FPGAs and SDRAMs, which are tightly coupled with high speed serial links. By distributing the input data transfer with I/O ring network and broadcasting I-Table, C-Table and Suffix-Array with the board-to-board interconnection network, about 10 times performance improvement was achieved when compared to the software implementation. Since the proposed method is scalable to the number of boards, we can obtain the required performance by increasing the number of boards.
M. M. Imdad Ullah, Akram Ben Ahmed, Hideharu Amano

### Reconfigurable Accelerator for On-Board SAR Imaging Using the Backprojection Algorithm

Abstract
Synthetic Aperture Radar is a form of radar widely used to extract information about the surface of the target. The transformation of the signals into an image is based on DSP algorithms that perform intensive but repetitive computation over the signal data. Traditionally, an aircraft or satellite acquires the radar data streams and sends it to be processed on a data center to produce images faster. However, there are novel applications demanding on-board signal processing to generate images. This paper presents a novel implementation for an on-board embedded SoC of an accelerator for the Backprojection algorithm, which is the reference algorithm for producing images of SAR sensors. The methodology used is based on a HW/SW design partition, where the most time consuming computations are implemented in hardware. The accelerator was specified in HLS, which allows to reuse the code from the original implementation of the algorithm in software. The accelerator performs the computations using floating-point arithmetic to produce the same output as the original algorithm. The target SoC device is a Zynq 7020 from Xilinx which has a dual-core ARM-A9 processor along with a reconfigurable fabric which is used to implement the hardware accelerator. The proposed systems outperformed the software-only implementation in 7.7$$\times$$ while preserving the quality of the image by adopting the same floating-point representations from the original software implementation.
Rui P. Duarte, Helena Cruz, Horácio Neto