Zum Inhalt

2021 | Buch

Applied Reconfigurable Computing. Architectures, Tools, and Applications

17th International Symposium, ARC 2021, Virtual Event, June 29–30, 2021, Proceedings

herausgegeben von: Steven Derrien, Frank Hannig, Pedro C. Diniz, Daniel Chillet

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

Dieses Buch bildet den Abschluss des 17. Internationalen Symposiums für angewandte Rekonfigurable Computing, ARC 2021, das im Juni 2021 als virtuelles Ereignis abgehalten wurde. Die 14 vollständigen Vorträge und 11 Kurzvorträge in diesem Band wurden sorgfältig geprüft und aus 40 Einreichungen ausgewählt. Die Beiträge decken ein breites Anwendungsspektrum rekonfigurierbarer Datenverarbeitung ab, von Fahrhilfen, Beschleunigung der Daten- und Graphenverarbeitung, Computersicherheit bis hin zum gesellschaftlich relevanten Thema der Unterstützung der Früherkennung von Covid-Infektionskrankheiten.

Inhaltsverzeichnis

Frontmatter

Applications

Frontmatter
Fast Approximation of the Top-k Items in Data Streams Using a Reconfigurable Accelerator
Abstract
This paper presents a novel method for finding the top-k items in data streams using a reconfigurable accelerator. The accelerator is capable of extracting an approximate list of the topmost frequently occurring items in an input stream, which is only scanned once without the need for random-access. The accelerator is based on a hardware architecture that implements the well-known Probabilistic sampling algorithm by mapping its main processing stages to two custom systolic arrays. The proposed architecture is the first hardware implementation of this algorithm, which shows better scalability compared to other architectures that are based on other stream algorithms. When implemented on an Intel Arria 10 FPGA (10AX115N2F45E1SG), 50% of the FPGA chip is sufficient for 3000+ Processing Elements (PEs). Experimental results on both synthetic and real input datasets showed very good accuracy and significant throughput gains compared to existing solutions. With achieved throughputs exceeding 300 Million items/s, we report average speedups of 20x compared to typical software implementations, 1.5x compared to GPU-accelerated implementations, and 1.8x compared to the fastest FPGA implementation.
Ali Ebrahim, Jalal Khalifat
Exploiting 3D Memory for Accelerated In-Network Processing of Hash Joins in Distributed Databases
Abstract
The computing potential of programmable switches with multi-Tbit/s throughput is of increasing interest to the research community and industry alike. Such systems have already been employed in a wide spectrum of applications, including statistics gathering, in-network consensus protocols, or application data caching. Despite their high throughput, most architectures for programmable switches have practical limitations, e.g., with regard to stateful operations.
FPGAs, on the other hand, can be used to flexibly realize switch architectures for far more complex processing operations. Recently, FPGAs have become available that feature 3D-memory, such as HBM stacks, that is tightly integrated with their logic element fabrics. In this paper, we examine the impact of exploiting such HBM to accelerate an inter-server join operation at the switch-level between the servers of a distributed database system. As the hash-join algorithm used for high performance needs to maintain a large state, it would overtax the capabilities of conventional software-programmable switches.
The paper shows that across eight 10G Ethernet ports, the single HBM-FPGA in our prototype can not only keep up with the demands of over 60 Gbit/s of network throughput, but it also beats distributed-join implementations that do not exploit in-network processing.
Johannes Wirth, Jaco A. Hofmann, Lasse Thostrup, Andreas Koch, Carsten Binnig

Design Tools

Frontmatter
Evaluation of Different Manual Placement Strategies to Ensure Uniformity of the V-FPGA
Abstract
Virtual FPGA (V-FPGA) architectures are useful as both early prototyping testbeds for custom FPGA architectures, as well as to enable advanced features which may not be available on a given host FPGA. V-FPGAs use standard FPGA synthesis and placement tools, and as a result the maximum application frequency is largely determined by the synthesis of the V-FPGA onto the host FPGA. Minimal net delays in the virtual layer are crucial for applications, but due to increased routing congestion, these delays are often significantly worse for larger than for smaller designs. To counter this effect, we investigate three different placement strategies with varying amounts of manual intervention. Taking the regularity of the V-FPGA architecture into account, a regular placement of tiles can lead to an 37% improvement in the achievable clock frequency. In addition, uniformity of the measured net delays is increased by 39%, which makes implementation of user applications more reproducible. As a trade-off, these manual placement strategies increase area usage of the virtual layer up to 16%.
Johannes Pfau, Peter Wagih Zaki, Jürgen Becker
Timing Optimization for Virtual FPGA Configurations
Abstract
Fine-grained reconfigurable FPGA overlays, usually called virtual FPGAs, suffer from virtualization costs regarding area requirements and timing performance. Decreasing the area costs of such virtual FPGAs has been the focus of several research efforts over the past years, but adapting the (virtual) timing suffers from the contradiction of having to optimize properties with strong physical ties in an environment that is specifically designed to abstract them away.
This paper explores several methods to optimize the maximum operating frequency of the virtual FPGA ZUMA and its guest circuits despite this conflict using two complementary approaches: fine-tuned physical overlay design optimization through floorplanning in Xilinx’ design suite Vivado and delay-optimized virtual synthesis in the VTR tool flow. In our experimental results with virtual benchmark circuits, we respectively improve the operating frequency of a \({3\times 3}\) or a \({5\times 5}\) ZUMA architecture of up to \(41\%\) and \(65\%\) for individual benchmarks, and by \(23\%\) and \(31\%\) on average. Our results would also scale accordingly should future research uncover new potential to reduce the area cost further.
Linus Witschen, Tobias Wiersema, Masood Raeisi Nafchi, Arne Bockhorn, Marco Platzner
Hardware Based Loop Optimization for CGRA Architectures
Abstract
With the increasing demand for high performance computing in application domains with stringent power budgets, coarse-grained reconfigurable array (CGRA) architectures have become a popular choice among researchers and manufacturers. Loops are the hot-spots of kernels running on CGRAs and hence several techniques have been devised to optimize the loop execution. However, works in this direction are predominantly software-based solutions. This paper addresses the optimization opportunities at a deeper level and introduces a hardware based loop control mechanism that can support arbitrarily nested loops up to four levels. Major contributions of this work are, a lightweight Hardware Loop Block (HLB) for CGRAs that eliminates control instruction overhead of loops and an acyclic graph transformation that removes loop branches from the application CDFG. When tested on a set of kernels chosen from various application domains, the design could achieve a maximum of 1.9\(\times \) and an average of 1.5\(\times \) speed-up against the conventional approach. The total number of instructions executed is reduced to half for almost all the kernels with an area and power consumption overhead of 2.6% and 0.8% respectively.
Chilankamol Sunny, Satyajit Das, Kevin J. M. Martin, Philippe Coussy
Supporting On-Chip Dynamic Parallelism for Task-Based Hardware Accelerators
Abstract
The open-source hardware/software framework TaPaSCo aims to make reconfigurable computing on FPGAs more accessible to non-experts. To this end, it provides an easily usable task-based programming abstraction, and combines this with powerful tool support to automatically implement the individual hardware accelerators and integrate them into usable system-on-chips. Currently, TaPaSCo relies on the host to manage task parallelism and perform the actual task launches. However, for more expressive parallel programming patterns, such as pipelines of task farms, the round trips from the hardware accelerators back to the host for launching child tasks, especially when exploiting data-dependent execution times, quickly add up. The major contribution of this work is the addition of on-chip task scheduling and launching capabilities to TaPaSCo. This enables not only low-latency dynamic task parallelism, it also encompasses the efficient on-chip exchange of parameter values and task results between parent and child accelerator tasks. Our solution is able to handle recursive task structures and is shown to have latency reductions of over 35x compared to the prior approaches.
Carsten Heinz, Andreas Koch
Combining Design Space Exploration with Task Scheduling of Moldable Streaming Tasks on Reconfigurable Platforms
Abstract
Design space exploration can be used to find a power-efficient architectural design for a given application, such as the best suited configuration of a heterogeneous system from soft cores of different types, given area and throughput constraints. We show how to integrate design space exploration into a static scheduling algorithm for a streaming task graph application with parallelizable tasks and solve the resulting combined optimization problem by an integer linear program (ILP). We demonstrate the improvements by our strategy with ARM big and LITTLE soft cores and synthetic task graphs.
Jörg Keller, Sebastian Litzinger, Christoph Kessler
Task-Based Programming Models for Heterogeneous Recurrent Workloads
Abstract
This paper proposes the extension of task-based programming models with recurrent workloads concepts. The proposal introduces new clauses in the OmpSs task directive to efficiently model recurrent workloads. The clauses define the task period and/or the number of task body repetitions. Despite the new clauses are suitable for any device, their support has been implemented using the capabilities of FPGA devices in embedded systems. These heterogeneous systems are common in industrial applications that usually develop recurrent workloads. The evaluation shows a huge gap in the applications’ programmability, saving lines of code, and increasing the code readability. Besides, it shows the efficient management of recurrent tasks when performed in FPGA devices, which can support one order of magnitude finer tasks. All these improvements perfectly suit the needs of cyber-physical heterogeneous systems, which are frequently used in industrial environments to run recurrent workloads.
Jaume Bosch, Miquel Vidal, Antonio Filgueras, Daniel Jiménez-González, Carlos Álvarez, Xavier Martorell, Eduard Ayguadé

Architecture

Frontmatter
Multi-layered NoCs with Adaptive Routing for Mixed Criticality Systems
Abstract
Multiple applications of different criticality are increasingly being executed on the same System-on-Chip (SoC) platform to reduce resource consumption. Communication resources like the Network-on-Chip (NoC) on such platforms can be shared by such applications. Performance can be improved if the NoC is able to adapt at runtime to the requirements of different applications. An important challenge here is guaranteeing Quality of Service (QoS) for critical applications while improving overall performance of critical and non-critical applications. In this paper, we address this challenge by proposing a multi-layered hierarchical NoC which utilizes an adaptive routing algorithm. The proposed routing algorithm determines nodes connecting higher layers at runtime which have a shorter hop count to the destination nodes. Depending on the criticality of the applications, the packets are forwarded on shorter or longer hop paths. An adaptive congestion avoidance feature is integrated. Without congestion awareness, the proposed algorithm which utilizes multiple layers has upto 38% decrease in latency and with congestion awareness has upto 56% decrease in latency compared to the popular XY routing. When comparing algorithms which use different layers, the algorithm with congestion awareness has upto 29% decrease in latency and upto 16% increase in throughput.
Nidhi Anantharajaiah, Zhe Zhang, Juergen Becker
PDU Normalizer Engine for Heterogeneous In-Vehicle Networks in Automotive Gateways
Abstract
In this work, authors propose the concept of Protocol Data Unit (PDU) normalization for heterogeneous In-Vehicle Networks (IVN) in automotive gateways (GW). Through the development of the so-called PDU Normalizer Engine (PDUNE), it is possible to create a novel protocol-agnostic frame abstraction layer for PDU and signal gatewaying functions. It consists of normalizing the format of the frames present in the GW ingress ports of any kind (e.g. CAN, LIN, FlexRay or Ethernet). That is, the PDUNE transforms the ingress frames into new refactored frames which are independent of their original network protocol, and this occurs in an early stage before being processed across the different stages of the GW controller till reaching the egress ports, optimizing thus not only the processing itself but also the resources and latencies involved. The hardware (HW) implementation of the PDUNE exploits Software Defined Networking (SDN) architectural concepts by decomposing each ingress frame in two streams: a data frame moving across the data plane and an instruction frame provided with all the necessary metadata that –in parallel and synchronously to the data frame– evolves through the different processing stages of the GW controller performed directly in HW from the control plane. The PDUNE has been synthesized as a coarse-grain configurable HW accelerator (HWA) or co-processor attachable to the system CPU of the GW controller, aimed at contributing towards future automotive zonal GW solutions targeting heterogeneous IVNs with stringent real-time routing and tunneling functional constraints.
Angela Gonzalez Mariño, Francesc Fons, Li Ming, Juan Manuel Moreno Arostegui
StreamGrid - An AXI-Stream-Compliant Overlay Architecture
Abstract
FPGAs are part of a modern data-centre and are used as hardware accelerators, which allows to accelerate applications and adapting to the current compute requirements dynamically. Overlay architectures provide a flexible system, which enables the hardware accelerator to adapt its applications by exchanging (sub-)functions on run-time. Such overlay architectures usually consist of multiple run-time reconfigurable tiles. Multiple tiles can be connected to form an application-specific accelerator. In this paper, we present an AXI-Stream-compliant overlay architecture – called StreamGrid with advanced multi-stream routing architecture, memory (DDR4, HBM) access for the application, and a configuration and monitoring system. Furthermore, the impact of buffering strategies, grid-size, and data width of the AXI-Stream interface is explored in terms of resource utilization and the achievable clock frequency. The fastest configuration of the overlay architecture has a maximum clock frequency of 752 MHz on a Xilinx Alveo U280 FPGA Card. Furthermore, a case study of a database query engine is evaluated and compared to a static design with the same functionality. The raw execution performance is comparable for both design, but the set up times is now drastically reduced from several 10 min to less than 3 ms, efficiently enabling hardware-accelerated queries.
Christopher Blochwitz, León Philipp, Mladen Berekovic, Thilo Pionteck

Security

Frontmatter
Increasing Side-Channel Resistance by Netlist Randomization and FPGA-Based Reconfiguration
Abstract
Modern FPGAs are equipped with the possibility of Partial Reconfiguration (PR) which along with other benefits can be used to enhance the security of cryptographic implementations. This feature requires development of alternative designs to be exchanged during run-time. In this work, we propose dynamically alterable circuits by exploring netlist randomization which can be utilized with PR as a countermeasure against physical attacks, in particular side-channel attacks. The proposed approach involves modification of an AES implementation at the netlist level in order to create circuit variants which are functionally identical but structurally different. In preliminary experiments, power traces of these variants have been shuffled to replicate the effect of partial reconfiguration. With these dynamic circuits, our experimental results show an increase in the resistance against power side-channel attacks by a factor of \({\sim }12.6\) on a Xilinx ZYNQ UltraScale+ device.
Ali Asghar, Benjamin Hettwer, Emil Karimov, Daniel Ziener
Moving Target and Implementation Diversity Based Countermeasures Against Side-Channel Attacks
Abstract
Side-channel attacks (SCAs) are among the major threats to embedded systems’ security, where implementation characteristics of cryptographic algorithms are exploited to extract secret parameters. The most common SCAs take advantage of electromagnetic (EM) leakage or power consumption recorded during device operation by placing an EM probe over the chip or measuring the voltage drop across an internal resistor, respectively. In this work, two SCA countermeasures are presented which address these two types of leakage vectors. The first countermeasure supports implementation diversity and moving target defense, while the second one generates random algorithmic noise. These concepts are implemented using the dynamic partial reconfiguration (DPR) feature of modern FPGA devices. Both of the countermeasures are easily scalable, and the effect of scalability on the area overhead and security strength is presented. We evaluate our design by measuring EM emanations from a state-of-the-art System-on-Chip (SoC) with 16 nm production technology. With the most secure variant, we are able to increase the resistance against Correlation Power Analysis (CPA) by a factor of 95 compared to an unprotected AES implementation.
Nadir Khan, Benjamin Hettwer, Jürgen Becker
Clone-Resistant Secured Booting Based on Unknown Hashing Created in Self-Reconfigurable Platform
Abstract
Deploying a physically unclonable trusted anchor is required for securing software running on embedded systems. Common mechanisms combine secure boot with either stored secret keys or keys extracted from a Physical Unclonable Function (PUF). We propose a new secure boot mechanism that is hardware-based, individual to each device, and keyless to prohibit any unauthorized alteration of the software running on a particular device. Our solution is based on the so-called Secret Unknown Hash (SUH), a self-created random secret unknown hardwired hash function residing as a permanent digital hardware-module in the device’s physical layout. It is initiated in the device in a post-manufacturing, unpredictable single event process in self-reconfigurable non-volatile SoC FPGAs. In this work, we explain the SUH creation process and its integration for a device-specific secure boot. The SUH is shown to be lightweight when implemented in a sample scenario as a DM-PRESENT-based hash function. A security analysis is also presented, highlighting the different proposed sample SUH-class entropies.
Randa Zarrouk, Saleh Mulhem, Weal Adi, Mladen Berekovic

Posters

Frontmatter
Transparent Near-Memory Computing with a Reconfigurable Processor
Abstract
Data intensive applications like machine learning or big data analysis have stressed the requirements on memory subsystems. They involve computational kernels whose performance is not limited by the algorithmic complexity, but by the large amount of data they need to process. To counteract the growing gap between computing power and memory bandwidth, near-memory processing techniques have been addressed to improve the performance in such applications significantly. In this paper, we leverage a general purpose processor extended with a reconfigurable framework to execute hardware-accelerated instructions. This framework features a high-bandwidth memory interface to the nearest memory controller, allowing for greatly increased bandwidth compared to the standard system bus. We introduce region-based data processing, which allows to trigger operations by merely storing data and is especially suitable for large many-core designs. We show two different approaches to trigger the architecture for near-memory operations, one using interrupts for software-assisted processing and one directly interfacing with the hardware accelerator. Our evaluations show a performance gain of 72% on SAD kernels, with memory performance improved by 48%. Benchmarking AES encryption, we can show a speed up of 70%.
Fabian Lesniak, Fabian Kreß, Jürgen Becker
A Dataflow Architecture for Real-Time Full-Search Block Motion Estimation
Abstract
Motion estimation is the cornerstone of main video compression standards, which are based on the reduction of the temporal redundancy between consecutive frames. Although the mechanism is simple, the best method, Full Search, uses a brute-force approach which is not suited for real-time applications. This work introduces a high performance architecture for performing on-the-fly full-search block matching estimation in FPGA devices, which has been modeled using C++ programming language and synthesized with Vivado HLS for a Xilinx ZC706 prototyping board. The architecture is based on a dataflow datapath and it is configurable, enabling a fast and easy exploration of the solution space. On-board results achieve a maximum performance of 743 fps, 247 fps and 110 fps for VGA, HD and FHD video resolutions, respectively, for a typical macroblock size of \(16 \times 16\) pixels and a search area of \(\pm 16\) pixels.
Jesús Barba, Julián Caba, Soledad Escolar, Jose A. De La Torre, Fernando Rincón, Juan C. López
Providing Tamper-Secure SoC Updates Through Reconfigurable Hardware
Abstract
Remote firmware updates have become the de facto standard to guarantee a secure deployment of often decentrally operated IoT devices. However, the transfer and the provision of updates are considered as highly security-critical. Immunity requirements, such as the authenticity of the update provider and the integrity and confidentiality of the content typically loaded from an external cloud server over an untrusted network are therefore mandatory. This is especially true for FPGA-based programmable System-on-Chip (PSoC) architectures, as they are ideal implementation candidates for products with a long lifetime due to their adaptivity of both software and hardware configurations. In this paper, we propose a methodology for securely updating PSoC architectures by exploiting the reconfigurable logic of the FPGA. In the proposed approach, the FPGA serves as a secure anchor point by performing the required authenticity and integrity checks before granting the system update to be installed. In particular, a hardware design called Trusted Update Unit (TUU) is defined that is loaded from memory for the duration of an update session to first verify the identity of an external update provider and then, based on this verification, to establish a secure channel for protected data transfers. The proposed approach is also able to secure the confidentiality of cryptographic keys even if the software of the PSoC is compromised by applying them only as device-intrinsic secrets. Finally, an implementation of the approach on a Xilinx Zynq PSoC is described and evaluated for the design objectives performance and resource costs.
Franz-Josef Streit, Stefan Wildermann, Michael Pschyklenk, Jürgen Teich
Graviton: A Reconfigurable Memory-Compute Fabric for Data Intensive Applications
Abstract
The rigid organization and distribution of computational and memory resources often limits how well accelerators can cope with changing algorithms and increasing dataset sizes and limits how efficiently they use their computational and memory resources. In this work, we leverage a novel computing paradigm and propose a new memory-based reconfigurable fabric, Graviton. We demonstrate the ability to dynamically trade memory for compute and vice versa, and can tune the architecture of the underlying hardware to suit the memory and compute requirements of the application. On a die-to-die basis, Graviton provides up to 47X more on-chip memory capacity over an Alveo U250 SLR, with just an additional \(1.7\%\) area on a die-to-die basis than modern FPGAs, and is 28.7X faster, on average, on a range of compute and data intensive tasks.
Ashutosh Dhar, Paul Reckamp, Jinjun Xiong, Wen-mei Hwu, Deming Chen
Dynamic Spatial Multiplexing on FPGAs with OpenCL
Abstract
Recent advances in High-Level Synthesis (HLS) allow software developers to offload compute kernels to FPGAs without deep knowledge about low-level hardware description languages. However, this abstraction comes at the cost of control over the bitstream and thus complicates features like partial reconfiguration. We introduce a vendor-agnostic high-level approach for time and space multiplexing on OpenCL-programmed FPGAs. It dynamically adjusts the FPGA’s configuration to provide load balancing between multiple kernels on the same device. Our method uses several configurations, each with a different amount of FPGA resources dedicated to the respective kernel. We introduce a model to decide which configuration is selected based on the projected runtime of the enqueued tasks. Our model and the implementation Forecast are demonstrated with an online scheduler on a current high-end FPGA. We find that Forecast makes automatic handling of configurations in HLS-applications possible.
Pascal Jungblut, Dieter Kranzlmüller
Accelerating Convolutional Neural Networks in FPGA-based SoCs using a Soft-Core GPU
Abstract
Field-Programmable Gate Arrays (FPGAs) have increased in complexity over the last few years. Now available in the form of Systems-on-Chip (SoCs) such as Xilinx Zynq or Intel Stratix, users are offered significant flexibility in deciding the best approach to execute their Deep Learning (DL) model: a) in a fixed, hardwired general-purpose processor, or b) using the programmable logic to implement application-specific processing cores. While the latter choice offers the best performance and energy efficiency, the programmable logic’s limited size requires advanced strategies for mapping large models onto hardware. In this work, we investigate using a soft-core Graphics Processing Unit (GPU), implemented in the FPGA, to execute different Convolutional Neural Networks (CNNs). We evaluate the performance, area, and energy tradeoffs of running each layer in a) an ARM Cortex-A9 with Neon extensions and in b) the soft-core GPU, and find that the GPU overlay can provide a mean acceleration of 5.9\(\times \) and 1.8\(\times \) in convolution and max-pooling layers respectively compared to the ARM core. Finally, we show the potential of the collaborative execution of CNNs using these two platforms together, with an average speedup of 2\(\times \) and 4.8\(\times \) compared to using only the ARM core or the soft GPU, respectively.
Hector Gerardo Munoz Hernandez, Mitko Veleski, Marcelo Brandalero, Michael Hübner
Evaluating the Design Space for Offloading 3D FFT Calculations to an FPGA for High-Performance Computing
Abstract
The 3D Fast Fourier Transformation (3D FFT) is a critical routine in a number of applications of today’s HPC workloads. Optimising it for computational performance with the use of FPGAs has been a focus of several studies. However, a systematic study has been missing on the viability of different scenarios how FPGA-accelerated 3D FFT implementations can be integrated into real world applications originally implemented on high-end CPUs. In this paper, we address this with two scenarios for offloading 3D FFT computations to an FPGA and investigate their feasibility in comparison to highly optimised FFTW based executions on CPU in terms of computation time and power consumption. In the first scenario, performance of individual 3D FFT offloading to FPGA is found to be limited by latency and unidirectional bandwidth of PCIe data transfers. This bottleneck is overcome in the second scenario by overlapping in a batched mode data transfers and computations in the FPGA to find performance competitive to CPU. In both these scenarios, projections to next generation PCIe connections show additional potential for the FPGA with up to 2x speedup over CPU executions. Furthermore, measurements indicate 3.7x to 4.1x lower average power consumption on the FPGA.
Arjun Ramaswami, Tobias Kenter, Thomas D. Kühne, Christian Plessl
FPGA Implementation of Custom Floating-Point Logarithm and Division
Abstract
The mathematical operations logarithm and division are widely used in many algorithms, including those used in digital image and signal processing algorithms and are performed by approximated computing through piece-wise polynomial functions. In this paper we present dedicated FPGA architectures for implementing the logarithm and division operations in floating-point arithmetic. The proposed hardware modules are customizable, with the mantissa and exponent fields of the floating-point representation defined as parameters that can be customized by the hardware designer. The design flow of the arithmetic blocks allows the generation of a set of architectures of custom-precision floating-point, which can result in compact hardware, when the numerical computations require less numerical range or precision. The paper also describes bit-width optimization using precision analysis and differential evolution (a genetic algorithm based method) is applied to reduce the power consumption and the resource usage in the FPGA minimizing the number of flip-flops, lookup tables and DSP blocks according to a desired accuracy chosen in the design, leading to significant resource savings compared to existing IP cores.
Nelson Campos, Slava Chesnokov, Eran Edirisinghe, Alexis Lluis
On the Suitability of Read only Memory for FPGA-Based CAM Emulation Using Partial Reconfiguration
Abstract
Content-addressable memory (CAM) is a high-speed searching memory that provides the address of the input search key in one clock cycle. Traditional implementations of large CAMs on FPGAs with updatable memory elements are resource intensive requiring bigger and more expensive chips. Additional circuitry required for updating the CAM content is a major contributor to this which also impacts the overall clock performance of the circuit thus hampering the system throughput. To reduce the resource requirement, we investigate CAM implementation using read only memories (ROMs) and using partial reconfiguration (PR) to update their contents. Using a high-speed reconfiguration controller and bitstream compression, the reconfiguration overhead due to PR is offsetted. The results show 10% and 200% improvement in hardware resources and speed, respectively, compared to the state-of-the-art available FPGA-based TCAMs.
Muhammad Irfan, Kizheppatt Vipin, Ray C. C. Cheung
Domain-Specific Modeling and Optimization for Graph Processing on FPGAs
Abstract
The use of High Level Synthesis (HLS) tools is on the rise; however, performance modelling research has been mainly focused on regular applications with uniform memory access patterns. These performance models fail to accurately capture the performance of graph applications with irregular memory access patterns. This paper presents a domain-specific performance model targeting graph applications synthesized using HLS tools for FPGAs. The performance model utilizes information from the hardware specification, the application’s kernel, and the graph input. While the compilation process of HLS tools takes hours, the information required by the performance model can be extracted from the intermediate compilation report, which compiles in seconds. The goal of this work is to provide FPGA users with a performance modelling framework for graph applications, to estimate performance and explore the optimization space. We tested the framework on Intel’s new Devcloud platform and achieved speedup up to \(3.4\times \) by applying our framework’s recommended optimization strategy compared to the single pipeline implementation. The framework recommended the best optimization strategy in 90% of the test cases.
Mohamed W. Hassan, Peter M. Athanas, Yasser Y. Hanafy
Covid4HPC: A Fast and Accurate Solution for Covid Detection in the Cloud Using X-Rays
Abstract
Covid-19 pandemic has devastated social life and damaged the economy of the global population with a constantly increasing number of cases and fatalities each day. A popular and cheap screening method is through chest X-Rays, however it is impossible for every patient with respiratory illness to be tested fast and get quarantined in time. Thus, an automatic approach is needed which is motivated by the efforts of the research community. Specifically, we introduce a Deep Neural Network topology that can classify chest X-Ray images from patients in 3 classes; Covid-19, Viral Pneumonia and Normal. Detecting COVID-19 infections on X-Rays with high accuracy is crucial and can aid doctors in their medical diagnosis. However, there is still enormous data to process which takes up time and computer energy. In this scheme, we take a step further and deploy this Neural Network (NN) on a Xilinx Cloud FPGA platform which as devices are proven to be fast and power efficient. The aim is to have a medical solution on the Cloud for hospitals in order to facilitate the medical diagnosis with accuracy, speed and power efficiency. To the best of our knowledge, this application has not yet been considered for FPGAs while the accuracy and speed achieved surpasses any previous known implementation of NNs for X-Ray Covid detection. Specifically, it can classify X-Ray images at a rate of 3600 FPS with \(96.2\%\) accuracy and a speed-up of \(3.1{\times }\) vs GPU, \(17.6{\times }\) vs CPU in performance and \(4.6\times \) vs GPU, \(13.1{\times }\) vs CPU in power efficiency.
Dimitrios Danopoulos, Christoforos Kachris, Dimitrios Soudris
Backmatter
Metadaten
Titel
Applied Reconfigurable Computing. Architectures, Tools, and Applications
herausgegeben von
Steven Derrien
Frank Hannig
Pedro C. Diniz
Daniel Chillet
Copyright-Jahr
2021
Electronic ISBN
978-3-030-79025-7
Print ISBN
978-3-030-79024-0
DOI
https://doi.org/10.1007/978-3-030-79025-7