Skip to main content

2017 | Buch

Applied Reconfigurable Computing

13th International Symposium, ARC 2017, Delft, The Netherlands, April 3-7, 2017, Proceedings

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 13th International Symposium on Applied Reconfigurable Computing, ARC 2017, held in Delft, The Netherlands, in April 2017.
The 17 full papers and 11 short papers presented in this volume were carefully reviewed and selected from 49 submissions. They are organized in topical sections on adaptive architectures, embedded computing and security, simulation and synthesis, design space exploration, fault tolerance, FGPA-based designs, neural neworks, and languages and estimation techniques.

Inhaltsverzeichnis

Frontmatter

Adaptive Architectures

Frontmatter
Improving the Performance of Adaptive Cache in Reconfigurable VLIW Processor
Abstract
In this paper, we study the impact of cache reconfiguration on the cache misses when the issue-width of a VLIW processor is changed. We clearly note here that our investigation pertains the local temporal effects of the cache resizing and how we counteract the negative impact of cache misses in such resizing instances. We propose a novel reconfigurable d-cache framework that can dynamically adapt its least recently used (LRU) replacement policy without much hardware overhead. We demonstrate that using our adaptive d-cache, it ensures a smooth cache performance from one cache size to the other. This approach is orthogonal to future research in cache resizing for such architectures that take into account energy consumption and performance of the overall application.
Sensen Hu, Anthony Brandon, Qi Guo, Yizhuo Wang
LP-PIP: A Low-Power Version of PIP Architecture Using Partial Reconfiguration
Abstract
Power consumption reduction is crucial for portable equipments and for those in remote locations, whose battery replacement is impracticable. P\(^2\)IP is an architecture targeting real-time embedded image and video processing, which combines runtime reconfigurable processing, low-latency and high performance. Being a configurable architecture allows the combination of powerful video processing operators (Processing Elements or PEs) to build the target application. However, many applications do not require all PEs available. Remaining idle, these PEs still represent a power consumption problem that Partial Reconfiguration can mitigate. To assess the impact on energy consumption, another P\(^2\)IP implementation based on Partial Reconfiguration was developed and tested with three different image processing applications. Measurements have been made to analyze energy consumption when executing each of three applications. Results show that compared to the original implementation of the architecture use of Partial Reconfiguration leads to power savings of up to 45%.
Álvaro Avelino, Valentin Obac, Naim Harb, Carlos Valderrama, Glauberto Albuquerque, Paulo Possa
NIM: An HMC-Based Machine for Neuron Computation
Abstract
Neuron Network simulation has arrived as a methodology to help one solve computational problems by mirroring behavior. However, to achieve consistent simulation results, large sets of workloads need to be evaluated. In this work, we present a neural in-memory simulator capable of executing deep learning applications inside 3D-stacked memories. With the reduction of data movement and by including a simple accelerator layer near to memory, our system was able to overperform traditional multi-core devices, while reducing overall system energy consumption.
Geraldo F. Oliveira, Paulo C. Santos, Marco A. Z. Alves, Luigi Carro
VLIW-Based FPGA Computation Fabric with Streaming Memory Hierarchy for Medical Imaging Applications
Abstract
In this paper, we present and evaluate an FPGA acceleration fabric that uses VLIW softcores as processing elements, combined with a memory hierarchy that is designed to stream data between intermediate stages of an image processing pipeline. These pipelines are commonplace in medical applications such as X-ray imagers. By using a streaming memory hierarchy, performance is increased by a factor that depends on the number of stages (\(7.5\times \) when using 4 consecutive filters). Using a Xilinx VC707 board, we are able to place up to 75 cores. A platform of 64 cores can be routed at 193 MHz, achieving real-time performance, while keeping 20% resources available for off-board interfacing.
Our VHDL implementation and associated tools (compiler, simulator, etc.) are available for download for the academic community.
Joost Hoozemans, Rolf Heij, Jeroen van Straten, Zaid Al-Ars

Embedded Computing and Security

Frontmatter
Hardware Sandboxing: A Novel Defense Paradigm Against Hardware Trojans in Systems on Chip
Abstract
A novel approach for mitigation of hardware Trojan in Systems on Chip (SoC) is presented. With the assumption that Trojans can cause harm only when they are activated, the goal is to avoid cumbersome and sometimes destructive pre-fabrication and pre-deployment tests for Trojans in SoCs, by building systems capable of capturing Trojan activation or simply nullifying their effect at run-time to prevent damage to the system. To reach this goal, non-trusted third-party IPs and components off the shelf (COTS) are executed in sandboxes with checkers and virtual resources. While checkers are used to detect run-time activation of Trojans and mitigate potential damage to the system, virtual resources are provided to IPs in the sandbox, thus preventing direct access to physical resources. Our approach was validated with benchmarks from trust-hub.com, a synthetic system on FPGA scenario using the same benchmark. All our results showed a 100% Trojan detection and mitigation, with only a minimal increase in resource overhead and no performance decrease.
Christophe Bobda, Joshua Mead, Taylor J. L. Whitaker, Charles Kamhoua, Kevin Kwiat
Rapid Development of Gzip with MaxJ
Abstract
Design productivity is essential for high-performance application development involving accelerators. Low level hardware description languages such as Verilog and VHDL are widely used to design FPGA accelerators, however, they require significant expertise and considerable design efforts. Recent advances in high-level synthesis have brought forward tools that relieve the burden of FPGA application development but the achieved performance results can not approximate designs made using low-level languages. In this paper we compare different FPGA implementations of gzip. All of them implement the same system architecture using different languages. This allows us to compare Verilog, OpenCL and MaxJ design productivity. First, we illustrate several conceptional advantages of the MaxJ language and its platform over OpenCL. Next we show on the example of our gzip implementation how an engineer without previous MaxJ experience can quickly develop and optimize a real, complex application. The gzip design in MaxJ presented here took only one man-month to develop and achieved better performance than the related work created in Verilog and OpenCL.
Nils Voss, Tobias Becker, Oskar Mencer, Georgi Gaydadjiev
On the Use of (Non-)Cryptographic Hashes on FPGAs
Abstract
Hash functions are used for numerous applications in computer networking, both on classical CPU-based systems and on dedicated hardware like FPGAs. During system development, hardware implementations require particular attention to take full advantage of performance gains through parallelization when using hashes. For many use cases, such as hash tables or Bloom filters, several independent short hash values for the same input key are needed. Here we consider the question how to save resources by splitting one large hash value into multiple sub-hashes. We demonstrate that even small flaws in the avalanche effect of a hash function induce significant deviation from a uniform distribution in such sub-hashes, which allows potential denial-of-service attacks. We further consider the cryptographic hash SHA3 and other non-cryptographic hashes, which do not exhibit such weaknesses, in terms of resource usage and latency in an FPGA implementation. The results show that while SHA3 was intended for security applications, it also outperforms the non-cryptographic hashes for other use cases on FPGAs.
Andreas Fiessler, Daniel Loebenberger, Sven Hager, Björn Scheuermann
An FPGA-Based Implementation of a Pipelined FFT Processor for High-Speed Signal Processing Applications
Abstract
In this study, we propose an efficient, 1024 point, pipelined FFT processor based on the radix-2 decimation-in-frequency (R2DIF) algorithm using the single-path delay feedback (SDF) pipelined architecture. The proposed FFT processor is designed as an intellectual property (IP) logic core for easy integration into digital signal processing (DSP) systems. It employs the shift-add method to optimize the multiplication of twiddle factors instead of the dedicated, embedded functional blocks. The proposed design is implemented on a Xilinx Virtex-7 field programmable gate array (FPGA). The experimental results show that the proposed FFT design is more efficient in terms of speed, accuracy and resource utilization as compared to existing designs and hence more suitable for high-speed DSP applications.
Ngoc-Hung Nguyen, Sheraz Ali Khan, Cheol-Hong Kim, Jong-Myon Kim

Simulation and Synthesis

Frontmatter
Soft Timing Closure for Soft Programmable Logic Cores: The ARGen Approach
Abstract
Reconfigurable cores support post-release updates which shortens time-to-market while extending circuits’ lifespan. Reconfigurable cores can be provided as hard cores (ASIC) or soft cores (RTL). Soft reconfigurable cores outperform hard reconfigurable cores by preserving the ASIC synthesis flow, at the cost of lowering scalability but also exacerbating timing closure issues. This article tackles these two issues and introduces the ARGen generator that produces scalable soft reconfigurable cores. The architectural template relies on injecting flip-flops into the interconnect, to favor easy and accurate timing estimation. The cores are compliant with the academic standard for place and route environment, making ARGen a one stop shopping point for whoever needs exploitable soft reconfigurable cores.
Théotime Bollengier, Loïc Lagadec, Mohamad Najem, Jean-Christophe Le Lann, Pierre Guilloux
FPGA Debugging with MATLAB Using a Rule-Based Inference System
Abstract
This paper presents an FPGA debugging methodology using a rule based inference system. Using this approach, the design stops a device under test (DUT), saves the data to external memory and then starts the DUT again. The saved data is used by MATLAB to debug the system by using a rule-based inference system. Normally, a debug system only displays the monitored data and then the decision making process is left to the user. But a rule-based inference system can be used to make the decision about the correct functionality of the system. The main benefits of this technique are no loss of debugging data due to an unlimited debug window, no use of HDL simulators for waveform viewing and shorter debugging time by using verification by a software technique.
Habib Ul Hasan Khan, Diana Göhringer
Hardness Analysis and Instrumentation of Verilog Gate Level Code for FPGA-based Designs
Abstract
Dependability analysis and test approaches are key steps in order to test and verify system robustness and fault-tolerance capabilities. Owing to the shrinking size of components, it is very difficult to guarantee an acceptable degree of reliability. With the growing computational power of FPGAs and other diverse advantages, they have become indispensable solutions for embedded applications. However, these systems are also prone to faults and errors. Therefore, the testability and the dependability analysis are necessary. Both methods require the deliberate introduction of faults in the SUT. In this paper, a fault injection algorithm is proposed for Verilog gate level code, which injects faults in the design. Also, the method is proposed for finding sensitive locations of SUT. These methods are developed under a fault injection tool, with a GUI, for the ease of use, and it is named RASP-FIT tool. Benchmark circuits from ISCAS’85 and ISCAS’89 are considered to validate the both proposed methods.
Abdul Rafay Khatri, Ali Hayek, Josef Börcsök
A Framework for High Level Simulation and Optimization of Coarse-Grained Reconfigurable Architectures
Abstract
High-level simulation tools are used for optimization and design space exploration of digital circuits for a target Field Programmable Gate Array (FPGA) or Application Specific Integrated Circuit (ASIC) implementation. Compared to ASICs, FPGAs are slower and less power-efficient, but they are programmable, flexible and offer faster prototyping. One reason for the slow performance in FPGA is their finer granularity as they operate at bit-level. The possible solution is Coarse Grained Reconfigurable Architectures (CGRAs) that work at word-level. There already exists a myriad of CGRAs based on their architectural parameters. However, the CGRA research lacks in design automation since high-level simulation and optimization tools targeted at CGRAs are nearly non-existent. In this paper, we propose a high-level simulation and optimization framework for mesh-based homogeneous CGRAs. As expected, the results show that auto-generated homogeneous CGRAs consume 54% more resources when compared with academic FPGAs while providing around 63.3% faster mapping time.
Muhammad Adeel Pasha, Umer Farooq, Muhammad Ali, Bilal Siddiqui

Design Space Exploration

Frontmatter
Parameter Sensitivity in Virtual FPGA Architectures
Abstract
Virtual FPGAs add the benefits of increased flexibility and application portability on bitstream level across any underlying commercial off-the-shelf FPGAs at the expense of additional area and delay overhead. Hence it becomes a priority to tune the architecture parameters of the virtual layer. Thereby, the adoption of parameter recommendations intended for physical FPGAs can be misleading, as they are based on transistor level models. This paper presents an extensive study of architectural parameters and their effects on area and performance by introducing an extended parameterizable virtual FPGA architecture and deriving suitable area and delay models. Furthermore, a design space exploration methodology based on these models is carried out. An analysis of over 1400 benchmark-runs with various combinations of cluster and LUT size reveals high parameter sensitivity with variances up to \(\pm 95.9\%\) in area and \(\pm 78.1\%\) in performance and a discrepancy to the studies on physical FPGAs.
Peter Figuli, Weiqiao Ding, Shalina Figuli, Kostas Siozios, Dimitrios Soudris, Jürgen Becker
Custom Framework for Run-Time Trading Strategies
Abstract
A trading strategy is generally optimised for a given market regime. If it takes too long to switch from one trading strategy to another, then a sub-optimal trading strategy may be adopted. This paper proposes the first FPGA-based framework which supports multiple trend-following trading strategies to obtain accurate market characterisation for various financial market regimes. The framework contains a trading strategy kernel library covering a number of well-known trend-following strategies, such as “triple moving average”. Three types of design are targeted: a static reconfiguration trading strategy (SRTS), a full reconfiguration trading strategy (FRTS), and a partial reconfiguration trading strategy (PRTS). Our approach is evaluated using both synthetic and historical market data. Compared to a fully optimised CPU implementation, the SRTS design achieves 11 times speedup, the FRTS design achieves 2 times speedup, while the PRTS design achieves 7 times speedup. The FRTS and PRTS designs also reduce the amount of resources used on chip by 29% and 15% respectively, when compared to the SRTS design.
Andreea-Ingrid Funie, Liucheng Guo, Xinyu Niu, Wayne Luk, Mark Salmon
Exploring HLS Optimizations for Efficient Stereo Matching Hardware Implementation
Abstract
Nowadays, FPGA technology offers a tremendous number of logic cells on a single chip. Digital design for such huge hardware resources under time-to-market constraint urged the evolution of High Level Synthesis (HLS) tools. In this work, we will explore several HLS optimization steps in order to improve the system performance. Different design choices are obtained from our exploration such that an efficient implementation is selected based on given system constraints (resource utilization, power consumption, execution time, ...). Our exploration methodology is illustrated through a case study considering a Multi-Window Sum of Absolute Difference stereo matching algorithm. We implemented our design using Xilinx Zynq ZC706 FPGA evaluation board for gray images of size \(640\times 480\).
Karim M. A. Ali, Rabie Ben Atitallah, Nizar Fakhfakh, Jean-Luc Dekeyser
Architecture Reconfiguration as a Mechanism for Sustainable Performance of Embedded Systems in case of Variations in Available Power
Abstract
The paper presents a method for deriving high-level power consumption estimation (PCE) model for FPGAs with tile based architecture. This model can be used by systems having multi-task workloads to support run-time architecture-to-workload adaptation in order to sustain the performance of critical tasks in case of depleting power. The approach is based on reconfiguring implementation variants of tasks by estimating their power consumption at run-time using the derived model. This allows reduction of system power consumption by reducing performance of non critical tasks while maintaining critical task performance at required level. In turn, it allows prolongation of system activity for the required period. The paper demonstrates derivation of PCE model for the System on Programmable Chip (SoPC) deployed on Xilinx Zynq XC7Z020 FPGA and how this SoPC adapts to depleting power, sustaining the performance of its critical task for an additional hour.
Dimple Sharma, Victor Dumitriu, Lev Kirischian

Fault Tolerance

Frontmatter
Exploring Performance Overhead Versus Soft Error Detection in Lockstep Dual-Core ARM Cortex-A9 Processor Embedded into Xilinx Zynq APSoC
Abstract
This paper explores the use of dual-core lockstep as a fault-tolerance solution to increase the dependability in hard-core processors embedded in APSoCs. As a case study, we designed and implemented an approach based on lockstep to protect a dual-core ARM Cortex-A9 processor embedded into Zynq-7000 APSoC. Experimental results show the effectiveness of the proposed approach in mitigate around 91% of bit flips injected in the ARM registers. Also, it is observed that performance overhead depends on the application size, the number of checkpoints performed, and the checkpoint and rollback routines.
Ádria Barros de Oliveira, Lucas Antunes Tambara, Fernanda Lima Kastensmidt
Applying TMR in Hardware Accelerators Generated by High-Level Synthesis Design Flow for Mitigating Multiple Bit Upsets in SRAM-Based FPGAs
Abstract
This paper investigates the use of Triple Modular Redundancy (TMR) in hardware accelerators designs described in C programming language and synthesized by High Level Synthesis (HLS). A setup composed of a soft-core processor and a matrix multiplication design protected by TMR and embedded into an SRAM-based FPGA was analyzed under accumulated bit-flips in its configuration memory bits. Different configurations using single and multiple input and output workload data streams were tested. Results show that by using a coarse grain TMR with triplicated inputs, voters, and outputs, it is possible to reach 95% of reliability by accumulating up to 61 bit-flips and 99% of reliability by accumulating up to 17 bit-flips in the configuration memory bits. These numbers imply in a Mean Time Between Failure (MTBF) of the coarse grain TMR at ground level from 50% to 70% higher than the MTBF of the unhardened version for the same reliability confidence.
André Flores dos Santos, Lucas Antunes Tambara, Fabio Benevenuti, Jorge Tonfat, Fernanda Lima Kastensmidt

FPGA-Based Designs

Frontmatter
FPGA Applications in Unmanned Aerial Vehicles - A Review
Abstract
Most of the existing Unmanned Aerial Vehicles (UAVs) in different scales, use microcontrollers as their processing engine. In this paper, we provide a wide study on how employing Field Programmable Gate Arrays (FPGAs) alters the development of such UAV systems. This work is organized based on the application’s criticality. After surveying recent products, we reviewed significant researches concerning the use of FPGAs in high-level control techniques necessary for navigation such as path planning, Simultaneous Localization and Mapping (SLAM), stereo vision, as well as the safety-critical low-level tasks such as system stability, state estimation and interfacing with peripherals. In addition, we study the use of FPGAs in mission-critical tasks, including target tracking, communications, obstacle avoidance, etc. In this paper, we mainly review other research papers and compare them in different terms such as speed and energy consumption.
Mustapha Bouhali, Farid Shamani, Zine Elabadine Dahmane, Abdelkader Belaidi, Jari Nurmi
Genomic Data Clustering on FPGAs for Compression
Abstract
Current sequencing machine technology generates very large and redundant volumes of genomic data for each biological sample. Today data and associated metadata are formatted in very large text file assemblies called FASTQ carrying the information of billions of genome fragments referred to as “reads” and composed of strings of nucleotide bases with lengths in the range of a few tenths to a few hundreds bases. Compressing such data is definitely required in order to manage the sheer amount of data soon to be generated. Doing so implies finding redundant information in the raw sequences. While most of it can be mapped onto the human reference genome and fits well for compression, about 10% of it usually does not map to any reference [1]. For these orphan sequences, finding redundancy will help compression. Doing so requires clustering these reads, a very time consuming process. Within this context this paper presents a FPGA implementation of a clustering algorithm for genomic reads, implemented on Pico Computing EX-700 AC-510 hardware, offering more than a \(1000\times \) speed up over a CPU implementation while reducing power consumption by a 700 factor.
Enrico Petraglio, Rick Wertenbroek, Flavio Capitao, Nicolas Guex, Christian Iseli, Yann Thoma
A Quantitative Analysis of the Memory Architecture of FPGA-SoCs
Abstract
In recent years, so called FPGA-SoCs have been introduced by Intel (formerly Altera) and Xilinx. These devices combine multi-core processors with programmable logic. This paper analyzes the various memory and communication interconnects found in actual devices, particularly the Zynq-7020 and Zynq-7045 from Xilinx and the Cyclone V SE SoC from Intel. Issues such as different access patterns, cache coherence and full-duplex communication are analyzed, for both generic accesses as well as for a real workload from the field of video coding. Furthermore, the paper shows that by carefully choosing the memory interconnect networks as well as the software interface, high-speed memory access can be achieved for various scenarios.
Matthias Göbel, Ahmed Elhossini, Chi Ching Chi, Mauricio Alvarez-Mesa, Ben Juurlink

Neural Networks

Frontmatter
Optimizing CNN-Based Object Detection Algorithms on Embedded FPGA Platforms
Abstract
Algorithms based on Convolutional Neural Network (CNN) have recently been applied to object detection applications, greatly improving their performance. However, many devices intended for these algorithms have limited computation resources and strict power consumption constraints, and are not suitable for algorithms designed for GPU workstations. This paper presents a novel method to optimise CNN-based object detection algorithms targeting embedded FPGA platforms. Given parameterised CNN hardware modules, an optimisation flow takes network architectures and resource constraints as input, and tunes hardware parameters with algorithm-specific information to explore the design space and achieve high performance. The evaluation shows that our design model accuracy is above 85% and, with optimised configuration, our design can achieve 49.6 times speed-up compared with software implementation.
Ruizhe Zhao, Xinyu Niu, Yajie Wu, Wayne Luk, Qiang Liu
An FPGA Realization of a Deep Convolutional Neural Network Using a Threshold Neuron Pruning
Abstract
For a pre-trained deep convolutional neural network (CNN) for an embedded system, a high-speed and a low power consumption are required. In the former of the CNN, it consists of convolutional layers, while in the latter, it consists of fully connection layers. In the convolutional layer, the multiply accumulation operation is a bottleneck, while the fully connection layer, the memory access is a bottleneck. In this paper, we propose a neuron pruning technique which eliminates almost part of the weight memory. In that case, the weight memory is realized by an on-chip memory on the FPGA. Thus, it achieves a high speed memory access. In this paper, we propose a sequential-input parallel-output fully connection layer circuit. The experimental results showed that, by the neuron pruning, as for the fully connected layer on the VGG-11 CNN, the number of neurons was reduced by 89.3% with keeping the 99% accuracy. We implemented the fully connected layers on the Digilent Inc. NetFPGA-1G-CML board. Comparison with the CPU (ARM Cortex A15 processor) and the GPU (Jetson TK1 Kepler), as for a delay time, the FPGA was 219.0 times faster than the CPU and 12.5 times faster than the GPU. Also, a performance per power efficiency was 125.28 times better than CPU and 17.88 times better than GPU.
Tomoya Fujii, Simpei Sato, Hiroki Nakahara, Masato Motomura
Accuracy Evaluation of Long Short Term Memory Network Based Language Model with Fixed-Point Arithmetic
Abstract
Long Short Term Memory network based language models are state-of-art techniques in the field of natural language processing. Training LSTM networks is computationally intensive, which naturally results in investigating FPGA acceleration where fixed-point arithmetic is employed. However, previous studies have focused only on accelerators using some fixed bit-widths without thorough accuracy evaluation. The main contribution of this paper is to demonstrate the bit-width effect on the LSTM based language model and the tanh function approximation in a comprehensive way by experimental evaluation. Theoretically, the 12-bit number with 6-bit fractional part is the best choice balancing the accuracy and the storage saving. Gaining similar performance to the software implementation and fitting the bit-widths of FPGA primitives, we further propose a mixed bit-widths solution combing 8-bit numbers and 16-bit numbers. With clear trade-off in accuracy, our results provide a guide to inform the design choices on bit-widths when implementing LSTMs in FPGAs. Additionally, based on our experiments, it is amazing that the scale of the LSTM network is irrelevant to the optimum fixed-point configuration, which indicates that our results are applicable to larger models as well.
Ruochun Jin, Jingfei Jiang, Yong Dou
FPGA Implementation of a Short Read Mapping Accelerator
Abstract
Recently, due to drastically reducing costs of sequencing a human DNA molecule, the demands for next generation DNA sequencing (NGS) has increased significantly. DNA sequencers deliver millions of small fragments (short reads) from random positions of a very large DNA stream. To align these short-reads such that the original DNA sequence is determined, various software tools called short read mappers, such as Burrows BWA, are available. Analyzing the massive quantities of sequenced data produced using these software tools, requires a very long run-time on general-purpose computing systems due to a great computational power it needs. This work proposes some methods to accelerate short read alignment being prototyped on an FPGA. We use a seed and compare architecture based on FM-index method. Also pre-calculated data are used for more performance improvement. A multi-core accelerator based on the proposed methods is implemented on a Xilinx Virtex-6. Our design performs alignment of short reads with length of 75 and up to two mismatches. The proposed parallel architecture performs the short-read mapping up to 41 and 19 times faster than parallel programmed BWA run on eight-core AMD FX9590 and 6-cores Intel Extreme Core i7-5820 k CPUs using 8 and 12 threads.
Mostafa Morshedi, Hamid Noori

Languages and Estimation Techniques

Frontmatter
dfesnippets: An Open-Source Library for Dataflow Acceleration on FPGAs
Abstract
Highly-tuned FPGA implementations can achieve significant performance and power efficiency gains over general purpose hardware. However the limited development productivity has prevented mainstream adoption of FPGAs in many areas such as High Performance Computing. High level standard development libraries are increasingly adopted in improving productivity. We propose an approach for performance critical applications including standard library modules, benchmarking facilities and application benchmarks to support a variety of use-cases. We implement the proposed approach as an open-source library for a commercially available FPGA system and highlight applications and productivity gains.
Paul Grigoras, Pavel Burovskiy, James Arram, Xinyu Niu, Kit Cheung, Junyi Xie, Wayne Luk
A Machine Learning Methodology for Cache Recommendation
Abstract
Cache memories are an important component of modern processors and consume a large percentage of the processor’s power consumption. The quality of service of this cache memories relies heavily on the memory demands of the software, what means that a certain program might benefit more from a certain cache configuration which is highly inefficient for another program. Moreover, finding the optimal cache configuration for a certain program is not a trivial task and usually, involves exhaustive simulation. In this paper, we propose a machine learning-based methodology that, given an unknown application as input, it outputs a prediction of the optimal cache reconfiguration for that application, regarding energy consumption and performance. We evaluated our methodology using a large benchmark suite, and our results show a 99.8% precision at predicting the optimal cache configuration for a program. Furthermore, further analysis of the results indicates that 85% of the mispredictions produce only up to a 10% increase in energy consumption in comparison to the optimal energy consumption.
Osvaldo Navarro, Jones Mori, Javier Hoffmann, Fabian Stuckmann, Michael Hübner
ArPALib: A Big Number Arithmetic Library for Hardware and Software Implementations. A Case Study for the Miller-Rabin Primality Test
Abstract
In this paper, we present the Arbitrary Precision Arithmetic Library - ArPALib, suitable for algorithms that require integer data representation with an arbitrary bit-width (up to 4096-bit in this study). The unique feature of the library is suitability to be synthesized by HLS (High Level Synthesis) tools, while maintaining full compatibility with C99 standard. To validate the applicability of ArPALib for the FPGA-enhanced SoCs, the Miller-Rabin primality test algorithm is considered as a case study. Also, we provide the performance analysis of our library in the software and hardware applications. The presented results show the speedup of 1.5 of the hardware co-processor over its software counterpart when ApPALib is used.
Jan Macheta, Agnieszka Dąbrowska-Boruch, Paweł Russek, Kazimierz Wiatr
Backmatter
Metadaten
Titel
Applied Reconfigurable Computing
herausgegeben von
Stephan Wong
Antonio Carlos Beck
Koen Bertels
Luigi Carro
Copyright-Jahr
2017
Electronic ISBN
978-3-319-56258-2
Print ISBN
978-3-319-56257-5
DOI
https://doi.org/10.1007/978-3-319-56258-2

Neuer Inhalt