Skip to main content

2003 | Buch

Field Programmable Logic and Application

13th International Conference, FPL 2003, Lisbon, Portugal, September 1-3, 2003 Proceedings

herausgegeben von: Peter Y. K. Cheung, George A. Constantinides

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book contains the papers presented at the 13th International Workshop on Field Programmable Logic and Applications (FPL) held on September 1–3, 2003. The conference was hosted by the Institute for Systems and Computer Engineering-Research and Development of Lisbon (INESC-ID) and the Depa- ment of Electrical and Computer Engineering of the IST-Technical University of Lisbon, Portugal. The FPL series of conferences was founded in 1991 at Oxford University (UK), and has been held annually since: in Oxford (3 times), Vienna, Prague, Darmstadt,London,Tallinn,Glasgow,Villach,BelfastandMontpellier.Itbrings together academic researchers, industrial experts, users and newcomers in an - formal,welcomingatmospherethatencouragesproductiveexchangeofideasand knowledge between delegates. Exciting advances in ?eld programmable logic show no sign of slowing down. New grounds have been broken in architectures, design techniques, run-time - con?guration, and applications of ?eld programmable devices in several di?erent areas. Many of these innovations are reported in this volume. The size of FPL conferences has grown signi?cantly over the years. FPL in 2002 saw 214 papers submitted, representing an increase of 83% when compared to the year before. The interest and support for FPL in the programmable logic community continued this year with 216 papers submitted. The technical p- gram was assembled from 90 selected regular papers and 56 posters, resulting in this volume of proceedings. The program also included three invited plenary keynote presentations from LSI Logic, Xilinx and Cadence, and three industrial tutorials from Altera, Mentor Graphics and Dafca.

Inhaltsverzeichnis

Frontmatter

Technologies and Trends

Reconfigurable Circuits Using Hybrid Hall Effect Devices

Hybrid Hall effect (HHE) devices are a new class of reconfigurable logic devices that incorporate ferromagnetic elements to deliver non-volatile operation. A single HHE device may be configured on a cycle-by-cycle basis to perform any of four different logical computations (OR, AND, NOR, NAND), and will retain its state indefinitely, even if the power supply is removed from the device. In this paper, we introduce the HHE device and describe a number of reconfigurable circuits based on HHE devices, including reconfigurable logic gates and non-volatile table lookup cells.

Steve Ferrera, Nicholas P. Carter
Gigahertz FPGA by SiGe BiCMOS Technology for Low Power, High Speed Computing with 3-D Memory

This paper presents an improved Xilinx XC6200 FPGA using IBM SiGe BiCMOS technology. The basic cell performance is greatly enhanced by eliminating redundant signal multiplexing procedures. The simulated combinational logic result has a 30% shorter gate delay than the previous design. By adjusting and properly shutting down the CML current, this design can be used in lower-power consumption circuits. The total saved power is 50% of the first SiGe FPGA developed in the same group. Lastly, the FPGA with a 3-D stacked memory concept is described to further reduce the influence of parasitics generated by the memory banks. The circuit area is also reduced to make dense integrated circuits possible.

Chao You, Jong-Ru Guo, Russell P. Kraft, Michael Chu, Robert Heikaus, Okan Erdogan, Peter Curran, Bryan Goda, Kuan Zhou, John F. McDonald

Communications Applications

Implementing an OFDM Receiver on the RaPiD Reconfigurable Architecture

Reconfigurable architectures have been touted as an alternative to ASICs and DSPs for applications that require a combination of high performance and flexibility. However, the use of fine-grained FPGA architectures in embedded platforms is hampered by their very large overhead. This overhead can be reduced substantially by taking advantage of an application domain to specialize the reconfigurable architecture using coarse-grained components and interconnects. This paper describes the design and implementation of an OFDM Receiver using the RaPiD architecture and RaPiD-C programming language. We show a factor of about 6x increase in cost-performance over a DSP implementation and 15x over an FPGA implementation.

Carl Ebeling, Chris Fisher, Guanbin Xing, Manyuan Shen, Hui Liu
Symbol Timing Synchronization in FPGA-Based Software Radios: Application to DVB-S

The design of all-digital symbol timing synchronizers for FPGAs is a complex task. There are several architectures available for VLSI wireless transceivers but porting them to a software defined radio (SDR) platform is not straightforward. In this paper we report a receiver architecture prepared to support demanding protocols such as satellite digital video broadcast (DVB-S). In addition, we report hardware implementation and area utilization estimation. Finally we present implementation results of a DVB-S digital receiver on a Virtex-II Pro FPGA.

Francisco Cardells-Tormo, Javier Valls-Coquillat, Vicenc Almenar-Terre

High Level Design Tools 1

An Algorithm Designer’s Workbench for Platform FPGAs

Growing gate density, availability of embedded multipliers and memory, and integration of traditional processors are some of the key advantages of Platform FPGAs. Such FPGAs are attractive for implementing compute intensive signal processing kernels used in wired as well as wireless mobile devices. However, algorithm design using Platform FPGAs, with energy dissipation as an additional performance metric for mobile devices, poses significant challenges. In this paper, we propose an algorithm designer’s workbench that addresses the above issues. The workbench supports formal modeling of the signal processing kernels, evaluation of latency, energy, and area of a design, and performance tradeoff analysis to facilitate optimization. The workbench includes a high-level estimator for rapid performance estimation and widely used low-level simulators for detailed simulation. Features include a confidence interval based technique for accurate power estimation and facility to store algorithm designs as library of models for reuse. We demonstrate the use of the workbench through design of matrix multiplication algorithm for Xilinx Virtex-II Pro.

Sumit Mohanty, Viktor K. Prasanna
Prototyping for the Concurrent Development of an IEEE 802.11 Wireless LAN Chipset

This paper describes how an FPGA based prototype environment aided the development of two multi-million gate ASICs: an IEEE 802.11 medium access controller and an IEEE 802.11a/b/g physical layer processor. Prototyping the ASICs on a reconfigurable platform enabled concurrent development by the hardware and software teams, and provided a high degree of confidence in the designs. The capabilities of modern FPGAs and their development tools allowed us to easily and quickly retarget the complex ASICs into FPGAs, enabling us to integrate the prototyping effort into our design flow from the start of the project. The effect was to accelerate the development cycle and generate an ASIC which had been through one pass of beta testing before tape-out.

Ludovico de Souza, Philip Ryan, Jason Crawford, Kevin Wong, Greg Zyner, Tom McDermott

Reconfigurable Architectures

ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix

The coarse-grained reconfigurable architectures have advantages over the traditional FPGAs in terms of delay, area and configuration time. To execute entire applications, most of them combine an instruction set processor(ISP) and a reconfigurable matrix. However, not much attention is paid to the integration of these two parts, which results in high communication overhead and programming difficulty. To address this problem, we propose a novel architecture with tightly coupled very long instruction word (VLIW) processor and coarse-grained reconfigurable matrix. The advantages include simplified programming model, shared resource costs, and reduced communication overhead. To exploit this architecture, our previously developed compiler framework is adapted to the new architecture. The results show that the new architecture has good performance and is very compiler-friendly.

Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins
Inter-processor Connection Reconfiguration Based on Dynamic Look-Ahead Control of Multiple Crossbar Switches

A parallel system architecture for program execution based on the look-ahead dynamic reconfiguration of inter-processor connections is discussed in the paper. The architecture is based on inter-processor connection reconfiguration in multiple crossbar switches that are used for parallel program execution. Programs are structured into sections that use fixed inter-processor connections for communication. The look-ahead dynamic reconfiguration assumes that while some inter-processor connections in crossbar switches are used for current section execution, other connections are in advance configured for execution of further sections. Programs have to be decomposed into sections for given time parameters of reconfiguration control, so, as to avoid program execution delays due to connection reconfiguration. Automatic program structuring is proposed based on the analysis of parallel program graphs. The structuring algorithm finds the partition into sections that minimizes the execution time of a program executed with the look-ahead created connections. The program execution time is evaluated by simulated program graph execution with reconfiguration control modeled as an extension of the basic program graph.

Eryk Laskowski, Marek Tudruj
Arbitrating Instructions in an ρμ-Coded CCM

In this paper, the design aspects of instruction arbitration in an ρμ-coded CCM are discussed. Software considerations, architectural solutions, implementation issues and functional testing of an ρμ-code arbiter are presented. A complete design of such an arbiter is proposed and its VHDL code is synthesized for the VirtexII Pro platform FPGA of Xilinx. The functionality of the unit is verified by simulations. A very low utilization of available reconfigurable resources is achieved after the design is synthesized. Simulations of an MPEG-4 case study suggest considerable performance speed-up in the range of 2,4-8,8 versus a pure software PowerPC implementation.

Georgi Kuzmanov, Stamatis Vassiliadis

Cryptographic Applications 1

How Secure Are FPGAs in Cryptographic Applications?

The use of FPGAs for cryptographic applications is highly attractive for a variety of reasons but at the same time there are many open issues related to the general security of FPGAs. This contribution attempts to provide a state-of-the-art description of this topic. First, the advantages of reconfigurable hardware for cryptographic applications are listed. Second, potential security problems of FPGAs are described in detail, followed by a proposal of a some countermeasure. Third, a list of open research problems is provided. Even though there have been many contributions dealing with the algorithmic aspects of cryptographic schemes implemented on FPGAs, this contribution appears to be the first comprehensive treatment of system and security aspects.

Thomas Wollinger, Christof Paar
FPGA Implementations of the RC6 Block Cipher

RC6 is a symmetric-key algorithm which encrypts 128-bit plaintext blocks to 128-bit ciphertext blocks. The encryption process involves four operations: integer addition modulo 2w, bitwise exclusive or of two w-bit words, rotation to the left, and computation of f(X)=(X (2X+1)) mod 2w, which is the critical arithmetic operation of this block cipher. In this paper, we investigate and compare four implementations of the f(X) operator on Virtex-E and Virtex-II devices. Our experiments show that the choice of an algorithm is strongly related to the target FPGA family. We also describe several architectures of a RC6 processor designed for feedback or non-feedback chaining modes. Our fastest implementation achieves a throughput of 15.2 Gb/s on a Xilinx XC2V3000-6 device.

Jean-Luc Beuchat
Very High Speed 17 Gbps SHACAL Encryption Architecture

Very high speed and low area hardware architectures of the SHACAL-1 encryption algorithm are presented in this paper. The SHACAL algorithm was a submission to the New European Schemes for Signatures, Integrity and Encryption (NESSIE) project and it is based on the SHA-1 hash algorithm. To date, there have been no performance metrics published on hardware implementations of this algorithm. A fully pipelined SHACAL-1 encryption architecture is described in this paper and when implemented on a Virtex-II X2V4000 FPGA device, it runs at a throughput of 17 Gbps. A fully pipelined decryption architecture achieves a speed of 13 Gbps when implemented on the same device. In addition, iterative architectures of the algorithm are presented. The SHACAL-1 decryption algorithm is derived and also presented in this paper, since it was not provided in the submission to NESSIE.

Máire McLoone, J. V. McCanny

Place and Route Tools

Track Placement: Orchestrating Routing Structures to Maximize Routability

The design of a routing channel for an FPGA is a complex process requiring a careful balance of flexibility with silicon efficiency. With a growing move towards embedding FPGAs into SoC designs, and the new opportunity to automatically generate FPGA architectures, this problem is even more critical. The design of a routing channel requires determining the number of routing tracks, the length of the wires in those tracks, and the positioning of the breaks between wires on the tracks. This paper focuses on the last problem, the placement of breaks in tracks to maximize overall flexibility. Our optimal algorithm for track placement finds a best solution provided the problem meets a number of restrictions. Our relaxed algorithm is without restrictions, and finds solutions on average within 1.13% of optimal.

Katherine Compton, Scott Hauck
Quark Routing

With inherent problem complexity, ever increasing instance size and ever decreasing layout area, there is need in physical design for improved heuristics and algorithms. In this investigation, we present a novel routing methodology based on the mechanics of auctions. We demonstrate its efficacy by exhibiting the superior results of our auctionbased FPGA router QUARK on the standard benchmark suite.

Sean T. McCulloch, James P. Cohoon
Global Routing for Lookup-Table Based FPGAs Using Genetic Algorithms

In this paper we present experiments concerning the feasibility of using genetic algorithms to efficiently build the global routing in lookup-table based FPGAs. The algorithm is divided in two steps: first, a set of viable routing alternatives is pre-computed for each net, and then the genetic algorithm selects the best routing for each one of the nets that offers the best overall global routing. Our results are comparable to other available global routers, so we conclude that genetic algorithms can be used to build competitive global routing tools.

Jorge Barreiros, Ernesto Costa

Multi-context FPGAs

Virtualizing Hardware with Multi-context Reconfigurable Arrays

In contrast to processors, current reconfigurable devices totally lack programming models that would allow for device independent compilation and forward compatibility. The key to overcome this limitations is hardware virtualization. In this paper, we resort to a macro-pipelined execution model to achieve hardware virtualization for data streaming applications. As a hardware implementation we present a hybrid multi-context architecture that attaches a coarse-grained reconfigurable array to a host CPU. A co-simulation framework enables cycle-accurate simulation of the complete architecture. As a case study we map an FIR filter to our virtualized hardware model and evaluate different designs. We discuss the impact of the number of contexts and the feature of context state on the speedup and the CPU load.

Rolf Enzler, Christian Plessl, Marco Platzner
A Dynamically Adaptive Switching Fabric on a Multicontext Reconfigurable Device

A framework of dynamically adaptive hardware mechanism on multicontext reconfigurable devices is proposed, and as an example, an adaptive switching fabric is implemented on NEC’s novel reconfigurable device DRP(Dynamically Reconfigurable Processor).In this switch, contexts for the full crossbar and alternative hardware modules, which provide larger bandwidth but can treat only a limited pattern of packet inputs, are prepared. Using the quick context switching functionality, a context for the full crossbar is switched by alternative contexts according to the packet inputs pattern. Furthermore, if the traffic includes a lot of packets for specific destinations, a set of contexts frequently used in the traffic is gathered inside the chip like a working set stored in a cache.4 x 4 mesh network connected with the proposed adaptive switches is simulated, and it appears that the latency between nodes is improved three times when the traffic between neighboring four nodes is dominant.

Hideharu Amano, Akiya Jouraku, Kenichiro Anjo
Reducing the Configuration Loading Time of a Coarse Grain Multicontext Reconfigurable Device

High speed and low cost configuration loading methods for a coarse grain multicontext reconfigurable device DRP(Dynamically Reconfigurable Processor) are proposed and implemented. In these methods, the configuration data is compressed on the host computer before loading, and decoded at the time of loading by circuits implemented on a part of logics. Unlike conventional reconfigurable device, the logic for decoder circuits is switched with application circuits immediately after loading in multicontext reconfigurable devices. Thus, the circuit does not use a real estate of the chip during the execution. Two compression methods LZSS-ARC and Selective coding are implemented and evaluated. LZSS-ARC achieves better compression ratio, while Selective coding can work at the same frequency of the data loading.

Toshiro Kitaoka, Hideharu Amano, Kenichiro Anjo

Cryptographic Applications 2

Design Strategies and Modified Descriptions to Optimize Cipher FPGA Implementations: Fast and Compact Results for DES and Triple-DES

In this paper, we propose a new mathematical DES description that allows us to achieve optimized implementations in term of ratio Throughput/Area. First, we get an unrolled DES implementation that works at data rates of 21.3 Gbps (333 MHz), using Virtex-II technology. In this design, the plaintext, the key and the mode (encryption/decrytion) can be changed on a cycle-by-cycle basis with no dead cycles. In addition, we also propose sequential DES and triple-DES designs that are currently the most efficient ones in term of resources used as well as in term of throughput. Based on our DES and triple-DES results, we also set up conclusions for optimized FPGA design choices and possible improvement of cipher implementations with a modified structure description.

Gaël Rouvroy, François-Xavier Standaert, Jean-Jacques Quisquater, Jean-Didier Legat
Using Partial Reconfiguration in Cryptographic Applications: An Implementation of the IDEA Algorithm

This paper shows that partial reconfiguration can notably improve the area and throughput of symmetric cryptographic algorithms implemented in FPGAs. In most applications the keys are fixed during a cipher session, so that several blocks, like module adders or multipliers, can be substituted for their constant-operand equivalents. These counterparts not only are faster, but also use significantly less resources. In this approach, the changes in the key are performed through a partial reconfiguration that modifies the constants. The International Data Encryption Algorithm (IDEA) has been selected as a case-study, and JBits has been chosen as the tool for performing the partial reconfiguration. The implementation occupies an 87% of a Virtex XCV600 and achieves a throughput of 8.3 GBits/sec.

Ivan Gonzalez, Sergio Lopez-Buedo, Francisco J. Gomez, Javier Martinez
An Implementation Comparison of an IDEA Encryption Cryptosystem on Two General-Purpose Reconfigurable Computers

The combination of traditional microprocessors and Field Programmable Gate Arrays (FPGAs) is developing as a future platform for intensive computational computing, combining the best aspects of traditional microprocessor front-end development with the reconfigurability of FPGAs for computation-intensive problems. Several prototype PC-FPGA machines have demonstrated significant speedups compared to standalone PC workstations for computationally intensive problems. Cryptographic applications are a clear candidate for this type of platform, due to their computational intensity and long operand lengths. In this paper, we demonstrate an efficient implementation of IDEA encryption, using two of the leading reconfigurable computers available, SRC Computers’ SRC-6E and Star Bridge Systems’ HC-36. We compare the hardware architecture and programming model of these reconfigurable computers, and the implementation of a common IDEA encryption architecture in both platforms. Detailed analyses of FPGA resource utilization for both systems, data transfer and reconfiguration overheads for the SRC system, and a comparison between SRC and a public domain software implementation are given in the paper.

Allen Michalski, Kris Gaj, Tarek El-Ghazawi

Low-Power Issues 1

Data Processing System with Self-reconfigurable Architecture, for Low Cost, Low Power Applications

In this paper, a low cost self-reconfigurable data processing system with a USB interface is presented. A single FPGA performs all processing and controls the multiple configurations without any additional elements, such as microprocessor, host computer or additional FPGAs. This architecture allows high performances at very low power consumption. In addition, a hierarchical reconfiguration system is used to support a large number of different processing tasks without the penalty in power consumption of a big local configuration memory. Due to its simplicity and low power, this data processing system is specially suitable for portable applications.

Michael G. Lorenz, Luis Mengibar, Luis Entrena, Raul Sánchez-Reillo
Low Power Coarse-Grained Reconfigurable Instruction Set Processor

Current embedded multimedia applications have stringent time and power constraints. Coarse-grained reconfigurable processors have been shown to achieve the required performance. However, there is not much research regarding the power consumption of such processors. In this paper, we present a novel coarse-grained reconfigurable processor and study its power consumption using a power model derived from Wattch. Several processor configurations are evaluated using a set of multimedia applications. Results show that the presented coarse-grained processor can achieve on average 2.5x the performance of a RISC processor with an 18% increase in energy consumption.

Francisco Barat, Murali Jayapala, Tom Vander Aa, Rudy Lauwereins, Geert Deconinck, Henk Corporaal
Encoded-Low Swing Technique for Ultra Low Power Interconnect

We present a novel encoded-low swing technique for ultra low power interconnect. Using this technique and an efficient circuit implementation, we achieve an average of 45.7% improvement in the power-delay product over the schemes utilizing low swing techniques alone, for random bit streams. Also, we obtain an average of 75.8% improvement over the schemes using low power bus encoding alone. We present extensive simulation results, including the driver and receiver circuitry, over a range of capacitive loads, for a general test interconnect circuit and also for a FPGA test interconnect circuit. Analysis of the results prove that as the capacitive load over the interconnect increases, the power-delay product for the proposed technique outperforms the techniques based on either low swing or bus encoding. We also present the signal to noise ratio (SNR) analysis using this technique for a CMOS 0.13μm process and prove that there is a 8.8% improvement in the worst case SNR compared to low swing techniques. This is a consequence of the reduction in the signal switching over the interconnect which leads to lower power supply noise.

Rohini Krishnan, Jose Pineda de Gyvez, Harry J. M. Veendrick

Run-Time Configurations

Building Run-Time Reconfigurable Systems from Tiles

This paper describes a component-based methodology tailored to the design of reconfigurable systems. Systems are constructed from tiles: localised, self contained blocks of reconfigurable logic which adhere to a specified interface. We present a state-based model for managing a hierarchical structure of tiles in a reconfigurable system and show how our approach allows automatic garbage collection techniques to be applied for reclaiming unused FPGA resources.

Gareth Lee, George Milne
Exploiting Redundancy to Speedup Reconfiguration of an FPGA

Reconfigurable logic promises a flexible computing fabric well suited to the low cost, low power, high performance and fast time to market demanded of today’s computing devices. This paper presents an analysis of what exactly occurs when a fine grain FPGA, specifically the Xilinx Virtex, is reconfigured, and proposes a tailorable approach to configuration architecture design trading off silicon area with reconfiguration time. It is shown that less than 3% of the bits contained in a typical Virtex reconfiguration bitstream are different to those already in the configuration memory, and a highly parallelisable compression technique is presented which achieves highly competitive results – 80% compression and better.

Irwin Kennedy
Run-Time Exchange of Mechatronic Controllers Using Partial Hardware Reconfiguration

We present an efficient technique to implement multi-controller systems using partial reconfigurable hardware (FPGA). The control algorithm is implemented as a dedicated circuit. Partial runtime reconfiguration is used to increase the resource efficiency by keeping just the currently active controller modules on the FPGA while inactive controller modules are stored in an external memory.

Klaus Danne, Christophe Bobda, Heiko Kalte

Cryptographic Applications 3

Efficient Modular-Pipelined AES Implementation in Counter Mode on ALTERA FPGA

This paper describes a high performance single-chip FPGA implementation of the new Advanced Encryption Standard (AES) algorithm dealing with 128-bit data/key blocks and operating in Counter (CTR) mode. Counter mode has a proven-tight security and it enables the simultaneous processing of multiple blocks without losing the feedback mode advantages. It also gives the advantage of allowing the use of similar hardware for both encryption and decryption parts. The proposed architecture is modular. The architecture basic module implements a single round of the algorithm with the required expansion hardware and control signals. It gives very high flexibility in choosing the degree of pipelining according to the throughput requirements and hardware limitations and this gives the ability to achieve the best compromised design due to these aspects. The FPGA implementation presented is that of a pipelined single chip Rijndael design which runs at a rate of 10.8 Gbits/sec for full pipelining on an ALTERA APEX-EP20KE platform.

François Charot, Eslam Yahya, Charles Wagner
An FPGA-Based Performance Analysis of the Unrolling, Tiling, and Pipelining of the AES Algorithm

In October 2000 the National Institute of Standards and Technology chose Rijndael algorithm as the new Advanced Encryption Standard (AES). AES finds wide deployment in a huge variety of products making efficient implementations a significant priority. In this paper we address the design and the FPGA implementation of a fully key agile AES encryption core with 128-bit keys. We discuss the effectiveness of several design techniques, such as accurate floorplanning, the unrolling, tiling and pipelining transformations (also in the case of feedback modes of operation) to explore the design space. Using these techniques, four architectures with different level of parallelism, trading off area for performance, are described and their implementations on a Virtex-E FPGA part are presented. The proposed implementations of AES achieve better performance as compared to other blocks in the literature and commercial IP core on the same device.

G. P. Saggese, A. Mazzeo, N. Mazzocca, A. G. M. Strollo
Two Approaches for a Single-Chip FPGA Implementation of an Encryptor/Decryptor AES Core

In this paper we present a single-chip FPGA full encryptor/decryptor core design of the AES algorithm. Our design performs all of them, encryption, decryption and key scheduling processes. High performance timing figures are obtained through the use of a pipelined architecture. Moreover, several modifications to the conventional AES algorithm’s formulations have been introduced, thus allowing us to obtain a significant reduction in the total number of computations and the path delay associated to them. Particularly, for the implementation of the most costly step of AES, multiplicative inverse in GF(28), two approaches were considered. The first approach uses pre-computed values stored in a lookup table giving fast execution times of the algorithm at the price of memory requirements. Our second approach computes multiplicative inverse by using composite field techniques, yielding a reduction in the memory requirements at the cost of an increment in the execution time. The obtained results indicate that both designs are competitive with the fastest complete AES single-chip FGPA core implementations reported to date. Our first approach requires up to 11.8% less CLB slices, 21.5% less BRAMs and yields up to 18.5% higher throughput than the fastest comparable implementation reported in literature.

Nazar A. Saqib, Francisco Rodríguez-Henríquez, Arturo Díaz-Pérez

Compilation Tools

Performance and Area Modeling of Complete FPGA Designs in the Presence of Loop Transformations

Selecting which program transformations to apply when mapping computations to FPGA-based architectures leads to prohibitively long design exploration cycles. An alternative is to develop fast, yet accurate, performance and area models to understand the impact and interaction of the transformations. In this paper we present a combined analytical performance and area modeling for complete FPGA designs in the presence of loop transformations. Our approach takes into account the impact of input/output memory bandwidth and memory interface resources, often the limiting factor in the effective implementation of these computations. Our preliminary results reveal that our modeling is very accurate allowing a compiler tool to quickly explore a very large design space resulting in the selection of a feasible high-performance design.

K. R. Shesha Shayee, Joonseok Park, Pedro C. Diniz
Branch Optimisation Techniques for Hardware Compilation

This paper explores using information about program branch probabilities to optimise reconfigurable designs. The basic premise is to promote utilization by dedicating more resources to branches which execute more frequently. A hardware compilation system has been developed for producing designs which are optimised for different branch probabilities. We propose an analytical queueing network performance model to determine the best design from observed branch probability information. The branch optimisation space is characterized in an experimental study for Xilinx Virtex FPGAs of two complex applications: video feature extraction and progressive refinement radiosity. For designs of equal performance, branch-optimised designs require 24% and 27.5% less area. For designs of equal area, branch optimised designs run upto 3 times faster. Our analytical performance model is shown to be highly accurate with relative error between 0.12 and 1.1 x 10− 4.

Henry Styles, Wayne Luk
A Model for Hardware Realization of Kernel Loops

Hardware realization of kernel loops holds the promise of accelerating the overall application performance and is therefore an important part of the synthesis process. In this paper, we consider two important loop optimization techniques, namely loop unrolling and software pipelining that can impact the performance and cost of the synthesized hardware. We propose a novel model that accounts for various characteristics of a loop, including dependencies, parallelism and resource requirement, as well as certain high level constraints of the implementation platform. Using this model, we are able to deduce the optimal unroll factor and technique for achieving the best performance given a fixed resource budget. The model was verified using a compiler-based FPGA synthesis framework on a number of kernel loops. We believe that our model is general and applicable to other synthesis frameworks, and will help reduce the time for design space exploration.

Jirong Liao, Weng-Fai Wong, Tulika Mitra

Asynchronous Techniques

Programmable Asynchronous Pipeline Arrays

We discuss high-performance programmable asynchronous pipeline arrays (PAPAs). These pipeline arrays are coarse-grain field programmable gate arrays (FPGAs) that realize high data throughput with fine-grain pipelined asynchronous circuits. We show how the PAPA architecture maintains most of the speed and energy benefits of a custom asynchronous design, while also providing post-fabrication logic reconfigurability. We report results for a prototype PAPA design in a 0.25μm CMOS process that has a peak pipeline throughput of 395MHz for asynchronous logic.

John Teife, Rajit Manohar
Globally Asynchronous Locally Synchronous FPGA Architectures

Globally Asynchronous Locally Synchronous (GALS) Systems have provoked renewed interest over recent years as they have the potential to combine the benefits of asynchronous and synchronous design paradigms. It has been applied to ASICs, but not yet applied to FPGAs. In this paper we propose applying GALS techniques to FPGAs in order to overcome the limitation on timing imposed by slow routing.

Andrew Royal, Peter Y. K. Cheung

Biology-Related Applications

Case Study of a Functional Genomics Application for an FPGA-Based Coprocessor

Although microarrays are already having a tremendous impact on biomedical science, they still present great computational challenges. We examine a particular problem involving the computation of linear regressions on a large number of vector combinations in a high-dimensional parameter space, a problem that was found to be virtually intractable on a PC cluster. We observe that characteristics of this problem map particularly well to FPGAs and confirm this with an implementation that results in a 1000-fold speed-up over a serial implementation. Other contributions involve the data routing structure, the analysis of bit-width allocation, and the handling of missing data. Since this problem is representative of many in functional genomics, part of the overall significance of this work is that it points to a potential new area of applicability for FPGA coprocessors.

Tom Van Court, Martin C. Herbordt, Richard J. Barton
A Smith-Waterman Systolic Cell

With an aim to understand the information encoded by DNA sequences, databases containing large amount of DNA sequence information are frequently compared and searched for matching or near-matching patterns. This kind of similarity calculation is known as sequence alignment. To date, the most popular algorithms for this operation are heuristic approaches such as BLAST and FASTA which give high speed but low sensitivity, i.e. significant matches may be missed by the searches. Another algorithm, the Smith-Waterman algorithm, is a more computationally expensive algorithm but achieves higher sensitivity. In this paper, an improved systolic processing element cell for implementing the Smith-Waterman on a Xilinx Virtex FPGA is presented.

C. W. Yu, K. H. Kwong, K. H. Lee, P. H. W. Leong

Codesign

Software Decelerators

This paper introduces the notion of a software decelerator, to be used in logic-centric system architectures. Functions are offloaded from logic to a processor, accepting a speed penalty in order to derive overall system benefits in terms of improved resource use (e.g. reduced area or lower power consumption) and/or a more efficient design process. The background rationale for such a strategy is the increasing availability of embedded processors ’for free’ in Platform FPGAs. A detailed case study of the concept is presented, involving the provision of a high-level technology-independent design methodology based upon a finite state machine model. This illustrates easier design and saving of logic resource, with timing performance still meeting necessary requirements.

Eric Keller, Gordon Brebner, Phil James-Roxby
A Unified Codesign Run-Time Environment for the UltraSONIC Reconfigurable Computer

This paper presents a codesign environment for the UltraSONIC reconfigurable computing platform which is designed specifically for real-time video applications. A codesign environment with automatic partitioning and scheduling between a host microprocessor and a number of reconfigurable processors is described. A unified runtime environment for both hardware and software tasks under the control of a task manager is proposed. The practicality of our system is demonstrated with an FFT application.

Theerayod Wiangtong, Peter Y. K. Cheung, Wayne Luk

Reconfigurable Fabrics

Extra-dimensional Island-Style FPGAs

This paper proposes modifications to standard island-style FPGAs that provide interconnect capable of scaling at the same rate as typical netlists, unlike traditionally tiled FPGAs. The proposal uses a logical third and fourth dimensions to create increasing wire density for increasing logic capacity. The additional dimensions are mapped to standard two-dimensional silicon. This innovation will increase the longevity of a given cell architecture, and reduce the cost of hardware, CAD tool and Intellectual Property (IP) redesign. In addition, extra-dimensional FPGA architectures provide a conceptual unification of standard FPGAs and time-multiplexed FPGAs.

Herman Schmit
Using Multiplexers for Control and Data in D-Fabrix

This paper describes the use of dynamically controlled multiplexers in the Elixent D-Fabrix Reconfigurable Algorithm Processor (RAP) for both datapath functions and to implement simple logic functions for control circuits.

Tony Stansfield
Heterogeneous Logic Block Architectures for Via-Patterned Programmable Fabrics

ASIC designs are becoming increasingly unaffordable due to rapidly increasing mask costs, greater manufacturing complexity, and the need for several re-spins to meet design constraints. Although FPGAs solve the NRE cost problem, they often fail to achieve the required performance and density. A Via-Patterned Gate Array (VPGA) that combines the regularity and design cost amortization benefits of FPGAs with silicon area and power consumption comparable to ASICs, was presented in [1]. The VPGA fabric consists of a regular interconnect architecture laid on top of an array of patternable logic blocks (PLBs). Customization of the logic and interconnect is done by the placement or removal of vias at a subset of the potential via locations. In this paper, we propose four heterogeneous PLBs for via-patterned fabrics and explore their performance, density and fabric utilization characteristics across several applications. Although this analysis is done in the context of the VPGA fabric, the proposed heterogeneous PLBs and the experimental methodology can be employed for any embedded programmable fabric.

Aneesh Koorapaty, Lawrence Pileggi, Herman Schmit

Image Processing Applications

A Real-Time Visualization System for PIV

Particle image velocimetry (PIV) is a method of imaging and analyzing field of flows. In the PIV method, small windows in an image of the field (time t) are compared with areas around the windows in the another image of the field (time t + Δt), and the most similar part to the windows are searched using two dimensional cross-correlation function. The computational complexity of the function is very huge, and can not be processed in real-time by micro-processors. In this paper, we describe a real-time visualization system for the PIV method. In the system, an improved direct computation method is used to reduce the computational complexity. The system consists of only one off-the-shelf Virtex-II FPGA board and a host computer, and calculates the complex function without reducing data bit-width, which becomes possible with one latest FPGA.

Toshihito Fujiwara, Kenji Fujimoto, Tsutomu Maruyama
A Real-Time Stereo Vision System with FPGA

In this paper, we describe a compact stereo vision system which consists of one off-the-shelf FPGA board with one FPGA. This system supports (1) camera calibration for easy use and for simplifying the circuit, and (2) left-right consistency check for reconstructing correct 3-D geometry from the images taken by the cameras. The performance of the system is limited by the calibration (which is, however, a must for practical use) because only one pixel data can be allowed to read in owing to the calibration. The performance is, however, 20 frame per second (when the size of images is 640 x 480, and 80 frames per second when the size of images is 320 x 240), which is fast enough for practical use such as vision systems for autonomous robots. This high performance can be realized by the recent progress of FPGAs and wide memory access to external RAMs (eight memory banks) on the FPGA board.

Yosuke Miyajima, Tsutomu Maruyama
Synthesizing on a Reconfigurable Chip an Autonomous Robot Image Processing System

This paper deals with the implementation, in a high density reconfigurable device, of an entire log-polar image processing system. The log-polar vision reduces the amount of data to be stored and processed, simplifying several vision algorithms and making it possible the implementation of a complete processing system on a single chip. This image processing system is specially appropriated for autonomous robotic navigation, since these platforms have typically power consumption, size and weight restrictions. Furthermore, the image processing algorithms involved are time consuming and many times they have also real-time restrictions. A reconfigurable approach on a single chip combines hardware performance and software flexibility and appears as specially suited to autonomous robotic navigation. The implementation of log-polar image processing algorithms as a pipeline of differential processing stages is a feasible approach, since the chip incorporates RAM memory enough for storing several full log-polar images as intermediate computations. Two different algorithms have been synthesized into the reconfigurable device showing the chip capabilities.

Jose Antonio Boluda, Fernando Pardo

SAT Techniques

Reconfigurable Hardware SAT Solvers: A Survey of Systems

By adapting to computations that are not so well supported by general-purpose processors, reconfigurable systems achieve significant increases in performance. Such computational systems use high-capacity programmable logic devices and are based on processing units customized to the requirements of a particular application. A great deal of research effort in this area is aimed at accelerating the solution of combinatorial optimization problems. Special attention was given to the Boolean satisfiability (SAT) problem resulting in a considerable number of different architectures being proposed. This paper presents the state-of-the-art in reconfigurable hardware SAT satisfiers. The analysis of existing systems has been performed according to such criteria as reconfiguration modes, the execution model, the programming model, etc.

Iouliia Skliarova, António B. Ferrari
Fault Tolerance Analysis of Distributed Reconfigurable Systems Using SAT-Based Techniques

The ability to migrate tasks from one reconfigurable node to another improves the fault tolerance of distributed reconfigurable systems. The degree of fault tolerance is inherent to the system and can be optimized during system design. Therefore, an efficient way of calculating the degree of fault tolerance is needed. This paper presents an approach based on satisfiability testing (SAT) which regards the question: How many resources may fail in a distributed reconfigurable system without losing any functionality? We will show by experiment that our new approach can easily be applied to systems of reasonable size as we will find in the future in the field of body area networks and ambient intelligence.

Rainer Feldmann, Christian Haubelt, Burkhard Monien, Jürgen Teich
Hardware Implementations of Real-Time Reconfigurable WSAT Variants

Local search methods such as WSAT have proven to be successful for solving SAT problems. In this paper, we propose two host-FPGA (Field Programmable Gate Array) co-implementations, which use modified WSAT algorithms to solve SAT problems. Our implementations are reconfigurable in real-time for different problem instances. On an XCV1000 FPGA chip, SAT problems up to 100 variables and 220 clauses can be solved. The first implementation is based on a random strategy and achieves one flip per clock cycle through the use of pipelining. The second uses a greedy heuristic at the expense of FPGA space consumption, which precludes pipelining. Both of the two implementations avoid re-synthesis, placement, routing for different SAT problems, and show improved performance over previously published reconfigurable SAT implementations on FPGAs.

Roland H. C. Yap, Stella Z. Q. Wang, Martin J. Henz

Application-Specific Architectures

Core-Based Reusable Architecture for Slave Circuits with Extensive Data Exchange Requirements

Many digital circuit’s functionality is strongly dependant on high speed data exchange between data source and sink elements. In order to alleviate the main processor’s work, it is usually interesting to isolate high speed data exchange from all other control tasks. A generic architecture, based on configurable cores, has been achieved for slave circuits controlled by an external host and with extensive data exchange requirements. Design reuse has been improved by means of a software application that helps on configuration and simulation tasks. Two applications implemented on FPGA technology are presented to validate the proposed architecture.

Unai Bidarte, Armando Astarloa, Aitzol Zuloaga, Jaime Jimenez, Iñigo Martínez de Alegría
Time and Energy Efficient Matrix Factorization Using FPGAs

In this paper, new algorithms and architectures for matrix factorization are presented. Two fully-parallel and block-based designs for LU decomposition on configurable devices are proposed. A linear array architecture is employed to minimize the usage of long interconnects, leading to lower energy dissipation. The designs are made scalable by using a fixed I/O bandwidth independent of the problem size. High level models for energy profiling are built and the energy performance of many possible designs is predicted. Through the analysis of design tradeoffs, the block size that minimizes the total energy dissipation is identified. A set of candidate designs was implemented on the Xilinx Virtex-II to verify the estimates. Also, the performance of our designs is compared with that of state-of-the-art DSP based designs and with the performance of designs obtained using a state-of-the-art commercial compilation tool such as Celoxica DK1. Our designs on the FPGAs are significantly more time and energy efficient in both cases.

Seonil Choi, Viktor K. Prasanna
Improving DSP Performance with a Small Amount of Field Programmable Logic

We show a systematic methodology to create DSP + field-programmable logic hybrid architectures by viewing it as a hardware/software codesign problem. This enables an embedded processor architect to evaluate the trade-offs in the increase in die area due to the field programmable logic and the resultant improvement in performance or code size. We demonstrate our methodology with the implementation of a Viterbi decoder. A key result of the paper is that the addition of a field-programmable data alignment unit (FPDAU) between the register-file and the computational blocks provides 15%-22% improvement in the performance of a Viterbi decoder on the state-of-the-art TigerSHARC DSP. The area overhead of the FPDAU is small relative to the DSP die size and does not require any changes to the programming model or the instruction set architecture.

John Oliver, Venkatesh Akella

DSP Applications

Fully Parameterized Discrete Wavelet Packet Transform Architecture Oriented to FPGA

The present paper describes a fully parameterized Discrete Wavelet Packet Transform (DWPT) architecture based on a folded Distributed Arithmetic implementation, which makes possible to design any kind of wavelet bases. The proposed parameterized architecture allows different CDF wavelet coefficient with variable bit precision (data input and output size, and coefficient length). Moreover, by combining different blocks in cascade, we can expand as many complete stages (wavelet packet levels) as we require. Our architecture need only two FIR filters to calculate various wavelet stages simultaneously, and specific VIRTEX family resources (SRL16E) have been instantiated to reduce area and increase frequency operation. Finally, a DWPT implementation for CDF(9,7) wavelet coefficients is synthesized on VIRTEX-II 3000-6 FPGA for different precisions.

Guillermo Payá, Marcos M. Peiró, Francisco Ballester, Francisco Mora
An FPGA System for the High Speed Extraction, Normalization and Classification of Moment Descriptors

We propose a new FPGA system for the high speed extraction, normalization and classification of moment descriptors. Moments are extensively used in computer vision, most recently in the MPEG-7 standard for the region shape descriptor. The computational complexity of such methods has been partially addressed by the proposal of custom hardware architectures for the fast computation of moments. However, a complete system for the extraction, normalization and classification of moment descriptors has not yet been suggested. Our system is a hybrid, relying partly on a very fast parallel processing structure and partly on a custom built, low cost, reprogrammable processing unit. Within the latter, we also propose FPGA circuits for low cost double precision floating-point arithmetic. Our system achieves the extraction and classification of invariant descriptors for hundreds or even thousands of intensity or color images per second and is ideal for high speed and/or volume applications.

Stavros Paschalakis, Peter Lee, Miroslaw Bober
Design and Implementation of a Novel FIR Filter Architecture with Boundary Handling on Xilinx VIRTEX FPGAs

This paper presents the design and implementation of a novel architecture for FIR filters on Xilinx Virtex FPGAs. The architecture is particularly useful for handling the problem of signal boundaries filtering, which occurs in finite length signal processing (e.g. image processing). It cleverly exploits the Shift Register Logic (SRL) component of the Virtex family in order to implement the necessary complex data scheduling, leading to considerable area savings compared to the conventional implementation (based on a hard router), with no speed penalty. Our architecture uses bit parallel arithmetic and is fully scalable and parameterisable. A case study based on the implementation of the standard low filter of the Daubechies-8 wavelet on Xilinx Virtex-E FPGAs is presented.

A. Benkrid, K. Benkrid, D. Crookes

Dynamic Reconfiguration

A Self-reconfiguring Platform

A self-reconfiguring platform is reported that enables an FPGA to dynamically reconfigure itself under the control of an embedded microprocessor. This platform has been implemented on Xilinx Virtex IItm and Virtex II Protm devices. The platform’s hardware architecture has been designed to be lightweight. Two APIs (Application Program Interface) are described which abstract the low level configuration interface. The Xilinx Partial Reconfiguration Toolkit (XPART), the higher level of the two APIs, provides methods for reading and modifying select FPGA resources. It also provides support for relocatable partial bitstreams. The presented self-reconfiguring platform enables embedded applications to take advantage of dynamic partial reconfiguration without requiring external circuitry.

Brandon Blodget, Philip James-Roxby, Eric Keller, Scott McMillan, Prasanna Sundararajan
Heuristics for Online Scheduling Real-Time Tasks to Partially Reconfigurable Devices

Partially reconfigurable devices allow to configure and execute tasks in a true multitasking manner. The main characteristics of mapping tasks to such devices is the strong nexus between scheduling and placement. In this paper, we formulate a new online real-time scheduling problem and present two heuristics, the horizon and the stuffing technique, to tackle it. Simulation experiments evaluate the performance and the runtime efficiency of the schedulers. Finally, we discuss our prototyping work toward an integration of scheduling and placement into an operating system for reconfigurable devices.

Christoph Steiger, Herbert Walder, Marco Platzner
Run-Time Minimization of Reconfiguration Overhead in Dynamically Reconfigurable Systems

Dynamically Reconfigurable Hardware (DRHW) can take advantage of its reconfiguration capability to adapt at run-time its performance and its energy consumption. However, due to the lack of programming support for dynamic task placement on these platforms, little previous work has been presented studying these run-time performance/power trade-offs. To cope with the task placement problem we have adopted an interconnection-network-based DRHW model with specific support for reallocating tasks at run-time. On top of it, we have applied an emerging task concurrency management (TCM) methodology previously applied to multiprocessor platforms. We have identified that the reconfiguration overhead can drastically affect both the system performance and energy consumption. Hence, we have developed two new modules for the TCM run-time scheduler that minimize these effects. The first module reuses previously loaded configurations, whereas the second minimizes the impact of the reconfiguration latency by applying a configuration prefetching technique. With these techniques reconfiguration overhead is reduced by a factor of 4.

Javier Resano, Daniel Mozos, Diederik Verkest, Serge Vernalde, Francky Catthoor

SoC Architectures

Networks on Chip as Hardware Components of an OS for Reconfigurable Systems

In complex reconfigurable SoCs, the dynamism of applications requires an efficient management of the platform. To allow run-time allocation of resources, operating systems and reconfigurable SoC platforms should be developed together. The operating system requires hardware support from the platform to abstract the reconfigurable resources and to provide an efficient communication layer. This paper presents our work on interconnection networks which are used as hardware support for the operating system. We show how multiple networks interface to the reconfigurable resources, allow dynamic task relocation and extend OS-control to the platform. An FPGA implementation of these networks supports the concepts we describe.

T. Marescaux, J-Y. Mignolet, A. Bartic, W. Moffat, D. Verkest, S. Vernalde, R. Lauwereins
A Reconfigurable Platform for Real-Time Embedded Video Image Processing

The increasing ubiquity of embedded digital video capture creates demand for high-throughput, low-power, flexible and adaptable integrated image processing systems. An architecture for a system-on-a-chip solution is proposed, based on reconfigurable computing. The inherent system modularity and the communication infrastructure are targeted at enhancing design productivity and reuse. Power consumption is addressed by a combination of efficient streaming data transfer and reuse mechanisms. It is estimated that the proposed system would be capable of performing up to ten complex image manipulations simultaneously and in real-time on video resolutions up to XVGA.

N. P. Sedcole, P. Y. K. Cheung, G. A. Constantinides, W. Luk

Emulation

Emulation-Based Analysis of Soft Errors in Deep Sub-micron Circuits

The continuous technology scaling makes soft errors a critical issue in deep sub-micron technologies, and techniques for assessing their impact are strongly required that combine efficiency and accuracy. FPGA-based emulation is a promising solution to tackle this problem when large circuits are considered, provided that suitable techniques are available to support time-accurate simulations via emulation. This paper presents a novel technique that embeds time-related information in the topology of the analyzed circuit, allowing evaluating the effects of the soft errors known as single event transients (SETs) in large circuits via FPGA-based emulation. The analysis of complex designs becomes thus possible at a very limited cost in terms of CPU time, as showed by the case study described in the paper.

M. Sonza Reorda, M. Violante
HW-Driven Emulation with Automatic Interface Generation

This paper presents an approach to automate the emulation of HW/SW-Systems on an FPGA-board attached to a host. The basic steps in design preparation for the emulation are the generation of the interconnection and the description of the synchronization mechanism between the HW and the SW. While some of the related work considers the generation of the interconnection with some manual interventions, the generation of the synchronization mechanism is left to the user as part of the effort to set up the emulation. We present an approach to generate the interconnection and the synchronization mechanism, which allows a HW-driven communication between the SW and the HW.

M. Çakir, E. Grimpe, W. Nebel

Cache Design

Implementation of HW$im – A Real-Time Configurable Cache Simulator

In this paper, we describe a computer cache memory simulation environment based on a custom board with multiple FPGAs and DRAM DIMMs. This simulation environment is used for future memory hierarchy evaluation of either single or multiple processors systems. With this environment, we are able to perform real-time memory hierarchy studies running real applications. The board contains five Xilinx’ VirtexTM, II-1000 FPGAs and eight SDRAM DIMMs. One of the FPGA is used to interface with a microprocessor system bus. The other four FPGAs work in parallel to simulate different cache configurations. Each of these four FPGAs interfaces with two SDRAM DIMMs that are used to store the simulated cache. This simulation environment is operational and achieves a system frequency of 133MHz.

Shih-Lien Lu, Konrad Lai
The Bank Nth Chance Replacement Policy for FPGA-Based CAMs

In this paper we describe a method to implement a large, high density, fully associative cache in the Xilinx VirtexE FPGA architecture. The cache is based on a content addressable memory (CAM), with an associated memory to store information for each entry, and a replacement policy for victim selection. This implementation method is motivated by the need to improve the speed of routing of IP packets through Internet routers. To test our methodology, we designed a prototype cache with a 32 bit cache tag for the IP address and 4 bits of associated data for the forwarding information. The number of cache entries and the sizes of the data fields are limited by the area available in the FPGA. However, these sizes are specified as high level design parameters, which makes modifying the design for different cache configurations or larger devices trivial.

Paul Berube, Ashley Zinyk, José Nelson Amaral, Mike MacGregor

Arithmetic 1

Variable Precision Multipliers for FPGA-Based Reconfigurable Computing Systems

This paper describes a new efficient multiplier for FPGA-based variable precision processors. The circuit here proposed can adapt itself at run-time to different data wordlengths avoiding time and power consuming reconfiguration. This is made possible thanks to the introduction of on purpose designed auxiliary logic, which enables the new circuit to operate in SIMD fashion and allows high parallelism levels to be guaranteed when operations on lower precisions are executed. The proposed circuit has been characterised using VIRTEX XILINX devices, but it can be efficiently used also in others FPGA families.

Pasquale Corsonello, Stefania Perri, Maria Antonia Iachino, Giuseppe Cocorullo
A New Arithmetic Unit in GF(2 m ) for Reconfigurable Hardware Implementation

This paper proposes a new arithmetic unit (AU) in GF(2m) for reconfigurable hardware implementation such as FPGAs, which overcomes the well-known drawback of reduced flexibility that is associated with traditional ASIC solutions. The proposed AU performs both division and multiplication in GF(2m). These operations are at the heart of elliptic curve cryptosystems (ECC). Analysis shows that the proposed AU has significantly less area complexity and has roughly the same or lower latency compared with some related circuits. In addition, we show that the proposed architecture preserves a high clock rate for large m (up to 571), when it is implemented on Altera’s EP2A70F1508C-7 FPGA device. Furthermore, the new architecture provides a high flexibility and scalability with respect to the field size m, since it does not restrict the choice of irreducible polynomials and has the features of regularity, modularity, and unidirectional data flow. Therefore, the proposed architecture is well suited for both division and multiplication unit of ECC implemented on FPGAs.

Chang Hoon Kim, Soonhak Kwon, Jong Jin Kim, Chun Pyo Hong

Biologically Inspired Designs

A Dynamic Routing Algorithm for a Bio-inspired Reconfigurable Circuit

In this paper we present a new dynamic routing algorithm specially implemented for a new electronic tissue called POEtic. This reconfigurable circuit is designed to ease the implementation of bio-inspired systems that bring cellular applications into play. Specifically designed for implementing cellular applications, such as neural networks, this circuit is composed of two main parts: a two-dimensional array of basic elements similar to those found in common commercial FPGAs, and a two-dimensional array of routing units that implement a dynamic routing algorithm which allows the creation of data paths between cells at runtime.

Yann Thoma, Eduardo Sanchez, Juan-Manuel Moreno Arostegui, Gianluca Tempesti
An FPL Bioinspired Visual Encoding System to Stimulate Cortical Neurons in Real-Time

This paper proposes a real-time bioinspired visual encoding system for multielectrodes’ stimulation of the visual cortex supported on Field Programmable Logic. This system includes the spatio-temporal preprocessing stage and the generation of varying in time spike patterns to stimulate an array of microelectrodes and can be applied to build a portable visual neuroprosthesis. It only requires a small amount of hardware which is achieved by taking advantage of the high operating frequency of the FPGAs to share circuits in time. Experimental results show that with the proposed architecture a real-time visual encoding system can be implemented in FPGAs with modest capacity.

Leonel Sousa, Pedro Tomás, Francisco Pelayo, Antonio Martinez, Christian A. Morillas, Samuel Romero

Low-Power Issues 2

Power Analysis of FPGAs: How Practical Is the Attack?

Recent developments in information technologies made the secure transmission of digital data a critical design point. Large data flows have to be exchanged securely and involve encryption rates that sometimes may require hardware implementations. Reprogrammable devices such as Field Programmable Gate Arrays are highly attractive solutions for hardware implementations of encryption algorithms and several papers underline their growing performances and flexibility for any digital processing application. Although cryptosystem designers frequently assume that secret parameters will be manipulated in closed reliable computing environments, Kocher et al. stressed in 1998 that actual computers and microchips leak information correlated with the data handled. Side-channel attacks based on time, power and electromagnetic measurements were successfully applied to the smart card technology, but we have no knowledge of any attempt to implement them against FPGAs. This paper examines how monitoring power consumption signals might breach FPGA-security. We propose first experimental results against FPGA-implementations of cryptographic algorithms in order to confirm that power analysis has to be considered as a serious threat for FPGA security. We also highlight certain features of FPGAs that increase their resistance against side-channel attacks.

François-Xavier Standaert, Loïc van Oldeneel tot Oldenzeel, David Samyde, Jean-Jacques Quisquater
A Power-Scalable Motion Estimation Architecture for Energy Constrained Applications

In the research community wireless devices are fostering many design and development activities. The augmented transmission bandwidth supplied by 3G transmission schemes will soon enable an ubiquitous fruition of multimedia content. This paper proposes a reconfigurable, power-scalable architecture for hybrid video coding, suitable for the mobile environment. The complete FPGA design flow shows very interesting performances both in terms of throughput, and power consumption.

Maurizio Martina, Andrea Molino, Federico Quaglio, Fabrizio Vacca

SoC Designs

A Novel Approach for Architectural Models Characterization. An Example through the Systolic Ring

In this article we present a model of coarse grained reconfigurable architecture, dedicated to accelerate data-flow oriented applications. The proliferation of new academic and industrial architectures implies a large variety of solutions for platform-based designers. Thus, efficient metrics to compare and qualify these architectures are more and more necessary. Several metrics, Troughput Density [3][12], Remanence [4] and Operative Density are then used to perform comparisons on different architectures. Architectures are often customisable and purpose several parameters. Therefore, it is crucial to characterize the architectural model according to these parameters. This paper proposes as a case study the Systolic Ring, and gives a set of metrics as functions of the architecture parameters. The methodology illustrated is generic and proved very efficient to highlight architectural properties such as the scalability.

P. Benoit, G. Sassatelli, L. Torres, M. Robert, G. Cambon, D. Demigny
A Generic Architecture for Integrated Smart Transducers

A smart transducer network hosts various nodes with different functionality. Our approach offers the possibility to design different smart transducer nodes as a system-on-a-chip within the same platform. Key elements are a set of code compatible processor cores which can be equipped with several extension modules. Due to the fact that all processor cores are code compatible, programs developed for one node run on all other nodes without any modification. A well-defined interface between processor cores and extension modules ensures that all modules can be used with every processor type. The applicability of the proposed approach is shown by presenting our experiences with the implementation of a smart transducer featuring the processor core and a UART extension module on an FPGA.

Martin Delvai, Ulrike Eisenmann, Wilfried Elmenreich
Customisable Core-Based Architectures for Real-Time Motion Estimation on FPGAs

This paper proposes new core-based architectures for motion estimation that are customisable for different coding parameters and hardware resources. These new cores are derived from an efficient and fully parameterisable 2-D single array systolic structure for full-search block-matching motion estimation and inherit its configurability properties in what concerns the macroblock dimension, the search area and parallelism level. The proposed architectures require significantly fewer hardware resources, by reducing the spatial and pixel resolutions rather than restricting the set of considered candidate motion vectors. Low-cost and low-power regular architectures suitable for field programmable logic implementation are obtained without compromising the quality of the coded video sequences. Experimental results show that despite the significant complexity level presented by motion estimation processors, it is still possible to implement fast and low-cost versions of the original core-based architecture using general purpose FPGA devices.

Nuno Roma, Tiago Dias, Leonel Sousa

Cellular Applications

A High Speed Computation System for 3D FCHC Lattice Gas Model with FPGA

In this paper, we describe a new computation method for 3D FCHC lattice gas model with FPGA. FCHC lattice gas model is a class of 3D cellular automata and used for simulating fluid dynamics. Many approaches with FPGAs for cellular automata have been researched to date. However, practical three dimensional cellular automata such as an FCHC lattice gas model could not be processed efficiently because they required large size data for each cell and very complex update rules for computing cells. We implemented the new method on an FPGA board with one XC2V6000. The speed gain for FCHC lattice gas model with 128 x 128 x 128 lattice is about 200 times compared with Athlon processor 1800 MHz.

Tomoyoshi Kobori, Tsutomu Maruyama
Implementation of ReCSiP: A ReConfigurable Cell SImulation Platform

A reconfigurable accelerator for cell simulators called "ReCSiP" is proposed. It consists of both reconfigurable hardware and software platform. For high performance simulation, numerical solution of kinetic formulas, which require a large amount of computation, are processed on the reconfigurable hardware. It also provides programming interface for developing cell simulators. In this paper, Michaelis-Menten solver is designed and implemented on ReCSiP. The result of preliminary evaluation shows that ReCSiP is 8 times faster than Intel PentiumIII 1.13GHz when simple metabolic simulations are executed.

Yasunori Osana, Tomonori Fukushima, Hideharu Amano
On the Implementation of a Margolus Neighborhood Cellular Automata on FPGA

Margolus neighborhood is the easiest form of designing Cellular Automata Rules with features such as invertibility or particle conserving. In this paper we propose two different implementations of systems based on this neighborhood: The first one corresponds to a classical RAM-based implementation, while the second, based on concurrent cells, is useful for smaller systems in which time is a critical parameter. This implementation has the feature that the evolution of all the cells in the design is performed in the same clock cycle.

Joaquín Cerdá, Rafael Gadea, Vicente Herrero, Angel Sebastiá

Arithmetic 2

Fast Modular Division for Application in ECC on Reconfigurable Logic

Elliptic Curve Public Key Cryptosystems are becoming increasingly popular for use in mobile devices and applications where bandwidth and chip area are limited. They provide much higher levels of security per key length than established public key systems such as RSA. The underlying operation of elliptic curve point multiplication requires modular multiplication, division/inversion and addition/subtraction. Division is by far the most costly operation in terms of speed. This paper proposes a new divider architecture and implementation on FPGA for use in an ECC processor.

Alan Daly, William Marnane, Tim Kerins, Emanuel Popovici
Non-uniform Segmentation for Hardware Function Evaluation

This paper presents a method for evaluating functions in hardware based on polynomial approximation with non-uniform segments. The novel use of non-uniform segments enables us to approximate non-linear regions of a function particularly well. The appropriate segment address for a given function can be rapidly calculated in run time by a simple combinational circuit. Scaling factors are used to deal with large polynomial coefficients and to trade precision with range. Our function evaluator is based on first-order polynomials, and is suitable for applications requiring high performance with small area, at the expense of accuracy. The proposed method is illustrated using two functions, $\sqrt{-\ln(x)}$ and cos(2 πx), which have been used in Gaussian noise generation.

Dong-U Lee, Wayne Luk, John Villasenor, Peter Y. K. Cheung
A Dual-Path Logarithmic Number System Addition/Subtraction Scheme for FPGA

A new architecture for calculating the addition/subtraction function required in a logarithmic number system (LNS) is presented. A substantial logic saving over previous works is illustrated along with similarities with the dual-path floating-point addition method. The new architecture constrains the lookups to be of fractional width and uses shifting to achieve this. Instead of calculating the function $\log_2(1\pm2^{M-K})$ in two lookups the function arithmetic is performed (i.e. the two functions 2M −  K and log2( ), plus a correction function) as this allows logic sharing that maps well to FPGA. Better-than-floating-point (BTFP) accuracy is used to enable a future comparison with floating-point.

Barry Lee, Neil Burgess

Fault Analysis

A Modular Reconfigurable Architecture for Efficient Fault Simulation in Digital Circuits

In this paper, a modular reconfigurable architecture for efficient stuck-at fault simulation in digital circuits is described. The architecture is based on a Universal Faulty Gate Block, which models each 2-input gate by a 4-input Look-Up Table (LUT) and a Shift-Register (SR) with 3 stages, and relies on collapsing the stuck-at fault list of the gates using equivalence and dominance relations between faults. An example is presented, the expected performance is estimated and the applicability and limitations of the architecture are discussed.

J. Soares Augusto, C. Beltrán Almeida, H. C. Campos Neto
Evaluation of Testability of Path Delay Faults for User-Configured Programmable Devices

A model of the combinational section of a programmable device suitable for an analysis of testability of delay faults is proposed. All relevant factors that affect the evaluation of testability of path delay faults are identified and their impact on the outcome of the evaluation is discussed. A detailed analysis, supported by quantitative results, focuses on the selection of the set of target faults in terms of a class of logical paths and on the concept of defining testability measures for physical paths rather than for logical paths. Practical guidelines are formulated for the development of a procedure for the evaluation of testability of path delay faults.

Andrzej Krasniewski
Fault Simulation Using Partially Reconfigurable Hardware

This paper presents a fault simulation algorithm and that uses efficient partial reconfiguration of FPGAs. The methodology is particularly useful for evaluation of BIST effectiveness, and for applications in which multiple fault injection is mandatory, such as safety-critical applications. A novel fault collapsing methodology is proposed, which efficiently leads to the minimal stuck-at fault list at the look-up-tables’ terminals. Fault injection is performed using local partial reconfiguration with small binary files. Our results on the ISCAS’89 sequential circuit benchmarks show that our methodology can be orders of magnitude faster than software or fully reconfigurable hardware fault simulation..

A. Parreira, J. P. Teixeira, A. Pantelimon, M. B. Santos, J. T. de Sousa
Switch Level Fault Emulation

The switch level is an abstraction level between the gate level and the electrical level, offers many advantages. Switch level simulators can reliably model many important phenomena in CMOS circuits, such as bi-directional signal propagation, charge sharing and variations in driving strength. However, the fault simulation of switch level models is more time-consuming than gate level models. This paper presents a method for fast fault emulation of switch level circuits using FPGA chips. In this method, gates model switch level circuits and we can emulate mixed gate-switch level models. By the use of this method, FPGA chips can be used to accelerate the fault injection campaigns into switch level models.

Seyed Ghassem Miremadi, Alireza Ejlali

Network Applications

An Extensible, System-On-Programmable-Chip, Content-Aware Internet Firewall

An extensible firewall has been implemented that performs packet filtering, content scanning, and per-flow queuing of Internet packets at Gigabit/second rates. The firewall uses layered protocol wrappers to parse the content of Internet data. Packet payloads are scanned for keywords using parallel regular expression matching circuits. Packet headers are compared to rules specified in Ternary Content Addressable Memories (TCAMs). Per-flow queuing is performed to mitigate the effect of Denial of Service attacks. All packet processing operations were implemented with reconfigurable hardware and fit within a single Xilinx Virtex XCV2000E Field Programmable Gate Array (FPGA). The single-chip firewall has been used to filter Internet SPAM and to guard against several types of network intrusion. Additional features were implemented in extensible hardware modules deployed using run-time reconfiguration.

John W. Lockwood, Christopher Neely, Christopher Zuver, James Moscola, Sarang Dharmapurikar, David Lim
IPsec-Protected Transport of HDTV over IP

Bandwidth-intensive applications compete directly with the operating system’s network stack for CPU cycles. This is particularly true when the stack performs security protocols such as IPsec; the additional load of complex cryptographic transforms overwhelms modern CPUs when data rates exceed 100 Mbps. This paper describes a network-processing accelerator which overcomes these bottlenecks by offloading packet processing and cryptographic transforms to an intelligent interface card. The system achieves sustained 1 Gbps host-to-host bandwidth of encrypted IPsec traffic on commodity CPUs and networks. It appears to the application developer as a normal network interface, because the hardware acceleration is transparent to the user. The system is highly programmable and can support a variety of offload functions. A sample application is described, wherein production-quality HDTV is transported over IP at nearly 900 Mbps, fully secured using IPsec with AES encryption.

Peter Bellows, Jaroslav Flidr, Ladan Gharai, Colin Perkins, Pawel Chodowiec, Kris Gaj
Fast, Large-Scale String Match for a 10Gbps FPGA-Based Network Intrusion Detection System

Intrusion Detection Systems such as Snort scan incoming packets for evidence of security threats. The most computation-intensive part of these systems is a text search against hundreds of patterns, and must be performed at wire-speed. FPGAs are particularly well suited for this task and several such systems have been proposed. In this paper we expand on previous work, in order to achieve and exceed a processing bandwidth of 11Gbps. We employ a scalable, low-latency architecture, and use extensive fine-grain pipelining to tackle the fan-out, match, and encode bottlenecks and achieve operating frequencies in excess of 340MHz for fast Virtex devices. To increase throughput, we use multiple comparators and allow for parallel matching of multiple search strings. We evaluate the area and latency cost of our approach and find that the match cost per search pattern character is between 4 and 5 logic cells.

Ioannis Sourdis, Dionisios Pnevmatikatos
Irregular Reconfigurable CAM Structures for Firewall Applications

Hardware packet-filters for firewalls, based on content-addressable memory (CAM), allow packet matching processes to keep in pace with network throughputs. However, the size of an FPGA chip may limit the size of a firewall rule set that can be implemented in hardware. We develop two irregular CAM structures for packet-filtering that employ resource sharing methods, with various trade-offs between size and speed. Experiments show that the use of these two structures are capable of reduction, up to 90%, of hardware resources without losing performance.

T. K. Lee, S. Yusuf, W. Luk, M. Sloman, E. Lupu, N. Dulay

High Level Design Tools 2

Compiling for the Molen Programming Paradigm

In this paper we present compiler extensions for the Molen programming paradigm, which is a sequential consistency paradigm for programming custom computing machines (CCM). The compiler supports instruction set extensions and register file extensions. Based on pragma annotations in the application code, it identifies the code fragments implemented on the reconfigurable hardware and automatically maps the application on the target reconfigurable architecture. We also define and implement a mechanism that allows multiple operations to be executed in parallel on the reconfigurable hardware. In a case study, the Molen processor has been evaluated. We considered two popular multimedia benchmarks: mpeg2enc and ijpeg and some well-known time-consuming operations implemented in the reconfigurable hardware. The total number of executed instructions has been reduced with 72% for mpeg2enc and 35% for ijpeg encoder, compared to their pure software implementations on a general purpose processor (GPP).

Elena Moscu Panainte, Koen Bertels, Stamatis Vassiliadis
Laura: Leiden Architecture Research and Exploration Tool

At Leiden Embedded Research Center (LERC), we are building a tool chain called Compaan/Laura that allows us to map fast and efficiently applications written in Matlab onto reconfigurable platforms. In this chain, first the Matlab code is converted automatically to executable Kahn Process Network (KPN) specification. Then a tool called Laura accepts this specification and transforms the specification into design implementations described as synthesizable VHDL. In this paper, we present our methodology implemented in the Laura tool, to automatically convert KPNs to synthesizable VHDL code targeted for mapping onto FPGA-based platforms. With the help of Laura, a designer is able to either fast prototype signal processing and multimedia applications directly in hardware or to extract very fast valuable low-level quantitative implementation data such as performance in terms of clock cycles, time delays and silicon area.

Claudiu Zissulescu, Todor Stefanov, Bart Kienhuis, Ed Deprettere
Communication Costs Driven Design Space Exploration for Reconfigurable Architectures

In this paper we propose a design space exploration method targeting reconfigurable architectures that takes place at the algorithmic level and aims to rapidly highlight architectures that present good performance vs. flexibility tradeoffs. The exploration flow is based on a functional model to describe the architectures that the designer wants to compare. The paper mainly focuses on the projection step of our flow and presents an allocation heuristic that is based on communication costs reduction.

Lilian Bossuet, Guy Gogniat, Jean-Luc Philippe
From Algorithm Graph Specification to Automatic Synthesis of FPGA Circuit: A Seamless Flow of Graphs Transformations

The control, signal and image processing applications are complex in terms of algorithms, hardware architectures and real-time/embedded constraints. System level CAD softwares are then useful to help the designer for prototyping and optimizing such applications. These tools are oftently based on design flow methodologies. This paper presents a seamless design flow which transforms a data dependence graph specifying the application into an implementation graph containing both data and control paths. The proposed approach follows a set of rules based on the RTL model and on mechanisms of synchronized data transfers in order to transform automatically the initial algorithmic graph into the implementation graph. This transformation flow is part of the extension of our AAA (Algorithm-Architecture Adequation) rapid prototyping methodology to support the optimized implementation of real-time applications on reconfigurable circuits. It has been implemented in SynDEx, a system level CAD software tool that supports AAA.

Linda Kaouane, Mohamed Akil, Yves Sorel, Thierry Grandpierre

Technologies and Trends (Posters)

Adaptive Real-Time Systems and the FPAA

Adaptive real-time systems are often typically considered by design engineers to be a solely digital preserve. However, in recent years analogue technologies such as the Field programmable Analogue Array (FPAA) have been developed to support real-time reconfiguration. These technologies can offer an analogue platform for a vast number of applications requiring adaptive processing and form the subject of this paper. Although in terms of their technology and software/algorithmic support, FPAAs are still in the relative infancy compared with digital techniques, we aim to show their advantages and how present devices can be exploited in real-time applications.

Stuart Colsell, Reuben Edwards
Challenges and Successes in Space Based Reconfigurable Computing

For 6 years Los Alamos has developed space-compatible versions of Reconfigurable Computing (RCC). Several such designs are now operational. We describe the key research steps required to make commercial silicon processes amenable to Radiation Tolerant operations, and the limits on algorithms imposed by reliability concerns.

Mark E. Dunham, Michael P. Caffrey, Paul S. Graham
Adaptive Processor: A Dynamically Reconfiguration Technology for Stream Processing

In order to improve the performance of reconfigurable computing, the number of reconfigurable units is increased with advance of semiconductor technology. The array of reconfigurable units can be configured to application-specific pipelined processing datapath. Then configuration overhead will be critical overhead of total execution time for dynamic reconfiguration based system. In this paper, models of efficient configuration methodology and application-specific pipelined stream processing are proposed. Adaptive processor architecture is also proposed, and discussed in summary.

Shigeyuki Takano

Applications (Posters)

Efficient Reconfigurable Logic Circuits for Matching Complex Network Intrusion Detection Patterns

This paper presents techniques for designing pattern matching circuits for complex regular expressions, such as those found in network intrusion detection patterns. We have developed a pattern-matching co-processor that supports all the pattern matching functions of the Snort rule language [3]. In order to achieve maximum pattern capacity and throughput, the design focuses on minimizing circuit area while maintaining high clock speed. Using our approach, we are able to store the entire current Snort rule database consisting of over 1,500 rules and 17,000 characters into a single one-million-gate FPGA while comparing all patterns against traffic at gigabit rates.

Christopher R. Clark, David E. Schimmel
FPGAs for High Accuracy Clock Synchronization over Ethernet Networks

This article describes the architecture and implementation of two systems on a programmable chip, which support high accuracy clock synchronization over Ethernet networks. The network interface node on one hand provides all necessary hardware support to be flexibly used in a broad range of applications. The switch add-on on the other hand accounts for the packet delay uncertainties of Ethernet switches and is crucial for high accuracy clock synchronization.

Roland Höller
Project of IPv6 Router with FPGA Hardware Accelerator

This paper deals with a hardware accelerator as a part of the Liberouter project which is focused on design and implementation of a PC based IPv6 router. Major part of the Liberouter project is the development of a hardware accelerator – the PCI board called COMBO6 and its FPGA design which allows processing most of the network traffic in hardware.

Jiří Novotný, Otto Fučík, David Antoš
A TCP/IP Based Multi-device Programming Circuit

This paper describes a lightweight Field Programmable Gate Array (FPGA) circuit design that supports the simultaneous programming of multiple devices at different locations throughout the Internet. This task is accomplished by a single TCP/IP socket connection. Packets are routed through a series of devices to be programmed. At each location, a hardware circuit extracts reconfiguration information from the TCP/IP byte stream and programs other devices at that location. A novel feature of the Multi-Device Programmer is that it does not use a microprocessor or even a soft-core processor. All of the TCP/IP protocol processing and packet forwarding operations are handled directly in FPGA logic and state machines. This system is robust against lost and reordered packets, and has been successfully demonstrated in the laboratory.

David V. Schuehler, Harvey Ku, John Lockwood

Tools (Posters)

Design Flow for Efficient FPGA Reconfiguration

In Run Time Reconfiguration (RTR) systems, the amount of reconfiguration is considerable when compared to the circuit changes implemented. This is because reconfiguration is not considered as part of the design flow. This paper presents a method for reconfigurable circuit design by modeling the underlying FPGA reconfigurable circuitry and taking it into consideration in the system design. This is demonstrated for an image processing example on the Xilinx Virtex FPGA.

Richard H. Turner, Roger F. Woods
High-Level Design Tools for FPGA-Based Combinatorial Accelerators

Analysis of different combinatorial search algorithms has shown that they have a set of distinctive features in common. The paper suggests a number of reusable blocks that support these features and provide high-level design of combinatorial accelerators.

Valery Sklyarov, Iouliia Skliarova, Pedro Almeida, Manuel Almeida
Using System Generator to Design a Reconfigurable Video Encryption System

In this paper, we discuss the use of System Generator to design a reconfigurable video encryption system. It includes the design of the AES (Advanced Encryption System) and Enigma encryption cores. As a result of using this design flow, we are able to efficiently implement our system and algorithms with a significant improvement on traditional design times, without compromise for performance.

Daniel Denning, Neil Harold, Malachy Devlin, James Irvine
MATLAB/Simulink Based Methodology for Rapid-FPGA-Prototyping

The paper is focused on rapid prototyping for FPGA using the high-level environment of MATLAB/Simulink. An approach using combination of the Xilinx System Generator (XSG) and Handel-C is reviewed. A design flow to minimize HDL coding is considered.

Miroslav Líčko, Jan Schier, Milan Tichý, Markus Kühl

FPGA Implementations (Posters)

DIGIMOD: A Tool to Implement FPGA-Based Digital IF and Baseband Modems

This paper presents a software tool to design intermediate frequency and baseband digital transceivers on FPGA. Main characteristic of this tool is that an ad-hoc interpolation or decimation filter chain composed by CIC, polyphase, pulse shaping, matched filters and a CORDIC-based or ROM-based mixer can be selected. The tool allows the software radio designer to develop downconverters and upconverters and, finally, automatically to generate the VHDL code to implement the system on Xilinx FPGAs.

J. Marín-Roig, V. Torres, M. J. Canet, A. Pérez, T. Sansaloni, F. Cardells, F. Angarita, F. Vicedo, V. Almenar, J. Valls
FPGA Implementation of a Maze Routing Accelerator

This paper describes the implementation of the L3 maze routing accelerator in an FPGA. L3 supports fast single-layer and multi-layer routing, preferential routing, and rip-up-and-reroute. A 16 X 16 single-layer and 4 X 4 multi-layer router that can handle 2-16 layers have been implemented in a low-end Xilinx XC2S300E FPGA. Larger arrays are currently under construction.

John A. Nestor
Model Checking Reconfigurable Processor Configurations for Safety Properties

Reconfigurable processors pose unique problems for program safety because of their use of computational approaches that are difficult to integrate into traditional program analyses. The combination of proof-carrying code for verification of standard processor machine code and model-checking for array configurations is explored. This approach is shown to be useful in verifying safety properties including the synchronization of memory accesses by the reconfigurable array and memory access bounds checking.

John Cochran, Deepak Kapur, Darko Stefanovic
A Statistical Analysis Tool for FPLD Architectures

This paper investigates an analysis tool for the routing resources in the FPLD architecture design. The developed tool can assess the performance of a given architecture specified by the physical configuration of logic blocks and the switch boxes topology. Two problems are mainly considered in this paper: given an architecture, the terminal distribution of each switch box is first determined via probabilistic assumptions, then the sizes of required universal switch boxes are evaluated for routing successfully. The estimations are validated by comp aring them with the results obtained in the previous published experimental study on FPGA benchmark circuits. Moreover, our result confirms that the universal switch block is a good candidate for FPLD design.

Renqiu Huang, Tommy Cheung, Ted Kok

Video and Image Applications (Posters)

FPGA-Implementation of Signal Processing Algorithms for Video Based Industrial Safety Applications

Conventional protective devices as light curtains allow safe but often inconvenient flow of work. Unfortunately uncomfortable safety devices are often bypassed or simply switched off. Consequently, the design of a video based protective device, which avoids inconvenient processing steps is of special interest. The present paper describes favorable combinations of FPGA-hardware and algorithms, which allow safeguarding of work places if several constraints are met. The methods were originally developed for surveillance of press brakes, but it is easily adaptable to different types of machines or work places.

Jörg Velten, Anton Kummert
Configurable Hardware Architecture for Real-Time Window-Based Image Processing

In this work, a configurable hardware architecture for window-based image operations for real-time applications is presented. The architecture is based on an array of elemental processors under a systolic and pipeline approach to achieve a high rate of processing. A configurable window processor has been developed to cover a broad class of image processing algorithms and operators. The system is modeled in a Hardware Description Language and has been prototyped on an FPGA device. Some implementation and performance results are presented and discussed.

Cesar Torres-Huitzil, Miguel Arias-Estrada
An FPGA-Based Image Connected Component Labeller

This paper describes an FPGA implementation of a Connected Component Labelling algorithm (CCL), developed at Queen’s University Belfast. The algorithm iteratively scans the input image, performing a non-zero maximum neighbourhood operation. It has been coded in Handel C language and targeted Celoxica RC1000-PP PCI board. The whole design was fully implemented and tested on real hardware in less than 24 man-hour. It uses a Virtex-E FPGA and two banks of off-chip memory. For 1024x1024 input images, the whole circuit consumes 583 FPGA slices and 5 Block RAMs and can run at 72 MHz, leading to a 68 pass/sec performance. The FPGA implementation outperforms, easily, an equivalent software implementation running on a 1.6 GHz Pentium-IV PC. A 10-fold speed up has been realised in many instances.

K. Benkrid, S. Sukhsawas, D. Crookes, A. Benkrid
FPGA Implementation of Adaptive Non-linear Predictors for Video Compression

The paper describes the implementation of a systolic array for a non-linear predictor for image compression. We can implement very large interconnection layers by using large Xilinx and Altera devices with embedded memories and multipliers alongside the projection used in the systolic architecture. These physical and architectural features create a reusable, flexible, and fast method of designing a complete ANN (Artificial Neural Networks) on FPGAs. Our predictor, a MLP (Multilayer Perceptron) with the topology 12-10-1 and with training on the fly, works, both in recall and learning modes, with a throughput of 50 MHz, reaching the necessary speed for real-time training in video applications.

Rafael Gadea-Girones, Agustín Ramirez-Agundis, Joaquín Cerdá-Boluda, Ricardo Colom-Palero

Reconfigurable and Low-Power Systems (Posters)

Reconfigurable Systems in Education

This paper describes methods and tools that have been used for teaching disciplines dedicated to the design of reconfigurable digital systems. It demonstrates students’ projects, disseminates experience in the integration of different disciplines, and gives examples of stimulating student activity. A set of animated tutorials for students that are available on WebCT with a number of practical projects that cover a variety of topics in FPGA-based design can be seen as the most valuable contribution to the area considered.

Valery Sklyarov, Iouliia Skliarova
Data Dependent Circuit Design: A Case Study

Data dependent circuits are logic circuits specialized to specific input data. They are smaller and faster than the original circuits, although they are not reusable and require circuit generation for each input instance. This study examines data dependent designs for subgraph isomorphism problems, and shows that a simple algorithm is faster than an elaborate algorithm. An algorithm that requires many hardware resources consumes an accordingly longer circuit generation time, which outweighs the performance advantage in execution.

Shoji Yamamoto, Shuichi Ichikawa, Hiroshi Yamamoto
Design of a Power Conscious, Customizable CDMA Receiver

2G wireless systems have gained a widespred diffusion. Due to this fact, the transition to 3G ones can be critical. A possible solution to the interoperability problem can came from the Software Defined Radio paradigm. In this paper a complete, reconfigurable CDMA receiver implementation over a Xilinx XCV300E FPGA is described.

Maurizio Martina, Andrea Molino, Mario Nicola, Fabrizio Vacca
Power-Efficient Implementations of Multimedia Applications on Reconfigurable Platforms

The power-efficient implementation of motion estimation algorithms on a system comprised by an FPGA and an external memory is presented. Low power consumption is achieved by implementing an optimum on-chip memory hierarchy inside the FPGA, and moving the bulk of required memory transfers from the internal memory hierarchy instead of the external memory. Comparisons among implementations with and without this optimization, prove that great power efficiency is achieved while satisfying performance constraints.

K. Tatas, K. Siozios, D. Soudris, A. Thanailakis

Design Techniques (Posters)

A VHDL Library to Analyse Fault Tolerant Techniques

This work presents an initiative to teach the basis of fault tolerance in digital systems design in undergraduate and graduate courses in electrical and computer engineering. The approach is based on a library of characteristic circuits related to fault tolerance techniques which has been implemented using a Hardware Description Language (VHDL). Due to the properties of the design tools associated to these languages, this approach allows with ease: (1) to implement faults tolerant digital systems; (2) to d etermine the behaviour of system when faults are presented; (3) to evaluate the additional resources and response time linked to any fault tolerance technique in the laboratory.

P. M. Ortigosa, O. López, R. Estrada, I. García, E. M. Garzón
Hardware Design with a Scripting Language

The Python Hardware Description Language (PyHDL) provides a scripting interface to object-oriented hardware design in C++. PyHDL uses the PamDC and PAM-Blox libraries to generate FPGA circuits. The main advantage of scripting languages is a reduction in development time for high-level designs. We propose a two-step approach: first, use scripting to explore effects of composition and parameterisation; second, convert the scripted designs into compiled components for performance. Our results show that, for small designs, our method offers 5 to 7 times improvement in turnaround time. For a large 10x10 matrix vector multiplier, our method offers respectively 365% and 19% improvement in turnaround time over purely scripting and purely compiled methods.

Per Haglund, Oskar Mencer, Wayne Luk, Benjamin Tai
Testable Clock Routing Architecture for Field Programmable Gate Arrays

This paper describes an efficient methodology for testing dedicated clock lines in Field Programmable Gate Arrays (FPGAs). A H-tree based clocking architecture is proposed along with a test scheme. The H-tree architecture provides optimal clock skew characteristics. The H-tree architecture consumes at least 25% less of the routing resources when compared to conventional clock routing schemes. A testing scheme, which utilizes the partial reconfiguration capabilities of FPGAs through selective re-programming of the Complex Logic Blocks, to detect and locate faults in the clock lines is proposed

L. Kalyan Kumar, Amol J. Mupid, Aditya S. Ramani, V. Kamakoti

Neural and Biological Applications (Posters)

FPGA Implementation of Multi-layer Perceptrons for Speech Recognition

In this work we present different hardware implementations of a multi-layer perceptron for speech recognition. The designs have been defined using two different abstraction levels: register transfer level (VHDL) and a higher algorithmic-like level (Handel-C). The implementations have been developed and tested into a reconfigurable hardware (FPGA) for embedded systems. A study of the two considered approaches costs (silicon area), speed and required computational resources is presented.

E. M. Ortigosa, P. M. Ortigosa, A. Cañas, E. Ros, R. Agís, J. Ortega
FPGA Based High Density Spiking Neural Network Array

Pulsed neural networks can be applied to the design of dense arrays using minimum hardware resources in the interconnection among neurons. Using statistical saturation in pulse frequency coded neurons, a minimum size hardware neuron can be implemented. The proposed neuron is compact enough to be included in large arrays. The presented architecture has additional interesting characteristics like unrestricted topology and scalability. In this paper, the design and implementation of a high density spiking neural array is presented.

Juan M. Xicotencatl, Miguel Arias-Estrada
FPGA-Based Computation of Free-Form Deformations

This paper describes techniques for producing FPGA-based designs that support free-form deformation in medical image processing. The free-form deformation method is based on a B-spline algorithm for modelling three-dimensional deformable objects. We transform the nested loop in this algorithm to eliminate conditional statements, enabling the development of a fully pipelined design. Further optimisations include precalculation of the B-spline model using lookup tables, and deployment of multiple pipelines so that each covers a different image. Our design description, captured in the Handel-C language, is parameterisable at compile time to support a range of image resolutions and output precisions. An implementation on a Xilinx XC2V6000 device at 67MHz has a throughput which is 12.8 times faster than an Athlon based PC at 1400 MHz.

Jun Jiang, Wayne Luk, Daniel Rueckert
FPGA Implementations of Neural Networks – A Survey of a Decade of Progress

The ferst successful FPGA implementation [1] of artificial neural networks (ANNs) was published a little over a decade ago. It is timely to review the progress that has been made in this research area. This brief survey provides a taxonomy for classifying FPGA implementations of ANNs. Different implementation techniques and design issues are discussed. Future research trends are also presented.

Jihan Zhu, Peter Sutton

Codesign and Embedded Systems (Posters)

FPGA-Based Hardware/Software CoDesign of an Expert System Shell

This paper presents a new method for implementing in hardware expert systems based on belief revision concepts. The expert system’s knowledge base is first automatically translated to an equivalent network representation where nodes are facts and links stand for relationships. Then, changes are propagated throughout the network. The conclusions are extracted after no more changes occur in the state of the nodes. The automatic generation of the hardware network structure is described. Finally, the results obtained in this FPGA-based implementation are compared to those yielded by a Java-based implementation, the system’s efficiency being thus demonstrated.

Aurel Neţin, Dumitru Roman, Octavian Creţ, Kalman Pusztai, Lucia Văcariu
Cluster-Driven Hardware/Software Partitioning and Scheduling Approach for a Reconfigurable Computer System

To achieve a good performance when implementing applications in codesign systems, partitioning and scheduling are important steps. In this paper, a two-phase clustering algorithm is introduced as a preprocessing step to an existing hardware/software partitioning and scheduling system. This preprocessing step increases the granularity in the partition design, resulting in a higher degree of parallelism and a better mapping to the reconfigurable resource. This cluster-driven approach shows improvements in both the makespan of the implementation, and the CPU runtime.

Theerayod Wiangtong, Peter Y. K. Cheung, Wayne Luk
Hardware-Software Codesign in Embedded Asymmetric Cryptography Application – A Case Study

This paper presents a case study of a hardware-software codesign of the RSA cipher embedded in reconfigurable hardware. The soft cores of Altera’s Nios RISC processor are used as the basic building block of the proposed complete embedded solutions. The effect of moving computationally intensive parts of RSA into an optimized parameterized scalable Montgomery coprocessor(s) is analyzed and compared with a pure software solution. The impact of the tasks distribution between the hardware and the software on the occupation of logic resources as well as the speed of the algorithm is demonstrated and generalized.

Martin Šimka, Viktor Fischer, Miloš Drutarovský
On-chip and Off-chip Real-Time Debugging for Remotely-Accessed Embedded Programmable Systems

Embedded programmable systems are becoming common in system designs, resulting in the need for educational institutions to teach advanced embedded systems design and develop debugging competence in students. Remote laboratory experimentation provided as part of a web-based distance learning allows flexible access to on-campus resources free of time or geographical constraints. However, adapting and redeveloping existing software and hardware resources to this purpose is both time consuming and expensive. This paper introduces a remote-access laboratory architecture, which extends current e-learning strategies to provide real-time debugging for embedded programmable systems via the web. An example experiment illustrates the on-chip and off-chip real-time debugging capabilities of the laboratory.

Jim Harkin, Michael Callaghan, Chris Peters, Thomas M. McGinnity, Liam Maguire

Reconfigurable Systems and Architectures (Posters)

Fast Region Labeling on the Reconfigurable Platform ACE-V

This work will revisit the computer vision application of labeling connected regions in images, and compare results achievable on current configurable architectures with previous work both by our group [1] [2] as well as one of the first attempts targeting the pioneering Splash-2 custom computing machine [3].

Christian Schmidt, Andreas Koch
Modified Fuzzy C-Means Clustering Algorithm for Real-Time Applications

The fuzzy approach in image processing is taking each day greater importance. It is greatly due to the fact that every new application of artificial vision is closer to human vision. This means that tightly knot algorithms are not always a good solution and a more "imprecise" and fuzzy approach is desirable. This paper describes a modified Fuzzy C-Means algorithm intended to be implemented in hardware. The original algorithm was modified to match the desired level of parallelism, speed and to simplify the hardware implementations.

Jesús Lázaro, Jagoba Arias, José L. Martín, Carlos Cuadrado
Reconfigurable Hybrid Architecture for Web Applications

This paper describes a Reconfigurable Hybrid Architecture for the developing, distribution and execution of web applications with high computational requirements. The Architecture is a layered model based on a hybrid device (standard microprocessor and FPGA), for which has been designed and implemented a component as a web browser plug-in. Web applications are divided into two parts: an standard part and a reconfigurable part. The plug-in links the software and hardware applications, implementing an API for the management and access to the FPGA. A real implementation of the proposed architecture has been developed using Handel-C, the RC1000-PP platform, a compatible Intel CPU, and a Visual C++ ActiveX control plug-in.

David Rodríguez Lozano, Juan M. Sánchez Pérez, Juan A. Gómez Pulido

DSP Applications (Posters)

FPGA Implementation of the Adaptive Lattice Filter

The paper presents the FPGA implementation of a noise canceler with an adaptive RLS-Lattice filter in the Xilinx devices. Since this algorithm requires floating-point computations, Logarithmic Numbering System (LNS) has been used. The pipelined lattice filter macro and input/output conversion routines has been designed. The implementation results are compared with an implementation on 32-bit IEEE floating point signal processor.

Antonín Heřmánek, Zdeněk Pohl, Jiří Kadlec
Specifying Control Logic for DSP Applications in FPGAs

New non-HDL programming models for signal processing in FPGAs have focused primarily on building high-performance data paths. Along with the ability to construct sophisticated custom signal processors comes increased requirements for creating complex control circuitry. Recent enhancements to System Generator for DSP begin to address this need by providing mechanisms that include co-simulation interfaces to extend Simulink with HDL semantics, automatic compilation from Matlab m-code into Simulink and VHDL, and embedded microcontrollers. In this paper, we describe how such mechanisms can be used in a QAM receiver designed for a CCSDS standard.

J. Ballagh, J. Hwang, H. Ma, B. Milne, N. Shirazi, V. Singh, J. Stroomer
FPGA Processor for Real-Time Optical Flow Computation

In this work an FPGA-based architecture for optical flow computation in real-time is presented. The architecture is based on an algorithm providing a dense and accurate optical flow at an affordable computational cost. The architecture is composed of an array of processors interconnected under a systolic approach. The array of processors is mainly focused in performing matrix operations to speed up the computations of optical flow. The architecture is being prototyped on an FPGA device. Results are presented and discussed.

Selene Maya-Rueda, Miguel Arias-Estrada
A Data Acquisition Reconfigurable Coprocessor for Virtual Instrumentation Applications

Virtual instruments intended for electronic circuits verification arose from the combination of computers supporting advanced graphical interfaces with data acquisition systems providing input/output capabilities. In order to increase the versatility and the operation rate of virtual instruments, we have designed several data acquisition/generation modules based on reconfigurable hardware. By this way, not only the software modules but also the hardware functions are dynamically changed according to the requirements of each specific instrument. The main basis of the software and hardware levels of reconfigurable virtual instruments are described in this paper. This methodology summarize our experience in the design of virtual instrumentation platforms oriented to different measurement applications. Finally, a new data acquisition/generation coprocessor based on FPGAs and optimized for the implementation of portable instruments is described.

M. Dolores Valdés, María J. Moure, Camilo Quintáns, Enrique Mandado

Dynamic Reconfiguration (Posters)

Evaluation and Run-Time Optimization of On-chip Communication Structures in Reconfigurable Architectures

With technology improvements, the main bottleneck in terms of performance, power consumption, and design reuse in single chip systems is proving to be generated by the on-chip communication architecture. Benefiting from the non-uniformity of the workload in various signal processing applications, several dynamic power management policies can be envisaged. Nevertheless, the integration of on-line power, performance and information-flow management strategies based on traffic monitoring in (dynamically) reconfigurable templates has yet to be explicitly tackled. The main objective of this work is to define the concept of run-time functional optimization of application specific standard products, and show the importance of integrating such techniques in reconfigurable platforms and especially their communication architectures.

T. Murgan, M. Petrov, A. García Ortiz, R. Ludewig, P. Zipf, T. Hollstein, M. Glesner, B. Oelkrug, J. Brakensiek
A Controlled Data-Path Allocation Model for Dynamic Run-Time Reconfiguration of FPGA Devices

Although methods for dynamic run-time FPGA reconfiguration have been proposed, few address the problems associated with increasing data-path delays due to a full or partial reconfigurations. In this paper, a method is proposed that enables specific timing requirements to be maintained within a reconfigurable architecture, by using logic-module partitioning and known-delay interconnection modules. This system allows data-paths of varying widths to be routed effectively between device modules along paths that are fixed in both length and position. Further, the technique may be regarded as extending the Xilinx Modular Design tools methodology to support run-time scenarios.

Dylan Carline, Paul Coulton
Architecture Template and Design Flow to Support Application Parallelism on Reconfigurable Platforms

This paper introduces the ReSArT ( Reconfigurable Scalable Architecture Template). Based on a suitable design space model, ReSArT is parametrizable, scalable, and able to support all levels of parallelism. To derive architecture instances from the template, a design environment called DEfInE ( Design Environment for ReSArT Instance G eneration) is used, which integrates some existing academic and industrial tools with ReSArT-specific components, developed as a part of this work. Different architecture instances were tested with a set of 10 benchmark applications as a proof of concept, achieving a maximum degree of parallelism of 30 and an average degree of parallelism of nearly 20 16-bit operations per cycle.

Sergei Sawitzki, Rainer G. Spallek
Efficient Implementation of the Singular Value Decomposition on a Reconfigurable System

We present a new implementation of the singular value decomposition (SVD) on a reconfigurable system made upon a Pentium processor and a FPGA-board plugged on a PCI slot of the PC. A maximum performance of the SVD is obtained by an efficient distribution of the data and the computation across the FPGA resource. Using the reconfiguration capability of the FPGA help us implement many operators on the same device.

Christophe Bobda, Klaus Danne, André Linarth

Arithmetic (Posters)

A New Reconfigurable-Oriented Method for Canonical Basis Multiplication over a Class of Finite Fields GF(2 m )

A new method for multiplication in the canonical basis over GF(2m) generated by an all-one-polynomial (AOP) is introduced. The theoretical complexities of the bit-parallel canonical multiplier constructed using our approach are equal to the smallest ones found in the literature for similar methods, but the multiplier implementation over reconfigurable hardware using our method reduces the area requirements.

José Luis Imaña, Juan Manuel Sánchez
A Study on the Design of Floating-Point Functions in FPGAs

Floating-Point Operations represent a common task in a variety of applications, but such operations often result in a bottleneck, due to the large number of machine cycles required to compute them. Even though the FPGA community has developed advanced algorithms to improve the speed of FLOPs, floating-point transcendental functions are still underdeveloped. In this paper, we discuss some of the tradeoffs faced when implementing floating-point functions in FPGAs. These techniques, including lookup tables, and CORDIC algorithms, have been used in the past for the implementation of fixed-point analytic functions. This paper seeks to apply those methods to floating-point functions. The implementation results from different versions of a floating-point sine function are summarized in terms of speed, area, and accuracy to understand the effect of different architectural alternatives.

Fernando E. Ortiz, John R. Humphrey, James P. Durbano, Dennis W. Prather
Design and Implementation of RNS-Based Adaptive Filters

This paper presents the residue number system (RNS) implementation of reduced complexity and high performance adaptive FIR filters on Altera APEX20K field-programmable logic (FPL) devices. Index arithmetic over Galois fields along with a selection of a small wordwidth modulus set are keys for attaining low-complexity and high-throughput. The replacement of a classical modulo adder tree by a binary adder with extended precision followed by a single modulo reduction stage improved area requirements by 10% for a 32-tap FIR filter. A block LMS (BLMS) implementation was preferred for the update of the adaptive FIR filter coefficients. RNS-FPL merged filters demonstrated its superiority when compared to 2C (two’s complement) filters, being about 65% faster and requiring fewer logic elements for most study cases.

Javier Ramírez, Uwe Meyer-Bäse, Antonio García, Antonio Lloris
Domain-Specific Reconfigurable Array for Distributed Arithmetic

Distributed Arithmetic techniques are widely used to implement Sum-of-Products computations such as calculations found in multimedia applications like FIR filtering and Discrete Cosine Transform. This paper presents a flexible, low-power and high throughput array for implementing distributed arithmetic computations. Flexibility is achieved by using an array of elements arranged in an interconnect mesh similar to those employed in conventional FPGA architectures. We provide results which demonstrate a significant reduction in power consumption in addition to improvements in timing and area over standard FPGA architectures.

Sami Khawam, Tughrul Arslan, Fred Westall

Design and Implementations 1 (Posters)

Design and Implementation of Priority Queuing Mechanism on FPGA Using Concurrent Periodic EFSMs and Parametric Model Checking

In this paper, we propose a design and implementation method for priority queuing mechanisms on FPGAs. First, we describe behavior of WFQ (weighted fair queuing) with several parameters in a model called concurrent periodic EFSMs. Then, we derive a parameter condition for the concurrent EFSMs to execute their transitions without deadlocks in the specified time period repeatedly under the specified temporal constraints, using parametric model checking technique. From the derived parameter condition, we can decide adequate parameter values satisfying the condition, considering total costs of components. Based on the proposed method, high-reliable and high-performance WFQ circuits for gigabit networks can be synthesized on FPGAs.

Tomoya Kitani, Yoshifumi Takamoto, Isao Naka, Keiichi Yasumoto, Akio Nakata, Teruo Higashino
Custom Tag Computation Circuit for a 10Gbps SCFQ Scheduler

This paper details the architecture and implementation of a tag computation circuit for a Self-Clocked Fair Queuing (SCFQ) Scheduler. The core objectives of the presented project is the implementation of a custom accelerator circuit that is optimized to process tag values for terabit router nodes operating at 10 Gbps per link. The system is implemented using FPGA technology and provides extended programmability to adapt the tag computation to a range of custom scheduling schemes.

Brendan McAllister, Sakir Sezer, Ciaran Toal
Exploiting Stateful Inspection of Network Security in Reconfigurable Hardware

One of the most important areas of a network intrusion detection system (NIDS), stateful inspection, is described in this paper. We present a novel reconfigurable hardware architecture implementing TCP stateful inspection used in NIDS. This is to achieve a more efficient and faster network intrusion detection system as todays’ NIDSs show inefficiency and even fail to perform while encountering the faster Internet. The performance of the NIDS described is expected to obtain a throughput of 3.0 Gbps.

Shaomeng Li, Jim Tørresen, Oddvar Søråsen
Propose of a Hardware Implementation for Fingerprint Systems

Fingerprint is graphical flow-like ridges presents on human fingers. Each fingerprint is unique, offering a clear and unambiguous method to identify an individual. The uniqueness of each fingerprint is determined by fine details embedded in its overall structure, named minutiae. Fingerprint classification system is CPU time intensive, usually implemented in software. This paper presents an alternative way to identify the minutiae from fingerprints, aiming real-time processing. In the first part is implemented the alternative algorithm in software (Delphi) and after this is presented an architecture to be implemented using a configurable devices (FPGA). The performance of this algorithm, in hardware and software, are analyzed, presenting the spent time within each system block.

Vanderlei Bonato, Rolf Fredi Molz, João Carlos Furtado, Marcos Flôres Ferrão, Fernando G. Moraes

Design and Implementations 2 (Posters)

APPLES: A Full Gate-Timing FPGA-Based Hardware Simulator

Verification of large VLSI digital circuits is primarily accomplished through simulation. In general, there is a trade-off between speed of processing and accuracy. Software simulation tools can be very accurate but are very slow compared to logic accelerators and emulation systems. These latter systems, many FPGA based, while two to three orders of magnitude faster than software, deliver inferior timing analysis, in the latter case and cycle-based simulation it is merely equivalent to functional simulation. APPLES (Associative Parallel Processor for Logic Event-driven Simulation) is the first Full Gate-timing Logic Hardware Simulator, implemented in Xilinx Virtex-II technology. APPLES is a true simulator, delivering timing analysis with the accuracy of a software simulator, but has the distinction that processing is executed entirely in hardware devoid of any machine code. This has the potential to permit APPLES to be one to two orders of magnitude faster than equivalent software systems.

Damian Dalton, Vivian Bessler, Jeffery Griffiths, Andrew McCarthy, Abhay Vadher, Rory O’Kane, Rob Quigley, Declan O’Connor
Designing, Scheduling, and Allocating Flexible Arithmetic Components

This paper introduces new scheduling and allocation algorithms for designing with hybrid arithmetic component libraries composed of both operation-specific components and flexible components capable of executing multiple operations. The flexible components are implemented primarily in fixedlogic with only small amounts of application-specific reconfigurability, which provides the flexibility needed without the negative area and performance penalties commonly associated with general-purpose reconfigurable arrays. Results obtained with hybrid library scheduling and allocation on a variety of digital signal processing (DSP) filters reveal that significant area savings are achieved.

Vinu Vijay Kumar, John Lach
UNSHADES-1: An Advanced Tool for In-System Run-Time Hardware Debugging

The aim of a Rapid Prototyping System for electronic circuit design is to obtain a physical model as similar as possible to the final system as the hosting technology can allow. Large digital integrated circuits are substituted by complex and advanced Field Programmable Gate Arrays (FPGA’s) which emulate the whole circuit functionality. These devices can provide more information than the pure circuit emulation itself, they provide a special scheme to access the device configuration and execution time information of the design state registers. This paper describes the UNSHADES-1 system and is focused on the set of software tools that provide easy management and access to this execution time information.

M. A. Aguirre, J. N. Tombs, A. Torralba, L. G. Franquelo
Backmatter
Metadaten
Titel
Field Programmable Logic and Application
herausgegeben von
Peter Y. K. Cheung
George A. Constantinides
Copyright-Jahr
2003
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-45234-8
Print ISBN
978-3-540-40822-2
DOI
https://doi.org/10.1007/b12007