nach oben

2005 | Buch

Kapitel lesen Erstes Kapitel lesen

Embedded Computer Systems: Architectures, Modeling, and Simulation

5th International Workshop, SAMOS 2005, Samos, Greece, July 18-20, 2005. Proceedings

herausgegeben von: Timo D. Hämäläinen, Andy D. Pimentel, Jarmo Takala, Stamatis Vassiliadis

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

The SAMOS workshop is an international gathering of highly quali?ed researchers from academia and industry, sharing in a 3-day lively discussion on the quiet and - spiring northern mountainside of the Mediterranean island of Samos. As a tradition, the workshop features workshop presentations in the morning, while after lunch all kinds of informal discussions and nut-cracking gatherings take place. The workshop is unique in the sense that not only solved research problems are presented and discussed but also (partly) unsolved problems and in-depth topical reviews can be unleashed in the sci- ti?c arena. Consequently, the workshop provides the participants with an environment where collaboration rather than competition is fostered. The earlier workshops, SAMOS I–IV (2001–2004), were composed only of invited presentations. Due to increasing expressions of interest in the workshop, the Program Committee of SAMOS V decided to open the workshop for all submissions. As a result the SAMOS workshop gained an immediate popularity; a total of 114 submitted papers were received for evaluation. The papers came from 24 countries and regions: Austria (1), Belgium (2), Brazil (5), Canada (4), China (12), Cyprus (2), Czech Republic (1), Finland (15), France (6), Germany (8), Greece (5), Hong Kong (2), India (2), Iran (1), Korea (24), The Netherlands (7), Pakistan (1), Poland (2), Spain (2), Sweden (2), T- wan (1), Turkey (2), UK (2), and USA (5). We are grateful to all of the authors who submitted papers to the workshop.

Inhaltsverzeichnis

Frontmatter

Keynote

Platform Thinking in Embedded Systems

Modern embedded systems are built from microprocessors, domain-specific hardware blocks, communication means, application-specific sensor/actuators and as simple as possible user interface, which hides the embedded complexity. The design of embedded systems is typically done in an integrated way with strong dependencies between these building block elements and between different parts of the system. This talk focuses on how platform thinking and engineering can be applied to increasingly complex embedded systems and what impacts that will have on the design and architectures. Platform engineering in embedded systems may sound contradictory, but in practice will introduce modularity and stable interfaces. New system-level architectures for hardware, middleware architectures, and certifiable operating system micro-kernels are needed to raise the abstraction level and productivity of design. As an example I will go through the definitions of some modules in a mobile device and the requirements for their interfaces. I will describe the additional design steps, new formal methods and system-level tasks that are needed in the platform approach. Finally, I will review the Advanced Research and Technology for Embedded and Intelligent Systems (ARTEMIS) technology platform in EU 7th Framework Program, which is bringing together industrial and academic groups to create coherent and integrated European research in the domain of embedded systems.

Bob Iannucci

Reconfigurable System Design and Implementations

Interprocedural Optimization for Dynamic Hardware Configurations

Little research in compiler optimizations has been undertaken to eliminate or diminish the negative influence on performance of the huge reconfiguration latency of the available FPGA platforms. In this paper, we propose an interprocedural optimization that minimizes the number of executed hardware configuration instructions taking into account constraints such as the ”FPGA-area placement conflicts” between the available hardware configurations. The proposed algorithm allows the anticipation of hardware configuration instructions up to the application’s main procedure. The presented results show that our optimization produces a reduction of up to 3 – 5 order of magnitude of the number of executed hardware configuration instructions.

Elena Moscu Panainte, Koen Bertels, Stamatis Vassiliadis

Reconfigurable Embedded Systems: An Application-Oriented Perspective on Architectures and Design Techniques

Reconfiguration emerged as a key concept to cope with constraints regarding performance, power consumption, design time and costs posed by the growing diversity of application domains. This work gives an overview of several relevant reconfigurable architectures and design techniques developed by the authors in different projects and emphasizes the effective role of reconfigurability in embedded system design.

M. Glesner, H. Hinkelmann, T. Hollstein, L. S. Indrusiak, T. Murgan, A. M. Obeid, M. Petrov, T. Pionteck, P. Zipf

Reconfigurable Multiple Operation Array

In this paper, we investigate the collapsing of eight multi-operand addition related operations into a single and common (3:2) counter array. We consider for this unit multiplication in integer and fractional representations, the Sum of Absolute Differences (SAD) in unsigned, signed magnitude and two’s complement notation. Furthermore, the unit also incorporates a Multiply-Accumulation unit (MAC) for two’s complement notation. The proposed multiple operation unit was constructed around 10 element arrays that can be reduced using well known counter techniques, which are feed with the necessary data to perform the proposed eight operations. It is estimated that 6/8 of the basic (3:2) counter array is shared by the operations. The obtained results of the presented unit indicates that is capable of processing a 4x4 SAD macro-block in 36.35 ns and takes 30.43 ns to process the rest of the operations using a VIRTEX II PRO xc2vp100-7ff1696 FPGA device.

Humberto Calderon, Stamatis Vassiliadis

RAPANUI: Rapid Prototyping for Media Processor Architecture Exploration

This paper describes a new rapid prototyping-based design framework for exploring and validating complex multiprocessor architectures for multimedia applications. The new methodology combines a typical ASIC flow with an FPGA flow focused on rapid prototyping. In order to make an exhaustive verification of the system architecture, a reference model that specifies the hardware implementation is used for validating both, HDL description and emulated system. Functional coverage in addition to traditional code coverage is used to test 100% of data, control and structural hazards of the system architecture. The reference model is also part of a stand-alone simulation environment. This allows hardware and application development be supported by a unique system model.

Guillermo Payá Vayá, Javier Martín Langerwerf, Peter Pirsch

Data-Driven Regular Reconfigurable Arrays: Design Space Exploration and Mapping

This work presents further enhancements to an environment for exploring coarse grained reconfigurable data-driven array architectures suitable to implement data-stream applications. The environment takes advantage of Java and XML technologies to enable architectural trade-off analysis. The flexibility of the approach to accommodate different topologies and interconnection patterns is shown by a first mapping scheme. Three benchmarks from the DSP scenario, mapped on hexagonal and grid architectures, are used to validate our approach and to establish comparison results.

Ricardo Ferreira, João M. P. Cardoso, Andre Toledo, Horácio C. Neto

Automatic FIR Filter Generation for FPGAs

This paper presents a new tool for the automatic generation of highly parallelized Finite Impulse Response (FIR) filters. In this approach we follow our PARO design methodology. PARO is a design system project for modeling, transformation, optimization, and synthesis of massively parallel VLSI architectures. The FIR filter generator employs during the design flow the following advanced transformations, (a)

hierarchical partitioning

in order to balance the amount of local memory with external communication, and (b),

partial localization

to achieve higher throughput and smaller latencies. Furthermore, our filter generator allows for design space exploration to tackle trade-offs in cost and speed. Finally, synthesizable VHDL code is generated and mapped to an FPGA, the results are compared with a commercial filter generator.

Holger Ruckdeschel, Hritam Dutta, Frank Hannig, Jürgen Teich

Two-Dimensional Fast Cosine Transform for Vector-STA Architectures

A vector algorithm for computing the two-dimensional Discrete Cosine Transform (2D-VDCT) is presented. The formulation of 2D-VDCT is stated under the framework provided by elements of multilinear algebra. This algebraic framework provides not only a formalism for describing the 2D-VDCT, but it also enables the derivation by pure algebraic manipulations of an algorithm that is well suited to be implemented in SIMD-vector signal processors with a scalable level of parallelism. The 2D-VDCT algorithm can be implemented in a matrix oriented language and a suitable compiler generates code for our family of STA (Synchronous Transfer Architecture) vector architectures with different amounts of SIMD-parallelism. We show in this paper how important speedup factors are achieved by this methodology.

J. P. Robelly, A. Lehmann, G. Fettweis

Configurable Computing for High-Security/High-Performance Ambient Systems

This paper stresses why configurable computing is a promising target to guarantee the hardware security of ambient systems. Many works have focused on configurable computing to demonstrate its efficiency but as far as we know none have addressed the security issue from system to circuit levels. This paper recalls main hardware attacks before focusing on issues to build secure systems on configurable computing. Two complementary views are presented to provide a guide for security and main issues to make them a reality are discussed. As the security at the system and architecture levels is enforced by agility significant aspects related to that point are presented and illustrated through the AES algorithm. The goal of this paper is to make designers aware of that configurable computing is not just hardware accelerators for security primitives as most studies have focused on but a real solution to provide high-security/high-performance for the whole system.

Guy Gogniat, Wayne Burleson, Lilian Bossuet

FPL-3E: Towards Language Support for Reconfigurable Packet Processing

The

FPL-3e

packet filtering language incorporates explicit support for reconfigurable hardware into the language.

FPL-3e

supports not only generic header-based filtering, but also more demanding tasks such as payload scanning and packet replication. By automatically instantiating hardware units (based on a heuristic evaluation) to process the incoming traffic in real-time, the

NIC-FLEX

network monitoring architecture facilitates very high speed packet processing. Results show that

NIC-FLEX

can perform complex processing at gigabit speeds. The proposed framework can be used to execute such diverse tasks as load balancing, traffic monitoring, firewalling and intrusion detection directly at the critical high-bandwidth links (e.g., in enterprise gateways).

Mihai Lucian Cristea, Claudiu Zissulescu, Ed Deprettere, Herbert Bos

Processor Architectures, Design and Simulation

Flux Caches: What Are They and Are They Useful?

In this paper, we introduce the concept of flux caches envisioned to improve processor performance by dynamically changing the cache organization and implementation. Contrary to the traditional approaches, processors designed with flux caches instead of assuming a hardwired cache organization change their cache ”design” on program demand. Consequently program (data and instruction) dynamic behavior determines the cache hardware design. Experimental results to confirm the flux caches potential are also presented.

Georgi N. Gaydadjiev, Stamatis Vassiliadis

First-Level Instruction Cache Design for Reducing Dynamic Energy Consumption

Microarchitects should consider energy consumption, together with performance, when designing instruction cache architecture, especially in embedded processors. This paper proposes a power-aware instruction cache architecture, named Partitioned Instruction Cache (PI-Cache), to reduce dynamic energy consumption in the instruction cache. The proposed PI-Cache is composed of several small sub-caches. When the PI-Cache is accessed, only one sub-cache is accessed by utilizing the locality of applications. In the meantime, the other sub-caches are not accessed, resulting in dynamic energy reduction. The PI-Cache also reduces energy consumption by eliminating energy consumed in tag matching. Moreover, performance loss is little, considering the physical cache access time. We evaluated the energy efficiency by running cycle accurate simulator, SimpleScalar, with power parameters obtained from CACTI. Simulation results show that the PI-Cache reduces dynamic energy consumption by 42% – 59%.

Cheol Hong Kim, Sunghoon Shim, Jong Wook Kwak, Sung Woo Chung, Chu Shik Jhon

A Novel JAVA Processor for Embedded Devices

As a result of its object-oriented (OO) feature and corresponding advantages of security, robustness and platform independence, Java is widely applied in embedded devices. However, among current solutions to Java execution engine implemented by software or hardware, the overheads of executing OO related bytecodes are costly and have a great impacts on the overall performance of Java applications, especially in embedded devices, where real-time operations and low power consumptions are required in the case of limited memory. To solve this problem, a novel Java processor architecture called jHISC is proposed where the OO related bytecodes are supported in hardware directly. In jHISC, an object is represented by the hardware-readable data structure -object context, which then makes it possible to implement complex OO related bytecodes at hardware level and access some fields of object in parallel to improve the execution speed. It mainly targets J2ME and implements about 93% bytecodes and 83% OO related bytecodes in hardware directly, and the OO related operations are executed much faster in jHISC than by software traps.

Yiyu Tan, Chihang Yau, Kaiman Lo, Paklun Mok, Anthony S. Fong

Formal Specification of a Protocol Processor

To ensure the correctness of functional and temporal properties of modern network hardware devices is becoming increasingly challenging because the growing complexity and demanding time-to-market requirements. In this paper we address the problem by deriving a TACO protocol processor model in the formal framework of Timed Action Systems. Formal methods offer a prominent approach to specify, design, and verify such devices with the benefits of a rigorous mathematical basis. The derivation demonstrates the capability of preserving correctness when considering an important hardware design decision.

Tomi Westerlund, Juha Plosila

Tuning a Protocol Processor Architecture Towards DSP Operations

In this paper we present an experiment in enhancing our transport triggered protocol processor hardware platform to support DSP applications. Our focus is on integrating support for both application domains into a single processor without loss of performance in either domain. Such a processor could be taken advantage of in applications like Voice-over-IP communication using hand-held devices, where functionality is needed from both domains. As our first step in bridging the gap between the protocol processing and DSP domains we implement support for FIR filtering. We analyze four different architectural instances for implementing FIR filters according to their performance and bus utilisation. We were able to determine that protocol processing and DSP operations can be executed in parallel very efficiently. The implementations were verified with VHDL simulations and synthesis using 0.18

m CMOS technology.

Jani Paakkulainen, Seppo Virtanen, Jouni Isoaho

Observations on Power-Efficiency Trends in Mobile Communication Devices

Computing solutions used in mobile communications equipment are essentially the same as those in personal and mainframe computers. The key differences between the implementations are found at the chip level: in mobile devices low leakage silicon technology and lower clock frequency are used. So far, the improvements of the silicon processes in mobile phones have been exploited by software designers to increase functionality and to cut development time, while usage times, and energy efficiency, have been kept at levels that satisfy the customers. In this paper, we explain some of the observed developments.

Olli Silvén, Kari Jyrkkä

CORDIC-Augmented Sandbridge Processor for Channel Equalization

In this paper we analyze an architectural extension for a Sandbridge processor which encompasses a CORDIC functional unit and the associated instructions. Specifically, the first instruction is

CFG_CORDIC

that configure the CORDIC unit in one of the rotation and vectoring modes for circular, linear, and hyperbolic coordinate systems. The second instruction is

RUN_CORDIC

that launches CORDIC operations into execution. As case study, we consider channel estimation and correction of the Orthogonal Frequency Division Multiplexing (OFDM) demodulation. In particular, we propose a scheme to implement OFDM channel correction within the extended instruction set. Preliminary results indicate a performance improvement over the base instruction set architecture of more than 80% for doing channel correction, which translates to an improvement of 50% for the entire channel estimation and correction task.

Mihai Sima, John Glossner, Daniel Iancu, Hua Ye, Andrei Iancu, A. Joseph Hoane

Power-Aware Branch Logic: A Hardware Based Technique for Filtering Access to Branch Logic

In this paper, we propose a power-aware branch logic for high performance embedded processors by filtering access to BTB and branch predictor. The proposed scheme reduces the energy consumed in BTB and branch predictor. For reducing the energy consumption in the BTB and the branch predictor, we present an aggressive hardware-based scheme that reduces the number of access to the BTB and the branch predictor. Moreover, compared with general branch logic, the proposed branch logic has no performance degradation. This scheme reduces the number of access to the BTB and the branch predictor by 21% – 50% and reduces the energy consumption in the BTB and the branch predictor by 15% – 41%.

Sunghoon Shim, Jong Wook Kwak, Cheol Hong Kim, Sung Tae Jhang, Chu Shik Jhon

Exploiting Intra-function Correlation with the Global History Stack

The demand for more computation power in high-end embedded systems has put embedded processors on parallel evolution track as the RISC processors. Caches and deeper pipelines are standard features on recent embedded microprocessors. As a result of this, some of the performance penalties associated with branch instructions in RISC processors are becoming more prevalent in these processors. As is the case in RISC architectures, designers have turned to dynamic branch prediction to alleviate this problem. Global correlating branch predictors take advantage of the influence past branches have on future ones. The conditional branch outcomes are recorded in a global history register (

GHR

). Based on the hypothesis that most correlation is among intra-function branches, we provide a detailed analysis of the

Global History Stack (GHS)

in this paper. The GHS saves the global history in the return address stack when a call instruction is executed. Following the subsequent return, the history is restored from the stack. In addition, to preserve the correlation between the callee branches and the caller branches following the call instruction, we save a few of the history bits coming from the end of the callee’s execution. We also investigate saving the GHR of a function in the Branch Target Buffer (BTB) when it returns so that it can be restored when that function is called again. Our results show that these techniques improve the accuracy of several global history based prediction schemes by 4% on average. Consequently, performance improvements as high as 13% are attained.

Fei Gao, Suleyman Sair

Power Efficient Instruction Caches for Embedded Systems

Instruction caches typically consume 27% of the total power in modern high-end embedded systems. We propose a compiler-managed instruction store architecture (K-store) that places the computation intensive loops in a scratch-pad like SRAM memory and allocates the remaining instructions to a regular instruction cache. At runtime, execution is switched dynamically between the instructions in the traditional instruction cache and the ones in the K-store, by inserting jump instructions. The necessary jump instructions add 0.038% on an average to the total dynamic instruction count. We compare the performance and energy consumption of our K-store with that of a conventional instruction cache of equal size. When used in lieu of a 8KB, 4-way associative instruction cache, K-store provides 32% reduction in energy and 7% reduction in execution time. Unlike loop caches, K-store maps the frequent code in a reserved address space and hence, it can switch between the kernel memory and the instruction cache without any noticeable performance penalty.

Dinesh C. Suresh, Walid A. Najjar, Jun Yang

Micro-architecture Performance Estimation by Formula

An analytical performance model for out of order issue superscalar micro-processors is presented. This model quantifies the performance impacts of micro-architecture design options including memory hierarchy, branch prediction, issue width and changes in pipeline depth at all pipeline stages. The model requires a minimal number of cycle accurate and trace driven simulations to calibrate and once calibrated estimates performance by formula. The model estimates the performance of arbitrary micro-architecture configurations with an average error of 6.4%. During early design stages when cycle accurate simulation is prohibitive an analytical model can provide guidance to designers to increase design quality and reduce design effort. This allows the design of an embedded processor to be rapidly tuned to its application by reducing the cost of exploring the design space.

Lucanus J. Simonson, Lei He

Offline Phase Analysis and Optimization for Multi-configuration Processors

Energy consumption has become a major issue for modern microprocessors. In previous work, several techniques were presented to reduce the overall energy consumption by dynamically adapting various hardware structures. Most approaches however lack the ability to deal efficiently with the huge amount of possible hardware configurations in case of multiple adaptive structures. In this paper, we present a framework that is able to deal with this huge configuration space problem. We first identify phases through profiling and determine the optimal hardware configuration per phase using an efficient offline search algorithm. During program execution, we inspect the phase behavior and adapt the hardware on a per-phase basis. This paper also proposes a new phase classification scheme as well as a phase correspondence metric to quantify the phase similarity between different runs of a program. Using SPEC2000 benchmarks, we show that our adaptive processing framework achieves an energy reduction of 40% on average with an average performance degradation of only 2%.

Frederik Vandeputte, Lieven Eeckhout, Koen De Bosschere

Hardware Cost Estimation for Application-Specific Processor Design

In this paper, a methodology for estimating area, energy consumption and execution time of an application executed on a specified processor is proposed. In addition, a design exploration process to find suitable processor architectures for a specific application is proposed. Cost and performance estimation is an important part of the exploration process. The actual cost estimation is based on predefined characterizations of cost and performance of resources stored in a database. The results show that the method is quick and its accuracy is sufficient for design space exploration.

Teemu Pitkänen, Tommi Rantanen, Andrea Cilio, Jarmo Takala

Ultra Fast Cycle-Accurate Compiled Emulation of Inorder Pipelined Architectures

Emulation of one architecture on another is useful when the architecture is under design, when software must be ported to a new platform or is being developed for systems which are still under development, or for embedded systems that have insufficient resources to support the software development process. Emulation using an interpreter is typically slower than normal execution by up to 3 orders of magnitude. Our approach instead translates the program from the original architecture to another architecture while faithfully preserving its semantics at the lowest level. The emulation speeds are comparable to, and often faster than, programs running on the original architecture. Partial evaluation of architectural features is used to achieve such impressive performance, while permitting accurate statistics collection. Accuracy is at the level of the number of clock cycles spent executing each instruction (hence the description

cycle-accurate

Stefan Farfeleder, Andreas Krall, Nigel Horspool

Generating Stream Based Code from Plain C

The Stream model is a high level Intermediate Representation that can be mapped to a range of parallel architectures. The Stream model has a limited scope because it is aimed at architectures that reduce the control overhead of programmable hardware to improve the overall computing efficiency. While it has its limitations, the performance critical parts of embedded and media applications can often be compiled to this model. The automatic compilation to Stream programs from C code is demonstrated.

Marcel Beemster, Hans van Someren, Liam Fitzpatrick, Ruben van Royen

Fast Real-Time Job Selection with Resource Constraints Under Earliest Deadline First

The Stack Resource Policy (SRP) is a real-time synchronization protocol suitable for embedded systems for its simplicity. However, if SRP is applied to dynamic priority scheduling, the runtime overhead of job selection algorithms could affect the performance of the system seriously. To solve the problem, a job selection algorithm was proposed that uses a selection tree as a scheduling queue structure. The proposed algorithm selects a job in

(⌈

log

⌉) time, resulting in significant reduction in the run-time overhead of scheduler. In this paper, the correctness of the job selection algorithm is presented. Also, the job selection algorithm was implemented in GSM/GPRS handset with ARM7 processor to see its effectiveness on embedded systems. The experiments performed on the system show that the proposed algorithm can further utilize the processor by reducing the scheduling overhead.

Sangchul Han, Moonju Park, Yookun Cho

A Programming Model for an Embedded Media Processing Architecture

To follow rapid evolution of media processing algorithms, the latest media processing architecture enhances the execution efficiencies of media applications by adding a programmable vision processor and by improving memory hierarchy, while complicates the programming. In this paper, the features of this architecture are analyzed, the reason of inefficiency of media application implemented by general programming model is studied and SPUR programming model is proposed. In SPUR, media data and operations are expressed as media streams and corresponding operations naturally. Moreover, algorithm is divided into high-level part written by SP-C and low-level part written by UR-C. Fine-grained data parallelism are exploited explicitly as well. Experimental results show that SPUR provides programmer a novel, expressive and efficient programming way, and obviously improves readability, robustness, development efficiency and object-code quality of media applications.

Dan Zhang, Zeng-Zhi Li, Hong Song, Long Liu

Automatic ADL-Based Assembler Generation for ASIP Programming Support

Systems-on-Chip (SoCs) may be built upon general purpose CPUs or application-specific instruction-set processors (ASIPs). On the one hand, ASIPs allow a tradeoff between flexibility, performance and energy efficiency. On the other hand, since an ASIP is not a standard component, embedded software code generation cannot rely on pre-existent tools. Each ASIP requires a distinct toolkit. To cope with time-to-market pressure, automatic toolkit generation is required. Architecture description languages (ADLs) are the ideal starting point for such automation. This paper presents robust and efficient techniques to automatically generate a couple of tools (assembler and pre-processor) from the ADL description of a given target processor. Tool robustness results from formal techniques based on context-free grammars. Tool efficiency evidence is provided by experiments targeting three CPUs: MIPS, PowerPC 405 and PIC 16F84.

Leonardo Taglietti, Jose O. Carlomagno Filho, Daniel C. Casarotto, Olinto J. V. Furtado, Luiz C. V. dos Santos

Sandbridge Software Tools

We describe the generation of the simulation environment for the Sandbridge Sandblaster multithreaded processor. The processor model is described using the Sandblaster architecture Description Language (SaDL), which is implemented as python objects. Specific processor implementations of the simulation environment are generated by calling the python objects. Using just-in-time compiler technology, we dynamically compile an executing program and processor model to a target platform, providing fast interactive responses with accelerated simulation capability. Using this approach, we simulate up to 100 million instructions per second on a 1 GHz Pentium processor. This allows the system programmer to prototype many applications in real-time within the simulation environment, providing a dramatic increase in productivity and allowing flexible hardware-software trade-offs.

John Glossner, Sean Dorward, Sanjay Jinturkar, Mayan Moudgill, Erdem Hokenek, Michael Schulte, Stamatis Vassiliadis

Architectures and Implementations

A Hardware Accelerator for Controlling Access to Multiple-Unit Resources in Safety/Time-Critical Systems

In multitasking, priority-driven systems, resource access-control protocols such as Priority Ceiling Protocol (PCP) reduce the undesirable effects of resource contention. In general, software implementation of these protocols entails costly computations that can degrade the system performance to unacceptable levels. In this paper, we present the design for a hardware-accelerator to execute the PCP functionality for controlling access to multiple-unit resources and illustrate that the proposed implementation accelerates the execution time by a factor of up to 30.

Philippe Marchand, Purnendu Sinha

Pattern Matching Acceleration for Network Intrusion Detection Systems

Pattern matching is one of critical parts of Network Intrusion Detection Systems (NIDS). Pattern matching is computationally intensive. To handle an increasing number of attack signature patterns, a NIDS require a multi-pattern matching method that can meet the line-speed of packet transfer. The multi-pattern matching method should efficiently handle a large number of patterns with a wide range of pattern lengths and noncase-sensitive pattern matches. It should also be able to process multiple input characters in parallel. In this paper, we propose a multi-pattern matching hardware accelerator based on Shift-OR pattern matching algorithm. We evaluate the performance of the pattern matching accelerator under various assumptions. The performance evaluation shows that the pattern matching accelerator can be more than 80 times faster than the fastest software multi-pattern matching method used in Snort, a widely used open-source NIDS.

Sunil Kim

Real-Time Stereo Vision on a Reconfigurable System

Real-time three-dimensional vision would support various applications including a passive system for collision avoidance. It is a good alternative of active systems, which are subject to interference in noisy environments. In this paper, we investigate the optimization of real-time stereo vision with respect to resource usage. Correlation techniques using a simple sum of absolute differences(SAD) is popular having good performance. However, processing even a small image takes seconds. In order to provide depth maps at frame rate around 30fps, which typical cameras can provide, hardware accelerations are necessary. Regular structures, linear data flow and abundant parallelism make the correlation algorithm a good candidate for reconfigurable hardware. We implemented versions of SAD algorithms in VHDL and synthesized them to determine resource requirements and performance. By decomposing a SAD correlator into column SAD calculator and row SAD calculator with buffers in between we showed around 50% savings in resource usage. By altering the shape of correlation windows we found that a ‘short and wide’ rectangular window reduced storage requirements without sacrificing quality compared to a square one.

SungHwan Lee, Jongsu Yi, JunSeong Kim

Application of Very Fast Simulated Reannealing (VFSR) to Low Power Design

This paper addresses the problem of optimal supply and threshold voltage selection with device sizing by minimizing power consumption and maximizing battery charge capacitance using Very Fast Simulated Reannealing (VFSR). We assume that multiple supply voltages and multiple threshold voltage devices are available at gate level. Minimizing power consumption does not necessarily maximize battery charge capacitance. This paper achieves this by implementing both objectives in the cost function.

Ali Manzak, Huseyin Goksu

Compressed Swapping for NAND Flash Memory Based Embedded Systems

A swapping algorithm for NAND flash memory based embedded systems is developed by combining data compression and an improved page update method. The developed method allows efficient execution of a memory demanding or multiple applications without requiring a large size of main memory. It also helps enhancing the stability of a NAND flash file system by reducing the number of writes. The update algorithm is based on the CFLRU (Clean First LRU) method and employs some additional features such as selective compression and delayed swapping. The WKdm compression algorithm is used for software based compression while the LZO is used for hardware based implementation. The proposed method is implemented on an ARM9 CPU based Linux system and the performances in the execution of MPEG2 decoder, encoder, and gcc programs are measured and interpreted.

Sangduck Park, Hyunjin Lim, Hoseok Chang, Wonyong Sung

A Radix-8 Multiplier Design and Its Extension for Efficient Implementation of Imaging Algorithms

In our previous work, general principles to develop efficient architectures for matrix-vector arithmetics and video/image processing were proposed based on high-radix (4,8, or 16) multiplier extensions. In this work, we propose a radix-8 multiplier design and its extension to

Multifunctional Architecture for Video and Image Processing

(MAVIP). MAVIP may operate either as a programmable unit with DSP-specific operations such as multiplication, multiply-accumulate, parallel addition or as one or another HWA such as matrix-vector multiplier, FIR filter, or sum-of-absolute-difference accelerator. Simulations indicate that being a small device, MAVIP has competitive performance in video coding.

David Guevorkian, Petri Liuha, Aki Launiainen, Konsta Punkka, Ville Lappalainen

A Scalable Embedded JPEG2000 Architecture

It takes more than a good tool to shorten the time-to-market window: the scalability of a design also plays an important role in rapid prototyping if it needs to satisfy various demands. The design of JPEG2000 belongs to such cases. As the latest compression standard for still images, JPEG2000 is well tuned for diverse applications, raising different throughput requirements on its composed blocks. In this paper, a scalable embedded JPEG2000 encoder architecture is presented and prototyped onto Xilinx FPGA. The system level design presents dynamic profiling outcomes, proving the necessity of the design for scalability.

Chunhui Zhang, Yun Long, Fadi Kurdahi

A Routing Paradigm with Novel Resources Estimation and Routability Models for X-Architecture Based Physical Design

The increment of transistors inside one chip has been following Moore’s Law. To cope with dense chip design for VLSI systems, a new routing paradigm, called X-Architecture, is introduced. In this paper, we present novel resources estimation and routability models for standard cell global routing in X-Architecture. By using these models, we route the chip with a compensation-based convergent approach, called COCO, in which a random sub-tree growing (RSG) heuristic is used to construct and refine routing trees within several iterations. The router has been implemented and tested on MCNC and ISPD’98 benchmarks and some industrial circuits. The experimental results are compared with two typical existing routers (labyrinth and SSTT). It indicates that our router can reduce the total wire length and overflow more than 10% and 80% on average, respectively.

Yu Hu, Tong Jing, Xianlong Hong, Xiaodong Hu, Guiying Yan

Benchmarking Mesh and Hierarchical Bus Networks in System-on-Chip Context

A simulation-based comparison scheme for on-chip communication networks is presented. Performance of the network depends heavily on the application and therefore several test cases are required. In this paper, generic synthesizable 2-dimensional mesh and hierarchical bus, which is an extended version of a single bus, are benchmarked in a SoC context with five parameterizable test cases. The results show that the hierarchical bus offers a good performance and area trade-off. In the presented test cases, a 2-dimensional mesh offers a speedup of 1.1x – 3.3x over hierarchical bus, but the area overhead is of 2.3x – 3.4x, which is larger than performance improvement.

Erno Salminen, Tero Kangas, Jouni Riihimäki, Vesa Lahtinen, Kimmo Kuusilinna, Timo D. Hämäläinen

DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor

High-end microprocessors achieve their performance as a result of adding more features and therefore increasing their complexity. In this paper we present DDM-CMP, a Chip-Multiprocessor using the Data-Driven Multithreading execution model.

As a proof-of-concept we present a DDM-CMP configuration with the same hardware budget as a high-end processor. In that budget we implement four simpler CPUs, the TSUs, and the interconnection network. An estimation of DDMCMP performance for the execution of SPLASH-2 kernels shows that, for the same clock frequency, DDM-CMP achieves a speedup of 2.6 to 7.6 compared to the high-end processor. A lower frequency configuration, which is more powerefficient, still achieves high speedup (1.1 to 3.3). These encouraging results lead us to believe that the proposed architecture has a significant benefit over traditional designs.

Kyriakos Stavrou, Paraskevas Evripidou, Pedro Trancoso

System Level Design, Modeling and Simulation

Modeling NoC Architectures by Means of Deterministic and Stochastic Petri Nets

The design of appropriate communication architectures for complex Systems-on-Chip (SoC) is a challenging task. One promising alternative to solve these problems are Networks-on-Chip (NoCs). Recently, the application of deterministic and stochastic Petri-Nets (DSPNs) to model on-chip communication has been proven to be an attractive method to evaluate and explore different communication aspects. In this contribution the modeling of basic NoC communication scenarios featuring different processor cores, network topologies and communication schemes is presented. In order to provide a test bed for the verification of modeling results a state-of-the-art FPGA-platform has been utilized. This platform allows to instantiate a soft-core processor network which can be adapted in terms of communication network topologies and communication schemes. It will be shown that DSPN modeling yields good prediction results at low modeling effort. Different DSPN modeling aspects in terms of accuracy and computational effort are discussed.

H. Blume, T. von Sydow, D. Becker, T. G. Noll

High Abstraction Level Design and Implementation Framework for Wireless Sensor Networks

The diversity of applications, scarce resources, and large scale set demanding requirements for Wireless Sensor Networks (WSN). All requirements cannot be fulfilled by a general purpose WSN, but a development of application specific WSNs is needed. We present a novel WIreless SEnsor NEtwork Simulator (WISENES) framework for rapid design, simulation, evaluation, and implementation of both single nodes and large WSNs. New WSN design starts from high level Specification and Description Language (SDL) model, which is simulated and implemented on a prototype through code generation. One of the novel features is the back-annotation of measured values from physical prototypes to SDL model. The scalability and performance of WISENES have been evaluated with TUTWSN that is a very energy efficient new WSN. The results show only 6.7 percent difference between modeled and measured TUTWSN prototype energy consumption. Thus, WISENES hastens the development of WSN protocols and their evaluation in large networks.

Mauri Kuorilehto, Mikko Kohvakka, Marko Hännikäinen, Timo D. Hämäläinen

The ODYSSEY Tool-Set for System-Level Synthesis of Object-Oriented Models

We describe implementation of design automation tools that we have developed to automate system-level design using our ODYSSEY methodology, which advocates

object-oriented (OO) modeling

of the embedded system and

ASIP-based implementation

of it. Two flows are automated: one synthesizes an ASIP from a given C++ class library, and the other one compiles a given C++ application to run on the ASIP that corresponds to the class library used in the application. This corresponds, respectively, to hardware- and software-generation for the embedded system while hardware-software interface is also automatically synthesized. This implementation also demonstrates three other advantages: firstly, the tool is capable of synthesizing polymorphism that, to the best of our knowledge, is unique among other C++ synthesizers; secondly, the tools generate an executable co-simulation model for the ASIP hardware and its software, and hence, enable early validation of the hardware-software system before full elaboration; and finally, error-prone language transformations are avoided by choosing C++ for application modeling and SystemC for ASIP implementation.

Maziar Goudarzi, Shaahin Hessabi

Design and Implementation of a WLAN Terminal Using UML 2.0 Based Design Flow

This paper presents a UML 2.0 based design flow for real-time embedded systems. The flow starts with UML 2.0 application, architecture and mapping models for our TUTWLAN terminal with its medium access control protocol. As a result, the hardware/software implementation on Altera Excalibur FPGA is achieved. Implementation utilizes eCos real-time operating system, and hardware accelerators for time-critical protocol functions. The design flow is prototyped in practice showing rapid UML 2.0 application model modification, real-time protocol processing in an image transfer application, and execution monitoring.

Petri Kukkala, Marko Hännikäinen, Timo D. Hämäläinen

Rapid Implementation and Optimisation of DSP Systems on SoPC Heterogeneous Platforms

The emergence of programmable logic devices as processing platforms for digital signal processing applications poses challenges concerning rapid implementation and high level optimization of algorithms on these platforms. This paper describes Abhainn, a rapid implementation methodology and toolsuite for translating an algorithmic expression of the system to a working implementation on a heterogeneous multiprocessor/field programmable gate array platform, or a standalone system on programmable chip solution. Two particular focuses for Abhainn are the automated but configurable realisation of inter-processor communuication fabrics, and the establishment of novel dedicated hardware component design methodologies allowing algorithm level transformation for system optimization. This paper outlines the approaches employed in both these particular instances.

J. McAllister, R. Woods, D. Reilly, S. Fischaber, R. Hasson

DVB-DSNG Modem High Level Synthesis in an Optimized Latency Insensitive System Context

This paper presents our contribution in terms of synchronization processor to a SoC design methodology based on the theory of the latency insensitive systems (LIS) of Carloni et al.. This methodology 1) promotes pre-developed IPs intensive reuse, 2) segments inter-IPs interconnects with relay stations to break critical paths and 3) brings robustness to data stream irregularities to IPs by encapsulation into a synchronization wrapper. Our contribution consists in IP encapsulation into a new wrapper model containing a synchronization processor which speed and area are optimized and synthetizability guarantied. The main benefit of our approach is to preserve the local IP performances when encapsulating them. This approach is part of the RNRT ALIPTA project which targets design automation of intensive digital signal processing systems with GAUT [1], a high-level synthesis tool.

P. Bomel, N. Abdelli, E. Martin, A. -M. Fouilliart, E. Boutillon, P. Kajfasz

SystemQ: A Queuing-Based Approach to Architecture Performance Evaluation with SystemC

Platform architectures for modern embedded systems are increasingly heterogeneous and parallel. Early design decisions, such as the allocation of hardware resources and the partitioning of functionality onto architecture building blocks, become even more complex and important for the resulting design quality. To effectively support designers during the concept phase we base our design flow SystemQ on queuing systems. We show how by starting with a performance model the system’s behavior and structure can be refined systematically. SystemQ is implemented in SystemC and seamlessly supports the refinement of SystemQ models down to established transaction and RT levels. Compared with existing approaches, SystemQ’s formalism exposes transaction scheduling as one key aspect of the system’s performance and allows the modeling of time and resource workload-dependent behavior. A case study underpins the usefulness of SystemQ’s approach by evaluating a network access platform at three refinement levels.

Sören Sonntag, Matthias Gries, Christian Sauer

Moving Up to the Modeling Level for the Transformation of Data Structures in Embedded Multimedia Applications

Traditional design- and optimization techniques for embedded devices apply local transformations of source-code to maximize the performance and minimize the power consumption. Unfortunately, such transformations cannot adequately deal with the highly dynamic nature of today’s multimedia applications as they do not exploit application specific knowledge. We decided to go one step back in the design process. Starting from the original UML (Unified Modeling Language) model of the source code, we transform the UML model first before refining it into executable code. In this paper we present (i) the transformation of various UML models, (ii) a fast technique for the estimation of the high-level cost parameters that steer our transformations, and (iii) experiments based on three case-studies (a Snake game, a Tetris game and a 3D rendering engine) that show that our transformations can result in factors improvement in memory footprint and/or execution time with respect to the original model.

Marijn Temmerman, Edgar G. Daylight, Francky Catthoor, Serge Demeyer, Tom Dhaene

A Case for Visualization-Integrated System-Level Design Space Exploration

Design space exploration plays an essential role in the system-level design of embedded systems. It is imperative therefore to have efficient and effective exploration tools in the early stages of design, where the design space is largest. System-level simulation frameworks that aim for early design space exploration create large volumes of simulation data in exploring alternative architectural solutions. Interpreting and drawing conclusions from these copious simulation results can be extremely cumbersome. In other domains that also struggle with interpreting large volumes of data, such as scientific computing,

data visualization

is an invaluable tool. Such visualization is often domain specific and has not become widely used in evaluating the results of computer architecture simulations. Surprisingly little research has been undertaken in the

dynamic use

of visualization to guide architectural design space exploration. In this paper, we plead for the study and development of generic methods and techniques for run-time visualization of system-level computer architecture simulations. We further explain that these techniques must be scalable and interactive, allowing designers to better explore complex (embedded system) architectures.

Andy D. Pimentel

Mixed Virtual/Real Prototypes for Incremental System Design – A Proof of Concept

Design automation has continually moved towards higher system levels. In recent years it has become possible to model and simulate whole heterogeneous systems, containing hardware as well as complex software components, described on different abstraction levels, with a correct prediction of function and timing. The remaining problem, however, is to transform such a virtual prototype into the final real prototype. This transformation is usually not feasible in a single step. Intermediate versions consist of real as well as virtual subsystems. This paper explores the possibility of a step-wise transformation process (incremental system design) leading to the requirement to combine real subsystems with simulated ones (mixed virtual/real prototypes). The paper discusses the necessary real-time prerequisites in terms of simulation method, programming language, RTOS and the interface between real and virtual subsystems to realize this goal with today’s computing platforms.

Stefan Eilers, C. Müller-Schloer

Backmatter

Titel: Embedded Computer Systems: Architectures, Modeling, and Simulation
herausgegeben von: Timo D. Hämäläinen
Andy D. Pimentel
Jarmo Takala
Stamatis Vassiliadis
Verlag: Springer Berlin Heidelberg
Electronic ISBN: 978-3-540-31664-0
Print ISBN: 978-3-540-26969-4
DOI: https://doi.org/10.1007/b138322

Springer Professional