Skip to main content
Top

2019 | Book

Transactions on High-Performance Embedded Architectures and Compilers V

insite
SEARCH

About this book

Transactions on HiPEAC aims at the timely dissemination of research contributions in computer architecture and compilation methods for high-performance embedded computer systems. Recognizing the convergence of embedded and general-purpose computer systems, this journal publishes original research on systems targeted at specific computing tasks as well as systems with broad application bases. The scope of the journal therefore covers all aspects of computer architecture, code generation and compiler optimization methods of interest to researchers and practitioners designing future embedded systems.
This 5th issue contains extended versions of papers by the best paper award candidates of IC-SAMOS 2009 and the SAMOS 2009 Workshop, colocated events of the 9th International Symposium on Systems, Architectures, Modeling and Simulation, SAMOS 2009, held in Samos, Greece, in 2009. The 7 papers included in this volume were carefully reviewed and selected. The papers cover research on embedded processor hardware/software design and integration and present challenging research trends.

Table of Contents

Frontmatter
Efficient Mapping of Streaming Applications for Image Processing on Graphics Cards
Abstract
In the last decade, there has been a dramatic growth in research and development of massively parallel commodity graphics hardware both in academia and industry. Graphics card architectures provide an optimal platform for parallel execution of many number crunching loop programs from fields like image processing or linear algebra. However, it is hard to efficiently map such algorithms to the graphics hardware even with detailed insight into the architecture. This paper presents a multiresolution image processing algorithm and shows the efficient mapping of this type of algorithms to graphics hardware as well as double buffering concepts to hide memory transfers. Furthermore, the impact of execution configuration is illustrated and a method is proposed to determine offline the best configuration. Using CUDA as programming model, it is demonstrated that the image processing algorithm is significantly accelerated and that a speedup of more than \(145\times \) can be achieved on NVIDIA’s Tesla C1060 compared to a parallelized implementation on a Xeon Quad Core. For deployment in a streaming application with steadily new incoming data, it is shown that the memory transfer overhead to the graphics card is reduced by a factor of six using double buffering.
Richard Membarth, Hritam Dutta, Frank Hannig, Jürgen Teich
Programmable and Scalable Architecture for Graphics Processing Units
Abstract
Graphics processing is an application area with high level of parallelism at the data level and at the task level. Therefore, graphics processing units (GPU) are often implemented as multiprocessing systems with high performance floating point processing and application specific hardware stages for maximizing the graphics throughput.
In this paper we evaluate the suitability of Transport Triggered Architectures (TTA) as a basis for implementing GPUs. TTA improves scalability over the traditional VLIW-style architectures making it interesting for computationally intensive applications. We show that TTA provides high floating point processing performance while allows more programming freedom than vector processors.
Finally, one of the main features of the presented TTA-based GPU design is its fully programmable architecture making it a suitable target for general purpose computing on GPU APIs which have become popular in the recent years.
Carlos S. de La Lama, Pekka Jääskeläinen, Heikki Kultala, Jarmo Takala
Circular Buffers with Multiple Overlapping Windows for Cyclic Task Graphs
Abstract
Multimedia applications process streams of values and can often be represented as task graphs. For performance reasons, these task graphs are executed on multiprocessor systems. Inter-task communication is performed via buffers, where the order in which values are written into a buffer can differ from the order in which they are read. Some existing approaches perform inter-task communication via first-in-first-out buffers in combination with reordering tasks and require applications with affine index-expressions. In our previous work, we used circular buffers with a non-overlapping read and write window, such that a reordering task is not required. However, these windows can cause deadlock for cyclic task graphs.
In this paper, we introduce circular buffers with multiple overlapping windows that do not delay the release of locations and therefore they do not introduce deadlock for cyclic task graphs. We show that buffers with multiple overlapping read and write windows are attractive, because they avoid that a buffer has to be selected from which a value has to be read or into which a value has to be written. This significantly simplifies the extraction of a task graph from a sequential application. These buffers are also attractive, because a buffer capacity equal to the array size is sufficient for deadlock-free execution, instead of performing global analysis to compute sufficient buffer capacities. Our case-study presents two applications that require these buffers.
Tjerk Bijlsma, Marco J. G. Bekooij, Gerard J. M. Smit
A Hardware-Accelerated Estimation-Based Power Profiling Unit - Enabling Early Power-Aware Embedded Software Design and On-Chip Power Management
Abstract
The power consumption of battery powered and energy scavenging devices has become a major design metric for embedded systems. Increasingly complex software applications as well as rising demands in operating times, while having restricted power budgets are main drivers of power-aware system design as well as power management techniques. Within this work, a hardware-accelerated estimation-based power profiling unit delivering real-time power information has been developed. Power consumption feedback to the designer allows for real-time power analysis of embedded systems. Power saving potential as well as power-critical events can be identified in much less time compared to power simulations. Hence, the designer can take countermeasures already at early design stages, which enhances development efficiency and decreases time-to-market. Moreover, this work forms the basis for estimation-based on-chip power management by leveraging the power information for adoptions on system frequency and supply voltage in order to enhance the power efficiency of embedded systems. Power estimation accuracies achieved for a deep sub-micron smart-card controller are above 90% compared to gate-level simulations.
Andreas Genser, Christian Bachmann, Christian Steger, Reinhold Weiss, Josef Haid
The Abstract Streaming Machine: Compile-Time Performance Modelling of Stream Programs on Heterogeneous Multiprocessors
Abstract
Stream programming is a promising step towards portable, efficient, correct use of parallelism. A stream program is built from kernels that communicate only through point-to-point streams. The stream compiler maps a portable stream program onto the target, automatically sizing communications buffers and applying optimizing transformations such as blocking, task fission and task fusion.
This paper presents the Abstract Streaming Machine (ASM), the machine description and performance model used by the ACOTES stream compiler. We explain how the parameters of the ASM and the ASM coarse-grain simulator are used by the partitioning and queue length assignment phases of the ACOTES compiler. Our experiments on the Cell Broadband Engine show that the predictions from the ASM have a maximum relative error of 15% across our benchmarks.
Paul M. Carpenter, Alex Ramirez, Eduard Ayguade
Prototyping a Configurable Cache/Scratchpad Memory with Virtualized User-Level RDMA Capability
Abstract
We present the hardware design and implementation of a local memory system for individual processors inside future chip multiprocessors (CMP). Our memory system supports both implicit communication via caches, and explicit communication via directly accessible local (“scratchpad”) memories and remote DMA (RDMA). We provide run-time configurability of the SRAM blocks that lie near each processor, so that portions of them operate as 2nd level (local) cache, while the rest operate as scratchpad. We also strive to merge the communication subsystems required by the cache and scratchpad into one integrated Network Interface (NI) and Cache Controller (CC), in order to economize on circuits. The processor interacts with the NI at user-level through virtualized command areas in scratchpad; the NI uses a similar access mechanism to provide efficient support for two hardware synchronization primitives: counters, and queues. We describe the NI design, the hardware cost, and the latencies of our FPGA-based prototype implementation that integrates four MicroBlaze processors, each with 64 KBytes of local SRAM, a crossbar NoC, and a DRAM controller. One-way, end-to-end, user-level communication completes within about 20 clock cycles for short transfer sizes.
George Kalokerinos, Vassilis Papaefstathiou, George Nikiforos, Stamatis Kavadias, Xiaojun Yang, Dionisios Pnevmatikatos, Manolis Katevenis
A Dynamic Reconfigurable Super-VLIW Architecture for a Fault Tolerant Nanoscale Design
Abstract
A new scenario emerges due to nanotechnologies that will enable very high integration at the limits or even beyond silicon. However, the fault rate, which is predicted to range from 1% up to 20% of all devices, could compromise the future of nanotechnologies. This work proposes a fault tolerant reconfigurable architecture that tolerates the high fault rates that are expected in future technologies, named Super-VLIW. The architecture consists of a reconfigurable unit tightly coupled to a MIPS processor. The reconfigurable unit is composed of a binary translation unit, a configuration cache, a reconfigurable coarse-grained array of heterogeneous functional units and an interconnection network. Reconfiguration is done at run-time, by translating the binary code, and no recompilation is needed. The interconnection network is based on a set of multistage networks. These networks provide a fault-tolerant communication between any pair of functional unit and from/to the MIPS register file. This work proposes a mechanism to dynamically allocate the available units to ensure parallel execution of basic operations, performing the placement and routing on a single step, which allows the correct interconnection of units even under huge fault rates. Moreover, the proposed architecture could scale to the future nanotechnologies even under a 15% fault rate.
Ricardo Ferreira, Cristoferson Bueno, Marcone Laure, Monica Pereira, Luigi Carro
Backmatter
Metadata
Title
Transactions on High-Performance Embedded Architectures and Compilers V
Editors
Prof. Cristina Silvano
Koen Bertels
Michael Schulte
Copyright Year
2019
Publisher
Springer Berlin Heidelberg
Electronic ISBN
978-3-662-58834-5
Print ISBN
978-3-662-58833-8
DOI
https://doi.org/10.1007/978-3-662-58834-5

Premium Partner