Top

2012 | Book

Read chapter Read first chapter

Architecture of Computing Systems – ARCS 2012

25th International Conference, Munich, Germany, February 28 - March 2, 2012. Proceedings

Editors: Andreas Herkersdorf, Kay Römer, Uwe Brinkschulte

Publisher: Springer Berlin Heidelberg

Book Series : Lecture Notes in Computer Science

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

This book constitutes the refereed proceedings of the 25th International Conference on Architecture of Computing Systems, ARCS 2012, held in Munich, Germany, in February/March 2012. The 20 revised full papers presented in 7 technical sessions were carefully reviewed and selected from 65 submissions. The papers are organized in topical sections on robustness and fault tolerance, power-aware processing, parallel processing, processor cores, optimization, and communication and memory.

Frontmatter

Robustness and Fault Tolerance

Classification-Based Improvement of Application Robustness and Quality of Service in Probabilistic Computer Systems

Abstract

Future semiconductors no longer guarantee permanent deterministic operation. They are expected to show probabilistic behavior due to lowered voltages and shrinking structures.

Compared to radiation-induced errors, probabilistic systems face increased error frequencies leading to unexpected bit-flips. Approaches like probabilistic CMOS provide methods to control error distributions which reduce the error probability in more significant bits. However, instructions handling control flow or pointers still expect deterministic operation, thus requiring a classification to identify these instructions.

We apply our transient error classification to probabilistic circuits using differing voltage distributions. Static analysis ensures that probabilistic effects only affect unreliable operations which accept a certain level of impreciseness, and that errors in probabilistic components will never propagate to critical operations.

To evaluate, we analyze robustness and quality-of-service of an H.264 video decoder. Using classification results, we map unreliable arithmetic operations onto probabilistic components of a simulated ARM-based architecture, while the remaining operations use deterministic components.

Andreas Heinig, Vincent J. Mooney, Florian Schmoll, Peter Marwedel, Krishna Palem, Michael Engel

A Case Study on Error Resilient Architectures for Wireless Communication

Abstract

Reliability is the next big challenge if CMOS scaling will continue. To face this challenge, cross-layer approaches become mandatory. In this paper we present a dynamic error detection and correction flow for wireless communication. We demonstrate this flow on a flexible state-of-the-art decoder, i.e., an HSPA/LTE channel decoder. A profound analysis of the impact of timing and soft errors on the system behavior is presented. Dynamic techniques utilizing higher layers of communication systems to compensate these errors are proposed. This approach results in very low overhead for error resilience.

Christian Brehm, Matthias May, Christina Gimmler, Norbert Wehn

Using Dynamic Task Level Redundancy for OpenMP Fault Tolerance

Abstract

Obtaining fault tolerant applications and systems is one of today’s most important topics of research. Fault tolerance is becoming more and more essential in shared memory parallel programs and in multi/many core architectures due to the decreasing size of transistors and growing number of failures. Very few research works and techniques for fault tolerant OpenMP programs were studied. These few works are based on checkpoint and recovery, and on static thread level redundancy techniques. However, these approaches may illustrate scalability issues when the number of cores increases or when an unbalanced workload exists. To overcome these issues, we present in this paper a dynamic task level redundancy technique for fault tolerant OpenMP applications. Our method is based on dynamically applying a Triple Modular Redundancy for OpenMP tasks through a dedicated runtime and on applying a majority voting to guarantee correct results. Our flexible fault tolerant OpenMP approach has been evaluated for performance and fault coverage and it showed small overhead with good error detection and recovery rate.

Oussama Tahan, Mohamed Shawky

Power Aware Processing

A Very Fast and Quasi-accurate Power-State-Based System-Level Power Modeling Methodology

Abstract

In this paper, we propose a novel system-level power modeling methodology that allows for very fast joint power-performance evaluation at specification phase. This methodology adopts approximately-timed task-accurate performance models and augments them with power-state-based power models to enable efficient simulation. A flexible method is also proposed to model complex dynamic power management policies so that their effects can be evaluated. We validate the accuracy of our methodology by comparing simulation results with measurements on a real mobile phone platform. Experimental results show that the simulated power profile matches very well with the measurements and it only takes about 100 ms to simulate a 20 ms GSM paging burst use case.

Yang Xu, Rafael Rosales, Bo Wang, Martin Streubühr, Ralph Hasholzner, Christian Haubelt, Jürgen Teich

Static Task Mapping for Tiled Chip Multiprocessors with Multiple Voltage Islands

Abstract

The complexity of large Chip Multiprocessors (CMP) makes design reuse a practical approach to reduce the manufacturing and design cost of high-performance systems. This paper proposes techniques for static task mapping onto general-purpose CMPs with multiple pre-defined voltage islands for power management. The CMPs are assumed to contain different classes of processing elements with multiple voltage/frequency execution modes to better cover a large range of applications. Task mapping is performed with awareness of both on-chip and off-chip memory traffic, and communication constraints such as the link and memory bandwidth. A novel mapping approach based on Extremal Optimization is proposed for large-scale CMPs. This new combinatorial optimization method has delivered very good results in quality and computational cost when compared to the classical simulated annealing.

Nikita Nikitin, Jordi Cortadella

An Architecture for Power Management in Automotive Systems

Abstract

This paper presents an architectural model for power management in automotive systems. It is based on recent advances in cyber physical and cybernetic control systems. Based upon a previous model of power management, formal interactions in between a hierarchical structure are characterized. In the architecture, strategic decisions allow coordinated adjusting of power management plans as well as local autonomy in subsystem scope.

Andreas Barthels, Joachim Fröschl, Hans-Ulrich Michel, Uwe Baumgarten

Parallel Processing

Invasive MPI on Intel’s Single-Chip Cloud Computer

Abstract

The Single-chip Cloud Computer (SCC) from Intel Labs is an experimental CPU that integrates 48 cores. As its name suggests, it is a distributed memory system on a chip. In typical configurations, the available memory is divided equally across the cores. Message passing is supported by means of an on-die Message Passing Buffer (MPB). The memory organization and hardware features of the SCC make it an interesting platform for evaluating parallel programming models. In this work, an MPI implementation is optimized and extended to support the invasive programming model; the invasive model’s main idea is to allow for resource aware programming. The result is a library that provides resource awareness through extensions to MPI, while keeping its features and compatibility.

Isaías A. Comprés Ureña, Michael Riepen, Michael Konow, Michael Gerndt

A Low-Overhead Heuristic for Mixed Workload Resource Partitioning in Cluster-Based Architectures

Abstract

The execution of multiple multimedia applications on a modern Multi-Processor System-on-Chip (MPSoC) rises up the need of a Run-Time Management (RTM) layer to match hardware and application needs. This paper proposes a novel model for the run-time resource allocation problem taking into account both architectural and application standpoints. Our model considers clustered and non-clustered resources, migration and reconfiguration overheads, quality of service (QoS) and application priorities. A near optimal solution is computed focusing on spatial and computational constraints. Experiments reveal that our first implementation is able to manage tens of applications with an overhead of only fews milliseconds and a memory footprint of less than one hundred KB, thus suitable for usage on real systems.

Davide Zoni, Patrick Bellasi, William Fornaciari

Deterministic Execution Model on COTS Hardware

Abstract

In order to be able to use multicore COTS hardware in critical systems, we put forward a time-oriented execution model and provide a general framework for programming and analysing a multicore compliant with the execution model.

Frédéric Boniol, Hugues Cassé, Eric Noulard, Claire Pagetti

Processor Cores

Design Principles for Synthesizable Processor Cores

Abstract

As FPGAs get more competitive, synthesizable processor cores become an attractive choice for embedded computing. Currently popular commercial processor cores do not fully exploit current FPGA architectures. In this paper, we propose general design principles to increase instruction throughput on FPGA-based processor cores: first, superpipelining enables higher-frequency system clocks, and second, predicated instructions circumvent costly pipeline stalls due to branches. To evaluate their effects, we develop Tinuso, a processor architecture optimized for FPGA implementation. We demonstrate through the use of micro-benchmarks that our principles guide the design of a processor core that improves performance by an average of 38% over a similar Xilinx MicroBlaze configuration.

Pascal Schleuniger, Sally A. McKee, Sven Karlsson

HPC Performance Domains on Multi-core Processors with Virtualization

Abstract

As the number of cores increases in multi-core processors, more applications execute at the same time. In this paper we present a simple and non-intrusive approach that guarantees performance isolation for High Performance Applications. This is achieved using virtualization by creating multiple virtual machines on the same processor, which can be seen as different Performance Domains. While previously this technique has been explored for increasing utilization, in this work we exploit it for improving performance of multiple co-executing applications. For the purpose of this work we have studied two different virtualization approaches: (i) conventional hosted virtualization and (ii) bare-metal virtualization. To study the feasibility of this technique, we analyze the performance of applications when executing within a virtual machine. The isolation properties provided by both virtualization methods offer performance predictability for the executed applications. Our experimental results show that the performance overhead of executing on a virtualized environment is not significant, with the bare-metal virtualization resulting in an overhead of only 3%. Most importantly, virtualization is able to eliminate in some cases the negative effects of co-execution interference, thus applications running on virtual machines may achieve a better performance than running natively on the system.

Panayiotis Petrides, George Nicolaides, Pedro Trancoso

A Generic and Non-intrusive Profiling Methodology for SystemC Multi-core Platform Simulation Models

Abstract

The efficient programming of todays multi-core platforms has become a more and more difficult task due to the increasing complexity of the overall system. Especially the lack of an integrated HW/SW co-analysis methodology which allows to explore the behavior of programming models, runtime system and the virtual platform model of the multi-core system leads to the need for new developments in the field of HW/SW co-design tools. In order to support the HW/SW co-design we present a simulation based tracing and profiling methodology for multi-core platforms following a generic and non-intrusive approach supporting easy adaptability, fast applicability and accurate performance measures.

Jens Brandenburg, Benno Stabernack

Optimization

Dynamic Task-Scheduling and Resource Management for GPU Accelerators in Medical Imaging

Abstract

For medical imaging applications, a timely execution of tasks is essential. Hence, running multiple applications on the same system, scheduling with the capability of task preemption and prioritization becomes mandatory. Using GPUs as accelerators in this domain, imposes new challenges since GPU’s common FIFO scheduling does not support task prioritization and preemption. As a remedy, this paper investigates the employment of resource management and scheduling techniques for applications from the medical domain for GPU accelerators. A scheduler supporting both, priority-based and LDF scheduling is added to the system such that high-priority tasks can interrupt tasks already enqueued for execution. The scheduler is capable of utilizing multiple GPUs in a system to minimize the average response time of applications. Moreover, it supports simultaneous execution of multiple tasks to hide data transfers latencies. We show that the scheduler interrupts scheduled and already enqueued applications to fulfill the timing requirements of high-priority dynamic tasks.

Richard Membarth, Jan-Hugo Lupp, Frank Hannig, Jürgen Teich, Mario Körner, Wieland Eckert

An Approach for Performance Estimation of Hybrid Systems with FPGAs and GPUs as Coprocessors

Abstract

This paper presents an approach for modeling the achievable speed-ups of FPGAs (Field Programmable Gate Arrays) or GPUs (Graphic Processing Units) as coprocessors in hybrid computing systems. The underlying computation model assumes that the coprocessors are separate devices and that their input and output data are transferred from and into the system’s memory. The model considers all overheads involved when (sub-)tasks are performed on a coprocessor instead of the CPU. By means of a sample application the validity of the model is checked against measured values. In addition, the theoretical maximum speed-ups of two hybrid systems compared to an optimal single core CPU implementation are approximated. Using penalty factor P _SEQ as a measure to which degree a program cannot be fully parallelized due to data dependencies, a system with a Nvidia GTX 285 GPU achieves a speed-up of 2.7 times P _SEQ, while for a single node of a Cray XD1 with a Xilinx Virtex4 LX160 the speed-up is about 1 times P _SEQ.

Volker Hampel, Thilo Pionteck, Erik Maehle

Work Stealing Strategies for Parallel Stream Processing in Soft Real-Time Systems

Abstract

Work stealing has proven to be an efficient technique for scheduling parallel computations. In its basic form, however, work stealing is not suitable for real-time applications, since the latency of a task is hardly predictable. In this paper, we propose a number of variants and extensions of work stealing suitable for stream processing applications. Such applications are frequently encountered in embedded systems, which often have to obey real-time constraints. Moreover, we give bounds on the maximum latency for certain stealing strategies. Our experimental results show a significant reduction of the latency using these strategies.

Sebastian Mattheis, Tobias Schuele, Andreas Raabe, Thomas Henties, Urs Gleim

Design Space Exploration of Hybrid Ultra Low Power Branch Predictors

Abstract

Modern branch predictors are often too large and power hungry to be a viable option for small, embedded processors where die space, power consumption and performance are all at a premium. With embedded processors the large cache structures required for high performance branch prediction can easily take up more die space than the rest of the processor combined. When coupled with the large leakage energies, which are set to be an increasing issue as technologies advance to 45nm and beyond, it can often appear appealing to not use a dynamic branch predictor at all. This paper seeks to find a way of using an ultra small branch predictor in a hybrid predictor configuration suitable for an embedded processor. We introduce a novel bias parameter to the consideration of when to execute branches statically or dynamically, further exploring the performance vs energy trade-off. We present a solution that reduces dynamic branch predictor aliasing, improves performance and requires a minimum of extra die space. The results presented relate die space requirements, energy use and performance impacts. We look at how best to optimise this balance in a way that is usually not considered, and on a lower bits budget than has previously been presented. The EEMBC 1.1 benchmark suite [1] was used to explore the energy vs performance trade-off boundary, taking averages of the results across 31 different benchmarks. We evaluate 5 traditional branch predictor configurations and 36 novel ultra small hybrid branch predictors through the use of 9 sets of our novel bias values, combining GShare dynamic predictions with profiled backwards taken forwards not-taken (BTFN)/ backwards not-taken forwards taken (BNFT) static predictions. The results demonstrate that the use of a static-dynamic hybrid is not only beneficial but necessary for very small predictors to produce a positive effect on the cycle count and overall energy use of the processor. Through the use of our novel bias parameter we explore the performance vs energy trade-off and show that through a small (0.1 seconds at 500MHz or 0.35%) reduction in peak performance (total runtime in region of 28.35 seconds) for a given architecture we can gain substantial dynamic energy savings from reduced dynamic predictor accesses (removing up to an additional 16.5%, or 53 million, of the traditional hybrid predictor accesses). Our best performing architecture showed an average improvement in run time of 2 seconds (6.7%) over a static BTFN baseline (total runtime 30.46s), at the cost of only an additional 0.01mm² (or 1%) die space.

Matthew Bielby, Miles Gould, Nigel Topham

Communication and Memory

New Memory Organizations for 3D DRAM and PCMs

Abstract

The memory wall (the gap between processing and storage speeds) remains a concern to computer systems designers. Caches have played a key role in hiding the performance gap by keeping recently accessed information in fast memories closer to the processor. Multi and many core systems are placing severe demands on caches, exacerbating the performance disparity between memory and processors. New memory technologies including 3D stacked DRAMs, solid state disks (SSDs) such as those built using flash technologies and phase change memories (PCM) may alleviate the problem: 3D DRAMs and SSDs present lower latencies than conventional, off-chip DRAMs and magnetic disk drives. However these technologies force us to rethink how address spaces should be organized into pages and how virtual addresses should be translated into physical pages. In this paper, we present some preliminary ideas in this connection, and evaluate these new organizations using SPEC CPU2006 benchmarks.

Ademola Fawibe, Jared Sherman, Krishna Kavi, Mike Ignatowski, David Mayhew

Vertical Link On/Off Control Methods for Wireless 3-D NoCs

Abstract

Low-power techniques are proposed for the wireless three-dimensional Network-on-Chips (wireless 3-D NoCs), in which routers on the same chip are connected with metal wires while those on the different chips are connected wirelessly using the inductive-coupling. For saving power consumption of the vertical link, the clock and power supplies to the transmitter are stopped when their utilizations are between a specified range. Meanwhile, the whole wireless vertical link will be shut down when the utilization is lower than the threshold. In order to keep performance, on-demand activation is used in this paper. As long as flit comes, the dormant data transmitter or the whole vertical link will be activated immediately without any judgement. Full-system many-core simulations using power parameters derived from a real chip implementation show that the proposed low-power techniques reduce the power consumption by 23.4%-29.3%, while the performance overhead is less than 2.4%.

Hao Zhang, Hiroki Matsutani, Yasuhiro Take, Tadahiro Kuroda, Hideharu Amano

SADmote: A Robust and Cost-Effective Device for Environmental Monitoring

Abstract

Time to deployment for wireless sensor networks could be reduced by using commercial sensor nodes. However, this may lead to suboptimal flexibility, power consumption and cost of the system. Our pilot deployment for precision agriculture and fruit growing research showed similar conclusions and outlined the design decisions leading to SADmote: a new sensor node for environmental monitoring. It was evaluated both in the lab and field, showing improved energy consumption over commercial solutions such as Tmote Sky and Waspmote.

Atis Elsts, Rihards Balass, Janis Judvaitis, Reinholds Zviedris, Girts Strazdins, Artis Mednis, Leo Selavo

Streamlined Network-on-Chip for Multicore Embedded Architectures

Abstract

MPSoCs are becoming complex systems incorporating a large number of compute cores as well as various accelerators and application specific units. To handle the communication in MPSoCs, the Network-on-Chip (NoC) concept has been proposed as a versatile and scalable solution. The cost of the communication subsystem may have a major impact on the overall cost of the SoC; hence the need for careful evaluation of NoC design alternatives. Deflection routing, characterized by router simplicity and minimal resources, is an attractive design alternative but is generally viewed as suitable only for NoC with low and medium traffic. In this paper, we propose prioritization and buffering algorithms which improve deflection routing performance to the point it becomes attractive in heavily loaded NoC as well.

Gadi Oxman, Shlomo Weiss, Yitzhak (Tsahi) Birk

Backmatter

Title: Architecture of Computing Systems – ARCS 2012
Editors: Andreas Herkersdorf
Kay Römer
Uwe Brinkschulte
Publisher: Springer Berlin Heidelberg
Electronic ISBN: 978-3-642-28293-5
Print ISBN: 978-3-642-28292-8
DOI: https://doi.org/10.1007/978-3-642-28293-5

Springer Professional

Architecture of Computing Systems – ARCS 2012

25th International Conference, Munich, Germany, February 28 - March 2, 2012. Proceedings

About this book

Table of Contents

Frontmatter

Robustness and Fault Tolerance

Classification-Based Improvement of Application Robustness and Quality of Service in Probabilistic Computer Systems

A Case Study on Error Resilient Architectures for Wireless Communication

Using Dynamic Task Level Redundancy for OpenMP Fault Tolerance

Power Aware Processing

A Very Fast and Quasi-accurate Power-State-Based System-Level Power Modeling Methodology

Static Task Mapping for Tiled Chip Multiprocessors with Multiple Voltage Islands

An Architecture for Power Management in Automotive Systems

Parallel Processing

Invasive MPI on Intel’s Single-Chip Cloud Computer

A Low-Overhead Heuristic for Mixed Workload Resource Partitioning in Cluster-Based Architectures

Deterministic Execution Model on COTS Hardware

Processor Cores

Design Principles for Synthesizable Processor Cores

HPC Performance Domains on Multi-core Processors with Virtualization

A Generic and Non-intrusive Profiling Methodology for SystemC Multi-core Platform Simulation Models

Optimization

Dynamic Task-Scheduling and Resource Management for GPU Accelerators in Medical Imaging

An Approach for Performance Estimation of Hybrid Systems with FPGAs and GPUs as Coprocessors

Work Stealing Strategies for Parallel Stream Processing in Soft Real-Time Systems

Design Space Exploration of Hybrid Ultra Low Power Branch Predictors

Communication and Memory

New Memory Organizations for 3D DRAM and PCMs

Vertical Link On/Off Control Methods for Wireless 3-D NoCs

SADmote: A Robust and Cost-Effective Device for Environmental Monitoring

Streamlined Network-on-Chip for Multicore Embedded Architectures

Backmatter

Premium Partner