Skip to main content

2012 | Buch

Computer Architecture

ISCA 2010 International Workshops A4MMC, AMAS-BT, EAMA, WEED, WIOSCA, Saint-Malo, France, June 19-23, 2010, Revised Selected Papers

herausgegeben von: Ana Lucia Varbanescu, Anca Molnos, Rob van Nieuwpoort

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes the thoroughly refereed post-conference proceedings of the workshops held at the 37th International Symposium on Computer Architecture, ISCA 2010, in Saint-Malo, France, in June 2010. The 28 revised full papers presented were carefully reviewed and selected from the lectures given at 5 of these workshops. The papers address topics ranging from novel memory architectures to emerging application design and performance analysis and encompassed the following workshops: A4MMC, applications for multi- and many-cores; AMAS-BT, 3rd workshop on architectural and micro-architectural support for binary translation; EAMA, the 3rd Workshop for emerging applications and many-core architectures; WEED, 2nd Workshop on energy efficient design, as well as WIOSCA, the annual workshop on the interaction between operating systems and computer architecture.

Inhaltsverzeichnis

Frontmatter

A4MMC: Applications for Multi- and Many-Cores

Accelerating Agent-Based Ecosystem Models Using the Cell Broadband Engine
Abstract
This paper investigates how the parallel streaming capabilities of the Cell Broadband Engine can be used to speed up a class of agent-based plankton models generated from a domain-specific model compiler called the Virtual Ecology Workbench (VEW). We show that excellent speed-ups over a conventional x86 platform can be achieved for the agent update loop. We also show that scalability of the application as a whole is limited by the need to perform particle management, which splits and merges agents in order to keep the global agent count within specified bounds. Furthermore, we identify the size of the PPE L2 cache as the main hardware limitation for this process and give an indication of how to perform the required searches more efficiently.
Michael Lange, Tony Field
Performance Impact of Task Mapping on the Cell BE Multicore Processor
Abstract
Current multicores present themselves as symmetric to programmers with a bus as communication medium, but are known to be non-symmetric because their interconnect is more complex than a bus. We report on our experiments to map a simple application with communication in a ring to SPEs of a Cell BE processor such that performance is optimized. We find that low-level tricks for static mapping do not necessarily achieve optimal performance. Furthermore, we ran exhaustive mapping experiments, and we observed that (1) performance variations can be significant between consecutive runs, and (2) performance forecasts based on intuitive interconnect behavior models are far from accurate even for a simple communication pattern.
Jörg Keller, Ana Lucia Varbanescu
Parallelization Strategy for CELL TV
Abstract
Consumer electronics devices are moving forward to utilize multi-core processors. We have developed a series of unique applications for TV based on Cell Broadband EngineTM (Cell/B.E.). This paper introduces such applications realized by the capability of the multi-core processor, and shares the strategy we took to exploit its potential.
Motohiro Takayama, Ryuji Sakai
Towards User Transparent Parallel Multimedia Computing on GPU-Clusters
Abstract
The research area of Multimedia Content Analysis (MMCA) considers all aspects of the automated extraction of knowledge from multimedia archives and data streams. To satisfy the increasing computational demands of MMCA problems, the use of High Performance Computing (HPC) techniques is essential. As most MMCA researchers are not HPC experts, there is an urgent need for ‘familiar’ programming models and tools that are both easy to use and efficient.
Today, several user transparent library-based parallelization tools exist that aim to satisfy both these requirements. In general, such tools focus on data parallel execution on traditional compute clusters. As of yet, none of these tools also incorporate the use of many-core processors (e.g. GPUs), however. While traditional clusters are now being transformed into GPU-clusters, programming complexity vastly increases — and the need for easy and efficient programming models is as urgent as ever.
This paper presents our first steps in the direction of obtaining a user transparent programming model for data parallel and hierarchical multimedia computing on GPU-clusters. The model is obtained by extending an existing user transparent parallel programming system (applicable to traditional compute clusters) with a set of CUDA compute kernels. We show our model to be capable of obtaining orders-of-magnitude speed improvements, without requiring any additional effort from the application programmer.
Ben van Werkhoven, Jason Maassen, Frank J. Seinstra
Implementing a GPU Programming Model on a Non-GPU Accelerator Architecture
Abstract
Parallel codes are written primarily for the purpose of performance. It is highly desirable that parallel codes be portable between parallel architectures without significant performance degradation or code rewrites. While performance portability and its limits have been studied thoroughly on single processor systems, this goal has been less extensively studied and is more difficult to achieve for parallel systems. Emerging single-chip parallel platforms are no exception; writing code that obtains good performance across GPUs and other many-core CMPs can be challenging. In this paper, we focus on CUDA codes, noting that programs must obey a number of constraints to achieve high performance on an NVIDIA GPU. Under such constraints, we develop optimizations that improve the performance of CUDA code on a MIMD accelerator architecture that we are developing called Rigel. We demonstrate performance improvements with these optimizations over naïve translations, and final performance results comparable to those of codes that were hand-optimized for Rigel.
Stephen M. Kofsky, Daniel R. Johnson, John A. Stratton, Wen-mei W. Hwu, Sanjay J. Patel, Steven S. Lumetta
On the Use of Small 2D Convolutions on GPUs
Abstract
Computing many small 2D convolutions using FFTs is a basis for a large number of applications in many domains in science and engineering, among them electromagnetic diffraction modeling in physics. The GPU architecture seems to be a suitable architecture to accelerate these convolutions, but reaching high application performance requires substantial development time and non-portable optimizations. In this work, we present the techniques, performance results and considerations to accelerate small 2D convolutions using CUDA, and compare performance to a multi-threaded CPU implementation. To improve programmability and performance of applications that make heavy use of small convolutions, we argue that two improvements to software and hardware are needed: FFT libraries must be extended with a single convolution function and communication bandwidth between CPU and GPU needs to be drastically improved.
Shams A. H. Al Umairy, Alexander S. van Amesfoort, Irwan D. Setija, Martijn C. van Beurden, Henk J. Sips
Can Manycores Support the Memory Requirements of Scientific Applications?
Abstract
Manycores are very effective in scaling parallel computational performance. However, it is not clear if current memory technologies can scale to support such highly parallel processors.
In this paper, we examine the memory bandwidth and footprint required by a number of high-performance scientific applications. We find such applications require a per-core memory bandwidth of ~ 300MB/s, and have a memory footprint of some 300MB per-core.
When comparing these requirements with the limitations of state-of-the-art DRAM technology, we project that in the scientific domain, current memory technologies will likely scale well to support more than ~ 100 cores on a single chip, but may become a performance bottleneck for manycores consisting of more than 200 cores.
Milan Pavlovic, Yoav Etsion, Alex Ramirez
Parallelizing an Index Generator for Desktop Search
Abstract
Experience with the parallelization of an index generator for desktop search is presented. Several configurations of the index generator are compared on three different Intel platforms with 4, 8, and 32 cores. The optimal configurations for these platforms are not intuitive and are markedly different for the three platforms. For finding the optimal configuration, detailed measurements and experimentation were necessary. Several recommendations for parallel software design are derived from this study.
David J. Meder, Walter F. Tichy

AMAS-BT: 3rd Workshop on Architectural and Micro-Architectural Support for Binary Translation

Computation vs. Memory Systems: Pinning Down Accelerator Bottlenecks
Abstract
The world needs special-purpose accelerators to meet future constraints on computation and power consumption. Choosing appropriate accelerator architectures is a key challenge. In this work, we present a pintool designed to help evaluate the potential benefit of accelerating a particular function. Our tool gathers cross-procedural data usage patterns, including implicit dependencies not captured by arguments and return values. We then use this data to characterize the limits of hardware procedural acceleration imposed by on-chip communication and storage systems. Through an understanding the bottlenecks in future accelerator-based systems we will focus future research on the most performance-critical regions of the design. Accelerator designers will also find our tool useful for selecting which regions of their application to accelerate.
Martha A. Kim, Stephen A. Edwards
Trace Execution Automata in Dynamic Binary Translation
Abstract
Program performance can be dynamically improved by optimizing its frequent execution traces. Once traces are collected, they can be analyzed and optimized based on the dynamic information derived from the program’s previous runs. The ability to record traces is thus central to any dynamic binary translation system. Recording traces, as well as loading them for use in different runs, requires code replication to represent the trace. This paper presents a novel technique which records execution traces by using an automaton called TEA (Trace Execution Automata). Contrary to other approaches, TEA stores traces implicitly, without the need to replicate execution code. TEA can also be used to simulate the trace execution in a separate environment, to store profile information about the generated traces, as well to instrument optimized versions of the traces. In our experiments, we showed that TEA decreases memory needs to represent the traces (nearly 80% savings).
João Porto, Guido Araujo, Edson Borin, Youfeng Wu
ISAMAP: Instruction Mapping Driven by Dynamic Binary Translation
Abstract
Dynamic Binary Translation (DBT) techniques have been largely used in the migration of legacy code and in the transparent execution of programs across different architectures. They have also been used in dynamic optimizing compilers, to collect runtime information so as to improve code quality. In many cases, DBT translation mechanism misses important low-level mapping opportunities available at the source/target ISAs. Hot code performance has been shown to be central to the overall program performance, as different instruction mappings can account for high performance gains. Hence, DBT techniques that provide efficient instruction mapping at the ISA level has the potential to considerably improve performance. This paper proposes ISAMAP, a flexible instruction mapping driven by dynamic binary translation. Its mapping mechanism, provides a fast translation between ISAs, under an easy-to-use description. At its current state, ISAMAP is capable of translating 32-bit PowerPC code to 32-bit x86 and to perform local optimizations on the resulting x86 code. Our experimental results show that ISAMAP is capable of executing PowerPC code on an x86 host faster than the processor emulator QEMU, achieving speedups of up to 3.16x for SPEC CPU2000 programs.
Maxwell Souza, Daniel Nicácio, Guido Araújo

EAMA: 3rd Workshop for Emerging Applications and Many-Core Architectures

Parallelization of Particle Filter Algorithms
Abstract
This paper presents the parallelization of the particle filter algorithm in a single target video tracking application. In this document we demonstrate the process by which we parallelized the particle filter algorithm, beginning with a MATLAB implementation. The final CUDA program provided approximately 71x speedup over the initial MATLAB implementation.
Matthew A. Goodrum, Michael J. Trotter, Alla Aksel, Scott T. Acton, Kevin Skadron
What Kinds of Applications Can Benefit from Transactional Memory?
Abstract
We discuss the current state of research on transactional memory, with a focus on what characteristics of applications may make them more or less likely to be able to benefit from using transactional memory, and how this issue can and should influence ongoing and future research.
Mark Moir, Dan Nussbaum
Characteristics of Workloads Using the Pipeline Programming Model
Abstract
Pipeline parallel programming is a frequently used model to program applications on multiprocessors. Despite its popularity, there is a lack of studies of the characteristics of such workloads. This paper gives an overview of the pipeline model and its typical implementations for multiprocessors. We present implementation choices and analyze their impact on the program. We furthermore show that workloads that use the pipeline model have their own unique characteristics that should be considered when selecting a set of benchmarks. Such information can be beneficial for program developers as well as for computer architects who want to understand the behavior of applications.
Christian Bienia, Kai Li

WEED: 2nd Workshop on Energy Efficient Design

The Search for Energy-Efficient Building Blocks for the Data Center
Abstract
This paper conducts a survey of several small clusters of machines in search of the most energy-efficient data center building block targeting data-intensive computing. We first evaluate the performance and power of single machines from the embedded, mobile, desktop, and server spaces. From this group, we narrow our choices to three system types. We build five-node homogeneous clusters of each type and run Dryad, a distributed execution engine, with a collection of data-intensive workloads to measure the energy consumption per task on each cluster. For this collection of data-intensive workloads, our high-end mobile-class system was, on average, 80% more energy-efficient than a cluster with embedded processors and at least 300% more energy-efficient than a cluster with low-power server processors.
Laura Keys, Suzanne Rivoire, John D. Davis
KnightShift: Shifting the I/O Burden in Datacenters to Management Processor for Energy Efficiency
Abstract
Data center energy costs are growing concern. Many datacenters use direct-attached-storage architecture where data is distributed across disks attached to several servers. In this organization even if a server is not utilized it can not be turned off since each server carries a fraction of the permanent state needed to complete a request. Operating servers at low utilization is very inefficient due to the lack of energy proportionality. In this research we propose to use out-of-band management processor, typically used for remotely managing a server, to satisfy I/O requests from a remote server. By handling requests with limited processing needs, the management processor takes the load off the primary server thereby allowing the primary server to sleep when not actively being used; we call this approach KnightShift. We describe how existing management processors can be modified to handle KnightShift responsibility. We use several production datacenter traces to evaluate the energy impact of KnightShift and show that energy consumption can be reduced by 2.6X by allowing management processors to handle only those requests that demand less than 5% of the primary CPU utilization.
Sabyasachi Ghosh, Mark Redekopp, Murali Annavaram
Guarded Power Gating in a Multi-core Setting
Abstract
Power gating is an increasingly important actuation knob in chip-level dynamic power management. In a multi-core setting, a key design issue in this context, is determining the right balance of gating at the unit-level (within a core) and at the core-level. Another issue is how to architect the predictive control associated with such gating, in order to ensure maximal power savings at minimal performance loss. We use an abstract, analytical modeling framework to understand and discuss the fundamental tradeoffs in such a design. We consider plausible ranges of software/hardware control latencies and workload characteristics to understand when and where it makes sense to disable one or both of the gating mechanisms (i.e. intra- and inter-core). The overall goal of this research is to devise predictive power gating algorithms in a multi-core setting, with built-in “guard” mechanisms to prevent negative outcomes: e.g. a net increase in power consumption or an unacceptable level of performance loss.
Niti Madan, Alper Buyuktosunoglu, Pradip Bose, Murali Annavaram
Using Partial Tag Comparison in Low-Power Snoop-Based Chip Multiprocessors
Abstract
In this work we introduce power optimizations relying on partial tag comparison (PTC) in snoop-based chip multiprocessors. Our optimizations rely on the observation that detecting tag mismatches in a snoop-based chip multiprocessor does not require aggressively processing the entire tag. In fact, a high percentage of cache mismatches could be detected by utilizing a small subset but highly informative portion of the tag bits.
Based on this, we introduce a source-based snoop filtering mechanism referred to as S-PTC. In S-PTC possible remote tag mismatches are detected prior to sending the request. We reduce power as S-PTC prevents sending unnecessary snoops and avoids unessential tag lookups at the end-points. Furthermore, S-PTC improves performance as a result of early cache miss detection.
S-PTC improves average performance from 2.9% to 3.5% for different configurations and for the SPLASH-2 benchmarks used in this study. Our solutions reduce snoop request bandwidth from 78.5% to 81.9% and average tag array dynamic power by about 52%.
Ali Shafiee, Narges Shahidi, Amirali Baniasadi
Achieving Power-Efficiency in Clusters without Distributed File System Complexity
Abstract
Power-efficient operation is a desirable property, particularly for large clusters housed in datacenters. Recent work has advocated turning off entire nodes to achieve power-proportionality, but this leads to problems with availability and fault tolerance because of the resulting limits imposed on the replication strategies used by the distributed file systems (DFS) employed in these environments, with counter-measures adding substantial complexity to DFS designs. To achieve power-efficiency for a cluster without impacting data availability and recovery from failures and maintain simplicity in DFS design, our solution exploits cluster nodes that have the ability to operate in at least two extreme system level power states, characterized by minimum vs. maximum power consumption and performance. The paper describes a cluster built with power-efficient node prototypes and presents experimental evaluations to demonstrate power-efficiency.
Hrishikesh Amur, Karsten Schwan
What Computer Architects Need to Know about Memory Throttling
Abstract
Memory throttling is one technique for power and energy management that is currently available in commercial systems, yet has has received little attention in the architecture community. This paper provides an overview of memory throttling: how it works, how it affects performance, and how it controls power. We provide measured power and performance data with memory throttling on a commercial blade system, and discuss key issues for power management with memory throttling mechanisms.
Heather Hanson, Karthick Rajamani
Predictive Power Management for Multi-core Processors
Abstract
Predictive power management provides reduced power consumption and increased performance compared to reactive schemes. It effectively reduces the lag between workload phase changes and changes in power adaptations since adaptations can be applied immediately before a program phase change. To this end we present the first analysis of prediction for power management under SYSMark2007. Compared to traditional scientific/computing benchmarks, this workload demonstrates more complex core active and idle behavior. We analyze a table based predictor on a quad-core processor. We present an accurate runtime power model that accounts for fine-grain temperature and voltage variation. By predictively borrowing power from cores, our approach provides an average speedup of 7.3% in SYSMark2007.
William Lloyd Bircher, Lizy John

WIOSCA: 6th Annual Workshop on the Interaction between Operating Systems and Computer Architecture

IOMMU: Strategies for Mitigating the IOTLB Bottleneck
Abstract
The input/output memory management unit (IOMMU) was recently introduced into mainstream computer architecture when both Intel and AMD added IOMMUs to their chip-sets. An IOMMU provides memory protection from I/O devices by enabling system software to control which areas of physical memory an I/O device may access. However, this protection incurs additional direct memory access (DMA) overhead due to the required address resolution and validation.
IOMMUs include an input/output translation lookaside buffer (IOTLB) to speed-up address resolution, but still every IOTLB cache-miss causes a substantial increase in DMA latency and performance degradation of DMA-intensive workloads. In this paper we first demonstrate the potential negative impact of IOTLB cache-misses on workload performance. We then propose both system software and hardware enhancements to reduce IOTLB miss rate and accelerate address resolution. These enhancements can lead to a reduction of over 60% in IOTLB miss-rate for common I/O intensive workloads.
Nadav Amit, Muli Ben-Yehuda, Ben-Ami Yassour
Improving Server Performance on Multi-cores via Selective Off-Loading of OS Functionality
Abstract
Modern and future server-class processors will incorporate many cores. Some studies have suggested that it may be worthwhile to dedicate some of the many cores for specific tasks such as operating system execution. OS off-loading has two main benefits: improved performance due to better cache utilization and improved power efficiency due to smarter use of heterogeneous cores. However, OS off-loading is a complex process that involves balancing the overheads of off-loading against the potential benefit, which is unknown while making the off-loading decision. In prior work, OS off-loading has been implemented by first profiling system call behavior and then manually instrumenting some OS routines (out of hundreds) to support off-loading. We propose a hardware-based mechanism to help automate the off-load decision-making process, and provide high quality dynamic decisions via performance feedback. Our mechanism dynamically estimates the off-load requirements of the application and relies on a run-length predictor for the upcoming OS system call invocation. The resulting hardware based off-loading policy yields a throughput improvement of up to 18% over a baseline without off-loading, 13% over a static software based policy, and 23% over a dynamic software based policy.
David Nellans, Kshitij Sudan, Erik Brunvand, Rajeev Balasubramonian
Performance Characteristics of Explicit Superpage Support
Abstract
Many modern processors support more than one page size. In the 1990s the larger pages, called superpages, were identified as one means of reducing the time spent servicing Translation Lookaside Buffer (TLB) misses by increasing TLB reach. Transparent usage of superpages has seen limited support due to architectural limitations, the cost of monitoring and implementing promotion/demotion, the uncertainity of whether superpages will be a performance boost and the decreasing cost of TLB misses due to hardware innovations. As significant modifications are required to transparently support superpages, the perception is that the cost of transparency will exceed the benefits for real workloads.
This paper describes how processes can explicitly request memory be backed by superpages that is cross-platform, incurs no measurable cost and is suitable for use in a general operating system. By not impacting base page performance, a baseline metric is established that alternative superpage implementations can compare against. A reservation scheme for superpages is used at mmap() time that guarantees faults without depending on pre-faulting, the fragmentation state of the system or demotion strategies. It is described how to back different regions of memory using explicit superpage support without application modification and present an evaluation of an implementation running a range of workloads.
Mel Gorman, Patrick Healy
Interfacing Operating Systems and Polymorphic Computing Platforms Based on the MOLEN Programming Paradigm
Abstract
The MOLEN Programming Paradigm was proposed to offer a general function like execution of the computation intensive parts of the programs on the reconfigurable fabric of the polymorphic computing platforms. Within the MOLEN programming paradigm, the MOLEN SET and EXECUTE primitives are employed to map an arbitrary function on the reconfigurable hardware. However, these instructions in their current status are intended for single application execution scenario. In this paper, we extended the semantic of MOLEN SET and EXECUTE to have a more generalized approach and support multi application, multitasking scenarios. This way, the new SET and EXECUTES are APIs added to the operating system runtime. We use these APIs to abstract the concept of the task from its actual implementation. Our experiments show that the proposed approach has a negligible overhead over the overall applications execution.
Mojtaba Sabeghi, Koen Bertels
Extrinsic and Intrinsic Text Cloning
Abstract
Text Cloning occurs when a processor is storing in its shared caches the same text multiple times. There are several causes of Text Cloning and we classify them either as Extrinsic or Intrinsic.
Extrinsic Text Cloning can happen due to user and software practices, or middleware policies, which result into making multiple copies of a binary and concurrently executing the multiple copies on the same processor.
Intrinsic Text Cloning can happen when an instruction cache is Virtually Indexed/Virtually Tagged. A simultaneous multithreaded processor, that employs such cache, will map different processes of the same binary to different instruction cache space due to their distinct process identifier.
Text cloning can be wasteful to performance, especially for simultaneous multithreaded processors, because concurrent processes compete for cache space to store the same instruction blocks.
Experimental results on simultaneous multithreaded processors indicate that the performance overhead of this type of undesirable cloning is significant.
Marios Kleanthous, Yiannakis Sazeides, Marios D. Dikaiakos
A Case for Coordinated Resource Management in Heterogeneous Multicore Platforms
Abstract
Recent advances in multi- and many-core architectures include increased hardware-level parallelism (i.e., core counts) and the emergence of platform-level heterogeneity. System software managing these platforms is typically comprised of multiple independent resource managers (e.g., drivers and specialized runtimes) customized for heterogeneous vs. general purpose platform elements. This independence, however, can cause performance degradation for an application that spans diverse cores and resource managers, unless managers coordinate with each other to better service application needs. This paper first presents examples that demonstrate the need for coordination among multiple resource managers on heterogeneous multicore platforms. It then presents useful coordination schemes for a platform coupling an IXP network processor with x86 cores and running web and multimedia applications. Experimental evidence of performance gains achieved through coordinated management motivates a case for standard coordination mechanisms and interfaces for future heterogeneous many-core systems.
Priyanka Tembey, Ada Gavrilovska, Karsten Schwan
Topology-Aware Quality-of-Service Support in Highly Integrated Chip Multiprocessors
Abstract
Power limitations and complexity constraints demand modular designs, such as chip multiprocessors (CMPs) and systems-on-chip (SOCs). Today’s CMPs feature up to a hundred discrete cores, with greater levels of integration anticipated in the future. Supporting effective on-chip resource sharing for cloud computing and server consolidation necessitates CMP-level quality-of-service (QOS) for performance isolation, service guarantees, and security. This work takes a topology-aware approach to on-chip QOS. We propose to segregate shared resources into dedicated, QOS-enabled regions of the chip. We than eliminate QOS-related hardware and its associated overheads from the rest of the die via a combination of topology and operating system support. We evaluate several topologies for the QOS-enabled regions, including a new organization called Destination Partitioned Subnets (DPS) which uses a light-weight dedicated network for each destination node. DPS matches or bests other topologies with comparable bisection bandwidth in performance, area- and energy-efficiency, fairness, and preemption resilience.
Boris Grot, Stephen W. Keckler, Onur Mutlu
Backmatter
Metadaten
Titel
Computer Architecture
herausgegeben von
Ana Lucia Varbanescu
Anca Molnos
Rob van Nieuwpoort
Copyright-Jahr
2012
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-24322-6
Print ISBN
978-3-642-24321-9
DOI
https://doi.org/10.1007/978-3-642-24322-6