Skip to main content

2010 | Buch

High Performance Embedded Architectures and Compilers

5th International Conference, HiPEAC 2010, Pisa, Italy, January 25-27, 2010. Proceedings

herausgegeben von: Yale N. Patt, Pierfrancesco Foglia, Evelyn Duesterwald, Paolo Faraboschi, Xavier Martorell

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 5th International Conference on High Performance Embedded Architectures and Compilers, HiPEAC 2010, held in Pisa, Italy, in January 2010. The 23 revised full papers presented together with the abstracts of 2 invited keynote addresses were carefully reviewed and selected from 94 submissions. The papers are organized in topical sections on architectural support for concurrency; compilation and runtime systems; reconfigurable and customized architectures; multicore efficiency, reliability, and power; memory organization and optimization; and programming and analysis of accelerators.

Inhaltsverzeichnis

Frontmatter

Invited Program

Embedded Systems as Datacenters
(Keynote)
Abstract
Designing embedded systems the way we did 20 years ago is still alive and well. As expected, with declining costs, embedded systems are appearing in more and more applications. Advances to the state of the art in creating such systems, where memory and processor are precious resources, have continued, and this work is to be applauded. But just as interestingly, embedded systems are taking on not only problems that require loosening the memory and processor constraints but also those problems that push us into the domain of datacenter design. Datacenter design is where system thinking about heterogeneous parallel processing, data archiving, mission-critical connectivity, and energy management are first-order concerns. In this talk I will illustrate the latter case with some recent explorations and give some attention to how datacenter thinking can be effectively ushered into this new domain.
Bob Iannucci
Larrabee: A Many-Core Intel Architecture for Visual Computing
(Keynote)
Abstract
This talk will describe a many-core visual computing architecture code named Larrabee. Larrabee uses multiple in-order x86 CPU cores that are augmented by a wide vector processor unit, as well as some fixed function logic blocks. The talk will go into an overview of the Larrabee architecture and will cover the LRB Vector ISA in detail. We’ll then cover the Larrabee programming model and finally close with how one would target Larrabee for high performance 3D graphics.
Roger Espasa

Architectural Support for Concurrency

Remote Store Programming
A Memory Model for Embedded Multicore
Abstract
This paper presents remote store programming (RSP), a programming paradigm which combines usability and efficiency through the exploitation of a simple hardware mechanism, the remote store, which can easily be added to existing multicores. The RSP model and its hardware implementation trade a relatively high store latency for a low load latency because loads are more common than stores, and it is easier to tolerate store latency than load latency. This paper demonstrates the performance advantages of remote store programming by comparing it to cache-coherent shared memory (CCSM) for several important embedded benchmarks using the TILEPro64 processor. RSP is shown to be faster than CCSM for all eight benchmarks using 64 cores. For five of the eight benchmarks, RSP is shown to be more than 1.5 × faster than CCSM. For a 2D FFT implemented on 64 cores, RSP is over 3 × faster than CCSM. RSP’s features, performance, and hardware simplicity make it well suited to the embedded processing domain.
Henry Hoffmann, David Wentzlaff, Anant Agarwal
Low-Overhead, High-Speed Multi-core Barrier Synchronization
Abstract
Whereas efficient barrier implementations were once a concern only in high-performance computing, recent trends in core integration make the topic relevant even for general-purpose CMPs. While the nature of CMP applications requires low-latency, the cost of low-latency barrier implementations using hardware-based techniques can be prohibitive for CMPs, where die area represents opportunities for throughput and yield. Similarly, whereas traditional multiprocessor barrier implementations were developed primarily for dedicated environments, scheduling and multi-programming on CMPs require more adaptable barrier implementations.
In this paper, we present and evaluate three barrier implementations that are hybrids of software and dedicated hardware barriers and are specifically tailored for CMPs. The implementations leverage the unique characteristics of CMPs and provide low latency comparable to that of dedicated hardware networks at a fraction of the cost. The implementations also support adaptability, enabling efficient multi-programming and dynamic remapping of the barrier network.
John Sartori, Rakesh Kumar
Improving Performance by Reducing Aborts in Hardware Transactional Memory
Abstract
The optimistic nature of Transactional Memory (TM) systems can lead to the concurrent execution of transactions that are later found to conflict. Conflicts degrade scalability, and may lead to aborts that increase wasted work, and degrade performance. A promising approach to reducing conflicts at runtime is dynamically, and transparently, reordering the execution of transactions upon discovery of conflicts. This approach has been explored in Software TMs (STMs), but not in Hardware TMs (HTMs). Furthermore, STM implementations of this approach cannot be ported to HTMs easily.
This paper investigates the feasibility of such reordering in HTMs, and presents two designs that are scalable, independent of the on-chip interconnect, require only minor modifications to each core, and add no execution overhead if no conflicts occur. The evaluation takes LogTM-SE as a base line and considers benchmarks with different levels of contention (transactional conflicts). The results show that the preferred design increases HTM performance by up to 17% when contention is low, 57% when contention is high, and never degrades performance. Finally, the designs are orthogonal to LogTM-SE; they require no modification to cache structures, and continue to support transaction virtualization, open and closed unbounded nesting, paging, thread suspension, and thread migration.
Mohammad Ansari, Behram Khan, Mikel Luján, Christos Kotselidis, Chris Kirkham, Ian Watson
Energy and Throughput Efficient Transactional Memory for Embedded Multicore Systems
Abstract
We propose a new design for an energy-efficient hardware transactional memory (HTM) system for power-aware embedded devices. Prior hardware transactional memory designs proposed a small, fully-associative transactional cache at the same level as the L1 cache. We propose an alternative design that unifies the transactional and L1 caches, and provides a small victim cache to reduce effects of capacity and conflict evictions. We evaluate our new HTM scheme on a variety of benchmarks, both in terms of energy and performance. We show that the victim cache scheme can provide up to a 4X improvement in energy-delay product, compared to a traditional HTM scheme that uses a separate transactional cache.
Cesare Ferri, Samantha Wood, Tali Moreshet, Iris Bahar, Maurice Herlihy

Compilation and Runtime Systems

Split Register Allocation: Linear Complexity Without the Performance Penalty
Abstract
Just-in-time compilers are becoming ubiquitous, spurring the design of more efficient algorithms and more elaborate intermediate representations. They rely on continuous, feedback-directed (re-)compilation frameworks to adaptively select a limited set of hot functions for aggressive optimization. To date, (quasi-)linear complexity has remained a driving force in the design of just-in-time optimizers.
This paper describes a split register allocator showing that linear complexity does not imply reduced code quality. We present a split compiler design, where more expensive ahead-of-time analyses guide lightweight just-in-time optimizations. A split register allocator can be very aggressive in its offline stage, producing a semantic summary through bytecode annotations that can be processed by a lightweight online stage. The challenges are fourfold: (sub-)linear-size annotation, linear-time online processing, minimal loss of code quality, and portability of the annotation.
We propose a split register allocator meeting these challenges. A compact annotation derived from an optimal integer linear program (ILP) formulation of register allocation drives a linear-time algorithm near optimality. We study the robustness of this algorithm to variations in the number of physical registers. Our method is implemented in JikesRVM and evaluated on standard benchmarks.
Boubacar Diouf, Albert Cohen, Fabrice Rastello, John Cavazos
Trace-Based Data Layout Optimizations for Multi-core Processors
Abstract
The focus of this paper is on cache-conscious data layout optimizations. Although these optimizations have already been adopted by industrial compilers, they were shown to be inefficient for multi-process applications on multi-core platforms. Such factors as asymmetric distribution of processes over hardware resources (cores, cpus or hardware threads), along with their temporal migrations, unpredictably influence optimization results. Herein we present a new methodology that extends classical data layout optimizations to support multi-core architectures. Based on data trace collection that reflects actual interleaving of data accesses, this method aims to improve spatial locality of the data, while mitigating potential false sharing events. Introduction of architectural characteristics into an analysis phase further increases the accuracy of data affinity estimation. Feasibility study of this method, applied to multi-process webserver lighttpd on Power5 machine, not only showed performance improvement, but also proved its suitability for incorporation into an industrial compiler.
Olga Golovanevsky, Alon Dayan, Ayal Zaks, David Edelsohn
Buffer Sizing for Self-timed Stream Programs on Heterogeneous Distributed Memory Multiprocessors
Abstract
Stream programming is a promising way to expose concurrency to the compiler. A stream program is built from kernels that communicate only via point-to-point streams. The stream compiler statically allocates these kernels to processors, applying blocking, fission and fusion transformations. The compiler determines the sizes of the communication buffers, which affects performance since local memories can be small.
In this paper, we propose a feedback-directed algorithm that determines the size of each communication buffer, based on i) the stream program that has been mapped onto processors, ii) feedback from an earlier execution, and iii) the memory constraints. The algorithm exposes a trade-off between throughput and latency. It is general, in that it applies to stream programs with unstructured stream graphs, and it supports variable execution times and communication rates.
We show results for the StreamIt benchmarks and random graphs. For the StreamIt benchmarks, throughput is optimal after the first iteration. For random graphs with stochastic computation times, throughput is within 3% of optimal after four iterations. Compared with the previous general algorithm, by Basten and Hoogerbrugge, our algorithm has significantly better performance and latency.
Paul M. Carpenter, Alex Ramirez, Eduard Ayguadé
Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures
Abstract
Graphics processors are increasingly used in scientific applications due to their high computational power, which comes from hardware with multiple-level parallelism and memory hierarchy. Sparse matrix computations frequently arise in scientific applications, for example, when solving PDEs on unstructured grids. However, traditional sparse matrix algorithms are difficult to efficiently parallelize for GPUs due to irregular patterns of memory references. In this paper we present a new storage format for sparse matrices that better employs locality, has low memory footprint and enables automatic specialization for various matrices and future devices via parameter tuning. Experimental evaluation demonstrates significant speedups compared to previously published results.
Alexander Monakov, Anton Lokhmotov, Arutyun Avetisyan

Reconfigurable and Customized Architectures

Virtual Ways: Efficient Coherence for Architecturally Visible Storage in Automatic Instruction Set Extensions
Abstract
Customizable processors augmented with application-specific Instruction Set Extensions (ISEs) have begun to gain traction in recent years. The most effective ISEs include Architecturally Visible Storage (AVS), compiler-controlled memories accessible exclusively to the ISEs. Unfortunately, the usage of AVS memories creates a coherence problem with the data cache. A multiprocessor coherence protocol can solve the problem, however, this is an expensive solution when applied in a uniprocessor context. Instead, we can solve the problem by modifying the cache controller so that the AVS memories function as extra ways of the cache with respect to coherence, but are not generally accessible as extra ways for use under normal software execution. This solution, which we call Virtual Ways is less costly than a hardware coherence protocol, and eliminate coherence messages from the system bus, which improves energy consumption. Moreover, eliminating these messages makes Virtual Ways significantly more robust to performance degradation when there is a significant disparity in clock frequency between the processor and main memory.
Theo Kluter, Samuel Burri, Philip Brisk, Edoardo Charbon, Paolo Ienne
Accelerating XML Query Matching through Custom Stack Generation on FPGAs
Abstract
Publish-subscribe systems present the state of the art in information dissemination to multiple users. Such systems have evolved from simple topic-based to the current XML-enabled systems. Here, users pose complex queries (expressed in XPath) on the structure and content of the streaming documents. The parts of the documents that match the user queries are then returned to the users. This paper proposes a novel hardware architecture that would exploit the parallelism found in XPath filtering systems. Using an incoming XML stream, parsing and matching with thousands of user profiles are performed simultaneously on a single FPGA, thus yielding up to three orders of magnitude higher throughput when compared to conventional approaches bound by the sequential aspect of software computing. By converting XPath expressions into custom stacks, our architecture is the first providing full support for all structural XPath constructs, including parent-child and ancestor descendant relations, whilst allowing wildcarding and recursion.
Roger Moussalli, Mariam Salloum, Walid Najjar, Vassilis Tsotras
An Application-Aware Load Balancing Strategy for Network Processors
Abstract
This paper presents and compares different load balancing strategies in multi-core network processor (NP) chips. In our FlexPath NP system, packets are differentiated according to application-dependent processing requirements and optimized processing paths are provisioned for these applications. We derive a novel load balancing mechanism (S&H) by combining two schemes for stateful and stateless network applications in order to achieve better overall system throughput and reduced packet latencies. We show that appropriate QoS for the different regarded application types can be achieved under varying NP load conditions, while maintaining an almost uniform utilization of the available processing resources. Even though the investigations are focused on the FlexPath NP architecture, the concepts can also be applied to other architectures, where the incoming load has to be distributed among several parallel entities within an NP.
Rainer Ohlendorf, Michael Meitinger, Thomas Wild, Andreas Herkersdorf
Memory-Aware Application Mapping on Coarse-Grained Reconfigurable Arrays
Abstract
Coarse-Grained Reconfigurable Arrays (CGRAs) are a very promising platform, providing both, up to 10-100 MOps/mW of power efficiency and are software programmable. However, this cardinal promise of CGRAs critically hinges on the effectiveness of application mapping onto CGRA platforms. While previous solutions have greatly improved the computation speed, they have largely ignored the impact of the local memory architecture on the achievable power and performance. This paper motivates the need for memory-aware application mapping for CGRAs, and proposes an effective solution for application mapping that considers the effects of various memory architecture parameters including the number of banks, local memory size, and the communication bandwidth between the local memory and the external main memory. Our proposed solution achieves 62% reduction in the energy-delay product, which factors into about 47% and 28% reduction in the energy consumption and runtime, respectively, as compared to memory-unaware mapping for realistic local memory architectures. We also show that our scheme scales across a range of applications, and memory parameters.
Yongjoo Kim, Jongeun Lee, Aviral Shrivastava, Jonghee Yoon, Yunheung Paek

Multicore Efficiency, Reliability, and Power

Maestro: Orchestrating Lifetime Reliability in Chip Multiprocessors
Abstract
As CMOS feature sizes venture deep into the nanometer regime, wearout mechanisms including negative-bias temperature instability and time-dependent dielectric breakdown can severely reduce processor operating lifetimes and performance. This paper presents an introspective reliability management system, Maestro, to tackle reliability challenges in future chip multiprocessors (CMPs) head-on. Unlike traditional approaches, Maestro relies on low-level sensors to monitor the CMP as it ages (introspection). Leveraging this real-time assessment of CMP health, runtime heuristics identify wearout-centric job assignments (management). By exploiting the complementary effects of the natural heterogeneity (due to process variation and wearout) that exists in CMPs and the diversity found in system workloads, Maestro composes job schedules that intelligently control the aging process. Monte Carlo experiments show that Maestro significantly enhances lifetime reliability through intelligent wear-leveling, increasing the expected service life of a population of 16-core CMPs by as much as 38% compared to a naive, round-robin scheduler. Furthermore, in the presence of process variation, Maestro’s wearout-centric scheduling outperformed both performance counter and temperature sensor based schedulers, achieving an order of magnitude more improvement in lifetime throughput – the amount of useful work done by a system prior to failure.
Shuguang Feng, Shantanu Gupta, Amin Ansari, Scott Mahlke
Combining Locality Analysis with Online Proactive Job Co-scheduling in Chip Multiprocessors
Abstract
The shared-cache contention on Chip Multiprocessors causes performance degradation to applications and hurts system fairness. Many previously proposed solutions schedule programs according to runtime sampled cache performance to reduce cache contention. The strong dependence on runtime sampling inherently limits the scalability and effectiveness of those techniques. This work explores the combination of program locality analysis with job co-scheduling. The rationale is that program locality analysis typically offers a large-scope view of various facets of an application including data access patterns and cache requirement. That knowledge complements the local behaviors sampled by runtime systems. The combination offers the key to overcoming the limitations of prior co-scheduling techniques.
Specifically, this work develops a lightweight locality model that enables efficient, proactive prediction of the performance of co-running processes, offering the potential for an integration in online scheduling systems. Compared to existing multicore scheduling systems, the technique reduces performance degradation by 34% (7% performance improvement) and unfairness by 47%. Its proactivity makes it resilient to the scalability issues that constraints the applicability of previous techniques.
Yunlian Jiang, Kai Tian, Xipeng Shen
RELOCATE: Register File Local Access Pattern Redistribution Mechanism for Power and Thermal Management in Out-of-Order Embedded Processor
Abstract
In order to reduce register file’s peak temperature in an embedded processor we propose RELOCATE: an architectural solution which redistributes the access pattern to physical registers through a novel register allocation mechanism. RELOCATE regionalizes the register file such that even though accesses within a region are uniformly distributed, the activity levels are spread over the entire register file in a deterministic pattern. It partitions the register file and uses a micro-architectural mechanism to concentrate the accesses to a single or a subset of such partitions through a novel register allocation mechanism. The goal is to keep some partitions unused (idle) and cooling down. The temperature of idle partitions is further reduced by power gating them into destructive sleep mode to reduce their leakage power. The redistribution mechanism changes the active region periodically to modulate the activity within the register file and prevent the active region from heating up excessively. Our approach resulted in an average reduction of 8.3°C in the register file’s peak temperature for standard benchmarks.
Houman Homayoun, Aseem Gupta, Alex Veidenbaum, Avesta Sasan (M.A. Makhzan), Fadi Kurdahi, Nikil Dutt
Performance and Power Aware CMP Thread Allocation Modeling
Abstract
We address the problem of performance and power-efficient thread allocation in a CMP. To that end, based on analytical model, we introduce a parameterized performance/power metric that can be adjusted according to a preferred tradeoff between performance and power. We introduce an iterative threshold algorithm (ITA) for allocating threads to cores in the case of a single application with symmetric threads. We extend this to a simple and efficient heuristic for the case of multiple applications. We compare the performance/power metric value of ITA with constrained nonlinear optimization, pattern search algorithm and genetic algorithm. ITA outperforms the best of these methods by 9 while consuming on average 0.01% and at most 2.5% of the computational effort.
Yaniv Ben-Itzhak, Israel Cidon, Avinoam Kolodny

Memory Organization and Optimization

Multi-level Hardware Prefetching Using Low Complexity Delta Correlating Prediction Tables with Partial Matching
Abstract
This paper presents a low complexity table-based approach to delta correlation prefetching. Our approach uses a table indexed by the load address which stores the latest deltas observed. By storing deltas rather than full miss addresses, considerable space is saved while making pattern matching easier. The delta-history can predict repeating patterns with long periods by using delta correlation. In addition, we propose L1 hoisting which is a technique for moving data from the L2 to the L1 using the same underlying table structure and partial matching which reduces the spatial resolution in the delta stream to expose more patterns.
We evaluate our prefetching technique using the simulator framework used in the Data Prefetching Championship. This allows us to use the original code submitted to the contest to fairly evaluate several alternate prefetching techniques. Our prefetcher technique increases performance by 87% on average (6.6X max) on SPEC2006.
Marius Grannaes, Magnus Jahre, Lasse Natvig
Scalable Shared-Cache Management by Containing Thrashing Workloads
Abstract
Multi-core processors with shared last-level caches are vulnerable to performance inefficiencies and fairness issues when the cache is not carefully managed between the multiple cores. Cache partitioning is an effective method for isolating poorly-interacting threads from each other, but designing a mechanism with simple logic and low area overhead will be important for incorporating such schemes in future embedded multi-core processors. In this work, we identify that major performance problems only arise when one or more “thrashing” applications exist. We propose a simple yet effective Thrasher Caging (TC) cache management scheme that specifically targets these thrashing applications.
Yuejian Xie, Gabriel H. Loh
SRP: Symbiotic Resource Partitioning of the Memory Hierarchy in CMPs
Abstract
There have been many recent works in the context of Chip Multiprocessors (CMPs) investigating the need of intelligent shared cache partitioning which is believed to reduce the pressure on the off-chip bandwidth. Management of the off-chip memory bandwidth to improve system performance and/or mitigate performance volatility of applications has itself received considerable attention. Coordinated resource management schemes treat the interactions between cache allocation and bandwidth management as a black-box. This hinders the ability of these schemes from exploiting the intricate inter-relationships between the resource management strategies. In a multiprogrammed scenario, given the limited availability of the on-chip cache, it is not feasible to entirely eliminate off-chip accesses. However, it is possible to mitigate the impact of additional queueing delays associated with the memory controller by avoiding multiple applications from exercising the off-chip bandwidth simultaneously. Therefore, from the point of view of improving system performance, it is more important to have a symbiotic resource partitioning scheme that performs partitioning of each resource based on feedback it receives from the partitioning of the other.
Symbiotic resource partitioning (SRP) proposed in this paper avoids the scenarios of multiple applications exercising the off-chip memory bandwidth simultaneously by appropriately controlling the cache partitioning. In order to control the cache partitioning, SRP employs an empirical model that relies on a metric (last level cache misses per cycle) that represents the off-chip memory bandwidth demand of the applications and models the impact of cache partitioning on bandwidth demand by representing the last level cache misses per cycle metric as a function of the cache allocation per application. This model is dynamically updated to account for the phase behavior of the applications. Moreover, SRP is an iterative approach wherein each iteration of the approach consists of an update to the model, cache partitioning and bandwidth partitioning with a feedback from bandwidth partitioning that updates the model. Extensive simulations with a full system simulator and applications from the MiBench benchmark suite shows that SRP leads to a significant overall improvement in system performance as compared to a state-of the-art cache and bandwidth management schemes.
Shekhar Srikantaiah, Mahmut Kandemir
DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems
Abstract
Chip Multi-Processors (CMPs) commonly share hardware-controlled on-chip units that are unaware that memory requests are issued by independent processors. Consequently, the resources a process receives will vary depending on the behavior of the processes it is co-scheduled with. Resource allocation techniques can avoid this problem if they are provided with an accurate interference estimate. Our Dynamic Interference Estimation Framework (DIEF) provides this service by dynamically estimating the latency a process would experience with exclusive access to all hardware-controlled, shared resources. Since the total interference latency is the sum of the interference latency in each shared unit, the system designer can choose estimation techniques to achieve the desired accuracy/complexity trade-off. In this work, we provide high-accuracy estimation techniques for the on-chip interconnect, shared cache and memory bus. This DIEF implementation has an average relative estimate error between -0.4% and 4.7% and a standard deviation between 2.4% and 5.8%.
Magnus Jahre, Marius Grannaes, Lasse Natvig

Programming and Analysis of Accelerators

Tagged Procedure Calls (TPC): Efficient Runtime Support for Task-Based Parallelism on the Cell Processor
Abstract
Increasing the number of cores in modern CPUs is the main trend for improving system performance. A central challenge is the runtime support that multi-core systems ought to use for sustaining high performance and scalability without increasing disproportionally the effort required by the programmer. In this work we present Tagged Procedure Calls (TPC), a runtime system for supporting task-based programming models on architectures that require explicit data access specification by the programmer. We present the design and implementation of TPC for the Cell processor and examine how the runtime system can support task management functions with on-chip communication only. Through minimizing off-chip transactions in the runtime, we achieve sub-microsecond task initiation latency and minimum null task initiation/completion latency of 385 ns. We evaluate TPC with several kernels and applications, demonstrating that TPC achieves scalable on-chip execution of codes previously parallelized and optimized for shared-memory multiprocessors, can exploit additional fine-grain parallelism in codes previously parallelized at coarse levels of granularity, and performs competitively to existing task-based parallel programming frameworks that statically optimize data layout and task placement.
George Tzenakis, Konstantinos Kapelonis, Michail Alvanos, Konstantinos Koukos, Dimitrios S. Nikolopoulos, Angelos Bilas
Analysis of Task Offloading for Accelerators
Abstract
As an answer to the forthcoming heterogeneous multicore and accelerator–based architectures, we have proposed some syntactic extensions to C in the form of C pragmas, based on OpenMP, that make easier for programmers to offload parts of their applications to the auxiliary processors. Offloaded tasks can be made more profitable using a simple blocking strategy. And the runtime system is used to better support computation and communication overlap, while moving data to and from accelerators.
In order to prove the feasibility and usefulness of our proposal, we have considered the IBM Cell architecture. The performance of the whole system has been evaluated using HPCC STREAM Triad and several NAS benchmarks. We present their evaluation and a detailed performance breakdown at the level of parallel regions. We also classify the parallel regions according to their suitability to be exploited in accelerators. Overall, our performance is better compared to the results obtained from the IBM compiler for the Cell processor.
Roger Ferrer, Vicenç Beltran, Marc Gonzàlez, Xavier Martorell, Eduard Ayguadé
Offload – Automating Code Migration to Heterogeneous Multicore Systems
Abstract
We present Offload, a programming model for offloading parts of a C++ application to run on accelerator cores in a heterogeneous multicore system. Code to be offloaded is enclosed in an offload scope; all functions called indirectly from an offload scope are compiled for the accelerator cores. Data defined inside/outside an offload scope resides in accelerator/host memory respectively, and code to move data between memory spaces is generated automatically by the compiler. This is achieved by distinguishing between host and accelerator pointers at the type level, and compiling multiple versions of functions based on pointer parameter configurations using automatic call-graph duplication. We discuss solutions to several challenging issues related to call-graph duplication, and present an implementation of Offload for the Cell BE processor, evaluated using a number of benchmarks.
Pete Cooper, Uwe Dolinsky, Alastair F. Donaldson, Andrew Richards, Colin Riley, George Russell
Computer Generation of Efficient Software Viterbi Decoders
Abstract
This paper presents a program generator for fast software Viterbi decoders for arbitrary convolutional codes. The input to the generator is a specification of the code and a single-instruction multiple-data (SIMD) vector length. The output is an optimized C implementation of the decoder that uses explicit Intel SSE vector instructions. At the heart of the generator is a small domain-specific language called VL to express the structure of the forward pass. Vectorization is done by rewriting VL expressions, which a compiler then translates into actual code in addition to performing further optimizations specific to the vector instruction set. Benchmarks show that the generated decoders match the performance of available expert hand-tuned implementations, while spanning the entire space of convolutional codes. An online interface to the generator is provided at www.spiral.net.
Frédéric de Mesmay, Srinivas Chellappa, Franz Franchetti, Markus Püschel
Backmatter
Metadaten
Titel
High Performance Embedded Architectures and Compilers
herausgegeben von
Yale N. Patt
Pierfrancesco Foglia
Evelyn Duesterwald
Paolo Faraboschi
Xavier Martorell
Copyright-Jahr
2010
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-11515-8
Print ISBN
978-3-642-11514-1
DOI
https://doi.org/10.1007/978-3-642-11515-8