Advances in Computer Systems Architecture
11th Asia-Pacific Conference, ACSAC 2006, Shanghai, China, September 6-8, 2006. Proceedings
- 2006
- Buch
- Herausgegeben von
- Chris Jesshope
- Colin Egan
- Buchreihe
- Lecture Notes in Computer Science
- Verlag
- Springer Berlin Heidelberg
Über dieses Buch
On behalf of all of the people involved in the program selection, the program committee members as well as numerous other reviewers, we are both relieved and pleased to present you with the proceedings of the 2006 Asia-Pacific Computer Systems Architecture Conference (ACSAC 2006), which is being hosted in Shanghai on September 6–8, 2006. This is the 11th in a series of conferences, which started life in Australia, as the computer architecture component of the Australian Computer Science Week. In 1999 it ventured away from its roots for the first time, and the fourth Australasian Computer Architecture Conference was held in the beautiful city of Sails (Auckland, New Zealand). Perhaps it was because of a lack of any other computer architecture conference in Asia or just the attraction of traveling to the Southern Hemisphere but the conference became increasingly international during the subsequent three years and also changed its name to include Computer Systems Architecture, reflecting more the scope of the conference, which embraces both architectural and systems issues. In 2003, the conference again ventured offshore to reflect its constituency and since then has been held in Japan in the beautiful city of Aizu-Wakamatsu, followed by Beijing and Singapore. This year it again returns to China and next year will move to Korea for the first time, where it will be organized by the Korea University.
Inhaltsverzeichnis
-
Frontmatter
-
The Era of Multi-core Chips -A Fresh Look on Software Challenges
Guang R. GaoAbstractIn the past few months, the world has witnessed the impressive pace that the microprocessor chip vendors’ switching to multi-core chip technology. However, this is preventing steep software challenges – both in the migration of application software and in the adaptation of system software.In this talk, we discuss the challenges as well as opportunities facing software technology in the era of the emerging multi-core chips. We review the software effort failures and lessons learned during the booming years on parallel computing – in the 80s and early 90s, and analyze the issues and challenges today when we are once more trying to explore large-scale parallelism on multi-core chips and systems. We then predict major technology innovations that should be made in order to assure a success this time.This talk will begin with a discussion based on our own experience on working with fine-grain multithreading from execution/architecture models, system software technology, and relevant application software studies in the past decade. We then outline our recent experience in working on software issues for the next generation multi-core chip architectures. We will present a case study on a mapping of OpenMP on two representative classes of future multi-core architecture models. We discuss several fundamental performance issues facing system software designers. -
Streaming Networks for Coordinating Data-Parallel Programs (Position Statement)
Alex ShafarenkoAbstractA new coordination language for distributed data-parallel programs is presented, call SNet. The intention of SNet is to introduce advanced structuring techniques into a coordination language: stream processing and various forms of subtyping. The talk will present the organisation of SNet, its major type inferencing algorithms and will briefly discuss the current state of implementation and possible applications. -
Implementations of Square-Root and Exponential Functions for Large FPGAs
Mariusz Bajger, Amos R. OmondiAbstractThis paper discusses low-error, high-speed evaluation of two elementary functions: square-root (which is required in IEEE-754 standard on computer arithmetic) and exponential (which is common in scientific calculations). The basis of the proposed implementations is piecewise-linear interpolation but with the constants chosen in a way that minimizes relative error. We show that by placing certain constraints on the errors at three points within each interpolation interval, relative errors are greatly reduced. The implementation-targets are large FPGAs that have in-built multipliers, adders, and distributed memory. -
Using Branch Prediction Information for Near-Optimal I-Cache Leakage
Sung Woo Chung, Kevin SkadronAbstractThis paper describes a new on-demand wakeup prediction policy for instruction cache leakage control that achieves better leakage savings than prior policies, and avoids the performance overheads of prior policies. The proposed policy reduces leakage energy by more than 92% with only less than 0.3% performance overhead on average. The key to this new on-demand policy is to use branch prediction information for the wakeup prediction. In the proposed policy, inserting an extra stage for wakeup between branch prediction and fetch, allows the branch predictor to be also used as a wakeup predictor without any additional hardware. Thus, the extra stage hides the wakeup penalty, not affecting branch prediction accuracy. Though extra pipeline stages typically add to branch misprediction penalty, in this case, the extra wakeup stage on the normal fetch path can be overlapped with misprediction recovery. With such consistently accurate wakeup prediction, all cache lines except the next expected cache line are in the leakage saving mode, minimizing leakage energy. -
Scientific Computing Applications on the Imagine Stream Processor
Jing Du, Xuejun Yang, Guibin Wang, Fujiang AoAbstractThe Imagine processor is designed to address the processor-memory gap through streaming technology. Good performance of most media applications has been demonstrated on Imagine. However the research whether scientific computing applications are suited for Imagine is open. In this paper, we studied some key issues of scientific computing applications mapping to Imagine, and present the experimental results of some representative scientific computing applications on the ISIM simulation of Imagine. By evaluating the experimental results, we isolate the set of scientific computing application characteristics well suited for Imagine architecture, analyze the performance potentiality of scientific computing applications on Imagine compared with common processor and explore the optimizations of scientific stream program. -
Enhancing Last-Level Cache Performance by Block Bypassing and Early Miss Determination
Haakon Dybdahl, Per StenströmAbstractWhile bypassing algorithms have been applied to the first-level cache, we study for the first time their effectiveness for the last-level caches for which miss penalties are significantly higher and where algorithm complexity is not constrained by the speed of the pipeline. Our algorithm monitors the reuse behavior of blocks that are touched by delinquent loads and re-classify them on-the-fly. Blocks classified as bypassed are only installed in the level-1 cache. We leverage the algorithm to early send out a miss request for loads expected to request blocks classified to be bypassed. Such requests are sent to memory directly without tag checks at intermediary levels in the cache hierarchy. Overall, we find that we can robustly reduce the miss rate by 23% and improve IPC with 14% on average for memory bound SPEC2000 applications without degrading performance of the other SPEC2000 applications. -
A Study of the Performance Potential for Dynamic Instruction Hints Selection
Rao Fu, Jiwei Lu, Antonia Zhai, Wei-Chung HsuAbstractInstruction hints have become an important way to communicate compile-time information to the hardware. They can be generated by the compiler and the post-link optimizer to reduce cache misses, improve branch prediction and minimize other performance bottlenecks. This paper discusses different instruction hints available on modern processor architectures and shows the potential performance impact on many benchmark programs. Some hints can be effectively selected at compile time with profile feedback. However, since the same program executable can behave differently on various inputs and performance bottlenecks may change on different micro-architectures, significant performance opportunities can be exploited by selecting instruction hints dynamically. -
Reorganizing UNIX for Reliability
Jorrit N. Herder, Herbert Bos, Ben Gras, Philip Homburg, Andrew S. TanenbaumAbstractIn this paper, we discuss the architecture of a modular UNIX-compatible operating system, MINIX3, that provides reliability beyond that of most other systems. With nearly the entire operating system running as a set of user-mode servers and drivers atop a minimal kernel, the system is fully compartmentalized.By moving most of the code to unprivileged user-mode processes and restricting the powers of each one, we gain proper fault isolation and limit the damage bugs can do. Moreover, the system has been designed to survive and automatically recover from failures in critical modules, such as device drivers, transparent to applications and without user intervention.We used this new design to develop a highly reliable, open-source, POSIX-conformant member of the UNIX family. The resulting system is freely available and has been downloaded over 75,000 times since its release. -
Critical-Task Anticipation Scheduling Algorithm for Heterogeneous and Grid Computing
Ching-Hsien Hsu, Ming-Yuan Own, Kuan-Ching LiAbstractThe problem of scheduling a weighted directed acyclic graph (DAG) to a set of heterogeneous processors to minimize the completion time has been recently studied. The NP-completeness of the problem has instigated researchers to propose different heuristic algorithms. In this paper, we present an efficient Critical-task Anticipation (CA) scheduling algorithm for heterogeneous computing systems. The CA scheduling algorithm introduces a new task prioritizing scheme that based on urgency and importance of tasks to obtain better schedule length compared to the Heterogeneous Earliest Finish Time algorithm. To evaluate the performance of the proposed algorithm, we have developed a simulator that contains a parametric graph generator for generating weighted directed acyclic graphs with various characteristics. We have implemented the CA algorithm along with the HEFT scheduling algorithm on the simulator. The CA algorithm is shown to be effective in terms of speedup and easy to implement. -
Processor Directed Dynamic Page Policy
Dandan Huan, Zusong Li, Weiwu Hu, Zhiyong LiuAbstractThe widening gap between today’s processor and memory performance makes memory subsystem design an increasingly important part of computer design. Processor directed dynamic page policy is proposed by investigating the memory access patterns of applications. Processor directed dynamic page policy changes page mode adaptively in accordance with the directions of processor. It combines the advantages of close page policy and open page policy. The processor directed dynamic page policy is based on future memory access behavior. Compared with the direction information of existing dynamic page policies which is based on the history of memory access behavior, the direction information of processor directed dynamic page policy is more accurate. Furthermore, memory access requests of processor are scheduled based on the page policy to increase the page hit rate and reduce page conflict miss rate. The performance of SPEC CPU2000 benchmarks is improved significantly. The IPC is improved by 7.1%, 5.9% and 3.4% on average compared with close page policy, open page policy and conventional dynamic page policy, respectively. -
Static WCET Analysis Based Compiler-Directed DVS Energy Optimization in Real-Time Applications
Yi Huizhan, Chen Juan, Yang XuejunAbstractCompiler-directed dynamic voltage scaling (DVS) is one of the effective low-power techniques for real-time applications. Using the technique, compiler inserts voltage scaling points into a real-time application, and supply voltage and clock frequency are adjusted to the relationship between the remaining time and the remaining workload at each voltage scaling point. In this paper, based on the WCET (the worst case execution time) analysis tool HEPTANE and the performance/power simulator Sim-Panalyzer, we present a DVS-enabled simulation environment RTLPower (Real-Time Low Power), which integrates static WCET estimation, performance/power simulation, automatically inserting the DVS code into a real-time application, and profile-guided energy optimization. By simulations of some benchmark applications, we prove that the DVS technique and the profile-guided optimization technique significantly reduce energy consumption. -
A Study on Transformation of Self-similar Processes with Arbitrary Marginal Distributions
Hae-Duck J. Jeong, Jong-Suk R. LeeAbstractStochastic discrete-event simulation studies of communication networks often require a mechanism to transform self-similar processes with normal marginal distributions into self-similar processes with arbitrary marginal distributions. The problem of generating a self-similar process of a given marginal distribution and an autocorrelation structure is difficult and has not been fully solved. Our results presented in this paper provide clear experimental evidence that the autocorrelation function of the input process is not preserved in the output process generated by the inverse cumulative distribution function (ICDF) transformation, where the output process has an infinite variance. On the other hand, it preserves autocorrelation functions of the input process where the output marginal distributions (exponential, gamma, Pareto with α= 20.0, uniform and Weibull) have finite variances, and the ICDF transformation is applied to long-range dependent self-similar processes with normal marginal distributions. -
μTC – An Intermediate Language for Programming Chip Multiprocessors
Chris JesshopeAbstractμTC is a language that has been designed for programming chip multiprocessors. Indeed, to be more specific, it has been developed to program chip multiprocessors based on arrays of microthreaded microprocessors as these processors directly implement the concepts introduced in the language. However, it is more general than that and is being used in other projects as an interface defining dynamic concurrency. Ideally, a program written in μTC is a dynamic, concurrent control structure over small sequences of code, which in the limit could be a few instructions each. μTC is being used as an intermediate language to capture concurrency from data-parallel languages such as single-assignment C, parallelising compilers for sequential languages such as C and concurrent composition languages, such as Snet. μTC’s advantage over other approaches is that it allows an abstract representation of maximal concurrency in a schedule-independent form. Both Snet and μTC are being used in a European project called AETHER, in order to support all aspects of self-adaptive computation. -
Functional Unit Chaining: A Runtime Adaptive Architecture for Reducing Bypass Delays
Lih Wen Koh, Oliver DiesselAbstractBypass delays are expected to grow beyond 1ns as technology scales. These delays necessitate pipelining of bypass paths at processor frequencies above 1GHz and thus affect the performance of sequential code sequences. We propose dealing with these delays through a dynamic functional unit chaining approach. We study the performance benefits of a superscalar, out-of-order processor augmented with a two-by-two array of ALUs interconnected by a fast, partial bypass network. An online profiler guides the automatic configuration of the network to accelerate specific patterns of dependent instructions. A detailed study of benchmark simulations demonstrates these first steps towards mapping binaries to a small coarse-grained array at runtime can improve instruction throughput by over 18% and 25% when the microarchitecure includes bypass delays of one cycle and two cycles, respectively. -
Trace-Based Data Cache Leakage Reduction at Link Time
Lian Li, Jingling XueAbstractThis paper investigates the benefits of conducting leakage energy optimisations for data caches at link time for embedded applications. We introduce an improved algorithm for identifying and constructing the traces in a binary program and present a trace-based optimisation for reducing leakage energy in data caches. Our experimental results using Mediabench benchmarks show that good leakage energy savings can be achieved at the cost of some small performance and code size penalties. Furthermore, by varying the granularity of optimisation regions, which is a tunable parameter, embedded application programmers can make the tradeoffs between energy savings and these associated costs. -
Parallelizing User-Defined and Implicit Reductions Globally on Multiprocessors
Shih-wei LiaoAbstractMultiprocessors are becoming prevalent in the PC world. Major CPU vendors such as Intel and Advanced Micro Devices have migrated to multicore processors. However, this also means that computers will run an application at full speed only if that application is parallelized. To take advantage of more than a fraction of compute resource on a die, we develop a compiler to parallelize a common and powerful programming paradigm, namely reduction. Our goal is to exploit the full potential of reductions for efficient execution of applications on multiprocessors, including multicores. Note that reduction operations are common in streaming applications, financial computing and HPC domain. In fact, 9% of all MPI invocations in the NAS Parallel Benchmarks are reduction library calls. Recognizing implicit reductions in Fortran and C is important for parallelization on multiprocessors. Recent languages such as Brook Streaming language and Chapel language allow users to specify reduction functions. Our compiler provides a unified framework for processing both implicit and user-defined reductions. Both types of reductions are propagated and analyzed interprocedurally. Our global algorithm can enhance the scope of user-defined reductions and parallelize coarser-grained reductions. Thanking to the powerful algorithm and representation, we obtain an average speedup of 3 on 4 processors. The speedup is only 1.7 if only intraprocedural scalar reductions are parallelized. -
Overload Protection for Commodity Network Appliances
Luke MacphersonAbstractPerformance degradation under overload is a well known problem in networked systems. While this problem has been explored extensively in the context of TCP-based web servers, other applications have unique requirements which need to be addressed.In existing admission control systems, the cost of admission control increases with the load to the system. This is acceptable for responsive TCP-based loads, but it is not effective in preventing overload for unresponsive workloads.We present a solution where admission control cost is a function of the traffic admitted to the system, allowing our approach to maintain peak throughput under overload.We have implemented our approach in a real system and evaluated its effectiveness in preventing overload for a number of demanding network workloads. We find that our solution is effective in eliminating performance degradation under overload, while having the desirable property of being simple to implement in commodity systems.
- Titel
- Advances in Computer Systems Architecture
- Herausgegeben von
-
Chris Jesshope
Colin Egan
- Copyright-Jahr
- 2006
- Verlag
- Springer Berlin Heidelberg
- Electronic ISBN
- 978-3-540-40058-5
- Print ISBN
- 978-3-540-40056-1
- DOI
- https://doi.org/10.1007/11859802
Informationen zur Barrierefreiheit für dieses Buch folgen in Kürze. Wir arbeiten daran, sie so schnell wie möglich verfügbar zu machen. Vielen Dank für Ihre Geduld.