Skip to main content

2009 | Buch

Architecture of Computing Systems – ARCS 2009

22nd International Conference, Delft, The Netherlands, March 10-13, 2009. Proceedings

herausgegeben von: Mladen Berekovic, Christian Müller-Schloer, Christian Hochberger, Stephan Wong

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 22nd International Conference on Architecture of Computing Systems, ARCS 2009, held in Delft, The Netherlands, in March 2009. The 21 revised full papers presented together with 3 keynote papers were carefully reviewed and selected from 57 submissions. This year's special focus is set on energy awareness. The papers are organized in topical sections on compilation technologies, reconfigurable hardware and applications, massive parallel architectures, organic computing, memory architectures, enery awareness, Java processing, and chip-level multiprocessing.

Inhaltsverzeichnis

Frontmatter

Keynotes

Life on the Treadmill
Abstract
Silicon technology evolution over the last four decades has yielded an exponential increase in integration densities with steady improvements of performance and power consumption at each technology generation. This steady progress has created a sense of entitlement for the riches that future process generations would bring. Today, however, classical process scaling seems to be dead and living up to technology expectations requires continuous innovation at many levels, which comes at steadily progressing implementation and design costs. Solutions to problems need to cut across layers of abstractions and require coordination between software, architecture and circuit features.
Krisztián Flautner
Key Microarchitectural Innovations for Future Microprocessors
Abstract
Microprocessors have experienced tremendous performance improvements generation after generation since its inception. Moore’s law has fueled this evolution and will keep doing it in forthcoming generations. However, future microprocessors are facing new challenges that require innovative approaches to keep delivering improvements comparable to those that we have enjoyed so far. Power dissipation is a main challenge in all segments, from ultra-mobile to high-end servers. Another important challenge is the fact that we have relied on instruction-level parallelism (ILP) as a main lever to improve performance, but after more than 30 years of enhancing ILP techniques we are approaching a point of diminishing returns. In this talk we will discuss these challenges and propose some solutions to tackle them. Multicore is a recently adopted approach in most microprocessors that offers significant advantages in terms of power and exploits a new source of parallelism: thread-level parallelism. In this talk we will discuss the benefits of multicore and also show its limitations. We will also describe some other technologies that we believe are needed to complement the benefits of multicore and offer all together a foundation for future microprocessors.
Antonio González
The Challenges of Multicore: Information and Mis-Information
Abstract
Now that we have broken the threshold of one billion transistors on a chip and multi-core has become a reality, a lot of buzz has resulted – from how/why we got here, to what is important, to how we should determine how to effectively use multicore. In this talk, I will examine a number of these new “conventional wisdom” nuggets of information to try to see whether they add value or get in the way. For example: what can we expect multicore to do about saving power consumption? is ILP dead? should sample benchmarks drive future designs? is hardware sequential? should multicore structures be simple? is abstraction a fundamental good? Hopefully, our examinations will help shed some light on where we go from here.
Yale Patt

Compilation Technologies

Extracting Coarse-Grained Pipelined Parallelism Out of Sequential Applications for Parallel Processor Arrays
Abstract
We present development and runtime support for building application specific data processing pipelines out of sequential code, and for executing them on a general purpose platform that features a reconfigurable Parallel Processor Array (PPA). Our approach is to let the programmer annotate the source of the application to indicate the desired pipeline stages and associated data flow, with little code restructuring. A pre-processor is then used to transform the annotated program into different code segments according to the indicated pipeline structure, generate the corresponding executable code, and produce a bundled application package containing all executables and deployment information for the target platform. There are special mechanisms for setting up the application-specific pipeline structure on the PPA and achieving integrated execution in the context of a general-purpose operating system, enabling the pipelined application to access the usual system peripherals and run concurrently with other conventional programs. To verify our approach, we have built a prototype system using soft processor arrays on an embedded FPGA platform, and transformed a well-known application into a pipelined version that executes successfully on our prototype.
Dimitris Syrivelis, Spyros Lalis
Parallelization Approaches for Hardware Accelerators – Loop Unrolling Versus Loop Partitioning
Abstract
State-of-the-art behavioral synthesis tools barely have high-level transformations in order to achieve highly parallelized implementations. If any, they apply loop unrolling to obtain a higher throughput. In this paper, we employ the PARO behavioral synthesis tool which has the unique ability to perform both loop unrolling or loop partitioning. Loop unrolling replicates the loop kernel and exposes the parallelism for hardware implementation, whereas partitioning tiles the loop program onto a regular array consisting of tightly coupled processing elements. The usage of the same design tool for both the variants enables for the first time, a quantitative evaluation of the two approaches for reconfigurable architectures with help of computationally intensive algorithms selected from different benchmarks. Superlinear speedups in terms of throughput are accomplished for the processor array approach. In addition, area and power cost are reduced.
Frank Hannig, Hritam Dutta, Jürgen Teich
Evaluating Sampling Based Hotspot Detection
Abstract
In sampling based hotspot detection, performance engineers sample the running program periodically and record the Instruction Pointer (IP) addresses at the sampling. Empirically, frequently sampled IP addresses are regarded as the hotspot of the program. The question of how well the sampled hotspot IP addresses match the real hotspot of the program is seldom studied by the researchers. In this paper, we use instrumentation tool to count how many times the sampled hotspot IP addresses are executed, and compare the real execution result with the sampled one to see how well they match. We define the normalized root mean square error, the sample coverage and the order deviation to evaluate the difference between the real execution and the sampled results. Experiment on the SPEC CPU 2006 benchmarks with various sampling periods is performed to verify the proposed evaluation measurements. Intuitively, the sampling accuracy decreases with the increase of sampling period. The experimental results reveal that the order deviation reflects the intuitive relation between the sampling accuracy and the sampling period better than the normalized root mean square error and the sample coverage.
Qiang Wu, Oskar Mencer

Reconfigurable Hardware and Applications

A Reconfigurable Bloom Filter Architecture for BLASTN
Abstract
Efficient seed-based filtration methods exist for scanning genomic sequence databases. However, current solutions require a significant scan time on traditional computer architectures. These scan time requirements are likely to become even more severe due to the rapid growth in the size of databases. In this paper, we present a new approach to genomic sequence database scanning using reconfigurable field-programmable gate array (FPGA)-based hardware. To derive an efficient mapping onto this type of architecture, we propose a reconfigurable Bloom filter architecture. Our experimental results show that the FPGA implementation achieves an order of magnitude speedup compared to the NCBI BLASTN software running on a general purpose computer.
Yupeng Chen, Bertil Schmidt, Douglas L. Maskell
SoCWire: A Robust and Fault Tolerant Network-on-Chip Approach for a Dynamic Reconfigurable System-on-Chip in FPGAs
Abstract
Individual Data Processing Units (DPUs) are commonly used for operational control and specific data processing of scientific space instruments. These instruments have to be suitable for the harsh space environment in terms of e.g. temperature and radiation. Thus they need to be robust and fault tolerant to achieve an adequate reliability. The Configurable System-on-Chip (SoC) solution based on FPGA has successfully demonstrated flexibility and reliability for scientific space applications like the Venus Express mission. Future space missions demand high-performance on board processing because of the discrepancy of extreme high data volume and low downlink channel capacity. Furthermore, in-flight reconfiguration ability and dynamic reconfigurable modules enhances the system with maintenance potential and at run-time adaptive functionality. To achieve these advanced design goals a flexible Network-on-Chip (NoC) is proposed for applications with high reliability, like space missions. The conditions for SRAM-based FPGA in space are outlined. Additionally, we present our newly developed NoC approach, System-on-Chip Wire (SoCWire) and outline its performance and suitability for robust dynamic reconfigurable systems.
Björn Osterloh, Harald Michalik, Björn Fiethe
A Light-Weight Approach to Dynamical Runtime Linking Supporting Heterogenous, Parallel, and Reconfigurable Architectures
Abstract
When targeting hardware accelerators and reconfigurable processing units, the question of programmability arises, i.e. how different implementations of individual, configuration-specific functions are provided. Conventionally, this is resolved either at compilation time with a specific hardware environment being targeted, by initialization routines at program start, or decision trees at run-time. Such technique are, however, hardly applicable to dynamically changing architectures. Furthermore, these approaches show conceptual drawbacks such as requiring access to source code and requiring upfront knowledge of future system configurations, as well as overloading the code with reconfiguration-related control routines.
We therefore present a low-overhead technique enabling on-demand resolving of individual functions; this technique can be applied in two different manners; we will discuss the benefits of the individual implementations and show how both approaches can be used to establish code compatibility between different heterogeneous, reconfigurable, and parallel architectures. Further we will show, that both approaches are exposing an insignificant overhead.
Rainer Buchty, David Kramer, Mario Kicherer, Wolfgang Karl
Ultra-Fast Downloading of Partial Bitstreams through Ethernet
Abstract
In this paper we present a partial bitstreams ultra-fast downloading process through a standard Ethernet network. These Virtex-based and partially reconfigurable systems use a specific data-link level protocol to communicate with remote bistreams servers. Targeted applications cover portable communicating low cost equipments, multi-standards software defined radio, automotive embedded electronics, mobile robotics or even spacecrafts where dynamic reconfiguration of FPGAs reduces the components count: hence the price, the weight, the power consumption, etc... These systems require a local network controller and a very small internal memory to support this specific protocol. Measures, based on real implementations, show that our systems can download partial bistreams with a speed twenty times faster (a sustained rate of 80 Mbits/s over Ethernet 100 Mbit/s) than best known solutions with memory requirements in the range of 10th of KB.
Pierre Bomel, Jeremie Crenne, Linfeng Ye, Jean-Philippe Diguet, Guy Gogniat

Massive Parallel Architectures

SCOPE - Sensor Mote Configuration and Operation Enhancement
Abstract
Wireless sensor networks are difficult to manage and control due to their large geographical distribution and the lack of visual feedback. Tasks of configuration, debugging, monitoring and role assignment are only possible with access to the running application.
With SCOPE we developed a generic management framework for such networks. It can be integrated in every TinyOS application to monitor and adjust application values such as configuration variables, sensor readings or other data. To be most flexible a generic approach was taken to read and set variables by Remote Instance Calls. SCOPE is a demon application running on every mote and sending the desired data to a Java application on a PC with network access. This enables the system administrator to manage and control every single node by adjusting these values.
This paper describes the architecture and use of the SCOPE framework as well as comparies it with other non-commercial state-of-the-art system management frameworks for wireless sensor networks.
Harun Özturgut, Christian Scholz, Thomas Wieland, Christoph Niedermeier
Generated Horizontal and Vertical Data Parallel GCA Machines for the N-Body Force Calculation
Abstract
The GCA model (Global Cellular Automata) is a massively parallel computation model which is a generalization of the Cellular Automata model. A GCA cell contains data and link information. Using the link information each cell has dynamic read access to any global cell in the field. The data and link information is updated in every generation. The GCA model is applicable and efficient for a large range of parallel algorithms (sorting, vector reduction, graph algorithms, matrix computations etc.). In order to describe algorithms for the GCA model the experimental language GCAL was developed. GCAL programs can be transformed automatically into a data parallel architecture (DPA). The paper presents for the N-body problem how the force calculation between the masses can be described in GCAL and synthesized into a data parallel architecture. At first the GCAL description of the application is transformed into a Verilog description which is inserted into a Verilog template describing the general DPA. Then the whole Verilog code is used as input for an FPGA synthesizing tool which generates the application-specific DPA. Two different DPAs are generated, a “horizontal” and a “vertical” DPA. The horizontal DPA uses 17 floating-point operators in each deep pipeline. In contrast the “vertical” DPA uses only one floating-point operation at a time out of a set of 6 floating-point operators. Both architectures are compared to resource consumption, time per cell operation and cost (logic elements * execution time). It turned out that the horizontal DPA is approximately 15 times more cost efficient than the vertical DPA.
Johannes Jendrsczok, Rolf Hoffmann, Thomas Lenck
Hybrid Resource Discovery Mechanism in Ad Hoc Grid Using Structured Overlay
Abstract
Resource management has been an area of research in ad hoc grids for many years. Recently, different research projects have focused resource management in centralized, decentralized or in a hybrid manner. In this paper, we discuss a micro economic based, hybrid resource discovery mechanism. The proposed mechanism focuses on the extension of a structured overlay network to manage the (dis)appearance of matchmakers in the grid and to route the messages to the appropriate matchmaker in the ad hoc grid. The mechanism is based on the emergent behavior of the participating nodes and adapts with respect to changes in the ad hoc grid environment. Experiments are executed on PlanetLab to test the scalability and robustness of the proposed mechanism. Simulation results show that our mechanism performs better than previously proposed mechanisms.
Tariq Abdullah, Luc Onana Alima, Vassiliy Sokolov, David Calomme, Koen Bertels

Organic Computing

Marketplace-Oriented Behavior in Semantic Multi-Criteria Decision Making Autonomous Systems
Abstract
Autonomy in Organic Computing systems is supposed to ensure well-functioning engineering systems. Our approach called Semantic Multi-Criteria Decision Making (SeMCDM) brings the decision making process of autonomous units close to the intention of human designers. This paper studies the integration of marketplace-oriented behavior into SeMCDM. It defines distributed and centralized market scenarios, suggests evaluation metrics with consideration of resource-restricted applications, extracts related characteristics of the application environment and presents simulation results. The paper concludes with recommendations about the adequate market scenario in relation to the application environment.
Ghadi Mahmoudi, Christian Müller-Schloer, Jörg Hähner
Self-organized Parallel Cooperation for Solving Optimization Problems
Abstract
This paper is about using a set of self-organized computing resources to perform multi-objective optimization. In the proposed approach, the computing resources are presented as a unified resource to the user where in traditional parallel optimization paradigms the user has to assign tasks to the resources, collect the best available solutions and deal with failing resources. In this approach called self-organized parallel cooperation model, the user has to specify the preferences and only give the objective functions to the system. The self-organized computing resources deliver the obtained solutions after a certain time to the user. In such a system, fast resources must continue the optimization as long as the overall computing time is not over. However as the solutions of a multi-objective problem depend on each other (via the domination relation) adding a waiting time to the fast processors would affect the quality of the solutions. This has been studied on a scenario of 100 heterogeneous computing resources in the presence of failures in the system.
Sanaz Mostaghim, Hartmut Schmeck

Memory Architectures

Improving Memory Subsystem Performance Using ViVA: Virtual Vector Architecture
Abstract
The disparity between microprocessor clock frequencies and memory latency is a primary reason why many demanding applications run well below peak achievable performance. Software controlled scratchpad memories, such as the Cell local store, attempt to ameliorate this discrepancy by enabling precise control over memory movement; however, scratchpad technology confronts the programmer and compiler with an unfamiliar and difficult programming model. In this work, we present the Virtual Vector Architecture (ViVA), which combines the memory semantics of vector computers with a software-controlled scratchpad memory in order to provide a more effective and practical approach to latency hiding. ViVA requires minimal changes to the core design and could thus be easily integrated with conventional processor cores. To validate our approach, we implemented ViVA on the Mambo cycle-accurate full system simulator, which was carefully calibrated to match the performance on our underlying PowerPC Apple G5 architecture. Results show that ViVA is able to deliver significant performance benefits over scalar techniques for a variety of memory access patterns as well as two important memory-bound compact kernels, corner turn and sparse matrix-vector multiplication — achieving 2x–13x improvement compared the scalar version. Overall, our preliminary ViVA exploration points to a promising approach for improving application performance on leading microprocessors with minimal design and complexity costs, in a power efficient manner.
Joseph Gebis, Leonid Oliker, John Shalf, Samuel Williams, Katherine Yelick
An Enhanced DMA Controller in SIMD Processors for Video Applications
Abstract
Although current SIMD processor architectures can improve the processing performance by exploiting the data level parallelism inherent in video applications, an important performance penalty appears when processing data that is not formatted in an amenable way, e.g. unaligned memory access. This paper presents an enhanced DMA controller that performs block-based data transfers and a realignment when accessing a word in an external memory that is not aligned with the natural data memory/bus width boundary. Moreover, the enhanced DMA controller performs a signal extension while accessing data outside a specific region, e.g. a video frame, decreasing the total amount of processing cycles required for a typical video application. Performance improvements of up to 25% can be achieved when running a highly time consuming video encoding task (motion estimation) on a generic VLIW architecture with the enhanced DMA controller compared to a basic block-transfer DMA controller.
Guillermo Payá-Vayá, Javier Martín-Langerwerf, Sören Moch, Peter Pirsch
Cache Controller Design on Ultra Low Leakage Embedded Processors
Abstract
A leakage-efficient cache controller design targeted on ultra low power embedded processors is proposed. The key insight is that a large circuits subset is accessed only when cache misses happen. By utilizing the fine-grained run-time power gating technique, such a subset can be dynamically powered-off as a power gated domain. Two simple but effective sleeping control policies are proposed to assure the leakage reduction effect; and to eliminate the impact of wake-up process, a latency cancellation mechanism is also proposed. Evaluation results show, in 90nm CMOS technology, 69% and 64% of leakage power can be reduced for instruction cache controller and data cache controller without performance degradation.
Zhao Lei, Hui Xu, Naomi Seki, Saito Yoshiki, Yohei Hasegawa, Kimiyoshi Usami, Hideharu Amano

Energy Awareness

Autonomous DVFS on Supply Islands for Energy-Constrained NoC Communication
Abstract
An autonomous-DVFS-enabled supply island architecture on network-on-chip platforms is proposed. This architecture exploits the temporal and spatial network traffic variations in minimizing the communication energy while constraining the latency and supply management overhead. Each island is equipped with autonomous DVFS mechanism, which traces the local and nearby network conditions. In quantitative simulations with various types of representative traffic patterns, this approach achieves greater energy efficiency than two other low-energy architectures (typically 10% - 27% lower energy). With autonomous supply management on a proper granularity as demonstrated in this study, the communication energy can be minimized in a scalable manner for many-core NoCs.
Liang Guang, Ethiopia Nigussie, Lauri Koskinen, Hannu Tenhunen
Energy Management System as an Embedded Service: Saving Energy Consumption of ICT
Abstract
In this paper we present a service approach based on the use of embedded network devices for the energy management of Information and Communication Technology (ICT) infrastructures. The service is completely compatible with other strategies in order to achieve an efficient management for saving energy. These devices are very small, with low consumption and specially designed to operate with minimum maintenance, and also they are presented under Services Oriented Architecture open standards, more specifically, as Web Services. In addition, these embedded services can work individually or in collaboration with other ICT enterprise services, either through conventional systems or by means of other embedded devices. To validate the proposal we have implemented a prototype and we have designed a test scenario based in the ICT replicated infrastructure in order to support web applications of the Polytechnic University College at the University of Alicante.
Francisco Maciá-Pérez, Diego Marcos-Jorquera, Virgilio Gilart-Iglesias

Java Processing

A Garbage Collection Technique for Embedded Multithreaded Multicore Processors
Abstract
Multicore processors get more and more popular, even in embedded systems. Due to the deeply integrated threading concept, Java is a perfect choice to deal with the necessary thread-level parallelism required for the performance potential of a multicore. Accordingly, the software developers are familiar with the threading concept, which means that single core applications already fit very well on a multicore processor and are able to utilize its advantage. Nevertheless, a drawback of Java has to be mentioned: the required garbage collection. Especially in multicore environments the most often used stop-the-world collectors reach their limits because all cores have to be suspended at the time a single thread requires a garbage collection cycle. Hence, the performance of the other cores is harmed tremendously. In this paper we present a garbage collection technique that runs in parallel to the application threads within a multithreaded multicore without any stop-the-world behavior.
Sascha Uhrig, Theo Ungerer
Empirical Performance Models for Java Workloads
Abstract
Java is widely deployed on a variety of processor architectures. Consequently, an understanding of microarchitecture level Java performance is critical to optimize current systems and to aid design and development of future processor architectures for Java. Although this is facilitated by a rich set of processor performance counters featured on several contemporary processors, complex processor microarchitecture structures and their interactions make it difficult to relate observed events to overall performance. This, coupled with the complexities associated with running Java over a virtual machine, further aggravates the situation. This paper explores and evaluates the effectiveness of empirical modeling for Java workloads. Our models use statistical regression techniques to relate overall Java system performance to various observed microarchitecture events and their interactions. Multivariate adaptive regression splines effectively capture non-linear and non-monotonic associations between the response and predictor variables. Our models are interpretable, easy to construct and exhibit high correlation/low errors between predicted and measured performance. Furthermore, empirical models afford additional insights into the characteristics of Java performance and the use of statistical techniques throughout this study allow us to assign confidence levels to our estimates of performance.
Pradeep Rao, Kazuaki Murakami

Chip-Level Multiprocessing

Performance Matching of Hardware Acceleration Engines for Heterogeneous MPSoC Using Modular Performance Analysis
Abstract
In order to meet demanding challenges of increasing computational requirements and stringent power constraints, there is a gradual trend towards heterogeneous multi-processor system-on-chip (MPSoC) designs integrating application specific acceleration engines. One major problem faced by the design tools for mapping of algorithms onto MPSoC architectures is the dimensioning of system components through performance analysis. In this paper, we propose a fast and accurate methodology for rate matching of statically scheduled acceleration engines using modular performance analysis. Given a set of Pareto-optimal hardware accelerator designs and an input workload behavior, the proposed methodology determines cost efficient hardware accelerators that can handle the workload. A motion JPEG case study illustrates the benefit of coupling high level synthesis tools with performance analysis.
Hritam Dutta, Frank Hannig, Jürgen Teich
Evaluating CMPs and Their Memory Architecture
Abstract
Many-core processor architectures require scalable solutions that reflect the locality and power constraints of future generations of technology. This paper presents a CMP architecture that supports automatic mapping and dynamic scheduling of threads leaving the binary code devoid of any explicit communication. The thrust of this approach is to produce binary code that is divorced from implementation parameters, yet, which still gives good performance over future generations of CMPs. A key component of this abstract processor architecture is the memory system. This paper evaluates the memory architectures, which must maintain performance across a range of targets.
Chris Jesshope, Mike Lankamp, Li Zhang
Backmatter
Metadaten
Titel
Architecture of Computing Systems – ARCS 2009
herausgegeben von
Mladen Berekovic
Christian Müller-Schloer
Christian Hochberger
Stephan Wong
Copyright-Jahr
2009
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-00454-4
Print ISBN
978-3-642-00453-7
DOI
https://doi.org/10.1007/978-3-642-00454-4

Premium Partner