main-content

Über dieses Buch

On behalf of the ProgramCommittee, we are pleased to present the proceedings of the 2005 Asia-Paci?c Computer Systems Architecture Conference (ACSAC 2005) held in the beautiful and dynamic country of Singapore. This conference was the tenth in its series, one of the leading forums for sharing the emerging research ?ndings in this ?eld. In consultation with the ACSAC Steering Committee, we selected a - member Program Committee. This Program Committee represented a broad spectrum of research expertise to ensure a good balance of research areas, - stitutions and experience while maintaining the high quality of this conference series. This year’s committee was of the same size as last year but had 19 new faces. We received a total of 173 submissions which is 14% more than last year. Each paper was assigned to at least three and in some cases four ProgramC- mittee members for review. Wherever necessary, the committee members called upon the expertise of their colleagues to ensure the highest possible quality in the reviewing process. As a result, we received 415 reviews from the Program Committee members and their 105 co-reviewers whose names are acknowledged inthe proceedings.Theconferencecommitteeadopteda systematicblind review process to provide a fair assessment of all submissions. In the end, we accepted 65 papers on a broad range of topics giving an acceptance rate of 37.5%. We are grateful to all the Program Committee members and the co-reviewers for their e?orts in completing the reviews within a tight schedule.

Inhaltsverzeichnis

Processor Architecture for Trustworthy Computers

We propose that computer architects need to design more trustworthy computers that protect a user’s information, computations and communications from attacks by malicious adversaries. This is in addition to providing current engineering goals of higher performance, lower cost, lower power consumption and smaller footprint.

Ruby B. Lee

Efficient Voltage Scheduling and Energy-Aware Co-synthesis for Real-Time Embedded Systems

This paper presents an integrated methodology and a tool for system-level low power/energy co-synthesis for real-time embedded systems. Voltage scheduling (VS) is being applied to utilize the inherent slacks in the system. The voltage schedule is generated based on a global view of all tasks’ mapping and their energy profiles. The tool explores the three dimensional design space (performance-power-cost) to find implementations that offer the best trade-off among these design objectives. Unnecessary power dissipation is prevented by refining the allocation/binding in an additional synthesis step. The experimental results show that our approach remarkably improves the efficiency of VS and leads to additional energy savings, especially for applications with stringent delay constraints.

Energy-Effective Instruction Fetch Unit for Wide Issue Processors

Continuing advances in semiconductor technology and demand for higher performance will lead to more powerful, superpipelined and wider issue processors. Instruction caches in such processors will consume a significant fraction of the on-chip energy due to very wide fetch on each cycle. This paper proposes a new energy-effective design of the fetch unit that exploits the fact that not all instructions in a given I-cache fetch line are used due to taken branches. A

is proposed to detect which instructions in an I-cache access will actually be used to avoid fetching any of the other instructions. The solution is evaluated for a 4-, 8- and 16-wide issue processor in 100nm technology. Results show an average improvement in the I-cache Energy-Delay product of 20% for the 8-wide issue processor and 33% for the 16-wide issue processor for the SPEC2000, with no negative impact on performance.

Juan L. Aragón, Alexander V. Veidenbaum

Rule-Based Power-Balanced VLIW Instruction Scheduling with Uncertainty

Power-balanced instruction scheduling for Very Long Instruction Word (VLIW) processors is an optimization problem which requires a good instruction-level power model for the target processor. Conventionally, these power models are deterministic. However, in reality, there will always be some degree of imprecision involved. For power critical applications, it is desirable to find an optimal schedule which makes sure that the effects of these uncertainties could be minimized. The scheduling algorithm has to be computationally efficient in order to be practical for use in compilers. In this paper, we propose a rule based genetic algorithm to efficiently solve the optimization problem of power-balanced VLIW instruction scheduling with uncertainties in the power consumption model. We theoretically prove our rule-based genetic algorithm can produce as good optimal schedules as the existing algorithms proposed for this problem. Furthermore, its computational efficiency is significantly improved.

Shu Xiao, Edmund M. -K. Lai, A. B. Premkumar

An Innovative Instruction Cache for Embedded Processors

In this paper we present a methodology to enable the design of power efficient instruction cache for embedded processors. The proposed technique, which splits the instruction cache into several small sub-caches, utilizes the locality of applications to reduce dynamic energy consumption in the instruction cache. The proposed cache reduces dynamic energy consumption by accessing only one sub-cache when a request comes into the cache. It also reduces dynamic energy consumption by eliminating the energy consumed in tag matching. In addition, we propose the technique to reduce leakage energy consumption in the proposed cache. We evaluate the design using a simulation infrastructure based on SimpleScalar and CACTI. Simulation results show that the proposed cache reduces dynamic energy by 42% – 59% and reduces leakage energy by 70% – 80%.

Cheol Hong Kim, Sung Woo Chung, Chu Shik Jhon

Dynamic Voltage Scaling for Power Aware Fast Fourier Transform (FFT) Processor

In recent years, power dissipation in CMOS circuits has grown exponentially due to the fast technology scaling and the increase in complexity. Supply Voltage scaling is an effective technique to reduce dynamic power dissipation due to the non-linear relationship between dynamic power and Vdd. In other words, Vdd can be scaled freely except with limitation for below threshold voltage operation. The dynamic voltage scaling architecture mainly consists of dc-dc power regulator which is customised to produce variability on the Vdd. The implemented architecture can dynamically vary the Vdd from 300 mV to 1.2V, with initial setup time of 1.5

μ

sec. This paper investigates the effect of DVS on dynamic power dissipation in a Fast Fourier Transform multiplier core. Implementation of DVS on the multiplier blocks has shown 25% of average power reduction. The design was implemented using 0.12

μ

m ST-Microelectronic 6-metal layer CMOS dual- process technology.

David Fitrio, Jugdutt Singh, Aleksandar Stojcevski

Design of an Efficient Multiplier-Less Architecture for Multi-dimensional Convolution

Design of a hardware efficient multiplier-less architecture for the computation of multi-dimensional convolution is presented in this paper. The new architecture performs computations in the logarithmic domain by utilizing novel multiplier-less log

2

and inverse-log

2

modules. An effective data handling strategy is developed in conjunction with the logarithmic modules to eliminate the necessity of multipliers in the architecture. The proposed approach reduces hardware resources significantly compared to other approaches while it still maintains a high degree of accuracy. The architecture is developed as a combined systolic-pipelined design that produces an output in every clock cycle after the initial latency of the system. The architecture is capable of operating with a high speed clock frequency of 99 MHz based on Xilinx’s Virtex II 2v2000ff896-4 FPGA and the throughput of the system is observed as 99 MOPS (million outputs per second).

Ming Z. Zhang, Hau T. Ngo, Vijayan K. Asari

A Pipelined Hardware Architecture for Motion Estimation of H.264/AVC

The variable block size motion estimation (VBSME) presented in the video coding standard H.264/AVC significantly improves coding efficiency, but it requires much more considerable computational complexity than motion estimation using fixed macroblocks. To solve this problem, this paper proposes a pipelined hardware architecture for full-search VBSME aiming for high performance, simple structure, and small controls. Our architecture consists of 1-D arrays with 64 processing elements, an adder tree to produce motion vectors (MVs) for variable block sizes, and comparators to determine the minimum of MVs. This can produce all 41 MVs for variable blocks of one macroblock in the same clock cycles to other conventional 1-D arrays of 64 PEs. In addition, this can be easily controlled by a 2-bit counter. Implementation results show that our architecture can estimate MVs in CIF video sequence at a rate of 106 frames/s for the 32×32 search range.

Su-Jin Lee, Cheong-Ghil Kim, Shin-Dug Kim

Embedded Intelligent Imaging On-Board Small Satellites

Current commercial Earth Observation satellites have very restricted image processing capabilities on-board. They mostly operate according to a ‘store-and forward’ mechanism, where the images are stored on-board after being acquired from the sensors and are downlinked when contact with a ground station occurs. However, in order for disaster monitoring satellite missions to be effective, there is a need for automated and intelligent image processing onboard. In fact, the need for increasing the automation on-board is predicted as one of the main trends for future satellite missions. The main factors that hold back this concept are the limited power and computing resources on-board the spacecraft. This paper reviews existing image processing payloads of earth observing small satellites. An autonomous change detection system is proposed to demonstrate the feasibility of implementing an intelligent system on-board a small satellite. Performance results for the proposed intelligent imaging system are estimated, scaled and compared to existing hardware that are being used in the SSTL DMC satellite platform.

Siti Yuhaniz, Tanya Vladimirova, Martin Sweeting

Architectural Enhancements for Color Image and Video Processing on Embedded Systems

Application-specific extensions of a processor provide an efficient mechanism to meet the growing performance demands of multimedia applications. This paper presents a color-aware instruction set extension (CAX) for embedded multimedia systems that supports vector processing of color image sequences. CAX supports parallel operations on two-packed 16-bit (6:5:5) YCbCr (luminance-chrominance) data in a 32-bit datapath processor, providing greater concurrency and efficiency for color image and video processing. Unlike typical multimedia extensions (e.g., MMX, VIS, and MDMX), CAX harnesses parallelism within the human perceptual YCbCr space, rather than depending solely on generic subword parallelism. Experimental results on an identically configured, dynamically scheduled 4-way superscalar processor indicate that CAX outperforms MDMX (a representative MIPS multimedia extension) in terms of speedup (3.9× with CAX, but only 2.1× with MDMX over the baseline performance) and energy reduction (68% to 83% reduction with CAX, but only 39% to 69% reduction with MDMX over the baseline). More exhaustive simulations are conducted to provide an in-depth analysis of CAX on machines with varying issue widths, ranging from 1 to 16 instructions per cycle. The impact of the CAX plus loop unrolling is also presented.

Jongmyon Kim, D. Scott Wills, Linda M. Wills

A Portable Doppler Device Based on a DSP with High- Performance Spectral Estimation and Output

A low-cost and high-performance portable Doppler blood flow analysis device, which is based on a digital signal processor (DSP) TMS320V549 (Texas Instruments), contains a 240 * 320 LCD color graphic display module (Hantronix) and a thermal printer (Seiko Instruments), is developed in this study. The complex real-time autoregressive (AR) modeling is implemented in this device to estimate the time frequency representation of blood flow signals. Directional Doppler spectrograms are computed directly from the in-phase and quardrature components of the Doppler signal. Sampling frequency can vary among 5kHz, 10kHz, 15kHz, 20kHz and 25kHz to optimize the displaying dynamic range according to the blood velocity. In order to increase the display quality and provide more comprehensive information about the components of the blood flow profile, The Doppler spectrograms can be displayed in real-time on the LCD in 256 colors. They can also be printed out in 13 gray levels from the thermal printer for recording. The Doppler spectrograms computed by the AR modeling are compared with those by the STFT. The results show that this compact, economic, versatile bi-directional and battery- or line- operated Doppler device, which offers the high-performance spectral estimation and output, will be useful at different conditions, including bed-side hospitals and clinical offices.

Yufeng Zhang, Yi Zhou, Jianhua Chen, Xinling Shi, Zhenyu Guo

A Power-Efficient Processor Core for Reactive Embedded Applications

Reactive processors are a version of processors that provide architectural supports for the execution of reactive embedded applications. Even though much work has been done to improve the performance of reactive processors, the issue of optimizing power consumption has not been addressed. In this paper, we propose a new power-efficient processor core for reactive embedded applications. The new processor core (called ReMIC-PA) is implemented by adopting several power consumption optimizations to an existing reactive processor core (ReMIC). Initial benchmarking results show that ReMIC-PA achieves more than 20% power saving for data-dominated embedded applications and more than 50% power saving for control-dominated embedded applications when compared to ReMIC.

Lei Yang, Morteza Biglari-Abhari, Zoran Salcic

A Stream Architecture Supporting Multiple Stream Execution Models

Multimedia devices demands a platform integrated various functional modules and an increasing support of multiple standards. Stream architecture is able to solve the problem. However the applications suited for typical stream architecture are limited. This paper describes MASA (Multiple-morphs Adaptive Stream Architecture) prototype system which supports regular stream and irregular stream by different execution models according to applications’ stream characteristics. The paper first discusses MASA architecture and stream model, and then explores the features and advantages of MASA through mapping a stream applications H.264 to hardware. The result is encouraging. At last, presents the design and implementation on FPGA of MASA’s prototype.

Nan Wu, Mei Wen, Haiyan Li, Li Li, Chunyuan Zhang

The Challenges of Massive On-Chip Concurrency

Moore’s law describes the growth in on-chip transistor density, which doubles every 18 to 24 months and looks set to continue for at least a decade and possibly longer. This growth poses major problems (and provides opportunities) for computer architecture in this time frame. The problems arise from current architectural approaches, which do not scale well and have used clock speed rather than concurrency to increase performance. This, in turn, causes excessive power dissipation and circuit complexity. This paper takes a long-range position on the future of chip multiprocessors, both from the micro-architecture perspective, as well as from a systems perspective. Concurrency will come from many levels, with instruction and loop-level concurrency managed at the micro-architecture and higher levels by the system. Chip-level multiprocessors exploiting massive concurrency we term

Microgrids.

The directions proposed in this paper provide micro-architectural concurrency with full forward compatibility over orders of magnitude of scaling and also the management of on-chip resources (processors etc.) so as to autonomously configure a system for a variety of goals (e.g. low power, high performance, etc.).

Kostas Bousias, Chris Jesshope

FMRPU: Design of Fine-Grain Multi-context Reconfigurable Processing Unit

At present the scale of multimedia and communication systems has become more and more complicated due to their fast developments. In order to handle diverse functions and shorten system development time, the ability to reconfigure system architecture becomes an important and flexible design consideration. In this paper, we propose a novel reconfigurable processing unit, FMRPU, which is a fine-grain with multi-context reconfigurable processing unit targeting at high-throughput and data-parallel applications. It contains 64 reconfigurable logic evel connectivity. According to the simulation results, the longest routing arrays, 16 switch boxes, and connects with each other via three hierarchical-lpath of FMRPU only takes 6.5 ns at 0.35 processes, which is able to construct the required logic circuit efficiently. To compare with same kind devices in dealing with Motion Estimation operations, the performance is raise to 17% and has excellent performance in executing DSP algorithms.

Jih-Ching Chiu, Ren-Bang Lin

Modularized Redundant Parallel Virtual File System

Providing data availability in a high performance computing environment is very important, especially in this data-intensive world. Most clusters either use RAID technology or redundant nodes to ensure data availability. People usually use parallel file systems to increase the throughput of a computing system. However, when a parallel file system is involved in a distributed environment, some mechanisms must be provided to overcome the side-effect of using striping. PVFS is a popular and open source parallel file system in the Linux environment, but it provides no fault tolerance. We propose an idea of using distributed RAID technology to ensure the data availability of using striping. By introducing a parity cache table (PCT), we can improve write performance when updating parity is needed. The evaluation of our MRPVFS (Modularized Redundant Parallel Virtual File System) shows that the read performance of MRPVFS is almost the same as that of the original PVFS. As to the write performance, there only exists a little performance degradation due to generating parities.

Sheng-Kai Hung, Yarsun Hsu

Resource-Driven Optimizations for Transient-Fault Detecting SuperScalar Microarchitectures

Increasing microprocessor vulnerability to soft errors induced by neutron and alpha particle strikes prevents aggressive scaling and integration of transistors in future technologies if left unaddressed. Previously proposed instruction-level redundant execution, as a means of detecting errors, suffers from a severe performance loss due to the resource shortage caused by the large number of redundant instructions injected into the superscalar core. In this paper, we propose to apply three architectural enhancements, namely 1) floating-point unit sharing (FUS), 2) prioritizing primary instructions (PRI), and 3) early retiring of redundant instructions (ERT), that enable transient-fault detecting redundant execution in superscalar microarchitectures with a much smaller performance penalty, while maintaining the original full coverage of soft errors. In addition, our enhancements are compatible with many other proposed techniques, allowing for further performance improvement.

Jie S. Hu, G. M. Link, Johnsy K. John, Shuai Wang, Sotirios G. Ziavras

A Fault-Tolerant Routing Strategy for Fibonacci-Class Cubes

Fibonacci Cubes (

FCs

), together with the enhanced and extended forms, are a family of interconnec tion topologies formed by diluting links from binary hypercube. While they scale up more slowly, they provide more choices of network size. Despite sparser connectivity, they al low efficient emulation of many other topologies. However, there is

no

existing fault-tolerant routing strategy for

FC

s or other node/link diluted cubes. In this paper, we propose a unified fault-tolerant routing strategy for all Fibonacci-class Cubes, tolerating as many faulty components as network

node availability

. The algorithm is livelock free and generates deadlock-free routes, whose length is bounded linearly to network dimensionality. As a component, a generic approach to avoiding immediate cycles is designed which is applicable to a wide range of inter-connection networks, with computational and spatial complexity at

O

(1) and

O

(

n

log

n

) respectively. Finally, the performance of the algorithm is presented and analyzed through software simulation, showing its feasibility.

Zhang Xinhua, Peter K. K. Loh

Embedding of Cycles in the Faulty Hypercube

Let

f

v

(respectively,

f

e

) denote the number of faulty vertices (respectively, edges) in an

n

-dimensional hypercube. In this paper, we show that a fault-free cycle of length of at least 2

n

–2

f

v

can be embedded in an

n

-dimensional hypercube with

f

e

n

– 2 and

f

v

+

f

e

≤ 2

n

– 4. Our result not only improves the previously best known result of Sengupta (1998) where

f

v

> 0 or

f

e

n

– 2 and

f

v

+

f

e

n

– 1 were assumed, but also extends the result of Fu (2003) where only the faulty vertices are considered.

Sun-Yuan Hsieh

Improving the Performance of GCC by Exploiting IA-64 Architectural Features

The IA-64 architecture provides a rich set of features to aid the compiler in exploiting instruction-level parallelism to achieve high performance. Currently, GCC is a widely used open-source compiler for IA-64, but its performance, especially its floating-point performance, is poor compared to that of commercial compilers because it has not fully utilized IA-64 architectural features. Since late 2003 we have been working on improving the performance of GCC on IA-64. This paper reports four improvements on enhancing its floating-point performance, namely alias analysis for FORTRAN (its part for COMMON variables already committed in GCC 4.0.0), general induction variable optimization, loop unrolling and prefetching arrays in loops. These improvements have significantly improved the floating-point performance of GCC on IA-64 as extensively validated using SPECfp2000 and NAS benchmarks.

Canqun Yang, Xuejun Yang, Jingling Xue

An Integrated Partitioning and Scheduling Based Branch Decoupling

Conditional branch induced control hazards cause significant performance loss in modern out-of-order superscalar processors. Dynamic branch prediction techniques help alleviate the penalties associated with conditional branch instructions. However, branches still constitute one of the main hurdles towards achieving higher ILP. Dynamic branch prediction relies on the temporal locality of and spatial correlations between branches. Branch decoupling is yet another mechanism that exploits the innate lead in the branch schedule with respect to the rest of the computation. The compiler is responsible for generating the two maximally decoupled instruction streams: branch stream and program stream. Our earlier work on trace based evaluation of branch decoupling demonstrates a performance advantage of between 12% to 46% over 2-level branch prediction. However, how much of these gains are achievable through static, compiler driven decoupling is not known. This paper answers the question partially. A novel decoupling algorithm that integrates graph bi-partitioning and scheduling, was deployed in the GNU C compiler to generate a two instruction stream executable. These executables were targeted to branch decoupled architecture simulator with superscalar cores for the branch stream and program stream processors. Simulations show an average performance improvement of 7.7% and 5.5% for integer and floating point benchmarks of the SPEC2000 benchmark suite respectively.

Pramod Ramarao, Akhilesh Tyagi

A Register Allocation Framework for Banked Register Files with Access Constraints

Banked register file has been proposed to reduce die area, power consumption, and access time. Some embedded processors, e.g. Intel’s IXP network processors, adopt this organization. However, they expose some access constraints in ISA, which complicates the design of register allocation. In this paper, we present a register allocation framework for banked register files with access constraints for the IXP network processors. Our approach relies on the estimation of the costs and benefits of assigning a virtual register to a specific bank, as well as that of splitting it into multiple banks via copy instructions. We make the decision of bank assignment or live range splitting based on analysis of these costs and benefits. Compared to previous works, our framework can better balance the register pressure among multiple banks and improve the performance of typical network applications.

Feng Zhou, Junchao Zhang, Chengyong Wu, Zhaoqing Zhang

Designing a Concurrent Hardware Garbage Collector for Small Embedded Systems

Today more and more functionality is packed into all kinds of embedded systems, making high-level languages, such as Java, increasingly attractive as implementation languages. However, certain aspects, essential to high-level languages are much harder to address in a low performance, small embedded system than on a desktop computer. One of these aspects is memory management with garbage collection. This paper describes the design process behind a concurrent, garbage collector unit (GCU), a coprocessor to the Java Optimised Processor. The GCU, targeting small embedded real-time applications, implements a mark-compact algorithm, extended with concurrency support, and tuned for improved performance.

Flavius Gruian, Zoran Salcic

Irregular Redistribution Scheduling by Partitioning Messages

Dynamic data redistribution enhances data locality and improves algorithm performance for numerous scientific problems on distributed memory multi-computers systems. Previous results focus on reducing index computational cost, schedule computational cost, and message packing/unpacking cost. In irregular redistribution, however, messages with varying sizes are transmitted in the same communication step. Therefore, the largest sized messages in the same communication step dominate the data transfer time required for this communication step. This work presents an efficient algorithm to partition large messages into multiple small ones and schedules them by using the minimum number of steps without communication contention and, in doing so, reducing the overall redistribution time. When the number of processors or the maximum degree of the redistribution graph increases or the selected size of messages is medium, the proposed algorithm can significantly reduce the overall redistribution time to 52%.

Chang Wu Yu, Ching-Hsien Hsu, Kun-Ming Yu, C. -K. Liang, Chun-I Chen

Making Power-Efficient Data Value Predictions

Power dissipation due to value prediction is being more studied recently. In this paper, a new cost effective data value predictor based on a linear function is introduced. Without the complex two-level structure, the new predictor can still make correct predictions on some patterns that can only be done by the context-based data value predictors. Simulation results show that the new predictor works well with most value predictable instructions. Energy and performance impacts of storing partial tag and common sub-data values in the value predictor are studied. The two methods are found to be good ways to build better cost-performance value predictors. With about 5K bytes, the new data value predictor can obtain 16.5% maximal while 4.6% average performance improvements with the SPEC INT2000 benchmarks.

Yong Xiao, Xingming Zhou, Kun Deng

Speculative Issue Logic

In order to enhance the performance of a computer, most modern processors use superscalar architecture and raise the clock frequency. Superscalar architecture can execute more than one instruction per each cycle. The amount of instruction level parallelism will become more and more important when superscalar issue width is increased. With hardware support, instructions can be speculatively waked up. The more instructions are waked up, the more ILP is exploited, hence IPC is increased. Through speculative aspect can be adopted to wakeup more instructions. But the ILP is still limited by the true data dependency. In this paper we proposed the speculative wakeup logic with value prediction mechanism to overcome the data dependency that will exploit instruction level parallelism. And in order to reduce the recovery frequency, we also propose priority-based select logic with two bit counter. In our experiment, our model will enhance performance by 18.02%.

You-Jan Tsai, Jong-Jiann Shieh

Using Decision Trees to Improve Program-Based and Profile-Based Static Branch Prediction

Improving static branch prediction accuracy is an important problem with various interesting applications. First, several compiler optimizations such as code layout, scheduling, predication, etc. rely on accurate static branch prediction. Second, branches that are statically accurately predictable can be removed from the dynamic branch predictor thereby reducing aliasing. Third, for embedded microprocessors which lack dynamic branch prediction, static branch prediction is the only alternative.

This paper builds on previous work done on evidence-based static branch prediction which uses decision trees to classify branches. We demonstrate how decision trees can be used to improve the Ball and Larus heuristics by optimizing the sequence of applying the heuristics and by discovering two new heuristics, namely one based on the postdomination relationship between the current basic block and its successor and one based on the dependency distance between the branch and its operand defining instruction. Experimental results indicate an increase in the number of instructions per mispredicted branch by 18.5% on average for SPECint95 and SPECint2000. In addition, we show that decision trees can improve profile-based static branch prediction by up to 11.7% by predicting branches that are unseen in the profile runs.

Veerle Desmet, Lieven Eeckhout, Koen De Bosschere

Arithmetic Data Value Speculation

Value speculation is currently widely used in processor designs to increase the overall number of instructions executed per cycle (IPC). Current methods use sophisticated prediction techniques to speculate on the outcome of branches and execute code accordingly. Speculation can be extended to the approximation of arithmetic values. As arithmetic operations are slow to complete in pipelined execution an increase in overall IPC is possible through accurate arithmetic data value speculation. This paper will focus on integer adder units for the purposes of demonstrating arithmetic data value speculation.

Daniel R. Kelly, Braden J. Phillips

Exploiting Thread-Level Speculative Parallelism with Software Value Prediction

Software value prediction

(SVP) is an effective and powerful compilation technique helping to expose thread-level speculative parallelism. With the help of value analysis and profiling, the compiler identifies critical and predictable variable values and generates speculatively parallelized programs with appropriate value prediction and misprediction recovery code. In this paper, we examine this technique in detail, describe a complete and versatile SVP framework and its detailed implementation in a thread-level speculative parallelization compiler, and present our evaluation results with Spec2000Int benchmarks. Our results not only confirm quantitatively that value prediction is essential to thread-level speculative parallelism; they also show that the corresponding value prediction can be achieved efficiently and effectively by software. We also present evaluation results of the overhead associated with software value prediction and the importance of different value predictors in speculative parallel loops in Spec2000Int benchmarks.

Xiao-Feng Li, Chen Yang, Zhao-Hui Du, Tin-fook Ngai

Challenges and Opportunities on Multi-core Microprocessor

As Moore’s Law predicates, the number of transistors on die will continue increase on die in future. It will be a big challenge for computer industry to effectively and e.ciently use these transistors and optimize the microprocessor design under power envelop. Performance is not the only concern. The other design goals such as user friendly interface, easy for maintenance, security and reliability, will become more and more important. There is a clear trend in computer industry that more and more chip will implement multi-core on die.

Jesse Fang

Software-Oriented System-Level Simulation for Design Space Exploration of Reconfigurable Architectures

To efficiently utilize the functionality of dynamic reconfigurable computing systems, it is imperative that a software-oriented approach for modeling complex hardware/software systems be adopted. To achieve this, we need to enhance the simulation environment to understand these dynamic reconfiguration requirements during system-level modeling. We present a flexible simulation model which is able to produce multiple views/contexts for a given problem. We use this model to examine each view for the mapping space, bandwidth, reconfiguration requirements, and configuration patterns of each computational problem.

K. S. Tham, D. L. Maskell

A Switch Wrapper Design for SNA On-Chip-Network

In this paper we present a design of a switch wrapper as a component of SNA (SoC network architecture), which is an efficient on-chip-network compared to a shared bus architecture in a SoC. The SNA uses crossbar routers to provide the increasing demand on communication bandwidth within a single chip. A switch wrapper for SNA is located between a crossbar router and IPs connecting them together. It carries out a mode of routing to assist crossbar routers and executes protocol conversions to provide compatibility in IP reuse. A switch wrapper consists of a direct router, two AHB-SNP converters, two interface sockets and a controller module. We implement it in VHDL. Using ModelSim simulation, we confirm the functionality of the switch wrapper. We synthesize it using a Xilinx Virtex2 device to determine resource requirements: The switch wrapper seems to occupy appropriate spaces, about 900 gates, considering that a single SNA crossbar router costs about 20,000 gates.

Jiho Chang, Jongsu Yi, JunSeong Kim

A Configuration System Architecture Supporting Bit-Stream Compression for FPGAs

This paper presents an investigation and design of an enhanced on-chip configuration memory system that can reduce the time to (re)configure an FPGA. The proposed system accepts configuration data in a compressed form and performs decompression internally. The resulting FPGA can be (re)configured in time proportional to the size of the compressed bit-stream. The compression technique exploits the redundancy present in typical configuration data. An analysis of configurations corresponding to a set of benchmark circuits reveals that data that controls the same types of configurable elements have a common byte that occurs at a significantly higher frequency. This common byte is simply broadcast to all instances of that element. This step is followed by byte updates if required. The new configuration system has modest hardware requirements and was observed to reduce reconfiguration time for the benchmark set by two-thirds on average.

Marco Della Torre, Usama Malik, Oliver Diessel

Biological Sequence Analysis with Hidden Markov Models on an FPGA

Molecular biologists use Hidden Markov Models (HMMs) as a popular tool to statistically describe protein families. This statistical description can then be used for sensitive and selective database scanning, e.g. new protein sequences are compared with a set of HMMs to detect functional similarities. Even though efficient dynamic programming algorithms exist for the problem, the required scanning time is still very high, and because of the rapid database growth finding fast solutions is of high importance to research in this area. In this paper we present how reconfigurable architectures can be used to derive an efficient fine-grained parallelization of the dynamic programming calculation. It is described how this technique leads to significant runtime savings for HMM database scanning on a standard off-the-shelf FPGA.

Jacop Yanto, Timothy F. Oliver, Bertil Schmidt, Douglas L. Maskell

FPGAs for Improved Energy Efficiency in Processor Based Systems

Processor based embedded systems often need to apply simple filter functions to input data. In this paper we examine the relative energy efficiency of microprocessor and Field Programmable Gate Array (FPGA) implementations of these functions. We show that considerable savings can be achieved by adding an FPGA to a microprocessor based systems and propose a strategy to reduce the impact of the excessive leakage energy overhead in low data rate applications.

P. C. Kwan, C. T. Clarke

Morphable Structures for Reconfigurable Instruction Set Processors

This paper presents a novel methodology for instruction set customization of RISPs (Reconfigurable Instruction Set Processors) using morphable structures. A morphable structure consists of a group of hardware operators chained together to implement a restricted set of custom instructions. These structures are implemented on the reconfigurable fabric, and the operators are enabled/disabled on demand. The utilization of a predefined set of morphable structures for instruction set customization dispenses the need for hardware synthesis in design exploration, and avoids run-time reconfiguration while minimizing the reconfigurable area. We will describe the two stages of the methodology for constructing the morphable structures, namely template generation and identification of a maximal unique pattern set from the templates. Our preliminary studies show that 23 predefined morphable structures can sufficiently cater to any application in a set of eight MiBench benchmark applications. In addition, to achieve near-optimal performance, the maximum required number of morphable structures for an application is only 8.

Siew-Kei Lam, Deng Yun, Thambipillai Srikanthan

Implementation of a Hybrid TCP/IP Offload Engine Prototype

Hankook Jang, Sang-Hwa Chung, Soo-Cheol Oh

Matrix-Star Graphs: A New Interconnection Network Based on Matrix Operations

In this paper, we introduce new interconnection networks

matrix-star graphs

MTS

n

1,...,

nk

where a node is represented by

n

1

× ... ×

n

k

matrix and an edge is defined by using matrix operations. A matrix-star graph

MTS

2,n

can be viewed as a generalization of the well-known star graph such as degree, connectivity, scalability, routing, diameter, and broadcasting. Next, we generalize

MTS

2,n

to 2-dimensional and 3-dimensional matrix-star graphs

MTS

k

,

n

,

MTS

k

,

n

,

p

. One of important desirable properties of interconnection networks is network cost which is defined by degree

times

diameter. The star graph, which is one of popular interconnection topologies, has smaller network cost than other networks. Recently introduced network, the macro-star graph has smaller network cost than the star graph. We further improve network cost of the macro-star graph: Comparing a matrix-star graph

$MTS_{k,k,k}(k = \sqrt[3]{n^{2}})$

with

n

2

! nodes to a macro-star graph

MS

(

n

–1,

n

–1) with ((

n

–1)

2

+1)! nodes, network cost of

MTS

k

,

k

,

k

is

O

(

n

2.7

) and that of

MS

(

n

–1,

n

–1) is

O

(

n

3

). It means that a matrix-star graph is better than a star graph and a macro-star graph in terms of network cost.

Hyeong-Ok Lee, Jong-Seok Kim, Kyoung-Wook Park, Jeonghyun Seo, Eunseuk Oh

The Channel Assignment Algorithm on RP(k) Networks

Embedding and channel assignment is a key topic in optical interconnection networks. Based on RP(k) network, a scheme to embed a hypercube into RP(k) is given and the wavelength assignment of realizing the Hypercube communication on RP(k) network is discussed in this paper. By introducing the reverse order of Hypercube, an algorithm to embed the n-Dimension Hypercube into RP(k) is designed, which multiplexes at most max{2,

$\lfloor5 * {N} / 96 \rfloor\}$

wavelengths. An algorithm to embed the n-Dimension Hypercube into the ring network is also proposed, with its congestion equal to

$\lfloor{ N} / 3 + {N} / 12 \rfloor$

. It is a better improvement than the known result, which is equal to

$\lfloor{N} / 3 + { N} / 4 \rfloor$

. The analyses prove that it is easier to realize the Hypercube communication on RP(k) network.

Fang’ai Liu, Xinhua Wang, Liancheng Xu

To present a solution to the problem of address space exhaustion in the Internet, this paper proposes a multi-tier addressing architecture of the Internet (IPEA), illustrates the address space and addressing scheme of IPEA, defines the format of IP packet with extension address, and hierarchical routing algorithm. This paper analyzes and evaluates the performance of IPEA with simulation, and concludes that IPEA has such advantages: first, it provides a solution to the problem of address space exhaustion in the Internet; second, it reduces the routing table length, helps dealing with routing table explosion; third, it demands little change on the IPv4 addressing scheme, easy for transition; finally it makes the autonomous network more manageable by using private address.

Tingrong Lu, Chengcheng Sui, Yushu Ma, Jinsong Zhao, Yongtian Yang

The Star-Pyramid Graph: An Attractive Alternative to the Pyramid

This paper introduces a new class of interconnection networks named

Star-Pyramid

,

SP

(

n

). A star-pyramid of dimension

n

is formed by piling up star graphs of dimensions 1 to

n

in a hierarchy, connecting any node in each

i

-dimensional star, 1<

i

≤ n, to a node in (

i

– 1)-star whose index is reached by removing the

i

symbol from the index of the former node in the

i

-star graph. Having extracted the properties of the new topology, featuring topological properties, a simple routing algorithm and Hamiltonicity then we compare the network properties of the proposed topology and the well-known pyramid topology. We show that the star-pyramid is more fault-tolerant and has less network diameter than its alternative, the pyramid. Finally, we propose a variant of star-pyramid, namely the generic star-pyramid as a topology with better scalability, fault-tolerance, and diameter.

Building a Terabit Router with XD Networks

Direct interconnection network has been widely used in supercomputer systems. Recently, it is considered to be used to build terabit router by the industry. This paper presents a distributed and scalable switching fabric based on a new direct interconnection network. It is a scalable topology and can be expanded in two ways easily, thus minimizing the initial investment of service providers. Its distributed control can offer low hardware complexity. Virtual cut through switching is used to achieve high throughput and low latency. The quality of service is guaranteed by introducing virtual channels based on the concept of DiffServ. The fault tolerant and load-balanced routing algorithm can offer deadlock and livelock freedom. It helps the network continue to work even with faulty parts existed. Finally, the performance of the proposed switching fabric is evaluated under various conditions. The results show it can outperform its counterpart in latency and throughput. It achieves terabit throughput with average latency of 1 us or so.

Huaxi Gu, Zengji Liu, Jungang Yang, Zhiliang Qiu, Guochang Kang

A Real Coded Genetic Algorithm for Data Partitioning and Scheduling in Networks with Arbitrary Processor Release Time

The problem of scheduling divisible loads in distributed computing systems, in presence of processor release time is considered. The objective is to find the optimal sequence of load distribution and the optimal load fractions assigned to each processor in the system such that the processing time of the entire processing load is a minimum. This is a difficult combinatorial optimization problem and hence genetic algorithms approach is presented for its solution.

S. Suresh, V. Mani, S. N. Omkar, H. J. Kim

D3DPR: A Direct3D-Based Large-Scale Display Parallel Rendering System Architecture for Clusters

The current trend in hardware for parallel rendering is to use clusters instead of high-end super computer. We describe a novel parallel rendering system that allows application to render to a large-scale display. Our system, called D3DPR, uses a cluster of PCs with high-performance graphics accelerators to drive an array of projectors. D3DPR consists of two types of logical nodes, Geometry Distributing Node and Geometry Rendering Node. It allows existing Direct3D9 application to run on our parallel system without any modification. The advantage of high-resolution and high-performance can be obtained in our system, especially when the triangle number of the application becomes very large. Moreover, the details of interconnecting network architecture, data distribution, communication and synchronization, etc. are hidden from the users.

Zhen Liu, Jiaoying Shi, Haoyu Peng, Hua Xiong

Determining Optimal Grain Size for Efficient Vector Processing on SIMD Image Processing Architectures

Adaptable silicon area usage within an integrated pixel processing array is a key issue for embedded single instruction, multiple data (SIMD) image processing architectures due to limited chip resources and varying application requirements. In this regard, this paper explores the effects of varying the number of vector (multichannel) pixels mapped to each processing element (VPPE) within a SIMD architecture. The VPPE ratio has a significant impact on the overall area and energy efficiency of the computational array. Moreover, this paper evaluates the impact of our color-aware instruction set (CAX) on each VPPE configuration to identify ideal grain size for a given SIMD system extended with CAX. CAX supports parallel operations on two-packed 16-bit (6:5:5) YCbCr (luminance-chrominance) data in a 32-bit datapath processor, providing greater concurrency and efficiency for vector processing of color image sequences. Experimental results for 3-D vector quantization indicate that high processing performance with the lowest cost is achieved at VPPE = 16 with CAX.

Jongmyon Kim, D. Scott Wills, Linda M. Wills

Partitioning and global scheduling are two approaches for scheduling real-time tasks in multiprocessor environments. Partitioning is the more favored approach, although it is sub-optimal. This is mainly due to the fact that popular uniprocessor real-time scheduling algorithms, such as EDF and RM, can be applied to the partitioning approach with low scheduling overhead. In recent years, much research has been done on global real-time multiprocessor scheduling algorithms based on the concept of “proportionate fairness”. Proportionate fair (Pfair) scheduling [5],[6] is the only known optimal algorithm for scheduling real-time tasks on multiprocessor. However, frequent preemptions caused by the small quantum length for providing optimal scheduling in the Pfair scheduling make it impractical. Deadline Fair Scheduling (DFS) [1] based on Pfair scheduling tried to reduce preemption-related overhead by means of extending quantum length and sharing a quantum among tasks. But extending quantum length causes a mis-estimation problem for eligibility of tasks and a non-work-conserving problem.

In this paper, we propose the Enhanced Deadline Fair Scheduling (E-DFS) algorithm to reduce preemption-related overhead. We show that E-DFS allows us to decrease quantum length by reducing overhead and save wasted CPU time that is caused by preemption-related overhead and miss-estimation of eligibility.

Kyong Jo Jung, Chanik Park

Minimizing Power in Hardware/Software Partitioning

Power efficiency is one of the major considerations in the current hardware/software co-designs. This paper models hardware/ software partitioning as an optimization problem with the objective of minimizing power consumption. An efficient heuristic algorithm running in

O

(

n

log

n

) is proposed by extending the idea of solving the 0-1 knapsack problem. Also, an exact algorithm based on dynamic programming is proposed to produce the optimal solution in

$O(n.\mathcal{A}.\mathcal{E})$

for

n

code fragments under the constraints: hardware area

$\mathcal{A}$

and execution time

$\mathcal{E}$

. Computational results show that the approximate solution produced by the proposed heuristic algorithm is nearly optimal in comparison to the optimal solution produced by the proposed exact algorithm.

Jigang Wu, Thambipillai Srikanthan, Chengbin Yan

Exploring Design Space Using Transaction Level Models

This paper presents THU-SOC, a new methodology and tool dedicated to explore design space by executing full-scale SW application code on the transaction level models of the SoC platform. The SoC platform supports alternative Transaction Level Models (TLMs), bus-functional model and bus-arbitration model, which enables it to cooperate with different levels of hardware descriptions. So, users are only required to provide functional descriptions to construct a whole cycle-accurate system simulation for a broad design space exploration in the architecture design. When the architecture is determined, the high-level descriptions can be replaced by RTL-level descriptions to accomplish the system verification, and the interface between the tool and the descriptions is unmodified. Moreover, THU-SOC integrates some behavior models of necessary components in a SoC system, such as ISS (Instruction-Set Simulator) simulator of CPU, interrupt controller, bus arbiter, memory controller, UART controller, so users can focus themselves on the design of the target component. The tool is written in C++ and supports the PLI (Programming Language Interface), therefore its performance is satisfying and different kinds of hardware description languages, such as System-C, Verilog, VHDL and so on, are supported.

Youhui Zhang, Liu Dong, Gu Yu, Dongsheng Wang

Increasing Embedding Probabilities of RPRPs in RIN Based BIST

In this paper, we propose a new clustered reconfigurable interconnect network (CRIN) BIST to improve the embedding probabilities of random-pattern-resistant-patterns. The proposed method uses a scan-cell reordering technique based on the signal probabilities of given test cubes and specific hardware blocks that increases the embedding probabilities of care bit clustered scan chain test cubes. We have developed a simulated annealing based algorithm that maximizes the embedding probabilities of scan chain test cubes to reorder scan cells, and an iterative algorithm for synthesizing the CRIN hardware. Experimental results demonstrate that the proposed CRIN BIST technique achieves complete fault coverage with lower storage requirement and shorter testing time in comparison with a previous method.

Dong-Sup Song, Sungho Kang

A Practical Test Scheduling Using Network-Based TAM in Network on Chip Architecture

It may be impractical to have TAM for test usage only in NoC because it causes enormous hardware overhead. Therefore, the reuse of on-chip networks for TAM is very attractive and logical. In network-based TAM, an effective test scheduling for built-in cores is also important to minimize the total test time. In this paper, we propose a new efficient test scheduling algorithm for NoC based on the reuse of on-chip networks. Experimental results using some ITC’02 benchmark circuits show the proposed algorithm can reduce the test time by about 5 – 20% compared to previous methods. Consequently, the proposed algorithm can be widely used due to its feasibility and practicality.

Jin-Ho Ahn, Byung In Moon, Sungho Kang

DRIL– A Flexible Architecture for Blowfish Encryption Using Dynamic Reconfiguration, Replication, Inner-Loop Pipelining, Loop Folding Techniques

Blowfish is used in a wide variety of applications, involving large amounts of data, demanding high-speed encryption, flexible key sizes and various encryption modes. Even though the ASIC implementations have better encryption speeds and throughput than FPGA ones, their flexibility is restricted. In blowfish, key generation and encryption stages function disjointedly rendering it suitable for dynamic reconfiguration. Fiestal network of the algorithm is better suited for inner loop pipelining and loop folding. Combining these architectural features with dynamic reconfiguration and replication results in a proficient architecture for blowfish described in this paper as the DRIL architecture. This four-tier architecture, involving both hardware and software designs, focuses on efficient hardware utilization, higher throughput, flexibility and better performance. Further, DRIL is a platform independent architecture. Performance evaluation on the XILINX SPARTAN 2E and VIRTEX 2 devices showed very high utilization of the FPGA device with a throughput of 259.615 x R Mbps on SPARTAN 2E and 515.218 x R Mbps on VIRTEX 2 devices, where R is the replication factor.

T. S. B. Sudarshan, Rahil Abbas Mir, S. Vijayalakshmi

Efficient Architectural Support for Secure Bus-Based Shared Memory Multiprocessor

Tamper-evident and tamper-resistant systems are vital to support applications such as digital right management and certified grid computing. Recently proposed schemes, such as XOM and AEGIS, assume trusting processor state only to build secure systems. Secure execution for shared memory multiprocessor is a challenging problem as multiple devices need to be trusted.

In this work, we propose a framework for providing secure execution on a bus-based multiprocessor system that tackles the key distribution problem, the overhead of encryption/decryption and the memory integrity overheads. We show how to remove the encryption/decryption latencies from the critical path of execution using

pseudo

While verifying the integrity of all memory transactions, we use a special buffer to check for replay on a random set of memory lines. Replay can be detected with certainty of 99.99%, even if the lines replayed are less than 1%.

Khaled Z. Ibrahim

Covert Channel Analysis of the Password-Capability System

The Password-Capability System is a compact operating system with an access control mechanism based on password-capabilities. We show that the system is able to support several security paradigms which solve real-world problems not adequately addressed by conventional operating systems such as Windows and Unix. We show also that these paradigms are only effective if the system is free from covert channels. To this end, we carry out a covert channel analysis of the system and outline the elimination of all channels found.

Dan Mossop, Ronald Pose

Comparing Low-Level Behavior of SPEC CPU and Java Workloads

Andy Georges, Lieven Eeckhout, Koen De Bosschere

Application of Real-Time Object-Oriented Modeling Technique for Real-Time Computer Control

This paper considers the design technique of the real-time control algorithm to implement the electronic interlocking system, which is the most important station control system in railway signal field. The proposed technique consists of the structure design and the detail design, which are based on the Real-time Object-Oriented Modeling (ROOM). The structure is designed by a modeling using the heuristic searching technique that catch and make out the specific requested condition. The detail design can be implemented if it may get the satisfied values through the repetitive modeling after comparing and examining the data obtained from the structure design for the more reliable and accurate system to be implemented. The technique proposed in this paper is implemented with C++ language that is easy to be transferred and compatible with the existing interfaces and the operating system is also designed and simulated by the VRTX that is a real-time operating system. This proposed technique is applied to the typical station model to prove the validity through verifying the performance of the modeling station.

Jong-Sun Kim, Ji-Yoon Yoo

VLSI Performance Evaluation and Analysis of Systolic and Semisystolic Finite Field Multipliers

Finite field multiplication in

GF

(2

m

) is an ineluctable operation in elliptic curve cryptography. The objective of this paper is to survey fast and efficient hardware implementations of systolic and semisystolic finite field multipliers in

GF

(2

m

) with two algorithmic schemes – LSB-first and MSB-first. These algorithms have been mapped to seven variants of recently proposed array-type finite-field multiplier implementations with different input-output configurations. The relative VLSI performance merits of these ASIC prototypes with respect to their field orders are evaluated and compared under uniform constraints and in properly defined simulation runs on a Synopsys environment using the TSMC 0.18

μ

m CMOS standard cell library. The results of the simulation provide an insight into the behavior of various configurations of array-type finite-field multiplier so that system architect can use them to determine the most appropriate finite field multiplier topology for required design features.

Ravi Kumar Satzoda, Chip-Hong Chang

Analysis of Real-Time Communication System with Queuing Priority

In this paper, we discuss the real-time issue related to QoS requirements in a communication system, which consists of several sensors and one central control unit. In order to guarantee real-time requirement and fairness among sensors, a polling mechanism with queuing priority is applied in it. By means of the imbedded Markov chain theory and generating function method, mean queue length and mean circle time of asymmetric system are studied quantitatively, and corresponding exact expressions are obtained. These results are critical for system device and system performance analysis. Finally, computer simulations demonstrate effectiveness of our analysis.

Yunbo Wu, Zhishu Li, Yunhai Wu, Zhihua Chen, Tun Lu, Li Wang, Jianjun Hu

FPGA Implementation and Analyses of Cluster Maintenance Algorithms in Mobile Ad-Hoc Networks

A study of hardware complexity and power consumption is vital for algorithms implementing cluster maintenance in mobile ad-hoc networks. This is because of the intrinsic physical limitations of mobile nodes, such as limited energy available, limited computational ability of nodes that form the network. Clustering is divided into two phases, initial cluster formation and cluster maintenance. Cluster maintenance handles situations of change such as a node moving away from a cluster, a new node joining a cluster, clusters splitting due to excessive number of nodes in the cluster, and merging of clusters. In this paper, we have compared the hardware and power efficiency of three cluster maintenance algorithms, Gayathri et al., Lin H.C and Chu Y.H. and Lin C.R and Gerla M. The three algorithms were implemented in synthesizable VHDL to enable porting into FPGA. The hardware complexity and power consumption forms the metrics of comparison of the algorithms studied. For all the algorithms, the CLB slices used was between 123 and 3093 with the operating frequency between 2 MHz and 70 MHz. The total power consumption is between 803 mW and 1002 mW and the total current consumption is between 408 mA and 555 mA.

Sai Ganesh Gopalan, Venkataraman Gayathri, Sabu Emmanuel

A Study on the Performance Evaluation of Forward Link in CDMA Mobile Communication Systems

In the CDMA cellular mobile communication systems, the connection between base and mobile station has a forward link from base station to mobile station. Four channels of forward link, pilot channel, sync channel, paging channel, traffic channel, have an absolute influence on forward coverage and speech quality. And overhead channels(pilot, sync, paging) are assigned output at the adequate rate to optimize the structure of coverage and to minimize the influence of neighbor base station. And the quality of forward link depends on the size of received field strength, as well as Ec/Io which includes both self signal and all noise factors affecting self signal is the main standard in order to decide the coverage and quality of forward link. In this paper, in order to improve a performance of forward link systems in the mobile communications using CDMA, I examine influencing factors in forward links, and measure at LAB., and analyze their results. As the results, I confirm that identify their effect.

Sun-Kuk Noh

Cache Leakage Management for Multi-programming Workloads

Power consumption is becoming a critical design issue of embedded systems due to the popularity of portable device such as cellular phones and personal digital assistants. Leakage is projected to most amount of cache power budget in 70nm technology. In this paper, we utilize the task-level information to manage cache leakage power. We partition the caches among tasks according to their working set size. We then apply different leakage management policies to the cache regions allocated to active and suspended tasks, respectively. Our proposed policies effectively reduce L1 cache leakage energy by 84% on the average for the multi-programming workloads with only negligible degradations in performances.

Chun-Yang Chen, Chia-Lin Yang, Shih-Hao Hung

A Memory Bandwidth Effective Cache Store Miss Policy

Memory bandwidth becomes more and more important in the forthcoming 10 billion transistors chip times. This paper discusses and implements a memory bandwidth effective cache store miss policy. Although the write-allocate policy is adopted, we find it is possible not to load the full cache block from lower memory hierarchy when cache store miss occurs, if the cache block is fully modified before any load instruction accesses the un-modified data of the same cache block. This cache store miss policy will partly reduce the pressure on memory bandwidth, and improve the cache hit rate. We provides a hardware mechanism, Store Merge Buffer, to implement the policy in Goodson-2 processor. Our experiments demonstrate the encouraging results: Memory bandwidth improved by almost 50% (tested by stream benchmark), and IPC on SPEC CPU2K improved by 9.4% on average.

Hou Rui, Fuxin Zhang, Weiwu Hu

Application-Specific Hardware-Driven Prefetching to Improve Data Cache Performance

Data cache hit ratio has a major impact on execution performance of programs by effectively reducing average data access time. Prefetching mechanisms improve this ratio by fetching data items that shall soon be required by the running program. Software-driven prefetching enables application-specific policies and potentially provides better results in return for some instruction overhead, whereas hardware-driven prefetching gives little overhead, however general-purpose processors cannot adapt to the specific needs of the running application. In the application-specific processors that we develop customized to an object-oriented application, we implement application-specific hardware prefetching to benefit from both worlds. This prefetching policy prefetches all data items that shall be unconditionally accessed by a class method when the class method is called. We mathematically analyze this mechanism and present its simulation results using some object-oriented benchmarks. Simulation results in absence and presence of the proposed prefetching mechanism confirm the theoretical results and show that on average, the miss ratio is reduced by 73%.

Mehdi Modarressi, Maziar Goudarzi, Shaahin Hessabi

Targeted Data Prefetching

Given the increasing gap between processors and memory, prefetching data into cache becomes an important strategy for preventing the processor from being starved of data. The success of any data prefetching scheme depends on three factors:

timeliness

,

accuracy

and

. In most hardware prefetching mechanism, the focus has been on accuracy – ensuring that the predicted address do turn out to be demanded in a later part of the code. In this paper, we introduce a simple hardware prefetching mechanism that targets

, i.e. loads that account for a large proportion of the load misses in an application. Our results show that our prefetch strategy can reduce up to 45% of stall cycles of benchmarks running on a simulated out-of-order superscalar processor with an overhead of 0.0005 prefetch per CPU cycle.

Weng-Fai Wong

Area-Time Efficient Systolic Architecture for the DCT

A reduced-complexity algorithm and its systolic architecture are presented for computation of the discrete cosine transform. The proposed scheme not only leads to a fully-pipelined regular and modular hardware, but also offers significantly higher throughput, lower latency and lower area-time complexity over the existing structures. The proposed design is devoid of complicated input/output mapping and complex control structure. Moreover, it does not have any restriction on the transform-length and it is easily scalable for higher transform-length as well.

Pramod Kumar Meher

Efficient VLSI Architectures for Convolution and Lifting Based 2-D Discrete Wavelet Transform

This paper presents efficient VLSI architectures for real time processing of separable convolution and lifting based 2-D discrete wavelet transform (DWT). Convolution based architecture uses partitioning algorithm based on the state space representation method and lifting based architecture applies pipelining to each lifting step. Both architectures use recursive pyramid algorithm(RPA) scheduling that intersperses both the row and column operations of the second and following levels among column operations of the first level without using additional filter for row operations of the second and following levels. As a result, proposed architectures have smaller hardware complexity compared to that of other conventional separable architectures with comparable throughput.

Gab Cheon Jung, Seong Mo Park, Jung Hyoun Kim

A Novel Reversible TSG Gate and Its Application for Designing Reversible Carry Look-Ahead and Other Adder Architectures

Reversible logic is emerging as a promising area of research having its applications in quantum computing, nanotechnology, and optical computing. The classical set of gates such as AND, OR, and EXOR are not reversible. In this paper, a new 4 *4 reversible gate called “TSG” gate is proposed and is used to design efficient adder units. The proposed gate is used to design ripple carry adder, BCD adder and the carry look-ahead adder. The most significant aspect of the proposed gate is that it can work singly as a reversible full adder i.e reversible full adder can now be implemented with a single gate only. It is demonstrated that the adder architectures using the proposed gate are much better and optimized, compared to their counterparts existing in literature, both in terms of number of reversible gates and the garbage outputs.

Himanshu Thapliyal, M. B. Srinivas

Implementation and Analysis of TCP/IP Offload Engine and RDMA Transfer Mechanisms on an Embedded System

The speed of present-day network technology exceeds a gigabit and is developing rapidly. When using TCP/IP in these high-speed networks, a high load is incurred in processing TCP/IP protocol in a host CPU. To solve this problem, research has been carried out into TCP/IP Offload Engine (TOE) and Remote Direct Memory Access (RDMA). The TOE processes TCP/IP on a network adapter instead of using a host CPU; this reduces the processing burden on the host CPU, and RDMA eliminates any copy overhead of incoming data streams by allowing incoming data packets to be placed directly into the correct destination memory location. We have implemented the TOE and RDMA transfer mechanisms on an embedded system. The experimental results show that TOE and RDMA on an embedded system have considerable latencies despite of their advantages in reducing CPU utilization and data copy on the receiver side. An analysis of the experimental results and a method to overcome the high latencies of TOE and RDMA transfer mechanisms are presented.

In-Su Yoon, Sang-Hwa Chung

Backmatter

Weitere Informationen