Skip to main content

2014 | Buch

Advanced Computer Architecture

10th Annual Conference, ACA 2014, Shenyang, China, August 23-24, 2014. Proceedings

herausgegeben von: Junjie Wu, Haibo Chen, Xingwei Wang

Verlag: Springer Berlin Heidelberg

Buchreihe : Communications in Computer and Information Science

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 10th Annual Conference on Advanced Computer Architecture, ACA 2014, held in Shenyang, China, in August 2014. The 19 revised full papers presented were carefully reviewed and selected from 115 submissions. The papers are organized in topical sections on processors and circuits; high performance computing; GPUs and accelerators; cloud and data centers; energy and reliability; intelligence computing and mobile computing.

Inhaltsverzeichnis

Frontmatter

Processors and Circuits

Fusion Coherence: Scalable Cache Coherence for Heterogeneous Kilo-Core System
Abstract
Future heterogeneous systems will integrate CPUs and GPUs on a single chip to achieve high computing performance as well as high throughput. In general, it would discard the current discrete pattern and will build a uniformed shared memory system avoiding explicit data movement among CPUs and GPUs connected by high throughput NoC.
We propose a scalable cache coherence solution Fusion Coherence for Heterogeneous Kilo-core System Architecture by integrating CPUs and GPUs on a single chip to mitigate the coherence bandwidth side effects of GPU memory requests as well as overhead of copying data among memories of CPUs and GPUs. The Fusion Coherence coalesces L3 data cache of CPUs and GPUs based on a uniformed physical memory, further integrates a region directory and cuckoo directory into two levels of cache coherence directory without modifying cache coherence protocol. According to the experimental results with a subset of Rodina benchmarks, it is effective to decrease the overhead of data transfer and get an average execution speedup by 2.4x. The highest speedup is approximate to 4x for data-intensive applications.
Songwen Pei, Myoung-Seo Kim, Jean-Luc Gaudiot, Naixue Xiong
ACRP: Application Customized Reconfigurable Pipeline
Abstract
Reconfigurable architectures have become popular in recent years in the high performance computing field, because of their reconfigurable characteristic and abundant computing resources. These architectures combine the high performance of ASICs with the flexibility of microprocessors. A novel architecture named Application Customized Reconfigurable Pipeline (ACRP) is proposed for domain-specific applications in this paper. According to analyze and abstract the domain computing character, an application Customized Functional Unit (CFU) is designed to execute the frequent instruction sequence efficiently. The CFU is shared with the hardware pipeline which is composed of some Simple Process Elements (SPEs). The experimental results show that ACRP can exploit the CFU-, pipeline- and data-level parallelism efficiently with the area constraint.
Guanwu Wang, Lei Liu, Sikun Li
SRS: A Split-Range Shared Memory Consistency Model for Thousand-Core Processors
Abstract
A novel memory consistency model for thousand-core processors is presented. The model simplifies cache coherence for the full chip, and reduces cache design complexity. In addition, the model has the ability to describe the direct exchange of data on chip, thereby alleviating the off-chip memory bandwidth requirements. The paper gives a formal definition of the model, and proves that the model is sequentially consistent. All aspects of the definition are fully used in the process of proof, which means that there is no redundancy in the definition. Therefore, based on the split-range shared memory consistency model, a shared memory system can achieve high performance at low hardware cost. Meanwhile, the model is easy to be understood and used by programmers.
Hui Lyu, Fang Zheng, Xianghui Xie
A Partition Method of SoC Design Serving the Multi-FPGA Verification Platform
Abstract
FPGA (Field-Programmable Gate Array) technology can provide excellent accuracy and efficiency for Chip verification, which has become the key bottleneck of SoC design. Due to the resource constraints of single FPGA chip, Multi-FPGA architecture was applied to the verification of the large scale SoC design. In recent years, a variety of Multi-FPGA verification platforms have been developed, but most of them indirectly part the SoC design on the Netlist level after the synthesis procedure. A partition method is proposed in this paper, which works directly on the RTL (Register Transfer Level) code. It presents a universal partition methodology with realistic and detailed implementation, applying a linear partition algorithm. The experiment simulation of leon3, a SoC design based on SPARC processor, runs at a speed of 8 MHz correctly, over 100,000 times faster than software simulation, 1-2 times of the BEE4 FPGA based recognizable platforms.
Shenglai Yang, Kuanjiu Zhou, Jie Wang, Bin Liu, Ting Li

High Performance Computing

A Novel Node-to-Set Node-Disjoint Fault-Tolerant Routing Strategy in Hypercube
Abstract
This paper proposes a node-to-set node-disjoint routing algorithm based on a path storage model for the hypercube networks with faulty nodes. Two properties of the storage model are listed in the paper on condition that the n-dimension hypercube has no more than n-1 faulty nodes. The first is that its path length is no more than hamming distance plus 2, and the second is that its sub-cube model can be partitioned from the global model. Based on the model, a novel routing algorithm is brought up to generate node-to-set node-disjoint fault-tolerant path. It adopts divide-and-conquer strategy to take full advantage of the regularity of hypercube. The routing algorithm can reduce the path length to n +f + 2 at most and decrease the time complexity to O(mn) in a faulty-node hypercube system(where n is the number of dimensions, m is the number of destination nodes and f is number of faulty nodes). Experiment results show that the average path length is shorten by 9~10% compared with existing algorithms in a ten-dimension hypercube with no more than nine faulty nodes.
Endong Wang, Hongwei Wang, Jicheng Chen, Weifeng Gong, Fan Ni
Filtering and Matching of Data Blocks to Avoid Disk Bottleneck in De-duplication File System
Abstract
Since the growing scale of data has generated huge redundancy, de-duplication which can eliminate redundancy and improve space utilization of storage device has been widely adopted. De-duplication filesystem can provide a unified interface to the upper application and implement inline de-duplication. In this paper, we design and implement FmdFS, a kernel-space de-duplication filesystem. Due to memory limitation, metadata of FmdFS is stored on disk group. Meanwhile a scale-adaptive binary tree filter is constructed in memory, which not only avoids access to the metadata on disk for searching fingerprints of most new data, but also records the groups where duplicate data is stored. In addition, FmdFS uses LRU hash cache, which holds the metadata group that has been recently accessed, to exploit locality to match the duplicate data to avoid access to the metadata on disk. In comparison with traditional de-duplication filesystems, FmdFS has the higher write performance.
Jiajia Zhang, Xingjun Zhang, Runting Zhao, Xiaoshe Dong
Performance Optimization of a CFD Application on Intel Multicore and Manycore Architectures
Abstract
This paper reports our experience optimizing the performance of a high-order and high accurate Computational Fluid Dynamics (CFD) application (HOSTA) on the state of art multicore processor and the emerging Intel Many Integrated Core (MIC) coprocessor. We focus on effective loop vectorization and memory access optimization. A series techniques, including data structure transformations, procedure inlining, compiler SIMDization, OpenMP loop collapsing, and the use of Huge Pages, are explored. Detailed execution time and event counts from Performance Monitoring Units are measured. The results show that our optimizations have improved the performance of HOSTA by 1.61× on a two Intel Sandy Bridge processors based computer node and 1.97× on a Intel Knights Corner coprocessor, the public MIC product. The microarchitecture level effects of these optimizations are also discussed.
Yonggang Che, Lilun Zhang, Yongxian Wang, Chuanfu Xu, Wei Liu, Xinghua Cheng

GPUs and Accelerators

A Throughput-Aware Analytical Performance Model for GPU Applications
Abstract
Graphics processing units (GPUs) have shown increased popularity in general-purpose parallel processing. This massively parallel architecture allows GPUs to execute tens of thousands of threads in parallel to solve heavily data-parallel problems efficiently. However, despite the tremendous computing power, optimizing GPU kernels to achieve high performance is still a challenge due to the sea change from CPU to GPU and lacking of tools for programming and performance analysis.
In this paper, we propose a throughput-aware analytical model to estimate the performance of GPU kernels and optimizations. We construct a pipeline for global memory access servicing and redefine the compute throughput and memory throughput as the speed of memory requests arriving and leaving the pipeline. Based on concluding the kernel throughput limiting factor, GPU programs are classified into compute-bound and memory-bound categories and then we predict performance for each category. Besides, our model can provide useful information on the direction of optimization and predict the potential performance benefits. We demonstrate our model on a manually written benchmark as well as the matrix-multiply kernel and show that the geometric mean of absolute error of our model is less than 6.5%.
Zhidan Hu, Guangming Liu, Wenrui Dong
Parallelized Race Detection Based on GPU Architecture
Abstract
In order to harness abundant hardware resources, parallel programming has become a necessity in multicore era. However, parallel programs are prone to concurrency bugs, especially data races. Even worse, current software tools always suffer from both large runtime overheads and poor scalability, while most of hardware supports for race detection are not available in parallel programming. Therefore, it has been a challenge that how to introduce a practical and fast race detection tools. Nowadays, GPUs with massive parallel computation resources have become one of the most popular hardware platforms. Hence, the prevalence of GPU architectures has opened an opportunity of accelerating data race detection.
In this paper, we first have a deeply analysis on data race detection algorithms like happens-before and observe that these algorithms have very good computation and data parallelism. Based on the observation, we propose Grace, a software approach that leverages massive parallelism computation units of GPU architectures to accelerate data race detection. Grace deploys detection, the most computation intensive workload, on GPU to fully utilize the computation resource in GPU. Moreover, Grace leverages coarse-grained pipeline parallelism and data parallelism through exploiting the computation resource in multi-core CPUs to further improve performance. Experimental results show that Grace is fast and scalable. It achieves over 80x speedup compared to the sequential version even under a 128-thread configuration.
Zhuofang Dai, Zheng Zhang, Haojun Wang, Yi Li, Weihua Zhang
A Novel Design of Flexible Crypto Coprocessor and Its Application
Abstract
Accelerating security protocols has been a great challenge in general-purpose processor due to the complexity of crypto algorithms. Most crypto algorithms are employed at the function level among different security protocols. We propose a novel flexible crypto coprocessor architecture that relies on Reconfigurable Cryptographic Blocks (RCBs) to achieve a balance between high performance and flexibility and implement the architecture for security application on FPGA. The pipelining technique is adopted to realize parallel data and to reduce the commication costs. We consider several crypto algorithms as examples to illustrate the design of the RCB in the FC Coprocessor. Finally, we create a prototype of the FC coprocessor on a Xilinx XC5VLX330 FPGA chip. The experiment results show that the coprocessor, running at 216 MHz, outperforms the software-based file encryption running on an Intel Core i3 530 CPU at 2.93 GHz by a factor of 29× for typical encrypt application.
Shice Ni, Yong Dou, Kai Chen, Lin Deng

Cloud and Data Centers

A PGSA Based Data Replica Selection Scheme for Accessing Cloud Storage System
Abstract
The data replica management scheme is a critical component of cloud storage system. In order to enhance its scalability and reliability at the same time improve system response time, the multiple data replica scheme is adopted. When a cloud user issues an access request, a suitable replica should be selected to respond to it in order to shorten user access time and promote system load balance. In this paper, with network status, storage node load and historical information of replica selection considered comprehensively, a PGSA (Plant Growth Simulation Algorithm) based data replica selection scheme for cloud storage is proposed to improve average access time and replica utilization. The proposed scheme has been implemented based on CloudSim and performance evaluation has been done. Simulation results have shown that it is both feasible and effective with better performance than certain existent scheme.
Bang Zhang, Xingwei Wang, Min Huang
Location-Aware Multi-user Resource Allocation in Distributed Clouds
Abstract
Resource allocation for multi-user across multiple data centers is an important problem in cloud computing environments. Many geographically-distributed users may request virtualized resources simultaneously. And the distances from users to allocated resources have much impact on the quality of service (QoS) in multiple data centers environment. Most existing methods do not take all these factors into account when allocating resources. They usually result in poor runtime performance of users’ virtual computing environment and the remarkable difference of users’ QoS. In this paper, we propose RAMD, a resource allocation algorithm based on multi-stage decision in multiple data centers. The RAMD algorithm allocate VMs to users, taking into account the correlation and interaction between multiple users, so as to minimize the sum of all users’ service distances (i.e. determined by user location and network distance of virtual machines). Experimental results show that the algorithm can effectively deal with the cloud resource allocation for multi-user across multiple data centers. It can improve the runtime performance of users’ virtualized resources and reduce the difference of QoS.
Jiaxin Li, Dongsheng Li, Jing Zheng, Yong Quan

Energy and Reliability

Parallel Rank Coherence in Networks for Inferring Disease Phenotype and Gene Set Associations
Abstract
The RCNet (Rank Coherence in Networks) algorithm has been used to find out the associations between the gene sets and disease phenotypes. However, it suffers from high computational cost when the size of dataset is very large. In this paper, we design three mechanisms to solve the RCNet algorithm on heterogeneous CPU-GPU system based on CUDA and OpenMP programming model. The pipeline mechanism is much suitable for the collaborative computing on CPU and dual-GPUs, which can achieve more than 33 times performance gains. The work plays an important role in reconstructing the disease phoneme-genome association efficiently.
Tao Li, Duo Wang, Shuai Zhang, Yulu Yang
Dynamic Power Estimation with Hardware Performance Counters Support on Multi-core Platform
Abstract
Power estimation has attracted a plenty of attentions for its significant guidance for OS scheduling and the development of power-efficiency optimization design. Previous researches indicate that power consumption can be estimated via monitoring related hardware events, such as retirement of instructions, cache access, etc. However, these models based on hardware events will introduce an error around 5%. In this paper, a more accurate hardware events directed power model is proposed. We identified the most appropriate events to respond to the major power consumption components. By analyzing the hardware events in processor through performance counters, a unified run-time power estimation model is introduced. Our model has been verified through real-time measurement and shown to be 3.01% and 1.99% inaccurate for PARSEC and SPLASH-2 benchmark suites. Our power estimation model can serve as a foundation for intelligent, power-aware systems that can dynamically balance power assignment and smooth peak power at run-time.
Xin Liu, Li Shen, Cheng Qian, Zhiying Wang
Double Circulation Wear Leveling for PCM-Based Embedded Systems
Abstract
Phase change memory (PCM) has emerged as a promising candidate to replace DRAM in embedded systems with its attractive features. However, the endurance of PCM greatly limits its adoption in embedded systems. It can only sustain a limited number of write operations. To solve this issue, we propose a simple, novel, and effective wear leveling technique, called Double Circulation Wear Leveling (DCWL), to evenly distribute write activities across the PCM chips. The basic idea is to periodically move the hot region across the whole PCM chips. When a movement of the hot region is triggered, several small areas in the hot region move to the right. The experimental results show that our wear leveling technique can effectively improve the lifetime of PCM chips compared with the previous work.
Guan Wang, Fei Peng, Lei Ju, Lei Zhang, Zhiping Jia

Intelligence Computing and Mobile Computing

Reputation-Based Participant Incentive Approach in Opportunistic Cognitive Networks
Abstract
Sufficient reputable participants are critical to effective data collections and data disseminations in opportunistic cognitive networks. However, it is difficult to identify reputable or malicious participants in opportunistic networks. Cognitive network technology can be applied to the communication system of opportunistic networks to provide reputation-aware schemes of the participants. Furthermore, keeping participants enthusiasm for activities of networks is also important. In this work, we propose a Reputation-Based Participant Incentive Approach (RBPIA) to motivate reputable participants. RBPIA scores participants using reputation degree according to their sensing data and bid price respectively and encourage them to keep interested in the activities with rewards. Simulations are performed in different scenarios to evaluate efficiency of the approach. The results show that RBPIA can identify participant types well, and remarkably reduce the incentive cost.
Jie Li, Rui Liu, Ruiyun Yu, Xingwei Wang, Zhijie Zhao
Semantic Similarity Calculation of Short Texts Based on Language Network and Word Semantic Information
Abstract
We first analyzes the deviation when current similarity calculation methods for texts are applied to short texts, and proposes a similarity calculation method for short texts based on language network and word semantic information. Firstly, models the short texts as language network according to the complex-network characteristic of human being’s language. Then analyzes the comprehensive eigenvalue of the words in the language network and the word similarity between different texts to obtain the word semantic. Calculate the similarity between short texts combining language network and word semantic. Finally the effectiveness of proposed algorithm is verified through clustering algorithm experiments.
Zhijian Zhan, Feng Lin, Xiaoping Yang
A New Technology for MIMO Detection: The μ Quantum Genetic Sphere Decoding Algorithm
Abstract
The technology for multiple-input multiple-output (MIMO) detection is a kind of key enabling technology in high-rate wireless communication, whose performance directly affects the data throughput of the whole system. How to improve the MIMO detection technology, so as to increase the detecting rate and reliability, as well as to lower the bit error rate (BER) has become a hot topic in the field of wireless digital communication. Since the original sphere decoding algorithm (OSDA) has a relatively high computational complexity and a relatively long decoding time, in this paper, we present a new technology for MIMO detection: the μ quantum genetic sphere decoding algorithm (μQGSDA), which combines the super-parallelism of μ quantum computing with the global superiority of genetic algorithm (GA), and can be summarized as a multi-dimensional search for a single-dimensional search, thus to avoid a large number of complex matrix operations, as well as to improve the detection efficiency. Simulation experiment results demonstrate that our method has some advantages of good robustness, search capability and convergence rate. What’s more, the detection performance of μQGSDA has been greatly improved than OSDA.
Jian Zhao, Hengzhu Liu, Xucan Chen, Ting Chen
Research on a Kind of Optimization Scheme of MIMO-OFDM Sphere Equalization Technology for Unmanned Aerial Vehicle Wireless Image Transmission Data Link System
Abstract
While unmanned aerial vehicle (UAV) is in the mission, the acquired big data information needs to communicate in real-time with the base. Consequently, how to achieve a high-speed and high-quality data transmission via the limited bandwidth and frequency spectrum resource has currently become a hot researching topic in the field of wireless communication and aeronautical telemetry. Aiming at these problems, in this paper, we present a kind of optimization scheme of multi-input multi-output (MIMO) orthogonal frequency division multiplexing (OFDM) sphere equalization technology for UAV wireless image transmission data link system, which combines MIMO technology with OFDM technology, thus to increase the spectrum utilization rate and to improve the system performance while resisting to the multipath effect. What’s more, by means of carrying out the collaborative optimization on the original sphere equalization technology (OSET), and by the introduction of the support of the configurable parameters, the system computational complexity is significantly reduced, the detection efficiency as well as the adaptability to complex environment is also improved. Simulation experiment results demonstrate that our method has an approximately optimal bit error rate (BER) performance, a high bandwidth efficiency, a good robustness, a fast convergence rate, and the comprehensive performance is greatly improved than OSET. Furthermore, our method also has a very important reference significance and application value to the development of the equalization technologies of the wireless image transmission data link system based on the UAV platform in our country, as well as to the researches in domestic and foreign related fields.
Jian Zhao, Hengzhu Liu, Xucan Chen, Botao Zhang, Ting Chen
Backmatter
Metadaten
Titel
Advanced Computer Architecture
herausgegeben von
Junjie Wu
Haibo Chen
Xingwei Wang
Copyright-Jahr
2014
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-662-44491-7
Print ISBN
978-3-662-44490-0
DOI
https://doi.org/10.1007/978-3-662-44491-7

Neuer Inhalt