Skip to main content

Über dieses Buch

This book constitutes the refereed proceedings of the 12th Annual Conference on Advanced Computer Architecture, ACA 2018, held in Yingkou, China, in August 2018.

The 17 revised full papers presented were carefully reviewed and selected from 80 submissions. The papers of this volume are organized in topical sections on: accelerators; new design explorations; towards efficient ML/AI; parallel computing system.





A Scalable FPGA Accelerator for Convolutional Neural Networks

Convolution Neural Networks (CNN) have achieved undisputed success in many practical applications, such as image classification, face detection, and speech recognition. As we all know, FPGA-based CNN prediction is more efficient than GPU-based schemes, especially in terms of power consumption. In addition, OpenCL-based high-level synthesis tools in FPGA is widely utilized due to the fast verification and implementation flows. In this paper, we propose an FPGA accelerator with a scalable architecture of deeply pipelined OpenCL kernels. The design is verified by implementing three representative large-scale CNNs, AlexNet, VGG-16 and ResNet-50 on Altera OpenCL DE5-Net FPGA board. Our design has achieved a peak performance of 141 GOPS for convolution operation, and 103 GOPS for the entire VGG-16 network that performs ImageNet classification on DE5-Net board.
Ke Xu, Xiaoyun Wang, Shihang Fu, Dong Wang

Memory Bandwidth and Energy Efficiency Optimization of Deep Convolutional Neural Network Accelerators

Deep convolutional neural networks (DNNs) achieve state-of-the-art accuracy but at the cost of massive computation and memory operations. Although highly-parallel devices effectively meet the requirements of computation, energy efficiency is still a tough nut.
In this paper, we present two novel computation sequences, \(N\!H\!W\!C_{fine}\) and \(N\!H\!W\!C_{coarse}\), for the DNN accelerators. Then we combine two computation sequences with appropriate data layouts. The proposed modes enable continuous memory access patterns and reduce the number of memory accesses, which is achieved by leveraging and transforming the local data reuse of weights and feature maps in high-dimensional convolutions.
Experiments with various convolutional layers show that the proposed modes made up of computing sequences and data layouts are more energy efficient than the baseline mode on various networks. The reduction for total energy consumption is up to 4.10\(\times \). The reduction for the off-chip memory access latency is up to 5.11\(\times \).
Zikai Nie, Zhisheng Li, Lei Wang, Shasha Guo, Qiang Dou

Research on Parallel Acceleration for Deep Learning Inference Based on Many-Core ARM Platform

Deep learning is one of the hottest research directions in the field of artificial intelligence. It has achieved results which subvert these of traditional methods. However, the demand for computing ability of hardware platform is also increasing. The academia and industry mainly use heterogeneous GPUs to accelerating computation. ARM is relatively more open than GPUs. The purpose of this paper is to study the performance and related acceleration techniques of ThunderX high-performance many-core ARM chips under large-scale inference tasks. In order to study the computational performance of the target platform objectively, several deep models are adapted for acceleration. Through the selection of computational libraries, adjustment of parallel strategies, application of various performance optimization techniques, we have excavated the computing ability of many-core ARM platforms deeply. The final experimental results show that the performance of single-chip ThunderX is equivalent to that of the i7 7700 K chip, and the overall performance of dual-chip can reach 1.77 times that of the latter. In terms of energy efficiency, the former is inferior to the latter. Stronger cooling system or bad power management may lead to more power consumption. Overall, high-performance ARM chips can be deployed in the cloud to complete large-scale deep learning inference tasks which requiring high throughput.
Keqian Zhu, Jingfei Jiang

Research on Acceleration Method of Speech Recognition Training

Recurrent Neural Network (RNN) is now widely used in speech recognition. Experiments show that it has significant advantages over traditional methods, but complex computation limits its application, especially in real-time application scenarios. Recurrent neural network is heavily dependent on the pre- and post-state in calculation process, and there is much overlap information, so overlapping information can be reduced to accelerate training. This paper construct a training acceleration structure, which reduces the computation cost and accelerates training speed by discarding the dependence of pre- and post- state of RNN. Then correcting the recognition results errors with text corrector. We verify the proposed method on the TIMIT and Librispeech datasets, which prove that this approach achieves about 3 times speedup with little relative accuracy reduction.
Liang Bai, Jingfei Jiang, Yong Dou

New Design Explorations


A Post-link Prefetching Based on Event Sampling

Data prefetching is an effective approach to improve performance by hiding long memory latency. Existing profiling feedback optimizations can do well in pointer-based linked data structure prefetching. However, these optimizations, which instrument and optimize source code during compiling or post link, usually incur tremendous overhead at profiling stage. Furthermore, it is a mission impossible for these methods to do optimization without source code. This work designs and implements an Event Sampling based Prefetching Optimizer, which is a post-link prefetching based on hardware performance counters event sampling. Evaluation on SW26010 processor shows that with the proposed prefetching approach, 9 out of 29 programs of SPEC2006 can be speeded up by about 4.3% on average with only less than 10% sampling overhead on average.
Hongmei Wei, Fei Wang, Zhongsheng Li

The Design of Reconfigurable Instruction Set Processor Based on ARM Architecture

In embedded system, performance and flexibility are two of the most important concerns. To solve the problem of the flexibility of GPP (General Purpose Processor) and the performance of ASIC (Application Specific Integrated Circuit), an ARM based RISP(Reconfigurable Instruction Set Processor) architecture is proposed in this paper which adopts partial reconfiguration and coprocessor mechanism to realize the dynamic online reconfiguration of the processor instruction. A prototype system of the architecture is implemented on Xilinx KC705 FPGA and reconfigurable resource management software is designed and developed for the prototype system. DES encryption/decryption algorithms are tested with prototype, and the test results show that the architecture has the both flexibility of GPP and the performance of ASIC, so it has a wide application prospect.
Jinyong Yin, Zhenpeng Xu, Xinmo Fang, Xihao Zhou

Stateful Forward-Edge CFI Enforcement with Intel MPX

This paper presents a stateful forward-edge CFI mechanism based on a novel use of the Intel Memory Protection Extensions (MPX) technology. To enforce stateful CFI policies, we protect against malicious modification of pointers on the dereference pathes of indirect jumps or function calls by saving these pointers into shadow memory. Intel MPX, which stores pointer’s bounds into shadow memory, offers the capability of managing the copy for these indirect dereferenced pointers. There are two challenges in applying MPX to forward-edge CFI enforcement. First, as MPX is designed to protect against every pointers that may incurs memory errors, MPX incurs unacceptable runtime overhead. Second, the MPX defense has holes when maintaining interoperability with legacy code. We address these challenges by only protecting the pointers on the dereference pathes of indirect function calls and jumps, and making a further check on the loaded pointer value. We have implemented our mechanism on the LLVM compiler and evaluated it on a commodity Intel Skylake machine with MPX support. Evaluation results show that our mechanism is effective in enforcing forward-edge CFI, while incurring acceptable performance overhead.
Jun Zhang, Rui Hou, Wei Song, Zhiyuan Zhan, Boyan Zhao, Mingyu Chen, Dan Meng

Analytical Two-Level Near Threshold Cache Exploration for Low Power Biomedical Applications

Emerging biomedical applications generally work at low/medium frequencies and require ultra-low energy. Near threshold processors with near threshold caches are proposed to be the computing platforms for these applications. There exists a large design space for multi-level near threshold cache hierarchies, which requires a fast design space exploration framework. In this paper, we first propose three different two-level near threshold cache architectures with different performance and energy tradeoff. Then, we describe the design space of a two-level near threshold cache hierarchy and develop an accurate and fast analytical design space exploration framework to analyze this space. Experiments indicate that significant energy saving (\(59\%\)) on average is achieved by our new near threshold cache architecture. Moreover, our analytical framework is shown to be both accurate and efficient.
Yun Liang, Shuo Wang, Tulika Mitra, Yajun Ha

DearDRAM: Discard Weak Rows for Reducing DRAM’s Refresh Overhead

Due to leakage current, DRAM devices need periodic refresh operations to maintain the validity of data in each DRAM cell. The shorter refresh period is, the more refresh overhead DRAM devices have to amortize. Since the retention time of DRAM cells are different because of process variation, DRAM providers usually set default refresh period as the retention time of those weakest cells that account for less than 0.1% of total capacity.
In this paper, we propose DearDRAM (Discard weak rows DRAM), an efficient refresh approach that is able to substantially reduce refresh overhead using two mechanisms: selectively disabling weak rows and remapping their physical addresses to a reserved region. DearDRAM allows DRAM devices to perform refresh operations with a much longer period (increasing from 64 ms to 256 ms), which reduces energy consumption. It is worth noting that compared to previous schemes, DearDRAM is easy to be implemented, does not modify DRAM chip and only introduces slight modifications to memory controller. Experimental results show that DearDRAM can save refresh energy an average of 76.12%, save total energy about 12.51% and improve IPC an average of 4.56% in normal temperature mode.
Xusheng Zhan, Yungang Bao, Ninghui Sun

Towards Efficient ML/AI


EffectFace: A Fast and Efficient Deep Neural Network Model for Face Recognition

Despite the Deep Neural Network (DNN) has achieved a great success in image recognition, the resource needed by DNN applications is still too much in terms of both memory usage and computing time, which makes it barely possible to deploy a whole DNN system on resource-limited devices such as smartphones and small embedded systems. In this paper, we present a DNN model named EffectFace designed for higher storage and computation efficiency without compromising the accuracy.
EffectFace includes two sub-modules, EffectDet for face detection and EffectApp for face recognition. In EffectDet we use sparse and small-scale convolution cores (filters) to reduce the number of weights for less memory usage. In EffectApp, we use pruning and weights-sharing technology to further reduce weights. At the output stage of the network, we use a new loss function rather than the traditional Softmax function to acquire feature vectors of the input face images, which reduces the dimension of the output of the network from n to fixed 128 where n equals to the number of categories to classify. Experiments show that, compared with previous models, the amounts of weights of our EffectFace is dramatically decreased (less than 10% of previous models) without losing the accuracy of recognition.
Weicheng Li, Dan Jia, Jia Zhai, Jihong Cai, Han Zhang, Lianyi Zhang, Hailong Yang, Depei Qian, Rui Wang

A Power Efficient Hardware Implementation of the IF Neuron Model

Because of the human brain’s parallel computing structure and its characteristics of the localized storage, the human brain has great superiority of high throughput and low power consumption. Based on the bionics of the brain, many researchers try to imitate the behavior of neurons with hardware platform so that we can obtain the same or close computational acceleration performance like the brain. In this paper, we proposed a hardware structure to implement single neuron with Integration-and-Fire(IF) model on Virtex-7 XC7VX485T-ffg1157 FPGA. Through simulation and synthesis, we quantitatively analyzed the device utilization and power consumption of our structure; meanwhile, the function of the proposed hardware implementation is verified with the classic XOR benchmark with a 4-layer SNN and the scalability of our hardware neuron is tested with a handwritten digits recognition mission on MNIST database using a 6-layer SNN. Experimental results show that the neuron hardware implementation proposed in this paper can pass the XOR benchmark test and fulfill the need of handwritten digits recognition mission. The total on-chip power of 4-layer SNN is 0.114 W, which is the lowest among the ANN and firing-rate based SNN at the same scale.
Shuquan Wang, Shasha Guo, Lei Wang, Nan Li, Zikai Nie, Yu Deng, Qiang Dou, Weixia Xu

paraSNF: An Parallel Approach for Large-Scale Similarity Network Fusion

With the rapid accumulation of multi-dimensional disease data, the integration of multiple similarity networks is essential for understanding the development of diseases and identifying subtypes of diseases. The recent computational efficient method named SNF is suitable for the integration of similarity networks and has been extensively applied to the bioinformatics analysis. However, the computational complexity and space complexity of the SNF method increases with the increase of the sample numbers. In this research, we develop a parallel SNF algorithm named paraSNF to improve the speed and scalability of the SNF. The experimental results on two large-scale simulation datasets reveal that the paraSNF algorithm is 30x–100x faster than the serial SNF. And the speedup of the paraSNF over the SNF which running on multi-cores with multi-threads is 8x–15x. Furthermore, more than 60% memory space are saved using paraSNF, which can greatly improve the scalability of the SNF.
Xiaolong Shen, Song He, Minquan Fang, Yuqi Wen, Xiaochen Bo, Yong Dou

An Experimental Perspective for Computation-Efficient Neural Networks Training

Nowadays, as the tremendous requirements of computation-efficient neural networks to deploy deep learning models on inexpensive and broadly-used devices, many lightweight networks have been presented, such as MobileNet series, ShuffleNet, etc. The computation-efficient models are specifically designed for very limited computational budget, e.g., 10–150 MFLOPs, and can run efficiently on ARM-based devices. These models have smaller CMR than the large networks, such as VGG, ResNet, Inception, etc.
However, it is quite efficient for inference on ARM, how about inference or training on GPU? Unfortunately, compact models usually cannot make full utilization of GPU, though it is fast for its small size. In this paper, we will present a series of extensive experiments on the training of compact models, including training on single host, with GPU and CPU, and distributed environment. Then we give some analysis and suggestions on the training.
Lujia Yin, Xiaotao Chen, Zheng Qin, Zhaoning Zhang, Jinghua Feng, Dongsheng Li

Parallel Computing System


Distributed Data Load Balancing for Scalable Key-Value Cache Systems

In recent years, in-memory key-value cache systems have become increasingly popular in tackling real-time and interactive data processing tasks. Caching systems are often used to help with the temporary storage and processing of data. Due to skewed and dynamic workload patterns, e.g. data increase/decrease or request changes in read/write ratio, it can cause load imbalance and degrade performance of caching systems.
Migrating data is often essential for balancing load in distributed storage systems. However, it can be difficult to determine when to move data, where to move data, and how much data to move. This depends on the resources required, e.g. CPU, memory and bandwidth, as well as polices on data movement. Since frequent and global rebalance of systems may affect the QoS of applications utilizing caching systems, it is necessary to minimize system imbalances whilst considering the total migration cost. We propose a novel distributed load balancing method for the mainstream Cloud-based data framework (Redis Cluster). We show how distributed graph clustering through load balancing can be used to exploit varying rebalancing scenarios comprising local and global needs. During the rebalancing process, three phrases are adopted — random walk matching load balancing, local round-robin migration and data migration between the trigger node and new added servers. Our experiments show that the proposed approach can reduce migration time compared with other approach by 30s and load imbalance degree can be reduced by 4X when the locality degree reaches 50% whilst achieving high throughput.
Shanshan Chen, Xudong Zhou, Guiping Zhou, Richard O. Sinnott

Performance Analysis and Optimization of Cyro-EM Structure Determination in RELION-2

REgularised LIkelihood OptimisatioN (RELION) is one of the most popular softwares used in single particle cryo-EM structure determination. Although efforts have been made to optimize the workflow of RELION, the refinement step still remains as a bottleneck for our exploration of performance improvement. In this paper, we thoroughly analyze the cause of the performance bottleneck and propose corresponding optimization for performance speedup. The experiment results show that our approach achieves a speedup of 3.17\(\times \) without degrading the resolution.
Xin You, Hailong Yang, Zhongzhi Luan, Depei Qian

The Checkpoint-Timing for Backward Fault-Tolerant Schemes

To improve the performance of the backward fault tolerant scheme in the long-running parallel application, a general checkpoint-timing method was proposed to determine the unequal checkpointing interval according to an arbitrary failure rate, to reduce the total execution time. Firstly, a new model was introduced to evaluate the mean expected execution time. Secondly, the optimality condition was derived for the constant failure rate according to the calculation model, and the optimal equal checkpointing interval can be obtained easily. Subsequently, a general method was derived to determine the checkpointing timing for the other failure rate. The final results shown the proposal is practical to trade-off the re-processing overhead and the checkpointing overhead in the backward fault-tolerant scheme.
Min Zhang

Quota-constrained Job Submission Behavior at Commercial Supercomputer

Understanding user behavior is great helpful for assessing HPC system job scheduling, promoting allocation efficiency and improving user satisfaction. Current research on user behavior is mainly focused on think time (i.e. time between two consecutive jobs) of non-commercial supercomputer systems. In this paper, we present a methodology to characterize workloads of the commercial supercomputer. We use it to analyze the 2.7 million jobs of different users in various fields of Tianhe-1A from 2016.01 to 2017.12 and 0.89 million jobs of Sugon 5000A from 2015.09 to 2017.03.
In order to identify the main factors affecting the user’s job submission behavior on commercial supercomputers, this paper analyzed the correlation between user’s job submission behavior and various factors such as job characteristics and quota constraint. The result shows that, on the commercial supercomputer, user s job submission behavior is not obviously affected by the previous job’s runtime and waiting time. It is affected by the number of processors the job uses, the previous job’s status and the size of the total resources that users can submit jobs. We also find that, there are three job submission peaks on each day. In the time window of 8 h, 86% jobs of a same user have the same number of processors and nearly 40% of them have little difference in runtime.
Jinghua Feng, Guangming Liu, Zhiwei Zhang, Tao Li, Yuqi Li, Fuxing Sun


Weitere Informationen