main-content

Über dieses Buch

This book constitutes the refereed proceedings of the 13th Conference on Advanced Computer Architecture, ACA 2020, held in Kunming, China, in August 2020. Due to the COVID-19 pandemic the conference was held online.

The 24 revised full papers presented were carefully reviewed and selected from 105 submissions. The papers of this volume are organized in topical sections on: interconnection network, router and network interface architecture; accelerator-based, application-specific and reconfigurable architecture; processor, memory, and storage systems architecture; model, simulation and evaluation of architecture; new trends of technologies and applications.

Inhaltsverzeichnis

Interconnection Network, Router and Network Interface Architecture

Frontmatter

Abstract
After the Stunex event in 2010, the security problems of SCADA reveal to the public, which abstract more and more researchers to design new security firms to address the security problems of SCADA. Especially, after the software defined network (SDN) arose, it has become a beneficial attempt to improve the SCADA security. In this paper, a formalized vulnerability detection platform named SDNVD-SCADA is presented based on the SDN technology, which can be used to find the most familiar vulnerabilities in SCADA design, implementation, deployment and action processes. A general security mechanism description language and a SCADA vulnerability pattern database are embedded in SDNVD-SCADA to achieve the ambition of automatic vulnerability detection.
Jinjing Zhao, Ling Pang, Bai Lin

Optimal Implementation of In-Band Network Management for High-Radix Switches

Abstract
To manage (such as configuring and monitoring) the numerous network chips and its ports efficiently, the in-band management technology is used in the interconnect network of high performance computing systems. However, with the rapid development of network switching chips towards the higher radix, the traditional in-band management implementation of ring structure faces the problem of delay performance scalability. The work proposed two optimized implementation structures for the in-band management, four-quadrant double-layer ring and four-quadrant star ring to solve the problem. The results of resource consumption assessment and delay performance simulation showed that in the high-radix switching chips with 64, 80, 96, 112, 128, 144, and 160 ports, the occupancies of LUT (Look Up Table) resources of the four-quadrant double-layer ring and star ring structures increased by an average of $$5.46\%$$ and $$1.71\%$$ compared to the traditional ring structure, respectively. Meanwhile, the occupancies of LUTRAM (Look Up Table memory) resources increased by an average of $$30.89\%$$ and $$21.81\%$$; that of FF (Flip Flop) resources by an average of $$3.86\%$$ and $$0.19\%$$; the forward delay of management packets decreased by $$25.75\%$$ and $$21.81\%$$, respectively. Considering both resource consumption and delay performance, the star ring was an ideal structure to deal with the problem of delay performance scalability among the in-band management structures, which can be applied to realize the in-band management for the higher-radix switching chips in the future.
Jijun Cao, Mingche Lai, Xingyun Qi, Yi Dai, Zhengbin Pang

A 32 Gb/s Low Power Little Area Re-timer with PI Based CDR in 65 nm CMOS Technology

Abstract
This paper presents a 32 Gb/s low power little area re-timer with Phase Interpolator (PI) based Clock and Data Recovery (CDR). To further ensure signal integrity, both a Continuous Time Linear Equalizer (CTLE) and Feed Forward Equalizer (FFE) are adapted. To save power dissipation, a quarter-rate based 3-tap FFE is proposed. To reduce the chip area, a Band-Band Phase Discriminator (BBPD) based PI CDR is employed. In addition, a 2-order digital filter is adopted to improve the jitter performance in the CDR loop. This re-timer is achieved in 65 nm CMOS technology and supplied with 1.1 V. The simulation results show that the proposed re-timer can work at 32 Gb/s and consumes 91 mW. And it can equalize >−12 dB channel attenuation, tolerate the frequency difference of 200 ppm.
Zhengbin Pang, Fangxu Lv, Weiping Tang, Mingche Lai, Kaile Guo, Yuxuan Wu, Tao Liu, Miaomiao Wu, Dechao Lu

DBM: A Dimension-Bubble-Based Multicast Routing Algorithm for 2D Mesh Network-on-Chips

Abstract
Network-on-Chips (NoCs) has been widely used today for efficient communication in multicore systems. Existing NoCs mostly use 2D mesh topology in commercial and experimental manycore processors since it maps well to the 2D layout. For 2D mesh, dimension order routing and different adaptive routing algorithms performs well in unicast traffic but suffer from poor performance when faced with one-to-many (multicast) traffic. Efficient multicast routing algorithm is an important target for the design of special on-chip networks such as neural networks. Recently proposed multicast routing algorithms are less efficient or can introduce unbalanced load in some situations. In this paper, we propose DBM, a novel multicast routing algorithm based on the dimension-bubble flow control for 2D mesh networks. DBM is deadlock-free while achieving the minimal path and fully-adaptive multicast routing algorithm. Moreover, DBM simplifies the deadlock condition where the escape channel is not necessary. Evaluation results show that DBM can achieve much better performance than existing multicast routing algorithms, with 18% reduction in packet latency and 16% improvement in network throughput.
Canwen Xiao, Hui Lou, Cunlu Li, Kang Jin

MPLEG: A Multi-mode Physical Layer Error Generator for Link Layer Fault Tolerance Test

Abstract
In the design of high-speed communication network chips, the fault-tolerant design of the link layer is among the most important parts. In the design process, the link layer fault tolerance function need to be fully tested and verified. But it is far from enough to rely only on traditional case-by-case simulation. In order to test and verify this function completely, this paper proposes a configurable multi-mode physical layer error generation method implemented on chip: MPLEG (a Multi-mode Physical Layer Error Generator). With MPLEG, a desired bit error pattern can be generated at the physical layer in all stages of chip design, including simulation verification, FPGA prototype system verification, sample chip testing, and actual system running. The statistical analysis of the experimental results shows that MPLEG can generate an error pattern almost identical to the real link error. Meanwhile, MPLEG can perform relatively complete and efficient testing and verification of various functions of link layer fault tolerance.
Xingyun Qi, Pingjing Lu, Jijun Cao, Yi Dai, Mingche Lai, Junsheng Chang

GNN-PIM: A Processing-in-Memory Architecture for Graph Neural Networks

Abstract
Graph neural networks (GNNs) have attracted increasing interests in recent years. Due to the poor data locality and huge data movement during GNN inference, it is challenging to employ GNN to process large-scale graphs. Fortunately, processing-in-memory (PIM) architecture has been widely investigated as a promising approach to address the “Memory Wall”. In this work, we propose a PIM architecture to accelerate GNN inference. We develop an optimized dataflow to leverage the inherent parallelism of GNNs. Targeting the dataflow, we further propose a hierarchical NoC to perform concurrent data transmission. Experimental results show that our design can outperform prior works significantly.
Zhao Wang, Yijin Guan, Guangyu Sun, Dimin Niu, Yuhao Wang, Hongzhong Zheng, Yinhe Han

A Software-Hardware Co-exploration Framework for Optimizing Communication in Neuromorphic Processor

Abstract
Spiking neural networks (SNN) has been widely used to solve complex tasks such as pattern recognition, image classification and so on. The neuromorphic processors which use SNN to perform computation have been proved to be powerful and energy-efficient. These processors generally use Network-on-Chip (NoC) as the interconnect structure between neuromorphic cores. However, the connections between neurons in SNN are very dense. When a neuron fire, it will generate a large number of data packets. This will result in congestion and increase the packet transmission latency dramatically in NoC.
In this paper, we proposed a software-hardware co-exploration framework to alleviate this problem. This framework consists of three parts: software simulation, packet extraction&mapping, and hardware evaluation. At the software level, we can explore the impact of packet loss on the classification accuracy of different applications. At the hardware level, we can explore the impact of packet loss on transmission latency and power consumption in NoC. Experimental results show that when the neuromorphic processor runs MNIST handwritten digit recognition application, the communication delay can be reduced by 11%, the power consumption can be reduced by 5.3%, and the classification accuracy can reach 80.75% (2% higher than the original accuracy). When running FSDD speech recognition application, the communication delay can be reduced by 22%, the power consumption can be reduced by 2.2%, and the classification accuracy can reach 78.5% (1% higher than the original accuracy).
Shiying Wang, Lei Wang, Ziyang Kang, Lianhua Qu, Shiming Li, Jinshu Su

A CNN Hardware Accelerator in FPGA for Stacked Hourglass Network

Abstract
Staked hourglass network is a widely used deep neural network model for body pose estimation. The essence of this model can be roughly considered as a combination of Deep Convolutional Neural Networks (DCNNs) and cross-layer feature map fusion operations. FPGA gains its advantages in accelerating such a model because of the customizable data parallelism and high on-chip memory bandwidth. However, different with accelerating a bare DCNN model, stacked hourglass networks introduce implementation difficulty by presenting massive feature map fusion in a first-in-last-out manner. This feature introduces a larger challenge to the memory bandwidth utilization and control logic complexity on top of the already complicated DCNN data flow design. In this work, an FPGA accelerator is proposed as a pioneering effort on accelerating the stacked hourglass model. To achieve this goal, we propose an address mapping method to handle the upsample convolutional layers and a network mapper for scheduling the feature map fusion. A 125 MHz fully working demo on Xilinx XC7Z045 FPGA achieves a performance of 8.434 GOP/s with a power efficiency of 4.924 GOP/s/W. Our system is 296× higher than the compared Arm Cortex-A9 CPU and 3.2× higher power efficiency, measured by GOP/s/W, than the GPU implementation on Nvidia 1080Ti.
Dongbao Liang, Jiale Xiao, Yangbin Yu, Tao Su

PRBN: A Pipelined Implementation of RBN for CNN Training

Abstract
Recently, training CNNs (Convolutional Neural Networks) on-chip has attracted much attention. With the development of the CNNs, the proportion of the BN (Batch Normalization) layer’s execution time is increasing and even exceeds the convolutional layer. The BN layer can accelerate the convergence of training. However, little work focus on the efficient hardware implementation of BN layer computation in training. In this work, we propose an accelerator, PRBN, which supports the BN and convolution computation in training. In our design, a systolic array is used for accelerating the convolution and matrix multiplication in training, and RBN (Range Batch Normalization) array based on hardware-friendly RBN algorithm is implemented for computation of BN layers. We implement PRBN on FPGA PYNQ-Z1. The working frequency of it is 50 MHz and the power of it is 0.346 W. The experimental results show that when compared with CPU i5-7500, PRBN can achieve 3.3$$\times$$ speedup in performance and 8.9$$\times$$ improvement in energy efficiency.
Zhijie Yang, Lei Wang, Xiangyu Zhang, Dong Ding, Chuan Xie, Li Luo

Processor, Memory, and Storage Systems Architecture

Frontmatter

Abstract
Energy and power density have forced the industry to introduce many-cores where a large number of processor cores are integrated into a single chip. In such settings, the communication latency of the network on chip (NoC) could be performance bottleneck of a multi-core and many-core processor. Unfortunately, existing approaches for mapping the running tasks to the underlying hardware resources often ignore the impact of the NoC, leading to sub-optimal performance and energy efficiency. This paper presents a novel approach to allocating NoC resource among running tasks. Our approach is based on the topology partitioning of the shared routers of the NoC. We evaluate our approach by comparing it against two state-of-the-art methods using simulation. Experimental results show that our approach reduces the NoC communication latency by 5.19% and 2.99%, and the energy consumption by 17.94% and 12.68% over two competitive approaches.
Xiaole Sun, Yong Dong, Juan Chen, Zheng Wang

Dissecting the Phytium 2000+ Memory Hierarchy via Microbenchmarking

Abstract
An efficient use of the memory system on multi-cores is critical to improving data locality and achieving better program performance. But the hierarchical memory system with caches often works in a “black-box” manner, which automatically moves data across memory layers, and makes code optimization a daunting task. In this article, we dissect the memory system of the Phytium 2000+ many-core with micro-benchmarks. We measure the latency and bandwidth of moving cachelines across memory levels on a single core or two distinct cores. We design a set of micro-benchmarks by using the pointer-chasing method to measure latency, and using the chunk-accessing method to measure bandwidth. During measurement, we have to place the cacheline on the specified memory layer and set its initial consistency state. The experimental results on Phytium 2000+ provide a quantified form of its actual memory performance, and reveal undocumented performance data and micro-architectural details. To conclude, our work will provide quantitative guidelines for optimizing the Phytium 2000+ memory accesses.
Wanrong Gao, Jianbin Fang, Chuanfu Xu, Chun Huang

TSU: A Two-Stage Update Approach for Persistent Skiplist

Abstract
Skiplist, a widely used in-memory index structure, could incur crash inconsistency when running on emerging NVRAM (Non-Volatile Random Access Memory). Logging or strict serialization can ensure crash consistency at the cost of severe performance degradation. In this paper, we propose TSU, a Two-stage update approach to improve the performance of persistent skiplist while preserve crash consistency. TSU exploits space locality of skiplist and atomic write of NVRAM, thus effectively reducing expensive cache line flush (clflush) operations. To this end, we category all four crash inconsistent states into two types: recoverable and unrecoverable. TSU could guarantee the crash state is recoverable by constraining the memory access order for insertion and deletion. We further design a persistency algorithm to reduce clflush by preserving the memory persistent order of skiplist update. In addition, we develop a concurrent search for TSU. The evaluation result shows that TSU can reduce cache line flush with up to 47.6%, and decrease the average request latency by up to 36% for insertions compared to the strict serialization.
Shucheng Wang, Qiang Cao

NV-BSP: A Burst I/O Storage Pool Based on NVMe SSDs

Abstract
The High-Performance Computing (HPC) systems built for future exascale computing, big data analytics, and artificial intelligence applications raise an ever-increasing demand for high-performance and highly reliable storage systems. In recent years, as Non-Volatile Memory express (NVMe) Solid-State Drives (SSDs) are deployed in HPC storage systems, the performance penalty paid for the legacy I/O software stack and storage network architecture turns out to be non-trivial. In this paper, we propose NV-BSP, an NVMe SSD-based Burst I/O Storage Pool, to leverage the performance benefits of NVMe SSD, NVMe over Fabrics (NVMeoF) Protocol, and Remote Direct Memory Access (RDMA) networks in HPC storage systems. NV-BSP disaggregates NVMe SSDs from HPC compute nodes to enhance the scalability of HPC storage systems, employs fine-grained chunks rather than physical NVMe SSD devices as the RAID-based data protection areas, and exploits high concurrent I/O processing model to alleviate the performance overhead from lock contentions and context switches in critical I/O path . We implement NV-BSP in Linux and evaluate it with synthetic FIO benchmarks. Our experimental results show that NV-BSP achieves scalable system performance as the number of NVMe SSD and CPU core increases and obtains much better system performance compared with the built-in MD-RAID in Linux. Compared with node-local SSDs in HPC, NV-BSP provides a full system solution of storage disaggregation, delivers comparable performance, and significantly improves system reliability.
Qiong Li, Dengping Wei, Wenqiang Gao, Xuchao Xie

Pin-Tool Based Execution Backtracking

Abstract
Checkpoint/restart is a common fault tolerant technique which periodically dump state to reliable storage and restart applications after failure. Most of existing checkpoint/restart implementations only handle volatile state and lack of support for persistence state of applications. Even the algorithm specifically designed for file checkpointing may not support complex operations and some need to modify source code. This paper presents a new checkpoint technique, which use dynamic instrumentation to temporarily cache disk operations in memory, and use existing memory checkpoint tool to dump or restore process state at runtime. We show that not only can this method create regular checkpoints for both volatile and persistence state, but also has important applications in execution backtracking.
Shuangjian Wei, Weixing Ji, Qiurui Chen, Yizhuo Wang

Directory Controller Verification Based on Genetic Algorithm

Abstract
Directory protocol is the most widely used implementation cache consistency method in large-scale shared memory multi-core processor which is very complex and difficult to verify. In this paper, we propose a random test generation method based on genetic algorithm to verify directory controller of a type of 64-core processor, analyze the test features to code the symbols of genetic algorithm, and evaluate the merits of the test using the fitness function based on functional coverage. We establish the relationship between coverage and test vector, analyze the relationship between coverage and test stimulus through a genetic algorithm. The experimental results show that compared with the pseudo-random method, the functional coverage rate of this method is increased by nearly 20%–30%, the detection rate of bugs is relatively high, and the verification efficiency and quality are also improved.
Li Luo, Li Zhou, Hailiang Zhou, Quanyou Feng, Guoteng Pan

Prediction and Analysis Model of Telecom Customer Churn Based on Missing Data

Abstract
In the field of business data analysis, customer churn prediction analysis plays an important role. This paper combines traditional statistical prediction methods and artificial intelligence prediction methods to propose a customer churn prediction analysis model based on missing data in an attempt to explore a new solution in this field. Based on the missing data in this model, factor analysis method and data mining technique are used to generate key factor sets and their values to form input neurons and their initial values. The number of hidden layer neurons was determined by combinatorial prediction. Using the improved genetic algorithm, the initial weight and threshold of BP network are determined. Finally, the prediction results and key attribute data related to the prediction results are generated for decision makers to analyze the problem. The experiment evaluates the model from the aspects of accuracy, precision, recall, and f-measure, which proves that the model is effective.
Rui Zeng, Lingyun Yuan, Zhixia Ye, Jinyan Cai

How to Evaluate Various Commonly Used Program Classification Methods?

Abstract
Understanding the characteristics of scientific computing programs has been of great importance due to its close relationship with the design and implementation of program optimization methods. Generally, scientific computing programs can be divided into three categories according to their computing, memory access and communication characteristics, namely compute-intensive, memory-intensive and communication-intensive, respectively. There are more than one commonly used program classification methods, particularly for compute-intensive and memory-intensive programs. In most cases, all kinds of classification methods have consistent results but occasionally different classification results also occur. Why are there occasionally inconsistent classification results and where? How to understand such inconsistencies and what is the reason behind that? We answer these questions by analyzing four representative program classification methods (IPC, MPKI, MEM/Uop and Roofline) on two platforms. Firstly, we discover some occasional inconsistency cases, the inconsistency from various indicators, the inconsistency from multi-phase characteristics and the inconsistency from various platforms, followed by some possible reasons. Secondly, we explore the impact of threshold settings on classification inconsistencies. All the experiment and analysis results and the data collected from other references prove that different classification methods have the same classification results in most cases but occasionally bring about inconsistencies especially for in-between programs that are between memory-intensive and compute-intensive programs, which have a bad impact on some optimization algorithms.
Xinxin Qi, Yuan Yuan, Juan Chen, Yong Dong

A Performance Evaluation Method for Machine Learning Cloud

Abstract
In recent years, the application of machine learning algorithm is more and more extensive, and the combination of cloud platform and machine learning algorithm is closer. With the popularity of cloud platform, more and more cloud platform providers, the comparison of performance of different cloud platforms becomes crucial. The cloud platform performance benchmark can provide a relatively objective reference for consumers, However, the current mature cloud platform performance benchmarks cannot meet the requirements of testing the machine learning capabilities of cloud platforms, while the recent ones just only test the performance of machine learning. Based on the previous cloud platform performance testing methods, this paper designed a cloud platform performance evaluation method for machine learning applications based on the combination of AI-based testing benchmark and CPU-based testing benchmark, which can not only evaluate the performance of cloud platform in terms of CPU, but also test the performance of cloud platform in terms of GPU, running machine learning algorithms.
Yue Zhu, Shazhou Yang, Yongheng Liu, Longfei Zhao, ZhiPeng Fu

Parallelization and Optimization of Large-Scale CFD Simulations on Sunway TaihuLight System

Abstract
TRIP is an in-house Computational Fluid Dynamics (CFD) software that can simulate subsonic, transonic, and supersonic flows with complex geometries. With the increase of computation and memory requirement for large-scale CFD simulations, it is an inevitable trend to use massively parallel computers for parallel computing. In this paper, with a dual-level hybrid and heterogeneous programming model using MPI + OpenACC, we port and optimize the TRIP software on the Sunway TaihuLight supercomputer. A series of optimization techniques, including data reconstruction, data packing, loop refactoring and array swapping, are explored. In addition, a grid preprocessing tool is developed for reducing the load imbalance caused by the non-cube shape of sub grids. Scalability tests show that TRIP can achieve 66.9% parallel efficiency of strong scaling and 96% efficiency of weak scaling when the cores are increased from 10,400 to 665,600.
Hao Yue, Liang Deng, Dehong Meng, Yuntao Wang, Yan Sun

Liquid State Machine Applications Mapping for NoC-Based Neuromorphic Platforms

Abstract
Liquid State Machine (LSM) is one of spiking neural network (SNN) containing recurrent connections in the reservoir. Nowadays, LSM is widely deployed on a variety of neuromorphic platforms to deal with vision and audio tasks. These platforms adopt Network-on-Chips (NoC) architecture for multi-cores interconnection. However, a large communication volume stemming from the reservoir of LSM has a significant effect on the performance of the platform. In this paper, we propose an LSM mapping method by using the toolchain - SNEAP for mapping LSM to neuromorphic platforms with multi-cores, which aims to reduce the energy and latency brought by spike communication on the interconnection. The method includes two key steps: partitioning the LSM to reduce the spikes communicated between partitions, and mapping the partitions of LSM to the NoC to reduce average hop of spikes under the constraint of hardware resources. This method is also effective for large-scale of LSM. The experimental results show that our method can achieve 1.5$${\times }$$ reduction in end-to-end execution time, and reduce average energy consumption by 57% on 8 $$\times$$ 8 2D-mesh NoC and average spike latency by 23% on 4 $$\times$$ 4 2D-mesh NoC, compared to SpiNeMap.
Shiming Li, Lei Wang, Shiying Wang, Weixia Xu

Compiler Optimizing for Power Efficiency of On-Chip Memory

Abstract
As we all known, power constraint is the biggest challenge to build an exascale computing system. Among all parts of high performance processor, on-chip memory, including register, cache and so on, is accessed frequently and incurs high power consumption during program executing. Due to trivial overhead and good portability, compiling is a promising way to reduce power consumption and thermal dissipation of processor. In this paper, we focus on compiling to save power of on-chip memory access. A compiler optimizing on bypassing registers is proposed to reduce the number of register access in order to lower the power of the register files. Besides that, to save the power consumption of cache, another compiler optimizing is proposed to elegantly adjust loop transformation to make a better use of L0 cache. Finally, in order to evaluate the effectiveness of the above techniques, we build a systematic evaluation platform, named as GEAT, which consists of compiler, performance simulator and power simulator. Experiment results show that our proposed techniques can effectively reduce the power consumption of on-chip memory with trivial overhead of performance.
Wei Wu, Qi Zhu, Fei Wang, Rong-Fen Lin, Feng-Bin Qi

Structural Patch Decomposition Fusion for Single Image Dehazing

Abstract
In this paper, we present a new image dehazing method via structural patch decomposition image fusion, which does not rely on the accuracies of global atmospheric light and transmission. Instead of estimating the exact global atmospheric and the transmission separately as most previous methods, our method directly constructs initial dehazing images with different exposure through the histogram analysis and structural patch decomposition image fusion filter to improve the visual dehazing effect. Experimental results show that this method outperforms state-of-the-art haze removal methods in terms of both efficiency and the dehazing visual effect.
Yin Gao, Hongyun Li, Yijing Su, Jun Li

Historic and Clustering Based QoS Aggregation for Composite Services

Abstract
Web services run in an open, heterogeneous and multi-tenant network environment, which makes the QoS of services uncertain and difficult to be described by a known probability distribution. Therefore, the calculation of QoS aggregation for composite services is facing challenges. This paper presents a new method for the aggregation calculation of composite services. In this method, the QoS of Web services is characterized by the sample space formed by their historical records, and a clustering method is adopted to control the number of samples in the sample space, so as to avoid the problem of combinatorial explosion during the process of aggregation calculation. This method does not need to limit the distribution of QoS, and is suitable for the composite services described by various common workflows and all kinds of QoS attributes. Experiments show that our method has advantages in terms of time cost and computational accuracy compared with the existing methods.
Zhang Lu, Ye Heng Zhou

A High-Performance with Low-Resource Utility FPGA Implementation of Variable Size HEVC 2D-DCT Transform

Abstract
High Efficiency Video Coding (HEVC) is a new international video compression standard offering much better compression efficiency than previous video compression standards at the expense of much higher computational complexity. This paper presents a design of two-dimensional (2D) discrete cosine transform (DCT) hardware architecture dedicated for High Efficiency Video Coding (HEVC) in field programmable gate array (FPGA) platforms. The proposed methodology efficiently proceeds 2D-DCT computation to fit internal components and characteristics of FPGA resources. This architecture supports variable size of DCT computation, including 4 × 4, 8 × 8, 16 × 16, and 32 × 32, and has been implemented in Verilog and synthesized in various FPGA platforms. Compared with existing related works, our proposed architecture demonstrates significant advantages in the performance improvement with low FPGA resource utility, which are very important for the whole FPGA solution for whole HEVC codec.
Ying Zhang, Gen Li, Lei Wang

Backmatter

Weitere Informationen