Skip to main content

2019 | Buch

Network and Parallel Computing

16th IFIP WG 10.3 International Conference, NPC 2019, Hohhot, China, August 23–24, 2019, Proceedings

herausgegeben von: Dr. Xiaoxin Tang, Quan Chen, Pradip Bose, Weiming Zheng, Jean-Luc Gaudiot

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes the proceedings of the 16th IFIP WG 10.3 International Conference on Network and Parallel Computing, NPC 2019, held in Hohhot, China, in August 2019.
The 22 full and 11 short papers presented in this volume were carefully reviewed and selected from 107 submissions. They were organized in topical sections named: graph computing; NOC and networks; neural networks; big data and cloud; HPC; emerging topics; memory and file system.

Inhaltsverzeichnis

Frontmatter
Correction to: Efficient Processing of Convolutional Neural Networks on SW26010

In the originally published version of this chapter, in section 2.2 and 3.3, in the second to last sentence “swDGEMM” was corrected to “swDNN”. Furthermore, in the last sentence of 3.3, “16” was corrected to “17”, and a reference to https://github.com/feifeibear/swDNNv1.0 has been added.

Yi Zhang, Bing Shu, Yan Yin, Yawei Zhou, Shaodi Li, Junmin Wu

Graph Computing

Frontmatter
GraphScSh: Efficient I/O Scheduling and Graph Sharing for Concurrent Graph Processing

With the increasing need for analyzing graph data, graph systems have to efficiently deal with concurrent graph processing (CGP) jobs. However, existing platforms are inherently designed for a single job, they incur the high cost when CGP jobs are executed. In this work, we observed that existing systems do not allow CGP jobs to share graph structure data of each iteration, introducing redundant accesses to same graph. Moreover, all the graphs are real-world graphs with highly skewed power-law degree distributions. The gain from extending multiple external storage devices is diminishing rapidly, which needs reasonable schedulings to balance I/O pressure into each storage. Following this direction, we propose GraphScSh that handles CGP jobs efficiently on a single machine, which focuses on reducing I/O conflict and sharing graph structure data among CGP jobs. We apply a CGP balanced partition method to break graphs into multiple partitions that are stored in multiple external storage devices. Additionally, we present a CGP I/O scheduling method, so that I/O conflict can be reduced and graph data can be shared among multiple jobs. We have implemented GraphScSh in C++ and the experiment shows that GraphScSh outperforms existing out-of-core systems by up to 82%.

Shang Liu, Zhan Shi, Dan Feng, Shuo Chen, Fang Wang, Yamei Peng
Game-Based Multi-MD with QoS Computation Offloading for Mobile Edge Computing of Limited Computation Capacity

Mobile edge computing (MEC) is becoming a promising paradigm of providing cloud computing capabilities to the edge network, which can serve mobile devices (MDs) with computation-intensive and delay-sensitive tasks. Facing with high requirements of many MDs, it’s essential for MEC with limited computation capacity to serve more MDs with QoS. For each mobile device, it is also desirable to have a low energy consumption with an expected deadline. To solve above problems, we propose a Game-based Computation Offloading (GCO) algorithm, which includes the task offloading profile and the transmission power controlling with the method of non-cooperative game. Our mechanism maximizes the number of served MDs with deadline, as well as minimizing the energy consumption of each MD whose task is executed on MEC. Specifically, Given the allocation of transmission power, a Greedy-Pruning algorithm is proposed to determine the number of tasks executed on MEC. Besides, each MD adopts his/her transmission power controlling strategy to compete the computation resource of MEC or minimize the energy consumption. A game model for illustrating the problem of task offloading is formulated to find a proper transmission power for each task and is proved the existence of Nash equilibrium solution. Experiments are simulated to evaluate the proposed algorithm in terms of effectiveness evaluation.

Junyan Hu, Chubo Liu, Kenli Li, Keqin Li

NOC and Networks

Frontmatter
KLSAT: An Application Mapping Algorithm Based on Kernighan–Lin Partition and Simulated Annealing for a Specific WK-Recursive NoC Architecture

Application mapping is a critical phase in NoC design because of the running time, the network latency and the power consumption. In order to reduce these problems of applications running on multicore architecture, we propose a novel application mapping algorithm, called KLSAT mapping algorithm. It is used for the triplet-based architecture (TriBA) topology which is WK-recursive based networks well conform to a modular design due to the properties of regularity and scalability. The KLSAT mapping algorithm exploits the advantage of both the Kernighan–Lin partitioning algorithm and simulated annealing algorithm to reduce the overall power consumption and network latency. Compared to the random mapping algorithm, the experiment results reveal that the solutions generated by the proposed mapping algorithm reduce average power consumption and network latency by 6.4%, 12.2% in mapping 27 cores and 29.5%, 26.7% in mapping 81 cores respectively.

XiaoJun Wang, Feng Shi, Hong Zhang
Modeling and Analysis of the Latency-Based Congestion Control Algorithm DX

Nowadays, low latency has become one of the primary goals of congestion control in data center networks. To achieve low latency, many congestion control algorithms have been proposed, wherein DX is the first latency-based one. Specifically, DX tackles the accurate latency measurement problem, reduces the flow completion time and outperforms the de facto DCTCP algorithm significantly in term of median queueing delay. Although the advantages of DX have been confirmed by experimental results, the behaviors of DX have not been fully revealed. Accordingly, some drawbacks of DX under special environment are unexplored. Therefore, in this paper, we conduct fluid-flow analysis over DX, deducing sufficient condition for the stability of DX and revealing the behaviors of DX. Analytical results uncover two problems of DX: (1) it has poor throughput when either the base RTT is very large or the number of flows is relatively small; (2) it suffers from large queueing delay when either the base RTT is relatively small or the number of flows is very large. These results are instructive to the improvement and deployment of DX. Simulation results based on NS-3 verify our analytical results.

Wanchun Jiang, Lijuan Peng, Chang Ruan, Jia Wu, Jianxin Wang
Distributed Quality-Aware Resource Allocation for Video Transmission in Wireless Networks

The rapid development of wireless networks makes it more convenient for people to enjoy high quality multimedia. However, video applications are throughput-demanding, and relatively, radio resource always seems insufficient. Hence, a distributed algorithm is designed in this paper to allocate the limited wireless resource among multiple users for video streaming. In order to specify multimedia service from other ordinary data transmission, the QoE-oriented utility function is considered first. Then, a potential game model is formulated and all the video receivers can update their rate strategies with very little information exchange. By this kind of updating, the bandwidth allocation could be achieved intelligently. The algorithm converges to a set of correlated equilibria. Numeric simulation results indicate that it brings remarkable benefits to both the resource provider and the video users.

Chao He, Zhidong Xie, Chang Tian

Neural Networks

Frontmatter
PRTSM: Hardware Data Arrangement Mechanisms for Convolutional Layer Computation on the Systolic Array

The systolic array is an array of processing units which share the inner data flow. Since the 2D systolic array fits the operation of multiplication and accumulation (MAC) naturally, there are many groups which use the systolic array to accelerate the computation of DNN (Deep Neural Network). However, the performance of the systolic array is limited by the data bandwidth. Some groups solve this problem with the method of loop tiling and care little about the pixel reuse potential of the convolutional layer. In this paper, we propose a novel method of PRTSM (Pixels Reuse with Time and Spatial Multiplexing) which reuses the pixels of the input feature map with time and spatial multiplexing. With it, we can significantly reduce the pressure of bandwidth and save the time of data preparing for convolutional layers on the systolic array. We propose three algorithms for this method and implement the corresponding hardware mechanisms on Xilinx FPGA XCVU440. Experiments show that our hardware mechanisms can reduce at least $$72.03\%$$ of the off-chip traffic. The mechanisms proposed by this paper can reach a peak performance of 64.034 GOPS with a frequency of 167 MHz.

Shuquan Wang, Lei Wang, Shiming Li, Tian Shuo, Shasha Guo, Ziyang Kang, Shuzheng Zhang, Weixia Xu
PParabel: Parallel Partitioned Label Trees for Extreme Classification

Extreme classification consists of extreme multi-class or multi-label predictions, whose objective is to learn classifiers that can label each data point with the most relevant labels. Recently, some approaches such as 1-vs-all method have been proposed to accomplish the task. However, their training time is linear with the number of classes, which makes them unrealistic in real-world applications such as text and image tagging. In this work, we are motivated to present a two-stage thread-level parallelism which is based on Partitioned Label Trees for Extreme Classification (Parabel). Our method is able to train the tree nodes in different parallel ways according to their number of labels. We compare our algorithm with recent state-of-the-art approach on some publicly available real-world datasets which have up to 670,000 labels. The experimental results demonstrate that our algorithm achieves the shortest training time.

Jiaqi Lu, Jun Zheng, Wenxin Hu
Statistical Analysis and Prediction of Parking Behavior

In China, more and more families own cars, and parking is also undergoing a revolution from manual to automatic charging. In the process of parking revolution, understanding parking behavior and making an effective prediction is important for parking companies and municipal policymakers.We obtain real parking data from a big parking company for parking behavior analysis and prediction. The dataset comes from a shopping mall in Ningbo, Zhejiang, and it consists of 136,973 records in 396 days. Specifically, we mainly explore the impact of weather factors on parking behavior. We study several models, and find that the random forest model can make the most accurate parking behavior prediction. Experiments show that the random forest model can reach 89% accuracy.

Ningxuan Feng, Feng Zhang, Jiazao Lin, Jidong Zhai, Xiaoyong Du

Big Data+Cloud

Frontmatter
ASTracer: An Efficient Tracing Tool for HDFS with Adaptive Sampling

Existing distributed tracing tools such as HTrace use static probabilistic samplers to collect the function call trees for performance analysis, which may fail to capture important but less executed function call trees and thus miss the opportunities for performance optimization. To address the above problem, we propose ASTracer, a new distributed tracing tool with two adaptive samplers. The advantage of adaptive samplers is that they can adjust the sampling rate dynamically, which is able to capture comprehensive function call trees and in the meanwhile maintain the size of trace file acceptable. In addition, we propose an auto-tuning mechanism to search for the optimal parameter settings of the adaptive samplers in ASTracer. The experiment results demonstrate the adaptive samplers are more effective in tracing the function call trees compared to probabilistic sampler. Moreover, we provide several case studies to demonstrate the usage of ASTracer in identifying potential performance bottlenecks.

Yang Song, Yunchun Li, Shuhan Wu, Hailong Yang, Wei Li
BGElasor: Elastic-Scaling Framework for Distributed Streaming Processing with Deep Neural Network

In face of constant fluctuations and sudden bursts of data stream, elasticity of distributed stream processing system has become increasingly important. The proactive policy offers a powerful means to realize the effective elastic scaling. The existing methods lack the latent features of data stream, it leads the poor prediction. Furthermore, the poor prediction results in the high cost of adaptation and the instability. To address these issues, we propose the framework named BGElasor, which is a proactive and low-cost elastic-scaling framework based on the accurate prediction using deep neural networks. It can capture the potentially-complicated pattern to enhance the accuracy of prediction, reduce the cost of adaptation and avoid adaptation bumps. The experimental results show that BGElasor not only improves the prediction accuracy with three kinds of typical loads, but also ensure the end-to-end latency on QoS with low cost.

Weimin Mu, Zongze Jin, Junwei Wang, Weilin Zhu, Weiping Wang
High Performance DDoS Attack Detection System Based on Distribution Statistics

Nowadays, web servers often face the threat of distributed denial of service attacks and their intrusion prevention systems cannot detect those attacks effectively. Many existing intrusion prevention systems detect attacks by the state of per-flow and current processing speed cannot fulfill the requirements of real-time detection due to the high speed traffic. In this paper, we propose a powerful system TreeSketchShield which can improve sketch data structure and detect attacks quickly. First, we discuss a novel structure TreeSketch to obtain statistics of network flow, which utilizes the stepped structure of binary tree to map the distribution and reduces the complexity of the statistic calculation. Second, we present a two-level detection scheme that could make a compromise between the detection speed and detection accuracy. Experimental results show that our method can process more than 100,000 records per second. The false alarm rate can achieve 2% to 25% performance improvement.

Xia Xie, Jinpeng Li, Xiaoyang Hu, Hai Jin, Hanhua Chen, Xiaojing Ma, Hong Huang
DDP-B: A Distributed Dynamic Parallel Framework for Meta-genomics Binary Similarity

Great efforts have been made on meta-genomics in the field of new species exploration in the past decades. With the development of next-generation sequencing technology, meta-genomics datasets have been produced as large as dozens of hundreds of gigabytes or even several terabytes, which brings a severe challenge to data analysis. Besides, conventional meta-genomics comparing algorithms may not take full advantage of powerful computing capacity from parallel computing techniques due to lack of parallelism. In this paper, we propose DDP-B, a distributed dynamic parallel framework for meta-genomics binary similarity analysis, to overcome these limitations. In this framework, we introduce a binary distance algorithm for meta-genomics similarity measurement and develop different levels of parallel granularity of the algorithm utilizing MPI, OpenMP, and SIMD techniques. Moreover, we establish a dynamic scheduling method to deliver asynchronous parallel computing tasks and design a distributed cluster to deploy the dynamic parallel system, which completes 2.97K pairs of meta-genomics vectors comparison per second and achieves an 134.79x speedup versus the baseline in the optimal condition. Our framework shows stable scalability when assigned larger workloads.

Mengxian Chi, Xu Jin, Feng Li, Hong An
Optimal Resource Allocation Through Joint VM Selection and Placement in Private Clouds

It is the goal of private cloud platforms to optimize the resource allocation process and minimize the expense to process tasks. Essentially, resource allocation in clouds involves two phases: virtual machine selection (VMS) and virtual machine placement (VMP), and they can be jointly considered. However, existing solutions separate VMS and VMP, therefore, they can only get local optimal resource utilization. In this paper, we explore how to optimize the resource allocation globally through considering VMS and VMP jointly. Firstly, we formulate the joint virtual machine selection and placement (JVMSP) problem, and prove its NP hardness. Then, we propose the Resource-Decoupling algorithm that converts the JVMSP problem into two independent sub-problems: Max-Capability and Min-Cost. We prove that the optimal solutions of the two sub-problems guarantees the optimal solution of the JVMSP problem. Furthermore, we design the efficient Max-Balanced-Utility and Extent-Greedy heuristic algorithms to solve Max-Capability and Min-Cost, respectively. We evaluate our proposed algorithms on datasets with different distributions of resources, and the results demonstrate that our algorithms significantly improve the resource utilization efficiency compared with traditional solutions and existing algorithms.

Hongkun Chen, Feilong Tang, Linghe Kong, Wenchao Xu, Xingjun Zhang, Yanqin Yang
A Parallel Multi-keyword Top-k Search Scheme over Encrypted Cloud Data

With searchable encryptions in the cloud computing, users can outsource their sensitive data in ciphertext to the cloud that provides efficient and privacy-preserving multi-keyword top-k searches. However, most existing top-k search schemes over encrypted cloud data are the centralize schemes which are limited in large scale data environment. To support scalable searches, we propose a parallel multi-keyword top-k search scheme over encrypted cloud data. In this scheme, the fragment-based encrypted inverted index is designed, which is indistinguishable and can be used for parallel searching. On the basis of such indexes, a Map-Reduce-based distributed computing framework is adopted to implement the parallel multi-keyword top-k search algorithms. Security analysis and experiment evaluation show that the proposed scheme is privacy-preserving, efficient and scalable.

Maohu Yang, Hua Dai, Jingjing Bao, Xun Yi, Geng Yang
N-Docker: A NVM-HDD Hybrid Docker Storage Framework to Improve Docker Performance

Docker has been widely adopted in production environment, but unfortunately deployment and cold-start of container are limited by the low speed of disk. The emerging non-volatile memory (NVM) technology, which has high speed and can store data permanently, brings a new chance to accelerate the deployment and cold-start of container. However, it is expensive to replace the whole hard disk driver (HDD) with NVM. To achieve the fastest deployment and cold-start with lowest cost, we conduct in-depth analysis on the Top-134 images in Docker Hub and obtain two main insights as: (1) the storing latency of layered image has become the bottleneck of container deployment; (2) only a few image layers are required for container cold-start. Based on these two findings, we propose a NVM-HDD hybrid docker storage framework as N-Docker. It can effectively accelerate container cold-start by detecting the bottleneck layers as well as cold-start required layers and storing them into NVM for faster container startup with limited NVM capacity. Experimental results show that N-Docker can accelerate the container deployment by 1.21X and cold-start by 2.96X. Compared to NVM-Docker, which stores all images into NVM, N-Docker achieves the same performance improvements while reducing the usage of NVM by 88.22%.

Lin Gu, Qizhi Tang, Song Wu, Hai Jin, Yingxi Zhang, Guoqiang Shi, Tingyu Lin, Jia Rao

HPC

Frontmatter
MMSR: A Multi-model Super Resolution Framework

Single image super-resolution (SISR), as an important image processing method, has received great attentions from both industry and academia. Currently, most super-resolution image reconstruction approaches are based on the deep-learning techniques and they usually focus on the design and optimization of different network models. But they usually ignore the differences among image texture features and use the same model to train all the input images, which greatly influence the training efficiency. In this paper, we try to build a framework to improve the training efficiency through specifying an appropriate model for each type of images according to their texture characteristics, and we propose MMSR, a multi-model super resolution framework. In this framework, all input images are classified by an approach called TVAT (Total Variance above the Threshold). Experimental results indicate that our MMSR framework brings a 66.7% performance speedup on average without influencing the accuracy of the results HR images. Moreover, MMSR framework exhibits good scalability.

Ninghui Yuan, Zhihao Zhu, Xinzhou Wu, Li Shen
HiPower: A High-Performance RDMA Acceleration Solution for Distributed Transaction Processing

The increasing complex tasks and growing size of data have necessitated the application of distributed transaction processing (DTP), which decouples tasks and data among multiple nodes for jointly processing. However, compared with the revolutionary development of computation power, the network capability falls relatively behind, leaving communication as a more distinct bottleneck. This paper focuses on the recent emerging RDMA technology, which can greatly improve communication performance but cannot be well exploited in many cases due to improper interactive design between the requester and responder. Our research finds that the typical implementation of confirming per work request (CPWR) triggers considerable CPU involvement, which further degrades the overall performance of RDMA communication. Targeting at this, we propose HiPower, which leverages a batched confirmation scheme with lower CPU utilization, to improve high-frequency communication efficiency. Our experiments show that, compared with CPWR, HiPower can improve the communication efficiency by up to 75% and reduce CPU cost by up to 79%, which speeds up the overall FCT (Flow Completion Time) by up to 14% on real workflow (Resnet-152).

Runhua Zhang, Yang Cheng, Jinkun Geng, Shuai Wang, Kaihui Gao, Guowei Shen

Emerging Topics

Frontmatter
LDAPRoam: A Generic Solution for Both Web-Based and Non-Web-Based Federate Access

Identity federation technology has been widely used in recent years. But the solution for federate access is totally different between the Web-Based and Non-Web-Based scenarios. Furthermore, it is highly limited for lack of support from Non-Web-Based scenarios now. This paper proposes a generic federate access solution based on LDAP roaming, which can provide reliable identity roaming in any internet service. To service providers, our solution is transparent and looks like a LDAP. The paper first presents the difficulties in realizing LDAP roaming and discusses offers solutions to the implementation of LDAP roaming. Then it evaluates the easy integration and usability of LDAP roaming. Finally it compares the Generic Solution with the existing federal access solution.

Qi Feng, Wei Peng
Characterizing Perception Module Performance and Robustness in Production-Scale Autonomous Driving System

Autonomous driving is a field that gathers many interests in academics and industry and represents one of the most important challenges of next years. Although individual algorithms of autonomous driving have been studied and well understood, there is still a lack of study for those tasks in a production-scale system. In this work, we profile and analyze the perception module of the open-source autonomous driving system Apollo, developed by Baidu, in terms of response time and robustness against sensor errors. The perception module is fundamental to the proper functioning and safety of autonomous driving, which relies on several sensors, such as LIDARs and cameras, for detecting obstacles and perceiving the surrounding environment. We identify the computation characteristics and potential bottlenecks in the perception module. Furthermore, we design multiple noise models for the camera frames and LIDAR cloud points to test the robustness of the whole module in terms of accuracy drop against a noise-free baseline. Our insights are useful for future performance and robustness optimization of autonomous driving system.

Alessandro Toschi, Mustafa Sanic, Jingwen Leng, Quan Chen, Chunlin Wang, Minyi Guo

Memory and File System

Frontmatter
Spindle: A Write-Optimized NVM Cache for Journaling File System

Journaling techniques are widely employed in modern file systems to guarantee crash consistency. However, journaling usually leads to system performance decrease due to the frequent storage accesses it entails. Architects can utilize emerging non-volatile memory (NVM) as a persistent cache or journaling device to reduce the storage accesses of journaling file systems. Yet problems such as double writes, metadata write amplification and heavy transaction ordering overhead still exist in current solutions. Therefore, we propose Spindle, a write-optimized NVM cache to address these challenges. Spindle decouples data and metadata accesses by processing data in DRAM while pinning metadata in NVM. With redesigned metadata log and state switch mechanism, Spindle eliminates double writes and relieves metadata write amplification. Moreover, Spindle adopts a lightweight transaction scheme to guarantee crash consistency and reduce transaction ordering overhead. Experimental results reveal that Spindle achieves up to $$47\%$$ throughput improvement compared with state-of-the-art design.

Ge Yan, Kaixin Huang, Linpeng Huang
Two-Erasure Codes from 3-Plexes

We present a family of parity array codes called 3-PLEX for tolerating two disk failures in storage systems. It only uses exclusive-or operations to compute parity symbols. We give two data/parity layouts for 3-PLEX: (a) When the number of disks in array is at most 6, we use a horizontal layout which is similar to EVENODD codes, (b) otherwise we choose hybrid layout like HoVer codes. The major advantage of 3-PLEX is that it has optimal encoding/decoding/updating complexity in theory and the number of disks in a 3-PLEX disk array is less constrained than other array codes, which enables greater parameter flexibility for trade-offs in storage efficiency and performances.

Liping Yi, Rebecca J. Stones, Gang Wang
Deep Fusion: A Software Scheduling Method for Memory Access Optimization

Deep neural networks (DNNs) have been considered to be the state-of-the-art artificial intelligence methods in a very broad range of applications. However, DNNs are compute intensive and memory intensive which are difficult to be employed in practical scenarios. Due to their favorable parallel computing ability, a series of DNN accelerators have been proposed. However, the improvement of on-chip computing capacity and the increasing number of parameters in the neural networks make access to memory a bottleneck. In this paper, we analyze the existing DNN algorithms. We observe that the special structure of neural networks makes it have two useful characteristics, which are unilateral directivity and local independence. Based on these characteristics, we propose a general software scheduling method to reduce memory access cost. Based on the experimental results, our method can reduce 32% memory access cost and achieve a speedup of 1.6x in average on our experiment platform and the best result is in ResNet-50, which is up to 56% and 2.62x.

Yimin Zhuang, Shaohui Peng, Xiaobing Chen, Shengyuan Zhou, Tian Zhi, Wei Li, Shaoli Liu
Optimizing Data Placement on Hierarchical Storage Architecture via Machine Learning

As storage hierarchies are getting deeper on modern high-performance computing systems, intelligent data placement strategies that can choose the optimal storage tier dynamically is the key to realize the potential of hierarchical storage architecture. However, providing a general solution that can be applied in different storage architectures and diverse applications is challenging. In this paper, we propose adaptive storage learner (ASL), which explores the idea of using machine learning techniques to mine the relationship between data placement strategies and I/O performance under varied workflow characteristics and system status, and uses the learned model to choose the optimal storage tier intelligently. We implement a prototype and integrate it into an existing data management system. Empirical comparison based on real scientific workflows tests shows that ASL is capable of combining workflow characteristics and real-time system status to make optimal data placement decisions.

Peng Cheng, Yutong Lu, Yunfei Du, Zhiguang Chen, Yang Liu

Short Papers

Frontmatter
I/O Optimizations Based on Workload Characteristics for Parallel File Systems

Parallel file systems usually provide a unified storage solution, which fails to meet specific application needs. In this paper, we propose an extended file handle scheme to address this problem. It allows the file systems to specify optimizations for individual file or directory based on workload characteristics. One case study shows that our proposed approach improves the aggregate throughput of large files and small files by up to 5% and 30%, respectively. To further improve the access performance of small files in parallel file systems, we also propose a new metadata-based small file optimization method. The experimental results show that the aggregate throughput of small files can be effectively improved through our method.

Bing Wei, Limin Xiao, Bingyu Zhou, Guangjun Qin, Baicheng Yan, Zhisheng Huo
Energy Consumption of IT System in Cloud Data Center: Architecture, Factors and Prediction

In recent years, as cloud data center has grown constantly in size and quantity, the energy consumption of cloud data center has increased dramatically. Therefore, it is of great significance to study the energy-saving issues of cloud data centers in depth. Therefore, this paper analyzes the architecture of energy consumption of IT system in cloud data centers and proposes a new framework for collecting energy consumption. Based on this framework, the factors affecting energy consumption are studied, and various parameters closely related to energy consumption are selected. Finally, the RBF neural network is used to model and predict the energy consumption of the cloud data centers, which is aim to prove the accuracy of the framework for collecting energy consumption and influencing factors. The experimental results show that these parameters under the framework for collecting energy consumption have better accuracy and adaptability to the prediction of energy consumption in cloud data centers than the previous model of energy consumption prediction.

Haowei Lin, Xiaolong Xu, Xinheng Wang
Efficient Processing of Convolutional Neural Networks on SW26010

Artificial intelligence has developed rapidly in recent years. Deep neural networks are the basis of many artificial intelligence applications. How to accelerate the computational processing of deep neural networks is very important. To explor the potential for accelerating the process deep neural networks on various hardware platforms, we propose a convolutional neural network optimization method based on the Weight-Stationary for SW26010 processor. We re-circulate convolution loops and use hybrid DMA transmission mode to increase memory bandwidth and reduce memory access overhead. On top of those, further optimizations are done based on register communication, asynchronous DMA transfer double buffering, instruction scheduling and other schemes. Finally, we achieve a double-precision convolution performance over 2.4 Tflops, achieving 81% of the processor’s peak performance. In multiple parameters, we achieve a proforamnce acceleration of $$2.4-4.0\times $$ speedup compared to the Tesla K80 GPU with cuDNNv7.

Yi Zhang, Bing Shu, Yan Yin, Yawei Zhou, Shaodi Li, Junmin Wu
ADMMLIB: A Library of Communication-Efficient AD-ADMM for Distributed Machine Learning

Alternating direction method of multipliers (ADMM) has recently been identified as a compelling approach for solving large-scale machine learning problems in the cluster setting. To reduce the synchronization overhead in a distributed environment, asynchronous distributed ADMM (AD-ADMM) was proposed. However, due to the high communication overhead in the master-slave architecture, AD-ADMM still cannot scale well. To address this challenge, this paper proposes the ADMMLIB, a library of AD-ADMM for distributed machine learning. We employ a set of network optimization techniques. First, hierarchical communication architecture is utilized. Second, we integrate ring-based allreduce and mixed precision training into ADMMLIB to further effectively reduce the inter-node communication cost. Evaluation with large dataset demonstrates that ADMMLIB can achieve significant speed up, up to 2x, compared to the original AD-ADMM implementation, and the overall communication cost is reduced by 83%.

Jinyang Xie, Yongmei Lei
Energy-Aware Resource Scheduling with Fault-Tolerance in Edge Computing

Edge computing extends computation and storage resources to the edge of the network, which largely improve the performance problem of cloud computing incurred by the bandwidth limitation. And it still needs to address the challenges of energy and reliability. In this paper, we propose an energy-aware fault-tolerant resource scheduling algorithm to improve system reliability while minimizing the energy consumption. We allocate resources by reliability and energy-aware resource scheduling method for tasks firstly. Then, CPU temperature prediction and time between failures (TBF) prediction are used to trigger proactive fault tolerance mechanism (VM migration). The experimental results show that the reliability is greatly improved and energy consumption generated by VM migration is not very large compared to other methods.

Yanfen Xue, Guisheng Fan, Huiqun Yu, Huaiying Sun
DIN: A Bio-Inspired Distributed Intelligence Networking

Software-Defined Networking (SDN) is a promising method to simplify network management and facilitate network evolution. However, SDN is a logically centralized technology with global network-wide view. It faces the problem of scalability and reliability. In this paper, we propose a novel method termed as Distributed Intelligence Networking (DIN). DIN optimizes network management based on distributed coordination of multiple forwarding nodes like the coordination in bird flocking motion, it is a fully physically and logically distributed structure based on neighbor network-wide view. This architecture naturally has the advantage of scalability and reliability.

Yufeng Li, Yankang Du, Chenhong Cao, Han Qiu
A DAG Refactor Based Automatic Execution Optimization Mechanism for Spark

In today’s big data era, traditional disk-based MapReduce big data framework encountered bottlenecks due to its lower memory utilization and inefficient orchestration of complex tasks. With the advantage of fully use memory resources, Spark provides a lot of data manipulate operators and use DAG to express the dependences. Spark split entire job to multi-stage according to DAG and schedule them in a distributed execution environment, which better adapted to the new characteristic of big data processing. However, Spark didn’t consider the resource requirement of different operators and schedule them indiscriminately, which could cause load imbalances on different nodes in the cluster and cause some node become bottlenecks due to its extraordinary resource consumption. In the past, solve this problem need developers to have a lot of experience of Spark and write code sophisticated. In this paper, we proposed a DAG refactor based automatic execution optimization mechanism for Spark. The experimental results show that the DAG refactor mechanism can greatly improve Spark performance by up to 8.8X without misinterpretation of original program semantics.

Hang Zhao, Yu Rao, Donghua Li, Jie Tang, Shaoshan Liu
BTS: Balanced Task Scheduling Strategy Based on Multi-resource Prediction and Allocation in Cloud Environment

Cloud computing is a new computing paradigm equipped with large-scale servers to satisfy diverse application demands. Managing and scheduling various application tasks on cloud servers is very challenging. In this paper, we propose a Balanced Task Scheduling (BTS) strategy by combining multi-objective particle swarm optimization and time series prediction model to achieve a better load balance among cloud servers. We not only consider the current server load which is used by most existing scheduling methods, but also take the future load change prediction into account. Experiments on the public Alibaba cluster trace with 1310 servers show that the proposed strategy can achieve a more balanced resource utilization.

Yongzhong Sun, Kejiang Ye, Wenbo Wang, Cheng-Zhong Xu
DAFL: Deep Adaptive Feature Learning for Network Anomaly Detection

With the rapid development of the Internet and the growing complexity of the network topology, network anomaly has become more diverse. In this paper, we propose an algorithm named Deep Adaptive Feature Learning (DAFL) for traffic anomaly detection based on deep learning model. By setting proper feature parameters $$\theta $$ on the neural network structure, DAFL can effectively generate low-dimensional new abstract features. Experimental results show the DAFL algorithm has good adaptability and robustness, which can effectively improve the detection accuracy and significantly reduce the detection time.

Shujian Ji, Tongzheng Sun, Kejiang Ye, Wenbo Wang, Cheng-Zhong Xu
SIRM: Shift Insensitive Racetrack Main Memory

Racetrack memory (RM) is a potential DRAM alternative due to its high density and low energy cost and comparative access latency with SRAM. On this occasion, we propose a shift insensitive racetrack main memory architecture SIRM. SIRM provides uniform access latency to upper system, which make it easy to be managed. Experiments demonstrate that RM can outperform DRAM for main memory design with higher density and energy efficiency.

Hongbin Zhang, Bo Wei, Youyou Lu, Jiwu Shu
PDRM: A Probability Distribution Based Resource Management for Batch Workloads in Heterogeneous Cluster

Resource consumption prediction and dynamic resource provision based on historical consumption are common methods to improve cluster resource utilization, however they have to face the challenge of fluctuation in resource consumption for accurate prediction. We propose PDRM, an efficient resource management scheme based on resource consumption probability distribution for batch workloads to deal with this dilemma. Based on the common sense that the same type of tasks have similar resource consumption on the same node, we get the resource consumption probability distribution of each type of task to describe the fluctuations in its resource consumption. Based on the resource consumption distribution function, we can allocate resources precisely for tasks. Experimental results demonstrate that PDRM achieves good performance for various application in the heterogeneous cluster. PDRM can effectively improve resource utilization and reduce job completion time.

Jun Zhou, Dan Feng, Fang Wang
Collaborating CPUs and MICs for Large-Scale LBM Multiphase Flow Simulations

This paper highlights the use of the OpenMP4.5 accelerator programming model to collaborate CPUs and Intel Many Integrated Cores (MIC) co-processors for large-scale LBM multiphase flow simulationson the Tianhe-2 supercomputer. To enhance the collaborative efficiency among intra-node CPUs and co-processors, we propose a flexible load balance model with heterogeneous domain decomposition for CPU-MIC task allocation, as well as asynchronous offloading to overlap operations of CPUs and multiple MICs. Tests for 3D multi-phase (liquid and gases) problem (about 100 Billion lattices) simulating drop impact with gravity effect using D3Q19 Lattice Boltzmann discretization and Shan-Chen BGK single relaxation time collision model are presented, achieving a weak parallel efficiency of above 80% in going from 128 to 2048 compute nodes.

Chuanfu Xu, Xi Wang, Dali Li, Yonggang Che, Zhenghua Wang
Multiple Algorithms Against Multiple Hardware Architectures: Data-Driven Exploration on Deep Convolution Neural Network

With the rapid development of deep learning (DL), various convolution neural network (CNN) models have been developed. Moreover, to execute different DL workloads efficiently, many accelerators have been proposed. To guide the design of both CNN models and hardware architectures for a high-performance inference system, we choose five types of CNN models and test them on six processors and measure three metrics. With our experiments, we get two observations and conduct two insights for the design of CNN algorithms and hardware architectures.

Chongyang Xu, Zhongzhi Luan, Lan Gao, Rui Wang, Han Zhang, Lianyi Zhang, Yi Liu, Depei Qian
A Parallel Retinex Image Enhancement Algorithm Based on OpenMP

Retinex image enhancement algorithm occupies an important position in eliminating image uneven exposure, low contrast, and smog influence. However, with the increasing of image resolution, the real-time performance of the serial Retinex algorithm has not satisfied the requirements of practical applications. This paper proposes an OpenMP-based parallel Retinex algorithm. The parallelism of the Retinex algorithm is first identified by theoretical analyses. Then, the time-consuming sub-algorithms such as Gaussian convolution and exponential transformation, of the serial algorithm are designed and executed in parallel. On Tianhe-2 supercomputer platform, the experimental results show that the speedup of the parallel algorithm is significantly improved, and the test image set achieves an average speedup of 12. It indicates that the parallel algorithm can satisfy the needs of real-time processing in image enhancement field.

Shixiong Cheng, Bin Liu, Dongjian He, Jinrong He, Yuancheng Li, Yanning Du
Backmatter
Metadaten
Titel
Network and Parallel Computing
herausgegeben von
Dr. Xiaoxin Tang
Quan Chen
Pradip Bose
Weiming Zheng
Jean-Luc Gaudiot
Copyright-Jahr
2019
Electronic ISBN
978-3-030-30709-7
Print ISBN
978-3-030-30708-0
DOI
https://doi.org/10.1007/978-3-030-30709-7

Premium Partner