Skip to main content
Top

2022 | Book

Network and Parallel Computing

18th IFIP WG 10.3 International Conference, NPC 2021, Paris, France, November 3–5, 2021, Proceedings

insite
SEARCH

About this book

This book constitutes the proceedings of the 18th IFIP WG 10.3 International Conference on Network and Parallel Computing, NPC 2021, which was held in Paris, France during November 3-5, 2021.

The 20 papers presented in this volume were carefully reviewed and selected from 62 submissions. They were organized in topical sections as follows: algorithms and applications; system software and resource management; storage; and networks and communications.

Table of Contents

Frontmatter

Algorithms and Applications

Frontmatter
High Resolution of City-Level Climate Simulation by GPU with Multi-physical Phenomena
Abstract
In this paper, we describe the Graphics Processing Unit (GPU) implementation of our City-LES code on detailed large eddy simulations, including the multi-physical phenomena on fluid dynamics, heat absorption and reflection by surface and building materials, cloud effects, and even sunlight effect. Because a detailed simulation involving these phenomena is required for analyses at the street level and several meters of resolution, the computation amount is enormous, and ordinary CPU computation cannot provide sufficient performance. Therefore, we implemented the entire code on GPU clusters with large-scale computing. We applied OpenACC coding to incrementally implement relatively easy programming and eliminate data transfers between the CPU and GPU memories. Based on this research, we determined that the elimination of data transfers is effective, even in the case where a part of the code execution on the GPU is slower than the CPU, owing to the absence of spatial parallelism. The objective of this study is to perform a complete climate simulation on a few square-kilometers field around the Tokyo Station, considering the finest resolution of the original highlighted area of the Marathon race in the Olympic Games Tokyo 2020. We successfully transferred the entire code to the GPU to provide approximately eight times the performance of CPU-only computation on multi-GPU per node with a large scale cluster.
Koei Watanabe, Kohei Kikuchi, Taisuke Boku, Takuto Sato, Hiroyuki Kusaka
dgQuEST: Accelerating Large Scale Quantum Circuit Simulation through Hybrid CPU-GPU Memory Hierarchies
Abstract
With the advancement of quantum computing, verifying the correctness of the quantum circuits becomes critical while developing new quantum algorithms. Constrained by the obstacles of building practical quantum computers, quantum circuit simulation has become a feasible approach to develop and verify quantum algorithms. Although there are many quantum simulators available, they either achieve low performance on CPUs, or limited simulation scale (e.g., number of qubits) on GPUs due to limited memory capacity. Therefore, we propose dgQuEST, a novel acceleration method that utilizes hybrid CPU-GPU memory hierarchies for large-scale quantum circuit simulation across multiple nodes. dgQuEST adopts efficient memory management and communication schemes to leverage the distributed CPU and GPU memories for accelerating large-scale quantum simulation. Our evaluation demonstrates that dgQuEST achieves an average speedup of 403\(\times \) compared to QuEST on quantum circuit simulation with 32 qubits, and scales to quantum circuit simulation with 35 qubits on two GPU nodes, far beyond the state-of-the-art implementation HyQuas can support.
Tianyu Feng, Siyan Chen, Xin You, Shuzhang Zhong, Hailong Yang, Zhongzhi Luan, Depei Qian
vSketchDLC: A Sketch on Distributed Deep Learning Communication via Fine-grained Tracing Visualization
Abstract
Intensive communication cost for gradients and parameters is becoming the bottleneck of distributed deep learning training. It is crucial for optimizing such communication bottleneck through measuring communication operations effectively. However, many existing communication measurement tools, such as MXNet profiler, still suffer from serious limitations. Specifically, they cannot satisfy two requirements simultaneously, that is, fine-grained collection of low-level communication operations and user-friendly analysis of comprehensive measurement results. In this paper, we make the first attempt to propose an open-sourced, fine-grained and user-friendly communication measurement tool on top of MXNet, called vSketchDLC. vSketchDLC can trace low-level communication events between framework and communication library interface, and capture end-to-end push and pull communications between workers and servers. It supports to generate communication records in standard format, enabling users to analyze the communication traces by merely using standard visualization tools such as Chrome Trace Viewer. Our design exploits in-memory buffers and asynchronous record writes to ensure measurement activities do not impact training performance. We conduct extensive experiments on a public-cloud GPU cluster to verify the effectiveness of vSketchDLC for MXNet. Experimental results show that vSketchDLC can empower users to analyze fine-grained communication records through friendly interactions, and identify potential training bottlenecks from multiple perspectives, including training timeline and iterations, DNN layers, workers or servers, etc. We can observe the relationship between different communications visually, i.e., to highlight a selected period of communication traces, to zoom in or zoom out, such that identifying the root causes of communication bottleneck and seeking to improve training performance.
Yanghai Wang, Shuo Ouyang, Dezun Dong, Enda Yu, Xiangke Liao
Scalable Algorithms Using Sparse Storage for Parallel Spectral Clustering on GPU
Abstract
Spectral clustering has many fundamental advantages over k-means, but has high computational complexity (\(\mathcal {O} (n^3)\)) and memory requirement (\(\mathcal {O} (n^2)\)), making it prohibitively expensive for large datasets. In this paper we present our solution on GPU to address the scalability challenge of spectral clustering. First, we propose optimized algorithms for constructing similarity matrix directly in CSR sparse format on the GPU. Next, we leverage the spectral graph partitioning API of the GPU-accelerated nvGRAPH library for remaining computations especially for eigenvector extraction. Finally, experiments on synthetic and real-world large datasets demonstrate the high performance and scalability of our GPU implementation for spectral clustering.
Guanlin He, Stephane Vialle, Nicolas Sylvestre, Marc Baboulin
XSP: Fast SSSP Based on Communication-Computation Collaboration
Abstract
Single-source shortest path (SSSP) is an important graph search algorithm for data-intensive applications which finds the minimum distance from a source vertex to any other vertex in a given graph. Although having been extensively studied for both single- and multi-node scenarios, SSSP search still brings severe challenge to communication when processing large graphs that consist of billions of vertices involving hundreds of computing nodes. To address this problem, in this paper we propose XSP, a fast SSSP search method based on communication-computation collaboration, which optimizes the communication of parallel SSSP in two aspects. First, we design a group-based scalable batching mechanism which effectively reduces the inter-machine communication overhead. Second, we propose a CCO (Communication-Computation Overlapping) method which realizes non-blocking execution of communication and computation. We have implemented XSP and extensive evaluation results show that the performance of XSP is significantly higher than that of the state-of-the-art parallel SSSP methods.
Xinbiao Gan, Wen Tan, Menghan Jia, Jie Liu, Yiming Zhang
A Class of Fast and Accurate Multi-layer Block Summation and Dot Product Algorithms
Abstract
Basic recursive summation and common dot product algorithm have a backward error bound that grows linearly with the vector dimension. Blanchard [1] proposed a class of fast and accurate summation and dot product algorithms respectively called FABsum and FABdot, which trades off the calculation accuracy and speed by the block size. Castaldo [2] proposed a multi-layer block summation and dot product algorithm called SuperBlocksum and SuperBlockdot that can increase the accuracy while adding almost no additional calculations. We combine the idea of [1] with the multi-layer block structure to propose SuperFABsum (for “super fast and accurate block summation”) and SuperFABdot (for “super fast and accurate block dot product”). Our algorithms have two variants, one is SuperFAB(within), the other is SuperFAB(outside). Our algorithms further improve accuracy and speed compared with FAB and SuperBlock. We conducted accuracy and speed tests on the high-performance FT2000+ processor. Experimental results show that SuperFABdot(within) algorithm is more accurate than FABdot and SuperBlockdot. Compared with FABdot, SuperFABdot(outside) algorithm can achieve up to 1.2\(\times \) performance speedup while ensuring similar accuracy.
Kang He, Roberto Barrio, Lin Chen, Hao Jiang, Jie Liu, Tongxiang Gu, Jin Qi
A KNN Query Method for Autonomous Driving Sensor Data
Abstract
Autonomous driving cars need to perceive and extract environmental information through sensors around the body during the driving process. Sensor data streams are rich in semantic information, and semantic annotation can form semantic text with geographic location information. This information is important for scenario applications such as effective updating of high-precision maps and more accurate perception of the surrounding environment. Therefore, it is necessary to store and organize them effectively. At the same time, users’ demand for such information retrieval is increasing, and it is more and more important to efficiently retrieve useful information for users from the huge amount of information.In this paper, We propose an index, SIR-tree, to efficiently organize the spatial, textual and social information of objects. SIR-tree is an R-tree containing both social relevance and textual relevance, and each parent node in the tree contains the spatial, textual and social information of all children nodes, and has good scalability. Based on the SIR -tree index we propose the priority traversal algorithm BF, which uses a priority queue to store all candidate objects and traverses the nodes to filter them according to their ranking function values. To optimize the query algorithm, we propose two pruning strategies, distance-based BF algorithm and X-HOP_BF algorithm, to improve the query efficiency. Experimental results show that the BF algorithm takes at least 70 s to return 20 objects in the Gowalla dataset, while the X-HOP_BF algorithm only takes about 20 s.
Tang Jie, Zhang Jiehui, Zeng Zhixin, Liu Shaoshan

System Software and Resource Management

Frontmatter
A Novel Task-Allocation Framework Based on Decision-Tree Classification Algorithm in MEC
Abstract
Mobile-edge computing (MEC) has emerged as a promising paradigm to extend the cloud computing tasks to the edge mobile devices for improving the quality of service. This paradigm addresses the problems in cloud computing architecture by enabling lower latency, higher bandwidth, and better privacy and security. Previous studies of task allocation in MEC systems generally only consider the single influence factor, such as the distance between the mobile device and cloudlets, to select the suitable cloudlets for tasks. However, there are various types of tasks with complex individual requirements about which we should consider, such as transmission bandwidth, computing capacity and storage capacity of cloudlets, and so on. In this paper, we propose an on-demand and service oriented task-allocation framework based on machine learning technology, called Edgant. It classifies tasks into three types using a decision-tree model according to the tasks’ characteristics including requirement on server resources and user’s requirements. For each type, we provide a selection strategy to allocate task to the most suitable cloudlet based on characteristics of the task and the current state of cloudlets. Simulated evaluations demonstrate that Edgant achieves lower latency and better accuracy compared to other task allocation methods, as well as high reliability under high mobility of the mobile devices.
Wenwen Liu, Zhaoyang Yu, Meng Yan, Gang Wang, Xiaoguang Liu
QoS-Aware Scheduling for Cellular Networks Using Deep Reinforcement Learning
Abstract
This research presents a reinforcement learning (RL) approach that provides coarse-grained decisions to maximize QoS requirement satisfaction in mobile networks. Deep reinforcement learning has demonstrated agents that can capture the dynamics of impossibly complex systems. At each scheduling interval, our RL agent provides a scheduling policy that is suitable for optimal resource allocation in a mobile network. By using a deep neural network to approximate the action-value function (Q-function), scheduling decisions can be made using an optimal policy. Utilising a 4G-LTE network simulator and Pytorch, this research explores three scenarios of diverse traffic and UE density. The implementation shows stable and effective performance when compared to baseline static schedulers. Additionally, the RL agent selects the optimal scheduler for both single and mixed traffic simulations. Being both scalable and cheap to compute, the implementation offers a simple and effective method of radio resource management.
Jonathan Robert Malin, Gun Ko, Won Woo Ro
Adaptive Buffering Scheme for PCM/DRAM-Based Hybrid Memory Architecture
Abstract
Phase Change Memory (PCM) motivates new hybrid memory architecture that consists of PCM and DRAM. A critical issue in the PCM/DRAM-based hybrid memory architecture is how to devise efficient buffering schemes for heterogeneous memory. A feasible way is to use PCM as a page buffer in the memory hierarchy. Based on this assumption, this paper presents a new buffering scheme named FWQ-LRU for PCM/DRAM-based hybrid memory. FWQ-LRU proposes two new designs. First, it uses two queues for DRAM to identify write-cold DRAM pages, which will be moved to PCM during page replacement and page migrations. Second, to reduce PCM writes, it adopts an adaptive page migration mechanism to swap write-hot PCM pages with write-cold DRAM pages. Specially, FWQ-LRU devises a queue called FWQ (FIFO Write Queue) to detect write-hot PCM pages. Further, we periodically measure the page-migration benefits under the current workload to adjust the page migration policy for future requests. With this mechanism, write requests are mainly redirected to DRAM, and the writes to PCM are reduced. In addition, FWQ-LRU can maintain the same hit ratio as LRU. We conduct comparative trace-driven experiments and the results suggest the efficiency of our proposal.
Xiaoliang Wang, Kaimeng Chen, Peiquan Jin
Efficiency-First Fault-Tolerant Replica Scheduling Strategy for Reliability Constrained Cloud Application
Abstract
Reliability requirement assurance is an important prerequisite for application execution in the cloud. Although copy management can improve the reliability of applications, it also brings a series of resource waste and overhead issues. Therefore, the efficiency-first fault-tolerant algorithm (EFFT) with minimum execution cost in the cloud application is proposed. This algorithm minimizes the execution cost of the application under the constraints of reliability, and solves the problem of excessive overhead caused by too many copies. The EFFT algorithm is divided into two stages: initial allocation and dynamic adjustment. On the initial allocation of EFFT algorithm, a sorting rule is defined to determine the priority of tasks and instances. During the adjustment phase, by defining an actual efficiency ratio indicator to measure the cost-effectiveness of an instance, the EFFT algorithm makes a good trade-off between cost and reliability in order to minimize execution costs. Run our algorithm on randomly generated parallel applications of different scales and compare the experimental results with four advanced algorithms. The experiments show that the performance of the algorithm we proposed is better than the other algorithms in terms of execution cost and fault tolerance.
Yingxue Zhang, Guisheng Fan, Huiqun Yu, Xingpeng Chen
Towards an Optimized Containerization of HPC Job Schedulers Based on Namespaces
Abstract
Recently, container technology is gaining increasing attention and has become an alternative to the traditional virtual machines artifact. The technology is used to deploy large-scale applications in several areas such as Big Data, AI, and High-Performance Computing (HPC). In the HPC field, several management tools exist as Slurm in one hand. On the other hand, the literature has considered many container scheduling strategies. The majority of container scheduling strategies don’t think about the amount of data transmitted between containers. This paper presents a new container scheduling strategy that automatically groups containers that belong to the same group (Namespace) on the same node. In brief, the plan is application-aware as long as someone knows which containers should be grouped in the same Namespace. The objective is to compact the nodes with containers of the same group to reduce the number of nodes used, the communication inter-node costs, and improve containerized applications’ overall Quality-of-Service (QoS). Our proposed strategy is implemented under the Kubernetes framework. Experiments demonstrate the potential of our strategy under different scenarios. Most importantly, we show first that cohabitation between our new scheduling strategy and the default Kubernetes strategy is possible and for the benefit of the system. Thanks to Namespaces, cohabitation is not limited to two methods for scheduling either batch jobs or online services. Second, thanks to the deployment automation, we also demonstrate that multiple Slurm clusters can be instantiated from a pool of bare metal nodes. This reality contributes to the concept of “HPC as a Service.”
Tarek Menouer, Nicolas Greneche, Christophe Cérin, Patrice Darmon
Architecture of an On-Time Data Transfer Framework in Cooperation with Scheduler System
Abstract
Technological advancement in networking and IoT have given researchers new methods or techniques to perform numerical analysis and simulation with the latest data observed on IoT sensors and other measurement devices. In general, large-scale simulations necessitate high-performance computing (HPC) systems. Such HPC systems are operated in a shared manner among researchers. Therefore, it becomes inherently difficult for researchers to use the latest observation data generated on remote data sources for their simulations. To enable researchers to utilize fresh data on a remote data source for computation, we propose an on-time data transfer framework that enables the execution of jobs with fresh data generated on a remote site data on a shared HPC system by extending the SLURM scheduler. The proposed framework consists of two functions: Job pinning and On-time data transfer. With the job pinning function, the proposed framework prevents the scheduling algorithm from rearranging the scheduled start time of jobs. The on-time data transfer function is in charge of data transfer from a remote site to the data transfer node. It attempts to complete the data transfer at just the time of the pinned start time of jobs. The evaluation in this paper indicates that the proposed framework can keep data freshness high and minimize the job waiting time for data transfer.
Kohei Yamamoto, Arata Endo, Susumu Date

Storage

Frontmatter
Data Delta Based Hybrid Writes for Erasure-Coded Storage Systems
Abstract
Erasure coding is widely used in storage systems since it can offer higher reliability at lower redundancy than data replication. However, erasure-coded storage systems have to perform a partial write to an entire erasure coding group for a small write, which causes a time-consuming write-after-read. This paper presents an efficient data delta based hybrid write scheme, named DABRI, which supports fast partial writes for erasure-coded storage systems. DABRI uses data deltas that are the differences between latest data values and original data values to bypass the computation of parity deltas and the read of old data. For a series of n partial writes to the same data, DABRI performs log-based data and parity updates for the first write, and takes in-place data updates and log-based parity updates for the last \(n-1\) writes. This enables new data to be written into data nodes and parity nodes in parallel, and the overhead of data reads and parity updates can be mitigated. Based on DABRI, we design and implement an erasure-coded prototype storage system that can deliver high-performance for small-write-intensive applications. Experimental results on real-world traces show that DABRI can successfully improve the I/O throughput by up to 77.41%, compared with the state-of-the-art cited in this paper.
Qiang Huang, Hui Chen, Bing Wei, Jigang Wu, Limin Xiao
BDCuckoo: an Efficient Cuckoo Hash for Block Device
Abstract
Hash is widely used in various storage systems due to its excellent insertion and search performance. However, existing hash designs are not friendly for block devices because they will generate a lot of random small I/Os, which will significantly reduce the I/O efficiency on block devices. This paper proposes BDCuckoo (Block Device Cuckoo) hash, a I/O-optimized Cuckoo hash for block device. BDCuckoo reduces the amount of I/Os during the slot detection by limiting the location where the element may be stored on the hash table. Unlike the traditional cuckoo hash that triggers an disk I/O for each detection, BDCuckoo hash loads the possible slots of the target element into DRAM in a single large disk I/O. The paper also shows a use case that uses BDCuckoo to optimize a large directory index in the EXT4 file system. The evaluation shows that BDCuckoo hash outperforms the traditional Cuckoo hash for all YCSB workloads been tested and has a 2.64 times performance improvement at most for workload D with the load factor of 0.7. In the use case, the directory index of BDCuckoo-optimized file system achieves 1.79 times performance improvement for stat command and 1.64 times performance improvement for rm command.
Xianqi Zheng, Jia Ma, Yubo Liu, Zhiguang Chen
A Two Tier Hybrid Metadata Management Mechanism for NVM Storage System
Abstract
NVM is an effective method for improving the performance of storage systems. Meanwhile, metadata management efficiency has a significant impact on the I/O performance of storage systems. But the current metadata management strategy is not efficient and will decrease the endurance of NVM devices. We design a two tier hybrid metadata management mechanism. Based upon the intrinsic characteristics of NVM devices and metadata management, the metadata management is decomposed instead of concentrating in the file system and redistributed between the file system and the device driver. A hybrid distributed directory tree, a dual-zone cache management algorithm and a multi-skiplists index embedded in driver are designed to short the I/O system software stack for accessing and managing metadata, and improve the concurrency of metadata management especially in computer with multi-core processor. Then, the prototype is implemented on a computer system equipped with Intel Optane DC based on the PMEM to evaluate with several common application workloads. The results demonstrate that it can improve the IOPS by 76.9% and 26.9% compared with EXT4 and NOVA on NVM respectively, and it has better scalability for the change of the number of files and threads.
Tao Cai, Pengfei Gao, Fuli Chen, Dejiao Niu, Fei Wang, Yueming Ma, Lei Li
A Novel CFLRU-Based Cache Management Approach for NAND-Based SSDs
Abstract
To ensure better I/O performance of NAND-based SSD storage, a DRAM cache is commonly equipped inside SSDs to absorb overwrites or writes performed to underlying SSD cells. This paper presents a simple, effective cache management method based on clean first least recently used (CFLRU) for SSDs, by also taking the factor of spatial locality into account. That is, we introduce a novel data management approach to separate hot and cold buffered data in the SSD cache by considering both factors of temporal and spatial locality, when accessing or inserting a piece of write data in the SSD cache. As a result, the hot buffered data and their spatially adjacent buffered data can be preferentially kept in the cache and other cold data will be evicted first. Simulation tests on several realistic disk traces show that our proposal improves cache hits by up to 4.5%, and then cuts down I/O response time by up to 9.4%, in contrast to the commonly used CFLRU scheme.
Haodong Lin, Jun Li, Zhibing Sha, Zhigang Cai, Jianwei Liao, Yuanquan Shi

Networks and Communications

Frontmatter
Taming Congestion and Latency in Low-Diameter High-Performance Datacenters
Abstract
High-performance computing (HPC) and data centers are showing a trend of merging into high-performance data centers (HPDC). HPDC is committed to providing extremely low latency for HPC or data center workloads. In addition to adopting a low-diameter network topology, HPDC also requires a more advanced congestion control mechanism. This paper implements the state-of-the-art congestion control method in the data centers on the Dragonfly topology and proposes Bowshot, a fast and accurate congestion control method for low-latency HPDC. Bowshot uses fine-grained feedback to accurately describe the network state. It uses switch feedback, ACK-padding, and ACK-first to reduce feedback delay. Bowshot uses switch calculation to reduce the overhead of congestion control. As the large-scale evaluation shows, Bowshot reduced the average flow completion time (FCT) by 33% and the 99th percentile FCT by 45% compared to the state-of-the-art work. Bowshot reduces the feedback delay by 89%. In addition, Bowshot maintains higher throughput and a near-zero queue length.
Renjie Zhou, Dezun Dong, Shan Huang, Zejia Zhou, Yang Bai
Evaluation of Topology-Aware All-Reduce Algorithm for Dragonfly Networks
Abstract
Dragonfly is a popular topology for current and future high-speed interconnection networks. The concept of gathering topology information to accelerate collective operations is a very hot research field. All-reduce operations are often used in the research fields of distributed machine learning (DML) and high-performance computing (HPC), because All-reduce is the key collective communication algorithm. The hierarchical characteristics of the dragonfly topology can be used to take advantage of the low communication delay of adjacent nodes to reduce the completion time of All-reduce operations. In this paper, we propose g-PAARD, a general proximity-aware All-reduce communication on the Dragonfly network. We study the impact of different routing mechanisms on the All-reduce algorithm, and their sensitivity to topology size and message size. Our results show that the proposed topology-aware algorithm can significantly reduce the communication delay, while having little impact on the network topology.
Junchao Ma, Dezun Dong, Cunlu Li, Ke Wu, Liquan Xiao
MPICC: Multi-Path INT-Based Congestion Control in Datacenter Networks
Abstract
Network congestion causes severe transmission performance loss. Congestion control is the key to achieve low latency, high bandwidth and high stability. Due to the high regularity of topologies, datacenter networks contain abundant resources of equal-cost paths inherently and have huge potential for high-bandwidth transmission. Taking in-network telemetry (INT) information as the congestion signal, congestion control algorithms can monitor the network based on precise link load data. Hence the INT-based algorithm is the focus of recent high-precision congestion control research. However, existing INT-based congestion control algorithms handle the single-path transmission between the sender and the receiver, which cannot make full use of the abundant equal-cost paths resources in the network. In this paper, we make the first attempt to propose a multi-path INT-based congestion control algorithm, called MPICC. Compared with the most advanced INT-based congestion control algorithm HPCC, our method realizes multi-path coordinated transmission of data packets in datacenter networks. We conduct extensive experiments under various workloads to evaluate the performance of MPICC. The experimental results show that under the condition of ensuring high throughput of end hosts, MPICC is able to use multi-path to transmit data packets, which greatly reduces the flow completion time (FCT) of the flow in the network. Specifically, under three common datacenter workloads, Cache Follower, Web Search and Web Server, MPICC reduces the average flow completion time of the network by 20.1%, 14.4%, and 39.5%, respectively, and shortens the 97th-percentile flow completion time by 39.9%, 18.9% and 57.9%, respectively.
Guoyuan Yuan, Dezun Dong, Xingyun Qi, Baokang Zhao
Backmatter
Metadata
Title
Network and Parallel Computing
Editors
Christophe Cérin
Prof. Depei Qian
Jean-Luc Gaudiot
Guangming Tan
Stéphane Zuckerman
Copyright Year
2022
Electronic ISBN
978-3-030-93571-9
Print ISBN
978-3-030-93570-2
DOI
https://doi.org/10.1007/978-3-030-93571-9

Premium Partner