Top

2022 | Book

Read chapter Read first chapter

Network and Parallel Computing

19th IFIP WG 10.3 International Conference, NPC 2022, Jinan, China, September 24–25, 2022, Proceedings

Editors: Shaoshan Liu, Xiaohui Wei

Publisher: Springer Nature Switzerland

Book Series : Lecture Notes in Computer Science

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

This book constitutes the proceedings of the 19th IFIP WG 10.3 International Conference on Network and Parallel Computing, NPC 2022, which was held in Jinan, China, during September 24-25, 2022.

The 23 full papers and 8 short papers presented in this volume were carefully reviewed and selected from 89 submissions. They were organized in topical sections as follows: computer architecture; cloud computing; deep learning; emerging applications; and storage and IO.

Frontmatter

Architecture

Frontmatter

A Routing-Aware Mapping Method for Dataflow Architectures

Dataflow architecture is a promising parallel computing platform with high performance, efficiency and flexibility. Dataflow mapping algorithm offloads programs onto dataflow hardware, which has a significant impact on the performance of the architecture. Dataflow mapping methods in previous studies are hardly efficient as they rarely consider the requirements of routing resources. In this paper, we propose a routing-aware mapping algorithm by combining hardware resources and dataflow graph characteristics to explore better mapping schemes. Our method first focuses on the influence of predecessor and successor nodes when mapping a node, and then comprehensively considers the competition of computing resources and the routing cost, to find the mapping solution with the lowest overhead. Experiments demonstrate that our method can achieve up to 2.06 $$\times $$ × performance improvement and 12.8% energy consumption reduction compared to state-of-the-art methods.

Zhihua Fan, Wenming Li, Tianyu Liu, Xuejun An, Xiaochun Ye, Dongrui Fan

Optimizing Winograd Convolution on GPUs via Partial Kernel Fusion

Convolution operations are the essential components in modern CNNs (Convolutional Neural Networks), which are also the most time-consuming. Several fast convolution algorithms include FFT and Winograd, have been proposed to solve this problem. Winograd convolution is used to improve the inference performance of the convolution operators with small kernels, which are the mainstream in the current popular CNNs. However, the implementations of Winograd convolution in many highly optimized deep neural network libraries and deep learning compilers are not efficient. Due to the complex data dependencies of the four stages of Winograd convolution, it is very challenging to optimize it. In this paper, we improve the inference performance of the Winograd convolution operator on GPUs. We propose a sync-free implementation of the calculation stage of Winograd and furtherly propose methods of PKF (Partial Kernel Fusion) utilizing different memory levels of GPUs. We implemented PKF-Reconstructor based on TVM for PKF Winograd convolution. Evaluations on convolution operators from real-world CNNs show that our method achieves a speedup of 8.22 $$\times $$ × –13.69 $$\times $$ × compared to cuDNN and 4.89 $$\times $$ × –9.10 $$\times $$ × to the fastest vanilla TVM Winograd implementation.

Gan Tong, Run Yan, Ling Yang, Mengqiao Lan, Jing Zhang, Yuanhu Cheng, Wentao Ma, Yashuai Lü, Sheng Ma, Libo Huang

Adaptive Low-Cost Loop Expansion for Modulo Scheduling

This paper presents a novel modulo scheduling method, which is called Expanded Modulo Scheduling (EMS). Unlike existing methods which regard loop unrolling and scheduling respectively, EMS supports adaptive loop expansion, and provides a unified scheduling strategy with conflict elimination mechanism for all unrolled layers. EMS constructs the data dependence graph (DDG) only once for the initial loop, and the expansion step is performed on DDG rather than the loop itself. As a heuristic, EMS focuses on the criticality of operations and tries to schedule interdependent operations as close as possible, thus reducing the register pressure. The paper describes this technique and evaluates it on MT-3000, achieving an average of over 25x performance improvement for classical assemblies and better resource utilization against other methods.

Hongli Zhong, Zhong Liu, Sheng Liu, Sheng Ma, Chen Li

SADD: A Novel Systolic Array Accelerator with Dynamic Dataflow for Sparse GEMM in Deep Learning

Nowadays, deep learning is prevalent in many fields. The primary workload in deep learning is the General Matrix-matrix Multiplication (GEMM). The TPU is the state-of-the-art GEMM accelerator. However, it does not support sparsity. In this paper, we design and implement the SADD, a systolic array accelerator that supports sparsity and dynamic dataflow. First, we propose the Group-Structure-Maintained Compression (GSMC). Then, based on the GSMC, we propose the Sparsity-supported Weight Stationary Dataflow (SWS) and Sparsity-supported Input Stationary Dataflow (SIS) to exploit the sparsity for systolic arrays. Finally, by combining the SIS and SWS, we propose the Sparsity-supported Dynamic Dataflow (SDD), which can change dataflow according to the computing environment. The experimental results show that the SDD in the SADD perform efficiently in any computing environment. When running the AlexNet, the performance of the SADD is $$2 \times$$ 2 × better than the TPU. In addition, the SADD brings only a small additional hardware overhead.

Bo Wang, Sheng Ma, Zhong Liu, Libo Huang, Yuan Yuan, Yi Dai

CSR&RV: An Efficient Value Compression Format for Sparse Matrix-Vector Multiplication

Sparse Matrix-Vector Multiplication (SpMV) plays a critical role in many areas of science and engineering applications. The storage space of value array in general real sparse matrices accounts for costly. However, the existing compressed formats cannot balance the compressed rate and computational speed. To address this issue, we propose an efficient value compression format implemented by AVX512 instructions called Compressed Sparse Row and Repetition Value (CSR &RV). This format stores each different value once and uses the indexes array to store the position of values, which reduces the storage space by compressing the value array. We conduct a series of experiments on an Intel Xeon processor and compare it with five other formats in 30 real-world matrices. Experimental results show that CSR &RV can achieve a speedup up to 3.86 $$\times $$ × (1.66 $$\times $$ × on average) and a speedup up to 12.42 $$\times $$ × (3.12 $$\times $$ × on average) for single-core and multi-core throughput, respectively. Meanwhile, our format can reduce the memory space by 48.57% on average.

Junjun Yan, Xinhai Chen, Jie Liu

Rgs-SpMM: Accelerate Sparse Matrix-Matrix Multiplication by Row Group Splitting Strategy on the GPU

The Sparse Matrix-Matrix Multiplication (SpMM) operation is widely used in different fields, especially the recently popular GNN framework. Researchers have designed many kernels on the GPU to accelerate the SpMM operation. Existing methods mostly adopt a row splitting strategy to obtain better parallelism and memory access efficiency. However, due to irregularities of sparse matrices such as short rows with few non-zero elements, current methods suffer from the underutilization of thread resources in GPU. In this paper, We rearrange the distribution of non-zero elements in the sparse matrix and design the SpMM kernel based on the row group splitting strategy. In contrast to previous methods which assign a “row” task unit to a warp for processing, we combine short rows in a sparse matrix into “row groups” as a task unit, which allocate more appropriate non-zero elements tasks to the GPU resources. This method reduces the thread divergence in a warp and improves load balancing among warps. Our experimental data comes from the SNAP Matrix Collection. The results show that our kernel is faster than cuSPARSE and GE-SpMM, with an average speedup of 1.61 and 1.42 respectively.

Mingfeng Guo, Yaobin Wang, Jun Huang, Qingfeng Wang, Yaqing Zhang, Mu Xu, Fang Lu

Cloud Computing

Frontmatter

Interference-aware Workload Scheduling in Co-located Data Centers

Modern data centers typically contain thousands of servers, providing various computing and storage services for users. The strategy to provide reliable and high-performance online services is to over-allocate resources for online services, which results in a waste of cluster resources. Therefore, cloud vendors tend to co-locate online services and offline batch jobs into the same cluster to improve resource utilization. However, the co-location leads to contention on shared resources and causes mutual performance interference, which may degrade the QoS (Quality of Service) of online services. We present a performance interference model based on linear regression to predict the performance interference. Furthermore, the model can perceive the status of servers in real-time for more refined and accurate prediction. Then, we design an interference-aware workload scheduling strategy that can schedule batch jobs to the server while introducing minimal interference. The evaluation demonstrates that our scheduling strategy can at best increase the throughput of batch jobs by 48.95% and 27.09% compared with round-robin scheduling and random scheduling while guaranteeing the QoS of online services. The paper aims to provide some elements of answering the following general question: How do we design cloud systems and data centers scheduling strategies that can withstand the human pressure of global-scale use and still provide robust and secure services to end-users?

Ting Zhang, Dongyang Ou, Zhefeng Ge, Congfeng Jiang, Christophe Cérin, Longchuan Yan

FaaSPipe: Fast Serverless Workflows on Distributed Shared Memory

Serverless workflows consist of multiple chained functions that pass intermediate data through function invocations. Existing platforms implement data passing via remote storage services that incur significant performance and maintenance overhead. Some recent works optimize data passing with local shared memory but require specialized function scheduling support and are available only in constrained settings. In this paper, we observed that a simplified, peer-to-peer form of distributed shared memory (DSM) is sufficient and efficient to pass data in serverless workflows. With the observation, we propose FaaSPipe, a serverless workflow runtime built on the simplified DSM. FaaSPipe provides PipeFunc, a user-friendly shared memory-based serverless workflow programming model. To support PipeFunc and take full advantage of the simplified DSM, FaaSPipe designs the intra-workflow memory sharing scheme for address space coordination and builds full-duplex memory transfer channels to enable fast, non-blocking peer-to-peer data passing. Evaluation results on real-world workflow applications show that FaaSPipe reduces workflow latency by up to 61% and consumes up to 2.07 $$\times $$ × less network traffic compared to state-of-art serverless platforms.

Ruizhe Tong

TopKmer: Parallel High Frequency K-mer Counting on Distributed Memory

High-throughput DNA sequencing is a crucial technology for genomics research. As genetic data grows to hundreds of gigabytes or even terabytes that traditional devices cannot support, high-performance computing plays an important role. However, current high-performance approaches to extracting k-mers cost a large memory footprint due to the high error rate of short-read sequences. This paper proposes TopKmer, a parallel k-mer counting workflow that indexes high-frequency k-mers within a tiny counting structure. On the 2048 cores of Tianhe-2, we construct k-mer index tables in 18 s for 174 GB fastq files and complete queries in 1 s for 1 billion k-mers, with a scaling efficiency of 95%. Compared with the state of the art, the counting table’s memory usage is reduced by 50% with no performance degradation.

Li Mocheng, Chen Zhiguang, Xiao Nong, Liu Yang, Luo Xi, Chen Tao

Flexible Supervision System: A Fast Fault-Tolerance Strategy for Cloud Applications in Cloud-Edge Collaborative Environments

With the development of cloud-edge collaborative computing technology, more and more cloud applications are transferred to edge devices. Some cloud applications in relatively unstable edge scenarios put forward higher requirements for fault tolerance. Therefore, we design and implement a flexible supervision system. The system provides a higher frequency of fault detection than existing cloud management platforms like Kubernetes. And It implements a more efficient checkpoint-restart fault handling scheme based on the distributed in-memory database. Meanwhile, we also consider minimizing the extra time costs caused by the fault-tolerance operations and saving cloud system resources including computing, storage, and network.

Weilin Cai, Heng Chen, Zhimin Zhuo, Ziheng Wang, Ninggang An

Adjust: An Online Resource Adjustment Framework for Microservice Programs

The categories of programs running in data centers are changing from the traditional monolithic program to loosely coupled microservice programs. Microservice programs are easy to update and can facilitate software heterogeneity. However, microservice programs have more stringent performance requirements. Estimating the resource requirements of each service becomes the key point to ensuring the QoS of the microservice programs. In this paper, we propose a QoS-aware framework Adjust for microservice programs. Adjust establishes a neural network-based microservice QoS prediction model. Moreover, Adjust identifies the causes of the abnormality of microservice programs, and determines performance assurance strategies on both system and microservice levels. By dynamically adjusting the resource allocation, Adjust can effectively guarantee the performance of microservice programs and improves system resource utilization at the same time.

Lin Wang, Tianyuan Huang, Shichao Geng, Donghua Li, Xiaomin Zhu, Huaxiang Zhang

Cloud-Native Server Consolidation for Energy-Efficient FaaS Deployment

The lack of function-oriented power management scheme has seriously hindered the serverless platform’s cost efficiency. In this paper, we analyze the invocation pattern of serverless functions and investigate its implications on server energy efficiency. Rather than using a one-size-fits-all strategy, we propose DAC, a software-hardware co-design solution to offer differentiated cloud-native server consolidation. We build a proof-of-concept framework and show that DAC can improve the energy efficiency of tail function deployment by up to 23%.

Lu Zhang, Yifei Pu, Cheng Xu, Du Liu, Zeyi Lin, Xiaofeng Hou, Pu Yang, Shang Yue, Chao Li, Minyi Guo

Deep Learning

Frontmatter

NeuProMa: A Toolchain for Mapping Large-Scale Spiking Convolutional Neural Networks onto Neuromorphic Processor

Neuromorphic processors, the new generation of brain-inspired non-von Neumann computing systems, have the potential to perform complex computations with more energy efficiency than conventional architectures. Neuromorphic processors typically implement the spiking neural network (SNN)-based applications. However, a non-optimized mapping of SNNs onto the neuromorphic processor may increase the on-chip communication delay and data exchange between the off-chip and on-chip memory, especially when the size of the SNNs exceeds the capacity of the processor limited by the on-chip resources. This paper proposes a toolchain, called NeuProMa, to map large-scale spiking convolutional neural networks (SCNNs) onto resource-constrained neuromorphic processors. We exploit the implicit regular connections in the SCNNs and split the SCNNs into multiple sub-networks while reducing the data exchange between the off-chip and on-chip memory. Then, we partition the sub-networks into multiple clusters sequentially in a specific order, which significantly reduces the spike messages between neuromorphic cores. Finally, NeuProMa dispatches the clusters to the neuromorphic cores, minimizing the maximum workload of the routers. Our experiments using six SCNN-based applications show that NeuProMa can significantly reduce the data exchange between the off-chip and on-chip memory, and reduce the spike latency and energy consumption by up to 17% and 85%, respectively, compared with the state-of-the-art.

Chao Xiao, Jihua Chen, Lei Wang

Multi-clusters: An Efficient Design Paradigm of NN Accelerator Architecture Based on FPGA

With the serial development of neural network models, choosing a superior platform for these complex computing applications is essential. Field-Programmable Gate Array (FPGA) is gradually becoming an accelerating platform that balances power and performance. The design of architecture in neural network accelerator based on FPGA is about two categories, stream and single-engine. Both design paradigms have advantages and disadvantages. The stream is easier to achieve high performance because of model customization but has low kernel compatibility. The single-engine is more flexible but has more scheduling overhead. Therefore, this work proposes a new design paradigm for the neural network accelerator based on FPGA, called the Multi-clusters (MC), which combines the characteristics of the above two design categories. We divide the original network model according to the calculated features. Then, different cores are designed to map these parts separately for efficient execution. The fine-grained pipeline is performed inside the cores. Multiple cores are executed by software scheduling and implement a coarse-grained schedule, thereby improving the overall computing performance. The experimental results show that the accelerator with the MC category achieved 39.7 $$\times $$ × times improvement of performance and 7.9 $$\times $$ × times improvement of energy efficiency compared with CPU and GPU, and finally obtained nearly 680.3 GOP/s computing performance in the peek.

Teng Wang, Lei Gong, Chao Wang, Yang Yang, Yingxue Gao

TrainFlow: A Lightweight, Programmable ML Training Framework via Serverless Paradigm

Distributed ML training is widely used to improve training performance. However, current distributed training frameworks bring undesirable burdens to application-oriented users due to its server-centric design. It is also difficult for users to customize training (e.g., with adaptive policies) to guarantee performance in dynamic environments. Thus, it is meaningful to make training framework lightweight and programmable. We argue that serverless paradigm can effectively help meet the demands. In this paper, we propose TrainFlow, adopting serverless paradigm to simplify and extend programmability of data-parallel training. First, the basic framework is built with a novel serverless process model, providing a high-level view and various state sharing. Then training can be divided into 2 processes with specific workflows. Second, TrainFlow provides an event-driven hook mechanism, allowing users to customize training workflow. We implement and evaluate TrainFlow with OpenFaaS. Experiments demonstrate its availability and programmability. For availability, TrainFlow can support various training patterns, and shows advantages of performance (e.g., 1.6× higher speedup ratio than baseline) and resource consuming (e.g., at most 41.0% less memory consuming than baseline). For programmability, TrainFlow can work with adaptive policies as expected (e.g., at most 1.48× higher throughput in a case).

Wenting Tan, Xiao Shi, Zhengyu Lei, Dong Liang, Cunchi Lv, Xiaohong Wang, Xiaofang Zhao

DRP:Discrete Rank Pruning for Neural Network

Although deep neural networks (DNNs) have achieved excellent performance in computer vision applications in recent years, it is still challenging to deploy them on resource-limited devices such as mobile phones. To solve this problem, we propose a novel filter pruning method for neural network named Discrete Rank Pruning (DRP). Moreover, many methods apply sparse regularization on the filter weights of the convolution layers to reduce the degradation of performance after pruning. We analyze these methods and find that it is necessary to consider the influence of the bias term. Based on these, we propose a novel sparse method named Consideration Bias Sparsity (CBS). Extensive experiments on MNIST, CIFAR-10 and CIFAR-100 datasets with LeNet-5, VGGNet-16, ResNet-56, GoogLeNet and DenseNet-40 demonstrate the effectiveness of CBS and DRP. For LeNet-5, CBS achieves 1.87% increase in accuracy than sparse regularization on MNIST. For VGGNet-16, DRP achieves 66.6% reduction in FLOPs by removing 83.3% parameters with only 0.36% decrease in accuracy on CIFAR-10. For ResNet-56, DRP achieves 47.45% reduction in FLOPs by removing 42.35% parameters with only 0.82% decrease in accuracy on CIFAR-100.

Songwen Pei, Jie Luo, Sheng Liang

TransMigrator: A Transformer-Based Predictive Page Migration Mechanism for Heterogeneous Memory

Page migration strategies are crucial to the performance of a hybrid main memory system which consists of DRAM and Non-Volatile RAM. Previous locality-based migration strategies have limitations on deciding which pages should be placed in limited DRAM. In this paper, we propose TransMigrator, a transformer-based predictive page migration mechanism. TransMigrator uses an end-to-end neural network to directly predict the page that will be accessed most in the near future, by learning patterns from long memory access history. The network achieved 0.7245 average accuracy of prediction with 0.804 MB model parameter size. Besides, a threshold-based method is used at the same time to make the system robust. TransMigrator reduces access time by 23.59% on average compared with AC-CLOCK, THMigrator and VC-HMM.

Songwen Pei, Jianan Li, Yihuan Qian, Jie Tang, Jean-Luc Gaudiot

Hardware Acceleration for 1D-CNN Based Real-Time Edge Computing

One-dimensional convolutional neural network (1D-CNN) has a major advantage of low-cost implementation on edge devices, for time series classification. However, for the edge devices working in real-time computing (RTC) systems, the nonconcurrent availability of input signals leads to a more complex computing process and a bigger challenge to satisfy the resource and timing constraints. In this paper, an energy-efficient high-performance 1D-CNN architecture is proposed for edge inference of RTC systems, which performs 1D-CNN operations element-wisely and simultaneously when the input sequence is streamed. We present a data reuse scheme to maximally reduce the computational and memory resources, based on the generation of 1D-CNN feature maps during RTC. A compiler is developed to generate the hardware architecture in pipeline, for any given 1D-CNN model. We implement our proposed architecture by a 65-nm CMOS technology, and show this design realizes up to 1.72 TOPs/W power efficiency. Regarding computational latency, our design can outperform state-of-the-art CNN accelerators with a reduction of more than one order of magnitude.

Xinyu Liu, Gaole Sai, Shengyu Duan

Emerging Applications

Frontmatter

DLC: An Optimization Framework for Full-State Quantum Simulation

Quantum simulation on classical computers is one of the main approaches to evaluate quantum computation devices and develop quantum algorithms. Some quantum simulators have been proposed, mainly divided into two categories: full-state simulators and tensor network simulators. The former consumes a lot of memory to hold the quantum state vectors. Therefore, the time overheads cost by calculation are much lower than that cost by memory-accesses and communications. Traditional optimization techniques such as latency hiding are not suitable for quantum simulation, and high-performance devices like GPGPUs cannot be fully utilized. This paper proposes DLC (Data Locality and Communication) optimizer to perform data locality and data layout optimizations. Both optimizations are based on the identification of amplitudes that can be processed by a sequence of quantum gates. They not only increase the data locality on the GPU side, but also reduces the date communication overheads and the times of data exchanges. In addition, layout data dynamically can significantly reduce the memory space on the GPGPU side for data communication. We evaluate our scheme on a small-scale CPU + GPU cluster. Experimental results show that for quantum circuits having 30–34 qubits, the ratio of communication to calculation increases from 12 to 79%, and a performance improvement 1.25–7× is achieved. Theoretically, our optimizations will be more effective as the number of qubits increases.

Zhikai Qin, Tao Li, Li Shen

Approximation Algorithms for Reliability-Aware Maximum VoI on AUV-Aided Data Collections

Underwater Wireless Sensor Networks (UWSNs) show great potential in ocean exploration on data collection. Recently, with the increasing amount of underwater sensed data, the Autonomous Underwater Vehicle (AUV) is introduced as a mobile sink to collect data from sensors. Existing research mainly regards the Value of Information (VoI) as the metric of real-time value such as data importance and timeliness, and they are committed to finding a path with maximum VoI for efficient collection. However, due to the limitation of AUV energy, partial sensors and their data may be omitted by the optimal path. From the perspective of the integrality dimension, data in the areas not covered by the path are indispensable for UWSN applications, which eventually reduces the reliability of the collection. To maximum VoI and improve reliability simultaneously, in this work, we propose approximate algorithms for reliability-aware AUV-aided data collection framework that extends the definition of VoI as the combination of the real-time VoI and reliability VoI. To find the optimal path under the framework, we first propose a dynamic priority strategy to re-quantify VoI on each sensor. Then we utilize an existing $$(2+\epsilon )$$ ( 2 + ϵ ) approximation algorithm to find the optimal path without considering data timeliness which is formulated as the Orienteering Problem (OP). After that, we propose a novel polynomial-factor approximation algorithm to consider the decay of real-time VoI by reducing such variant OP into k-TSP. Finally, simulation results validate the effectiveness of the proposed approximation algorithms.

Hao Guo, Xiaohui Wei, Xingwang Wang, Xiaonan Wang, Chenghao Ren, Meikang Qiu

CCSBD: A Cost Control System Based on Blockchain and DRG Mechanism

Diagnosis Related Group (DRG) allows patients to be grouped according to the initial diagnosis and to be prepaid within the group. Actual costs in follow-up treatment cannot exceed the prepaid value, achieving the purpose of medical cost control. Three problems exist in the process. First, some treatment operations are highly overlapping and therefore cannot be accurately classified. Second, classification data cannot be credibly shared across hospitals. Third problem is the historical payment path required to predict costs cannot be fully traced. To address these problems, we design a Cost Control System Based on Blockchain and DRG Mechanism (CCSBD). We proposes a fusion classification model to realize the contribution assessment of the important feature factors, leading to accurate classification. In order to ensure the security and consistency of shared information, We establish a hyper-ledger blockchain architecture for secure sharing of medical data. Through smart contract, the architecture realizes dynamic consensus endorsement of data and cross-chain cross-authentication of departmental attributes. We realize value data screening and clinical path tracking through logical chain code to generate reasonable cost metrics to predict expenses, and implement CCSBD on the Fabric consortium-chain platform. Through comparative analysis with three single classification models, we prove that CCSBD improves classification accuracy by 7%. Furthermore, the security and efficiency of the shared structure are demonstrated by smart contract latency tests and consistency attack tests.

Weiqi Dai, Yan Yu, Xia Xie, Dezhong Yao, Hai Jin

Number of UAVs and Mission Completion Time Minimization in Multi-UAV-Enabled IoT Networks

The application of unmanned aerial vehicles (UAVs) in IoT networks, especially data collection, has received extensive attention. Due to the urgency of the mission and the limitation of the network cost, the number and the mission completion time of UAVs are research hotspots. Most studies mainly focus on the trajectory optimization of the UAV to shorten the mission completion time. However, under different data collection modes, the collection time will also greatly affect the mission completion time. This paper studies the data collection of ground IoT devices (GIDs) in Multi-UAV enabled IoT networks. The problem of data collection is formulated to minimize the number and the maximum mission completion time of UAVs by jointly optimizing the mission allocation of UAVs, hovering location, and the UAV trajectory. In view of the complexity and non-convexity of the formulated problem, we design improved ant colony optimization (IACO) algorithm to determine the number of UAVs by the mission allocation. Then, based on the data collection scheme combining flying mode (FM) and hovering mode (HM), a joint optimization algorithm (JOATC) is proposed to minimize flight time and collection time by optimizing the trajectory of the UAV. Simulation results show that our scheme achieves excellent performance.

Xingxia Gao, Xiumin Zhu, Linbo Zhai

A Spatial-Temporal Similarity-Based Cooperative Surveillance Framework by Edge

Most of the current video cooperative surveillance strategies upload all video clips the camera takes, which will cause great data redundancy and bandwidth waste. In this paper, we combine temporal similarity with spatial similarity and introduce the concept of spatiotemporal similarity. In particular, we design a framework to calculate spatial-temporal similarity to reduce the complexity of the collection and the transmission of source data. Besides, we model the problem of minimum spatiotemporal similarity with the bandwidth limitation into a knapsack problem and propose a dynamic programming-based algorithm to determine the selection of video uploading. The results show the framework can make a 10% data redundancy reduction and bandwidth saving.

Jie Tang, Yuxuan Zhao, Rui Huang

A Progressive Transmission Method of Cloud Point Data for HD Map in Autonomous Driving

The High-Definition map (HD map) is an indispensable part of autonomous driving vehicle positioning and navigation. Because of its high accuracy, the general high-precision map has a very large data volume, and the existing network cannot meet the requirements of high-speed transmission. It will result in significant time delay and greatly threaten driving safety. Therefore, we propose a progressive cloud point data transmission model for HD map applications. It consists of three-level modeling of data compression, transmission time, and delivered data restoration. It can also adjust the data transmission accuracy to get a better transmission time delay according to different application demands. Experiments show that with map data being progressively delivered, autonomous driving can get a more fluent HD map service even though when the network is unstable.

Jie Tang, Kai Jiang, Rui Huang

IMRSim: A Disk Simulator for Interlaced Magnetic Recording Technology

The emerging interlaced magnetic recording (IMR) technology achieves a higher areal density for hard disk drive (HDD) over the conventional magnetic recording (CMR) technology. Unfortunately, there has been no related disk simulator and product available to the public. So, in this work, we implement the first public IMR disk simulator, called IMRSim, simulating the interlaced tracks and implementing many state-of-the-art data placement strategies. IMRSim is built on the actual CMR-based HDD to precisely simulate the I/O performance of IMR drives. We release IMRSim as an open-source IMR disk simulation tool and hope to provide a platform for related research.

Zhimin Zeng, Xinyu Chen, Laurence T. Yang, Jinhua Cui

Storage and I/O

Frontmatter

Alleviating Performance Interference Through Intra-Queue I/O Isolation for NVMe-over-Fabrics

The NVMe-over-Fabrics (NVMeoF) protocol enables high-performance Protocol Data Units (PDUs) exchanges between hosts and remote NVMe controllers. The performance benefits of NVMeoF are mainly derived from the multiple deep queue pairs for parallel PDUs transfers. NVMeoF has significantly facilitated NVMe SSD disaggregation from compute nodes for better resource utilization and scaling independence. However, as the performance of NVMe SSD and network infrastructure increases, the near-perfect performance delivery of NVMeoF is harder to achieve. The primary reason is the increased CPU interrupts and performance interference originated from the I/O requests served by the same NVMeoF queue pair.In this paper, we investigate how intra-queue requests are mutually affected, and propose PINoF, a Performance Isolated remote storage access mechanism for NVMe-over-Fabrics. PINoF separates CMD and Data PDUs resources in each NVMeoF queue pair to achieve intra-queue I/O isolation, transfers PDUs in batch along with read or write specific I/O path to achieve isolated interrupt-coalescing, and introduces differentiated PDU reordering schemes to achieve isolated scheduling. Our experimental results demonstrate that compared with state-of-the-art NVMeoF implementations, PINoF achieves 23.92% lower latency, increases bandwidth by up to 19.59%, and improves IOPS by 12.41% on average.

Wenhao Gu, Xuchao Xie, Dezun Dong

WALOR: Workload-Driven Adaptive Layout Optimization of Raft Groups for Heterogeneous Distributed Key-Value Stores

In a heterogeneous cluster based on the Raft protocol, in order to solve the problem of slow performance caused by the leader on a slow node, someone proposed ALOR. However, the leader distribution of ALOR is not optimal. In this paper, we propose Workload-driven Adaptive Layout Optimization of Raft groups (WALOR), which changes the leader distribution of ALOR to promote the performance further by more fitting the read-write request ratio of the system’s workload. Our experiments on an actual heterogeneous cluster show that, on average, WALOR improves throughput by 82.96% and 32.42% compared to the even distribution (ED) solution and ALOR, respectively.

Yangyang Wang, Yunpeng Chai, Qingpeng Zhang

Efficient Data Placement for Zoned Namespaces (ZNS) SSDs

ZNS SSDs (Zone Namespaces SSDs) are a new type of SSDs. It allows the entire SSD space to be divided into multiple zones, and only sequential writes are allowed within each zone. ZNS SSDs effectively improve the read/write throughput of SSDS and reduce write magnification. However, the sequential write and zone partitioning of ZNS SSDs pose challenges to existing storage allocation strategies. In this paper, we propose a new ZNS SSD-aware data placement algorithm. Specifically, the inserted and modified data is placed according to the lifetime of the data, and the variance of the data lifetime in each zone is used for management and garbage collection based on the calculation of the conventional garbage collection strategy. Experiments show that the lifetime-based insertion algorithm has a great improvement in stability compared with the average insertion and polling insertion algorithm, and the time performance is slightly reduced due to the calculation overhead of the lifetime. The lifetime variance-aware garbage collection algorithm is 9% better than the conventional garbage collection algorithm in terms of time performance and is more stable.

Hongtao Wang, Yang Liu, Peiquan Jin, Mingchen Lu, Xiangyu Zhuang, Yuanjing Lin, Kuankuan Guo

SchedP: I/O-aware Job Scheduling in Large-Scale Production HPC Systems

Job schedulers on High Performance Computing systems can serve more purposes than just maximising computing resource utilisation if they are equipped with more awareness on other aspects of the system. In this work, we focus on making a job scheduler I/O-aware to assist system I/O management. We propose SchedP as the first practical effort on I/O-aware job scheduling that can work in production HPC environment. It trains neural network model to predict each job’s I/O pattern, then makes a delay decision if starting a job right away will lead to I/O congestion in the system. We integrate it into Slurm and performed evaluations with real HPC workloads in production environment for about a month. The results show: a) the neural network model of SchedP reached over 99% for both training and test accuracy on predicting jobs’ I/O patterns; b) SchedP has obvious effect on alleviating system I/O contention.

Kaiyue Wu, Jianwen Wei, James Lin

SpacKV: A Pmem-Aware Key-Value Separation Store Based on LSM-Tree

Key-value (KV) stores based on persistent memories such as Intel Optane Pmem can deliver higher throughput and lower latency, compared to traditional SSD/HDD. Many KV stores adopt LSM-tree as the bone index structure. However, LSM-tree suffers from severe write amplification, which degrades the system’s performance and exacerbates the wearout of persistent memory. In this paper, we propose SpacKV, a hybrid DRAM-Pmem KV store, which applies a KV separation scheme and exploits Pmem’s device characteristics to achieve high throughput. We design a dedicated value storage structure to maintain localized order of values for efficient range queries and a compaction-triggered garbage collection mechanism to minimize intermediate I/O overhead. Moreover, we leverage Pmem’s key features: byte-addressability, access unit of 256 bytes and specific persistence instructions to further mitigate the write amplification effect. The experimental results show that SpacKV achieves 1.4–10.8 $$\times $$ × , 4.7–9.7 $$\times $$ × , and 6.7–13.5 $$\times $$ × in terms of write, read, and range query performance over three state-of-the-art LSM-tree based KV stores: LevelDB-Pmem, RocksDB-Pmem, and MatrixKV, respectively.

Xuran Ge, Mingche Lai, Yang Liu, Lizhou Wu, Zhutao Zhuang, Yang Ou, Zhiguang Chen, Nong Xiao

Consistent and Efficient Batch Operations for NoSQL Databases with Hybrid Timestamp

NoSQL databases, such as HBase or Cassandra employ weak consistency models to provide good scalability and availability. However, they often lack functionality that would help programmers reason about the correctness of their applications. Notably, they do not support consistent batch operations that could be used for important tasks, such as batch updates or maintaining secondary indexes. Some systems add transaction support to NoSQL databases. However, they often bring much overhead to existing single-row operations. This paper proposes an efficient algorithm for supporting batch operations on existing NoSQL databases. It reuses the existing local timestamp and adds a global timestamp to ensure batch operations’ consistency. Our implementation based on HBase shows that compared to transactional systems, our algorithm improves the throughput of batch operations by up to 2 $$\times $$ × . Meanwhile, the latency of single-row operations only increases by around 12%. In comparison, other transactional systems increase their latency by over 3 $$\times $$ × .

Qianmian Yu, Jing Zhou

Backmatter

Title: Network and Parallel Computing
Editors: Shaoshan Liu
Xiaohui Wei
Publisher: Springer Nature Switzerland
Electronic ISBN: 978-3-031-21395-3
Print ISBN: 978-3-031-21394-6
DOI: https://doi.org/10.1007/978-3-031-21395-3

Springer Professional

About this book

Table of Contents

Frontmatter

Architecture

Frontmatter

A Routing-Aware Mapping Method for Dataflow Architectures

Optimizing Winograd Convolution on GPUs via Partial Kernel Fusion

Adaptive Low-Cost Loop Expansion for Modulo Scheduling

SADD: A Novel Systolic Array Accelerator with Dynamic Dataflow for Sparse GEMM in Deep Learning

CSR&RV: An Efficient Value Compression Format for Sparse Matrix-Vector Multiplication

Rgs-SpMM: Accelerate Sparse Matrix-Matrix Multiplication by Row Group Splitting Strategy on the GPU

Cloud Computing

Frontmatter

Interference-aware Workload Scheduling in Co-located Data Centers

FaaSPipe: Fast Serverless Workflows on Distributed Shared Memory

TopKmer: Parallel High Frequency K-mer Counting on Distributed Memory

Flexible Supervision System: A Fast Fault-Tolerance Strategy for Cloud Applications in Cloud-Edge Collaborative Environments

Adjust: An Online Resource Adjustment Framework for Microservice Programs

Cloud-Native Server Consolidation for Energy-Efficient FaaS Deployment

Deep Learning

Frontmatter

NeuProMa: A Toolchain for Mapping Large-Scale Spiking Convolutional Neural Networks onto Neuromorphic Processor

Multi-clusters: An Efficient Design Paradigm of NN Accelerator Architecture Based on FPGA

TrainFlow: A Lightweight, Programmable ML Training Framework via Serverless Paradigm

DRP:Discrete Rank Pruning for Neural Network

TransMigrator: A Transformer-Based Predictive Page Migration Mechanism for Heterogeneous Memory

Hardware Acceleration for 1D-CNN Based Real-Time Edge Computing

Emerging Applications

Frontmatter

DLC: An Optimization Framework for Full-State Quantum Simulation

Approximation Algorithms for Reliability-Aware Maximum VoI on AUV-Aided Data Collections

CCSBD: A Cost Control System Based on Blockchain and DRG Mechanism

Number of UAVs and Mission Completion Time Minimization in Multi-UAV-Enabled IoT Networks

A Spatial-Temporal Similarity-Based Cooperative Surveillance Framework by Edge

A Progressive Transmission Method of Cloud Point Data for HD Map in Autonomous Driving

IMRSim: A Disk Simulator for Interlaced Magnetic Recording Technology

Storage and I/O

Frontmatter

Alleviating Performance Interference Through Intra-Queue I/O Isolation for NVMe-over-Fabrics

WALOR: Workload-Driven Adaptive Layout Optimization of Raft Groups for Heterogeneous Distributed Key-Value Stores

Efficient Data Placement for Zoned Namespaces (ZNS) SSDs

SchedP: I/O-aware Job Scheduling in Large-Scale Production HPC Systems

SpacKV: A Pmem-Aware Key-Value Separation Store Based on LSM-Tree

Consistent and Efficient Batch Operations for NoSQL Databases with Hybrid Timestamp

Backmatter

Premium Partner