Skip to main content
Top

2024 | Book

Advanced Parallel Processing Technologies

15th International Symposium, APPT 2023, Nanchang, China, August 4–6, 2023, Proceedings

Editors: Chao Li, Zhenhua Li, Li Shen, Fan Wu, Xiaoli Gong

Publisher: Springer Nature Singapore

Book Series : Lecture Notes in Computer Science

insite
SEARCH

About this book

This book constitutes the refereed proceedings of the 15th International Symposium on Advanced Parallel Processing Technologies, APPT 2023, held in Nanchang, China, during August 4–6, 2023.

The 23 full papers and 1 short papers included in this book were carefully reviewed and selected from 49 submissions. They were organized in topical sections as follows: High Performance Computing and Parallelized Computing, Storage Systems and File Management, Networking and Cloud Computing, Computer Architecture and Hardware Acceleration, Machine Learning and Data Analysis, Distinguished Work from Student Competition.

Table of Contents

Frontmatter

High Performance Computing and Parallelized Computing

Frontmatter
Enhancing Multi-physics Coupling on ARM Many-Core Cluster
Abstract
In scientific and engineering computing, there are a large number of complex physical simulations involving multiple physical fields. This complex physical simulation in which multiple physical fields superimpose and interact with each other is aiming at solving the multiphysics coupling problem. A typical approach to solving a complex physics problem is decoupling it into multiple separate physical models. These models are solved independently and coupled by explicitly exchanging data with each other. A key to the method is the design of the multiphysics coupler, that transmits data between two physical models with high fidelity and high efficiency. However, current multiphysics data transmission algorithms have scalability and performance bottlenecks caused by communication and computation overhead. In this paper, we take full advantage of modern multi-core hardware to improve the performance of multiphysics data transfer algorithms. At the same time, the scalability of the coupler is improved by optimizing the communication algorithm, search algorithm, and KD-Tree reusing strategies. Experimental results on the ARM multi-core platform show that our improved multiphysics coupling methods achieve more than 10\(\times \) acceleration compared with the original program. The scalability of the our method has also been greatly improved.
Wencheng Shi, Nan Hu, Jiangsu Du, Dan Huang, Yutong Lu
Polaris: Enhancing CXL-based Memory Expanders with Memory-side Prefetching
Abstract
The use of CXL-based memory expanders introduces increased latency compared to local memory due to control and transmission overheads. This latency difference negatively impacts tasks that are sensitive to latency. While cache prefetching has traditionally been used to mitigate memory latency, addressing this performance gap requires improved CPU prefetch coverage. However, tuning a CPU prefetcher for CXL memory necessitates costly CPU modifications and can result in cache pollution and wasted memory bandwidth. To address these challenges, we propose a solution called Polaris, a novel CXL memory expander that integrates a hardware prefetcher in the CXL memory controller chip. Polaris analyzes incoming memory requests and prefetches cachelines to a dedicated SRAM buffer without requiring modifications to CPUs or software. In cases where prefetch hits occur, Polaris establishes a “shortcut” for rapid memory access, significantly reducing the performance gap between CXL and local DDR memory. Furthermore, if small CPU changes are allowed, such as extending Intel’s DDIO, Polaris can further minimize CXL memory access overheads by actively pushing high-confidence prefetches to the CPU’s last-level cache (LLC). Extensive experiments demonstrate that, in conjunction with various CPU-side prefetchers, Polaris enables up to 85% of common workloads (on average, 43%) to effectively tolerate CXL memory’s longer latency.
Zhe Zhou, Shuotao Xu, Yiqi Chen, Tao Zhang, Ran Shu, Lei Qu, Peng Cheng, Yongqiang Xiong, Guangyu Sun
ExtendLife: Weights Mapping Framework to Improve RRAM Lifetime for Accelerating CNN
Abstract
Process-in-memory (PIM) engines based on Resistive random-access memory (RRAM) are used to accelerate the convolutional neural network (CNN). RRAM performs computation by mapping weights on its crossbars and applying a high voltage to get results. The computing process degrades RRAM from the fresh status where RRAM can support high data precision to the aged status where RRAM only can support low precision, potentially leading to a significant CNN training accuracy degradation. Fortunately, many previous studies show that the impact of loss caused by the RRAM precision limitation across various weights is different for CNN training accuracy, which motivates us to consider mapping different weights on RRAM with different statuses to keep high CNN training accuracy and extending the high CNN training accuracy iterations of PIM engines based on RRAM, which is regarded as the lifetime of the RRAM on CNN training. In this paper, we propose a method to evaluate the performance of the weights mapping on extending the lifetime of the RRAM and present a weights mapping framework specifically designed for the hybrid of aged and fresh RRAM to extend the lifetime of the RRAM engines on CNN training. Experimental results demonstrate that our weights mapping framework brings up to 6.3\(\times \) on average lifetime enhancement compared to the random weights mapping.
Fan Yang, Yusen Li, Zeyuan Niu, Gang Wang, Xiaoguang Liu
The Optimization of IVSHMEM Based on Jailhouse
Abstract
The hypervisor, with its resource isolation, security guarantees, and ability to meet high real-time requirements, offers significant advantages in real-time scenarios. Furthermore, its communication capabilities play a crucial role in enabling collaborative computation tasks across different virtual machines. The Jailhouse hypervisor, known for its real-time capabilities and secure embedded platform, demonstrates outstanding performance in real-time scenarios. However, the inter-virtual machine (inter-VM) communication protocol based on Jailhouse is not yet mature, necessitating optimization to enhance its suitability for real-time communication scenarios. Firstly, the existing communication mechanism underwent reconstruction, involving the disabling of the one-shot interrupt mode and expanding the shared memory area. Secondly, an experimental platform was established on the Raspberry Pi-4B, configuring the real-time system and adopting the io_uring methods. Finally, experimental evaluations were conducted to assess the differences in communication delay, throughput, and data transmission delay before and after the communication protocol reconstruction. Additionally, the mitigating effect of the new communication mechanism on VMexit behavior was also evaluated. The experimental results demonstrate that the enhanced communication mechanism significantly reduces both the system call overhead and the number of VMexit compared to the native communication protocol (Inter-VM Shared Memory, IVSHMEM). Moreover, the throughput exhibits a notable improvement of approximately 200 MB/s.
Jiaming Zhang, Fengyun Li, Liu Yang, Yucong Chen, Hubin Yang, Qingguo Zhou, Yan Li, Rui Zhou
Multi-agent Cooperative Computing Resource Scheduling Algorithm for Periodic Task Scenarios
Abstract
The scheduling of large-scale service requests and jobs usually requires the service cluster to fully use node computing resources. However, due to the increasing number of server devices, the dependence between resource allocation and request, and the periodic external request received, the scheduling process of edge-oriented service requests is a complicated scientific problem. Existing studies do not take into account the periodic characteristics of service requests in different periods, leading to inaccurate scheduling decisions on external requests. This paper proposes a coordinated Multi-Agent recurrent Actor-Critic, based on a recursive network. CMARAC is used to solve the problem of computing resource allocation for periodic requests in edge computing scenarios. According to different resource information in the server cluster and the status of the task queue, the system state information and historical information are captured and maintained by integrating LSTM, and then the most appropriate service resources are selected by processing them in the Actor-Critic network. Tracking experiments using actual request data show that CMARAC can successfully learn the periodic state between external requests in the face of large-scale service requests. Compared with the baseline, the average throughput rate of the system implemented by CMARAC is improved by 2.1%, and the algorithm convergence rate is improved by 0.69 times. Finally, we optimized the parameters through experiments and determined the best parameter configuration of CMARAC.
Zheng Chen, Ruijin Wang, Zhiyang Zhang, Ting Chen, Xikai Pei, Zhenya Wu

Storage Systems and File Management

Frontmatter
CLMS: Configurable and Lightweight Metadata Service for Parallel File Systems on NVMe SSDs
Abstract
With the tendency of running large-scale data-intensive applications on High-Performance Computing (HPC) systems, the I/O workloads of HPC storage systems are becoming more complex, such as the increasing metadata-intensive I/O operations in Exascale computing and High-Performance Data Analytics (HPDA). To meet the increasing performance requirements of the metadata service in HPC parallel file systems, this paper proposes a Configurable and Lightweight Metadata Service (CLMS) design for the parallel file systems on NVMe SSDs. CLMS introduces a configurable metadata distribution policy that simultaneously enables the directory-based and hash-based metadata distribution strategies and can be activated according to the application I/O access pattern, thus improving the processing efficiency of metadata accesses from different kinds of data-intensive applications. CLMS further reduces the memory copy and serialization processing overhead in the I/O path through the full-user metadata service design. We implemented the CLMS prototype and evaluated it under the MDTest benchmarks. Our experimental results demonstrate that CLMS can significantly improve the performance of metadata services. Besides, CMLS achieves a linear growth trend as the number of metadata servers increases for the unique-directory file distribution pattern.
Qiong Li, Shuaizhe Lv, Xuchao Xie, Zhenlong Song
Combining Cache and Refresh to Optimize SSD Read Performance Scheme
Abstract
In the era of continuous advances in flash technology, the storage density of NAND flash memory is increasing, but the availability of data is declining. In order to improve data availability, low-density parity check codes (LDPC), which have been highly corrected in recent years, are used in flash memory. However, although LDPC can solve the problem of low data availability, it also brings the problem of long decoding time. Moreover, the LDPC decoding delay time is related to its decoding level, and the higher the LDPC decoding level, the longer the delay time. Long decoding delays can have an impact on the read performance of the flash. Therefore, in order to improve the speed of reading data in flash memory, this paper proposes a scheme combining cache and flushing, the main idea of which is to use the cache to reduce LDPC decoding time, and at the same time refresh the pages with high latency replaced by the cache, so that the pages with high latency can be restored to the state of low latency pages Experimental results show that this scheme can significantly reduce LDPC decoding delay and improve data availability with less overhead, and optimize the read performance of flash memory. The experimental results show that compared with the original strategy, the average response time is reduced by 24%, and the average IOPS value is increased by 32%.
Jinli Chen, Peixuan Li, Ping Xie
CCS: A Motif-Based Storage Format for Micro-execution Dependence Graph
Abstract
Micro-execution dependence graphs model the program execution on a microprocessor as relationships of micro-execution events intra- and inter-instructions for performance analysis. Each instruction constitutes a motif whose structure is defined by the dependence graph model. With the size of the application increasing dramatically, storing a large-scale dependence graph with billions of instructions becomes difficult. However, popular graph storage formats, such as CSR and CSC, are inefficient for motifs. And the current motif-based compression methods involve the time-consuming process of subgraph isomorphism checking, which is NP-hard. To reduce redundancy, we propose a novel motif-based lossless storage format called compressed common subgraph (CCS) for micro-execution dependence graphs. The key idea is to divide the graph into the intra- and inter-motif parts and compress the common subgraph structures in the intra-motif part by storing the same structures only once. Our method avoids subgraph isomorphism checking because the motifs (instructions) are regularly arranged. Furthermore, the CCS format has two variant implementations, compressed common single subgraph (CCSS) and compressed common multiple subgraphs (CCMS) to adapt to various dependence graph models. Experimental results show that our CCSS and CCMS formats use 16.66% and 8.67% less memory size than the CSC graph format, respectively.
Yawen Zheng, Chenji Han, Tingting Zhang, Chao Yang, Jian Wang
Hydis: A Hybrid Consistent KVS with Effective Sync Among Replicas
Abstract
Distributed storage systems distribute user loads across regions. User requests from different geographical locations are directed to the nearest data center, benefiting reduced service latency and improved service quality. However, the consistency among regions holds against availability and richness of the underlying data services. To address these constraints, our study proposes Hydis, a hybrid consistency distributed key-value storage system based on optimized replica synchronization. Hydis guarantees high availability and scalability for geographically distributed systems and uses Conflict-free Replicated Data Types to construct HybridLattice that supports various consistency models. A novel Writeless-Consistency strategy is proposed to improve the synchronization efficiency between replicas, and a dynamic synchronization optimization based on this strategy is implemented for consistency algorithm to effectively reduce the synchronization overhead of distributed storage systems. A performance evaluation of the Hydis cluster deployed on Alibaba Cloud showed that the strong consistency algorithm in Hydis outperformed the Raft algorithm by 1.8X. Additionally, the causal consistency algorithm in Hydis outperformed the traditional Vector Clock algorithm by 2.5X.
Junsheng Lou, Zichen Xu

Networking and Cloud Computing

Frontmatter
An Automatic Deployment Method for Hybrid Cloud Simulation Platform
Abstract
The simulation resources are now virtualized and deployed on cloud platform. It can provide on-demand simulation tests, improving the efficiency of simulation systems. Simulation resources are packaged into virtual machines or containers. Complex software are packaged into virtual machines, and simple simulation services are packaged into containers. However, how to deploy simulation systems under resource constraints is a problem worth studying. This paper studies a virtual machine and container based hybrid cloud simulation platform. An automatic deployment method is proposed to reduce labor costs and errors of manual deployment. A simulation case is applied to verify the usefulness and efficiency of our approach.
Xilai Yao, Yizhuo Wang, Weixing Ji, Qiurui Chen
Reliability Optimization Scheduling and Energy Balancing for Real-Time Application in Fog Computing Environment
Abstract
Fog computing has the characteristics of stronger localized computing power and less data transmission load, thus better meeting the high energy efficiency, reliability, and real-time response requirements required by intelligent connected vehicle technology applications. Currently, research on fog computing task scheduling has become a hot topic, with existing research mainly focusing on low energy consumption or high real-time parallel task scheduling, which cannot meet the high reliability requirements in intelligent connected vehicle scenarios. Therefore, this paper establishes a fog computing task model based on Directed acyclic graph (DAG) to achieve accurate definition of energy, time and reliability. To achieve quantitative optimization of time and reliability indicators under energy constraints, a fog computing task scheduling algorithm was proposed and compared with existing scheduling algorithms. Then, the proposed algorithm is used to solve the DAG task list optimization problem based on fast Fourier transform (FFT) and Gaussian elimination (GE) structure. The experimental results show that compared with the existing ECLL method, ECLLRS has a more significant effect in satisfying the real-time and reliability of the system under the premise of limited energy budget.
Ruihua Liu, Huijuan Huang, Yulei He, Xiaochuan Guo, Can Yan, Junhao Dai, Wufei Wu
Federated Classification for Multiple Blockchain Systems
Abstract
As blockchain technology continues to advance, it has become increasingly utilized as a fundamental infrastructure in various industries, such as business, justice, and finance. The widespread adoption of blockchain technology has created a pressing need for effective information exchange among different institutional units within blockchain networks. Fortunately, cross-chain technology has emerged as a promising solution for enhancing information interaction among diverse blockchain units. In this study, we examined several variables and employed multiple methodologies to validate our proposed hypothesis. Using cross-chain technology, we introduce a blockchain cross-chain federated learning framework (BCFL) that facilitates the interaction and mutual verification of data and parameters across different blockchains. This approach enables federated learning without the need to collect or coordinate model weights on a central server, while also enhancing the security of the federated learning process through the consensus algorithm mechanism of blockchains. Finally, we conduct a comparative analysis of the effectiveness of BCFL compared to traditional machine learning and centralized federated learning.
Zhanyi Yuan, Fuhui Sun, Yurong Cheng, Xiaoyan Wang
Towards Privacy-Preserving Decentralized Reputation Management for Vehicular Crowdsensing
Abstract
The reputation of a vehicle is a critical role in most vehicular crowdsensing applications, which incentivizes vehicles to perform crowdsensing tasks by submitting high-quality data and getting remunerated accordingly. Unfortunately, existing centralized reputation systems are vulnerable to collusion attacks, and decentralized approaches are susceptible to Sybil attacks. What’s worse, both of them have privacy leakage and fairness problems. To address these issues, we take advantage of various cryptographic primitives and the blockchain technology to present a privacy-preserving decentralized reputation management system. Specifically, a compact traceable ring signature is proposed to provide identity privacy protection and resist Sybil attacks. To ensure fairness, the quantification of data quality is fulfilled by combining the rating feedback mechanism with comprehensive updating factors. Additionally, our system allows the reputation update automatically through smart contracts deployed on the consortium blockchain. The authenticity of the reputation can be verified by a zero-knowledge proof when a vehicle shows its reputation. Finally, a proof-of-concept prototype system by Parity Ethereum is presented. Extensive security analysis and implementations demonstrate the feasibility and efficiency of the proposed system.
Zhongkai Lu, Lingling Wang, Ke Geng, Jingjing Wang, Lijun Sun
Delay Optimization for Consensus Communication in Blockchain-Based End-Edge-Cloud Network
Abstract
With the rapid development of smart IoT technology, various innovative mobile applications improve many aspects of our daily life. End-edge-cloud collaboration provides data transmission in connecting heterogeneous IoT devices and machines with improvements in high quality of service and capacity. However, the end-edge cloud architecture still remains some challenges including the risks of data privacy and tolerance transmission delay. Blockchain is a promising solution to enable data processing in a secure and efficient way. In this paper, blockchain is considered as an infrastructure of the end-edge-cloud network and the time cost of the PBFT consensus is analyzed from the perspective of the leader’s position. Considering the concurrent processing of tasks in cellular networks, multi-intelligent deep reinforcement learning is used to train the assignment strategy of the edge server. The numerical results show that the proposed method can achieve better performance improvement in terms of the time consumption of data processing.
Shengcheng Ma, Shuai Wang, Wei-Tek Tsai, Yaowei Zhang

Computer Architecture and Hardware Acceleration

Frontmatter
A Low-Latency Hardware Accelerator for YOLO Object Detection Algorithms
Abstract
Object detection is an important computer vision task with a wide range of applications, including autonomous driving, smart security, and other domains. However, the high computational requirements poses challenges on deploying object detection on resource-limited edge devices. Thus dedicated hardware accelerators are desired to delever improved performances on detection speed and latency. Post-processing is a key step in object detection. It involves intensive computation on the CPU or GPU. The non-maximum suppression (NMS) algorithm is the core of post-processing, which can eliminate redundant boxes belonging to the same object. However, NMS becomes a bottleneck for hardware acceleration due to its characteristics of multiple iterations and waiting for all predicted boxes to be generated.
In this paper, we propose a novel hardware-friendly NMS algorithm for FPGA accelerator design. Our proposed algorithm alleviates the performance bottleneck of NMS by implementing the iterative algorithm into an efficient pipelined hardware circuit. We validate our algorithm on the VOC2007 dataset and show that it only brings 0.27% difference compared to the baseline NMS. Additional, the exponential function and sigmoid function are also extremely hardware-costly. To address this issue, we propose an approximate exponential function circuit to calculate the two functions with minimum logic cost and zero DSP cost.
We deploy our post-processing accelerator on Xilinx’s Alveo U50 FPGA board. The final design achieves a end-to-end detection latency of 283us for YOLOv2 model, According to the user guide provided by Xilinx and Intel, we converted the logic resources of different implementations on the FPGA into LUT resources. After that, we compared the resource utilization of acceleration module in the current state-of-the-art object detection system deployed on Intel with ours. Compared with it, we consumed 13.5\(\times \) lower LUT resources and used much fewer DSP resources.
Aibin Wang, Youshi Ye, Yu Peng, Dezheng Zhang, Zhihong Yan, Dong Wang
SCFM: A Statistical Coarse-to-Fine Method to Select Cross-Microarchitecture Reliable Simulation Points
Abstract
With computer microarchitectures advancing and benchmark sizes expanding, the need for agile pre-silicon performance estimation becomes increasingly crucial. SimPoint is a widely used sampling method to solve this problem, making it a promising research area. However, previous studies mainly focus on how to enhance the estimation accuracy, speedup, and usability of SimPoint, while ignoring the critical problem of cross-microarchitecture estimation reliability. We have observed that although SimPoint can provide an accurate performance estimation, it could fail in yielding reliable estimations across different microarchitectures due to the difficulties in (a) rapidly evaluating the cross-microarchitecture reliability of SimPoint and (b) effectively selecting reliable simulation points.
To address this problem, we propose SCFM, a statistical coarse-to-fine method to select cross-microarchitecture reliable simulation points. The SCFM introduces two key metrics: the micro-independent metric \(E_{repre}\) and micro-dependent metric Loss, to rapidly evaluate the simulation points. Our method could efficiently scan a large SimPoint parameter space by rapidly evaluating their program characteristic representation abilities and precisely assessing their cross-microarchitecture estimation capabilities. To verify the effectiveness of SCFM, we conducted thorough evaluations, configuring thirty distinct machine models to select reliable simulation points and preparing three test models to implement the verification. Experimental results demonstrate that the final-selected reliable simulation points could yield statistically accurate estimations for SPEC CPU 2006 on the test models, giving average errors of less than 1%.
Chenji Han, Hongze Tan, Tingting Zhang, Xinyu Li, Ruiyang Wu, Fuxin Zhang
On-Demand Triggered Memory Management Unit in Dynamic Binary Translator
Abstract
User-level Dynamic Binary Translators (DBTs) linearly map the guest virtual memory to host virtual memory to achieve optimal performance. When the host page size exceeds the guest page size, multiple small guest pages are mapped to a single large host page, resulting in inappropriate permissions mapping. DBTs face security and correctness risks accessing the inappropriately mapped host page. Our survey reveals that most of the state-of-the-art user-level DBTs suffer from these risks. While system-level DBT can avoid these risks through a software Memory Management Unit (MMU). However, the software MMU fully emulates guest memory management, leading to slower performance than the linear mapping approach of user-level DBTs.
To address the balance of performance and risks, we propose a DBT memory management method named On-Demand Triggered MMU (ODT-MMU), that combines the strengths of both user-level and system-level DBTs. ODT-MMU utilizes linear mapping for non-risky page accesses and triggers a software MMU when accessing risky pages. We implement ODT-MMU in two ways to accommodate various application scenarios: a platform-independent implementation named ODT-InterpMMU, and a hardware-accelerated implementation named ODT-ManipTLB. ODT-ManipTLB is designed for host Instruction Set Architectures (ISAs) that support programmable TLB. Experimental results demonstrate that both implementations can effectively mitigate risks associated with page size. Furthermore, ODT-ManipTLB achieves over 2000x performance improvement compared with the ODT-InterpMMU, while maintaining comparable performance to the DBT without ODT-MMU. Additionally, our work is applied to two industrial DBTs, XQM and LATX.
Benyi Xie, Xinyu Li, Yue Yan, Chenghao Yan, Tianyi Liu, Tingting Zhang, Chao Yang, Fuxin Zhang
MFHBT: Hybrid Binary Translation System with Multi-stage Feedback Powered by LLVM
Abstract
The shortage of applications has become a major concern for new Instruction Set Architecture (ISA). Binary translation is a common solution to overcome this challenge. However, the performance of binary translation is heavily dependent on the quality of the translated code. To achieve high-quality translation, recent studies focus on integrating binary translators with compilation optimization methods. Nevertheless, such integration faces two main challenges. Firstly, it is hard to employ complex compilation optimization techniques in a dynamic binary translator (DBT) without introducing significant runtime overhead. Secondly, the task of implementing register mapping in the compiler is challenging, which can reduce expensive memory access instructions generated to maintain the guest CPU state. To resolve these challenges, we propose a hybrid binary translation system with multi-stage feedback, combining dynamic and static binary translator, named MFHBT. This system eliminates the runtime overhead caused by compilation optimization. Additionally, we introduce a mechanism to implement the register mapping through inline constraints and stack variables in the compiler. We implement a prototype of this new system powered by LLVM. Experimental results demonstrate an 81% decrease in the number of memory access instructions and a performance improvement of 3.28 times compared to QEMU.
Zhaoxin Yang, Xuehai Chen, Liangpu Wang, Weiming Guo, Dongru Zhao, Chao Yang, Fuxin Zhang
Step and Save: A Wearable Technology Based Incentive Mechanism for Health Insurance
Abstract
The market of wearables are growing explosively for the past few years. The majority of the devices are related to health care and fitness. It is embarrassing that users easily lose interest in these devices, and thus fail to improve health condition. Recently, the “be healthy and be rewarded” programs are gaining popularity in health insurance market. The insurance companies give financial rewards to its policyholders who take the initiative to keep healthy. It provides the policyholders with incentives to lead a healthier lifestyle and the insurer can also benefit from less medical claims. Unfortunately, there are hardly any studies discussing how to design the incentive mechanism in this new emerging health promotion program. Improper design would not change policyholders’ unhealthy behavior and the insurer cannot benefit from it. In this paper, we propose a mechanism for this health promotion program. We model it as a monopoly market using contract theory, in which there is one insurer and many policyholders. We theoretically analyze how all parties would behave in this program. We propose a design that can guarantee that policyholders would faithfully participate in the program and the insurer can maximize its profit. Simulation results show that the insurer can improve its profit by \(40\%\) using the optimal contract.
Qianyi Huang, Wei Wang, Qian Zhang

Machine Learning and Data Analysis

Frontmatter
Spear-Phishing Detection Method Based on Few-Shot Learning
Abstract
With the further development of Internet technology, various online activities are becoming more frequent, especially online office and online transactions. This trend leads that the network security issues are increasingly prominent, the network security situation is more complex, and the methods and means of attacks are emerging in endlessly. Due to the characteristics of spear-phishing such as target accuracy, attack durability, camouflage concealment and damage severity, it has become the most commonly used initial means for attackers and APT organizations to invade targets. Thus, automated spear-phishing detection based machine learning and deep learning have become the focus of researchers in recent years. However, because of a smaller range and less attack frequency, the number of spear-phishing emails is very limited. How to detect spear-phishing based on machine learning and deep learning with small samples has become a key issue. Meanwhile, in machine learning and deep learning, few-shot learning aims to study a better classification model trained with only a few samples. Therefore, we propose a spear-phishing detection method based on few-shot learning that combines the basic features and the message body of emails. We propose a simple word-embedding model to analyzes the message body, which can process the message body of different lengths into text feature vectors with the same dimension, thus retaining the semantic information to the greatest extent. Then the text feature vectors are combined with the basic features of emails and input into commonly used machine learning classifiers for detection. Our proposed simple word-embedding method does not require the complex training of the model to learn a large number of parameters, thereby reducing the dependence of the model on a large number of training data. The experimental results show that the method proposed in this paper achieves better performance than the existing spear-phishing detection method. Especially, Especially, the advantages of our detection method are more obvious with small samples.
Qi Li, Mingyu Cheng
Time Series Classification Based on Data-Augmented Contrastive Learning
Abstract
Time series classification has become a popular research topic in data mining and has a wide range of applications in many fields in daily life. When analyzing and classifying time series, it is challenging to address their dynamic distribution characteristics and preserve key temporal information. In this paper, we propose a novel time series classification algorithm based on data-augmented contrastive learning. The proposed model consists of four parts, the Data Augmentation module, the Encoder, the Feature Space Contrastive Learning module and the Classifier. The four parts work together to jointly accomplish the task of time series classification. During the process of training the time series representation encoder, we adopt a loss function combining contrastive loss and classification loss to optimize the encoder, which can learn label-related representations from time series data and extract internal features. We conduct extensive experiments based on 30 open datasets, which show that the proposed method outperforms the state-of-the-art baseline algorithms.
Junyao Wang, Jiangyi Hu, Taishan Xu, Xiancheng Ren, Wenzhong Li
From Ledger to P2P Network: De-anonymization on Bitcoin Using Cross-Layer Analysis
Abstract
Cryptocurrency has the characteristics of decentralization and anonymization, which have emerged and attracted widespread attention from various parties. However, cryptocurrency anonymization breeds illegal activities such as money laundering, gambling, and phishing. Thus, it is essential to deanonymity on Cryptocurrency transactions. This paper proposes a cross-layer analysis method for Bitcoin transactions deanonymization. Through acquiring large-scale original transaction information and combining the characteristics of the network layer and the transaction layer, we propose a propagation pattern extraction model and associated address clustering model. We achieve the matching of the suspected transaction with the originator’s IP address for high precision and low overhead. Through experimental analysis in a real Bitcoin system, the cross-layer method can effectively match the original transaction with the target node, which reaches an accuracy of 81.3% and is 30% higher than the state-of-the-art method. By controlling several factors, such as different times and nodes, the characteristics of the extracted transaction propagation pattern can be proved reasonable and reliable. The practicality and effectiveness of the cross-layer analysis are higher than that of a single-level scheme.
Che Zheng, Shen Meng, Duan Junxian, Zhu Liehuang
Robust Online Crowdsourcing with Strategic Workers
Abstract
Crowdsourcing has facilitated a wide range of applications by leveraging public workers to contribute large number of tasks. However, most prior works only considered static environments and overlooked the system dynamics. In practice, the task set to be allocated is time-varying and the workers may be strategic when deciding whether to accept the tasks. In this paper, we formulate the online crowdsourcing problem as a sequential optimization problem, where a requestor needs to allocate tasks repeatedly to the workers to maximize the long-term cumulative utility. To deal with the dynamics, we first build an environmental model to predict the system dynamics. The model can also embed the tasks into a fixed lower-dimensional space. Next, we propose a multi-agent reinforcement learning algorithm to optimize the allocation mechanism for the requestor. The underlying intuition is that the mechanism can be robust even with adversarial workers. In the experiment, we conducted extensive experiments to evaluate the performance. The results validate that our method can achieve the best performance in almost all cases. The results are robust when deployed in an adversarial environment.
Bolei Zhang, Jingtao Zhang, Lifa Wu, Fu Xiao

Distinguished Work from Student Competation

Frontmatter
New Filter2D Accelerator on the Versal Platform Powered by the AI Engine
Abstract
Filter2D, as a fundamental operator of CNN, has vital optimization and acceleration significance in computer vision (CV) applications, so it is designed as the CCFSys-CCC2023 competition CV track. Based on the CCC2023 competition designated Versal ACAP Architecture, we proposed the AI Engine (AIE) kernel and AIE graph design scheme and reconstructed the programmable logic (PL) and Processing System (PS) accordingly. Results show that, compared to the only PS scheme, our design achieve about 104.51\(\sim \)139.41 speedup on the specified platform Versal ACAP, which overcame all other 50+ group and won the championship of CCC2023.
Wenbo Zhang, Tianshuo Wang, Yiqi Liu, Yiming Li, Zhenshan Bao
Backmatter
Metadata
Title
Advanced Parallel Processing Technologies
Editors
Chao Li
Zhenhua Li
Li Shen
Fan Wu
Xiaoli Gong
Copyright Year
2024
Publisher
Springer Nature Singapore
Electronic ISBN
978-981-9978-72-4
Print ISBN
978-981-9978-71-7
DOI
https://doi.org/10.1007/978-981-99-7872-4

Premium Partner