Skip to main content
Top

Network and Parallel Computing

21st IFIP WG 10.3 International Conference, NPC 2025, Nha Trang, Vietnam, November 14–16, 2025, Proceedings, Part I

  • 2026
  • Book
insite
SEARCH

About this book

This two-part volume LNCS 16305 and 16306 constitutes the proceedings of the 21st IFIP International Conference on Network and Parallel Computing, NPC 2025, which was held in Nha Trang, Vietnam, during November 14-16, 2025.

The 76 full papers included in this volume were carefully reviewed and selected from 223 submissions. Topics of interest include, but are not limited to, parallel and distributed applications and algorithms, parallel and distributed architectures and systems, and parallel and distributed software environments and tools.

Table of Contents

Frontmatter
HSampler : Optimizing Multi-GPU GNN Sampling with Collision-Avoid Selection

Graph Neural Network (GNN) has emerged in graph learning tasks in recent years. As the real-world graphs become larger, sampled GNN is widely used in both academia and industry instead of training on the whole graph. As the training stage consumes less execution time due to the small size of sampled subgraphs, the data preparation stage becomes the performance bottleneck, especially the graph sampling stage. Many sampling methods have been proposed to improve the efficiency, but they still suffer from the significant overhead of the high-degree node selection collision in bias sampling. In this paper, we present HSampler , a multi-GPU GNN sampling system. It selects sampled subgraphs with reordering and sliding windows to reduce the repeated trials in biased sampling, thus improving the performance of the graph sampling stage. Evaluation on a node with 4 GPUs shows that HSampler significantly outperforms state-of-the-art systems like DGL and $$P^3$$ P 3 by 2.1 $$\times $$ × to 6.2 $$\times $$ × .

Yuyang Jin, Jidong Zhai, Kezhao Huang, Weimin Zheng
ServScale: Concurrency-Aware Serverless Execution and Scaling Paradigm

Many modern cloud applications require intra-instance concurrency to efficiently handle multiple requests or tasks in parallel. While platforms like Kubernetes are widely adopted for scalable deployment on Infrastructure-as-a-Service (IaaS), this approach often leads to resource inefficiencies due to data center tax and idle containers. Serverless computing provides elastic scaling and pay-as-you-go benefits, but migrating such applications remains challenging. Most function containers of existing serverless platforms process requests serially and rely solely on horizontal scaling, forcing developers to modify application logic or over-provision resources, resulting in compromised runtime efficiency for typical workloads such as microservices.We propose ServScale, a concurrency-aware serverless execution and scaling system that efficiently handling tradeoffs between intra-instance and inter-instance concurrency. ServScale introduces a hybrid scaling mechanism combining horizontal and vertical scaling, dynamically adapting to real-time workload characteristics. We design specialized prewarm and node scheduling strategies to minimize scaling latency while maintaining application QoS. Implemented on Kubernetes, ServScale reduces core-hour consumption by 20.3% compared to multi-threading enabled serverless systems using only horizontal scaling, and by 50.1% compared to traditional IaaS-based deployments.

Zichen Xu, Zijun Li, Quan Chen, Minyi Guo
Pallas: Optimizing LLM-Based Anomaly Traffic Classification with Compressed Prompt Engineering

To alleviate the impact of Distributed Denial-of-Service (DDoS) attacks, many traffic classifiers have been deployed to filter the malicious traffic. Most existing classifiers are based on machine learning (ML) or deep learning (DL), which requires comprehensive data collection or well-designed algorithms. These well trained classifiers usually perform poorly on new unseen traffic distributions and need much time to improve it. Motivated by the excellent generalization promise of Large Language Model (LLM), some LLM-based classifiers have been proposed. However, direct application of foundation LLMs yields unsatisfactory results due to the high-dimension and non-text format of network traffic data. To this end, we propose Pallas, a compressed prompt engineering approach for optimizing LLM-based anomaly traffic classifier. Pallas consists of two main components: i) Feature Purification is responsible for selecting high discriminative features to eliminate noisy features and reduce token overhead; ii) Prompt Engineering Integration is designed to efficiently inject background knowledge into LLMs and employs carefully crafted reasoning templates to reduce hallucinations and enhance classification performance. The experimental results on public datasets show that Pallas can improve the accuracy by 9.78% compared with the state-of-the-art, while saving 69.37% of token consumption for users.

Hengxian Wang, Changhao Qiu, Bangbang Ren, Rui Chen, Lailong Luo, Deke Guo
A Reputation-Driven Malicious User Detection for Truth Discovery in Mobile Crowdsensing

Truth discovery has emerged as a vital technique for resolving data conflicts in Mobile Crowdsensing. While prior studies have primarily focused on securing data transmission, they often overlook a critical threat: the distortion of inferred truths by malicious or unreliable users. In practice, even users deemed trustworthy may unintentionally provide inaccurate data due to environmental noise or coordinated manipulation by adversaries. To tackle these challenges, we propose a novel truth discovery algorithm that leverages the inherent volatility of sensing data to detect malicious users. Unlike traditional methods, our approach combines statistical inference with a dynamic reputation mechanism to identify and suppress malicious behaviors over time. Specifically, we first construct a baseline reputation scoring framework under the assumption that users are benign, using confidence intervals to quantify their reliability. We then integrate hypothesis testing and data volatility analysis to flag abnormal reporting patterns that may indicate adversarial behavior. Extensive simulations show that our method not only outperforms existing solutions in accurately identifying malicious users but also significantly enhances the overall quality of the aggregated truth.

Dingwen Chi, Jun Tao, Yu Gao, Haotian Wang
CFLBD: Distance-Informed Dynamic Clustering via Bhattacharyya Metrics for Federated Learning

Federated learning (FL) is a privacy-preserving distributed learning paradigm in which clients perform localized training iterations and send model updates to a central server rather than sharing raw data. However, data heterogeneity among clients significantly impacts the performance of the global model. Existing FL methods tackle this issue by optimizing the loss function through client-side training, but fail to account for the distinct distribution differences of client data and rely on uniform client weighting during server-side aggregation. This limitation restricts the model’s effectiveness and weakens its generalization ability. To address this, we propose a novel FL optimization method, Clustering FL with Bhattacharyya Distance (CFLBD). This approach integrates the local similarity assessment based on client data heterogeneity and adaptive clustering to dynamically weigh the aggregation of local models in each communication round. Specifically, CFLBD quantifies the distribution similarity by calculating the Bhattacharyya distance between the local and global models. Then, the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm is leveraged to adaptively cluster the local models on the server, minimizing the distribution variance within each cluster. The global aggregation relies on a dynamically weighted average of the clustered clients, enhancing its performance and robustness across heterogeneous data scenarios. Experimental results show that CFLBD consistently outperforms baseline methods in accuracy and robustness across various heterogeneity settings.

Xiaowen Duan, Rui Zhao, Rui Zhou, Lei Qiao, Xin Liu, Qingguo Zhou
SAMSformer: A Multi-scale Prediction Model Based on Parallel Transformer

Existing Transformer-based time series forecasting models generally suffer from inadequate adaptive multi-scale capability and a decline in accuracy for long-term predictions. This paper proposes SAMSformer, a time series forecasting framework based on an adaptive multi-scale parallel Transformer for long-term forecasting. First, the model identifies the time series data characteristics through a sequence decomposition module. It then uses a dynamic selector to adaptively choose the time scales based on the data features and assigns weights to different scales. The multi-branch parallel Transformer structure models time segments of different scales in parallel. Each branch utilizes Local-Global Attention mechanism for parallel computing of local and global attention at different segment granularities. Local attention captures the local dependencies within each time segment, while global attention uses a multi-scale segment attention mechanism to extract global dependencies across time segments. Finally, to address the issue of accuracy decay in long-term predictions, the Time-Decaying Robust Loss function is introduced. Experiments on benchmark datasets from finance, transportation, electricity, and other domains demonstrate that the proposed method outperforms existing baseline models, making it a viable choice as a foundational architecture for time series forecasting.

Haiwei Xia, Houqun Yang, Fen Chen, Hongjuan Xue, Zhengyu Li, Xia Xie
Computing Measurement-Based Deployment of Service Function Chains in Computing Power Networks

Current service function chain (SFC) deployment algorithms face challenges in adapting to computing power networks (CPN) characterized by multidimensional resources and high dynamics, which hampers their ability to meet high-quality service demands. We investigate the SFC deployment problem in CPN for achieving low end-to-end latency service provisioning. Specifically, a computing measurement (CM) model is established for evaluating the performance of computing nodes considering multidimensional resources. Furthermore, we propose a novel heuristic SFC deployment algorithm, called CM-based routing selection and virtual network function (VNF) placement (CMRP), to find an approximate solution. Extensive simulations demonstrate that the performance of the presented algorithm closely approximates the optimal solution in small-scale network and surpasses that of the compared algorithm in terms of end-to-end latency in large-scale network.

Yuhan Zhang, Ran Wang, Jie Hao, Qiang Wu, Zehui Xiong, Jiawen Kang
Exploiting Transformer-Based Static Binary Analysis for Identifying Inefficient Locks

Multithreaded production software often suffers from lock-related inefficiencies that cause severe performance degradation. These issues are difficult to detect before a significant performance drop, and even skilled programmers struggle to resolve them without knowing whether frequent lock acquisition or contention is to blame. As transformer-based language models have shown strong potential in automating code analysis, we present Luna, a transformer-based static binary analysis tool for identifying inefficient locks. We formalize the classification task for frequent lock acquisition and contention in multithreaded binaries and design a transformer-based model with calling context awareness. By combining this model with static control flow analysis, Luna can identify suspicious lock operations along with their inefficiency type and call path attribution. Guided by Luna, developers can detect inefficient locks without executing the program and achieve significant performance gains through early optimization.

Zhibo Xuan, Xin You, Hailong Yang, Jingqi Chen, Zhongzhi Luan, Yi Liu, Depei Qian
Open Services Availability First-Based Routing and Scheduling Optimization for Wide-Area Deterministic Networks

Emerging applications, such as large-scale engineering computing, are accelerating the development of wide-area deterministic networks to accommodate increasing traffic demands and differentiated service requirements. However, existing technologies face substantial scalability and coordination challenges in multi-domain, heterogeneous environments, rendering them unsuitable for large-scale deployments. Moreover, relying on a single technological paradigm is insufficient to support an open, wide-area deterministic network at scale. To overcome these limitations, this paper proposes a large-scale deterministic network architecture based on open services availability first (OSAF). The proposed architecture employs cross-domain collaboration and hierarchical management mechanism, to support tiered resource scheduling and dynamic path optimization in heterogeneous cross-domain networks. In addition, a Transformer-based reinforcement learning algorithm (T-DRL) is introduced, enabling OSAF to optimize routing and scheduling decisions with a latency-first objective, thereby enhancing deterministic quality of service (QoS) guarantees for end-to-end differentiated open services.

Shengnan Cao, Qiang Wu, Ran Wang, Shuyang Li
Incremental Distributed Algorithms for Game-Theoretic Betweenness Centralities in Dynamic Graphs

Maintaining game-theoretic betweenness centralities in highly dynamic networks is challenging due to the high computational cost of recalculating it from scratch. This paper presents distributed incremental algorithms in the classic CONGEST model for maintaining Shapley- and semi-value-based betweenness centralities. By addressing the challenges of parallel traversal congestion and communication overhead, we propose incremental algorithms with round complexities of $$O(D^G( \mathcal {A}^{max}_\mathcal {B}+$$ O ( D G ( A B max + $$\mathcal {D}^{max}_ \mathcal {B})+ |F_{Batch}|+ |\text {Batch}|)$$ D B max ) + | F Batch | + | Batch | ) for multi-edge updates. Here, $$ D^G $$ D G , $$ \mathcal {A}^{\max }_{\mathcal {B}} $$ A B max and $$ \mathcal {D}^{\max }_{\mathcal {B}} $$ D B max denote the diameter of the graph, the maximum number of articulation points, and the maximum diameter of the biconnected components, respectively. $$ |F_{\text {Batch}}| $$ | F Batch | and $$|\text {Batch}|$$ | Batch | represent the number of affected vertices resulting from insertions and the number of inserted edges, respectively. Experimental results demonstrate that the proposed multi-edge incremental algorithm achieves speedup factors of up to $$ 7 \times $$ 7 × and $$ 16 \times $$ 16 × compared to the single-edge incremental algorithm and the static algorithm, respectively.

Yefei Wang, Qiang-Sheng Hua, Wenjie Gao, Hai Jin
LightCacheRL: A Lightweight Reinforcement Learning Framework for Unified Cache Management

Recent advancements have demonstrated the potential of reinforcement learning (RL) in enhancing cache management policies. However, existing approaches still face two critical limitations: (1) most methods treat prefetching and cache replacement as separate tasks, which restricts their adaptability to diverse and dynamic workloads; and (2) the high computational and storage overheads of typical RL techniques hinder their deployment in real-world hardware systems. To overcome these challenges, we propose LightCacheRL, a lightweight and practical RL-based cache management framework that unifies prefetching and replacement decisions within a unified formulation. Specifically, LightCacheRL models the cache management problem as a lightweight online Multi-Armed Bandit (MAB) process, enabling efficient and adaptive policy learning with minimal runtime overhead. The reward function in LightCacheRL integrates both instruction-per-cycle (IPC) and system-level bandwidth feedback, providing a hardware-aware optimization objective. We conduct comprehensive evaluations through simulation and hardware synthesis across single-core and multi-core configurations. Experimental results show that LightCacheRL achieves 11.5% average IPC improvement over LRU on a 16-core system, and outperforms state-of-the-art policies, including Hawkeye (4.35%), Glider (3.75%), and Mockingjay (2.1%), with negligible hardware cost.

Kunming Zhang, Zhihua Fan, Yingchun Fu, Yanhuan Liu, Lexin Wang, Yuqun Liu, Haibin Wu, Wenming Li
Joint Optimization of Computation and Communication Resources for GPU Allocation in Heterogeneous Clusters

The exponential growth of deep neural network (DNN) model size and data volume makes distributed training indispensable in cloud environments. Today’s cloud datacenters typically operate as heterogeneous clusters, where computing nodes are equipped with diverse GPU generations. In such environments, cloud providers face the critical challenge of optimally matching user-submitted training jobs with suitable GPU resources to accelerate the job training process. Existing works either improve computational efficiency by leveraging affinity between heterogeneous GPUs and jobs, or arrange jobs on the same type of GPUs as much as possible to reduce cross-rack communication overhead. However, none of these methods simultaneously considers both computation and communication factors, despite their combined importance in determining job completion time (JCT) for distributed training.In this paper, we propose HetSpeed, a Heterogeneous-aware resource allocation method to jointly cluster efficient computational resources and communication overheads to Speedup the model training process. HetSpeed formulates a binary integer programming problem that incorporates real-world scenario constraints and proves its NP-hardness. To solve this problem, HetSpeed presents an effective submodular-based greedy algorithm with a tight approximation ratio $$(1-\frac{1}{e})$$ ( 1 - 1 e ) . We evaluate HetSpeed with real-world job traces, and HetSpeed is able to minimize the cluster’s total cross-rack traffic, decreasing the average JCT by 27.8% compared to the state-of-the-art solution.

Jiacheng Zhu, Chu Xu, Gongming Zhao, Hongli Xu, Gangyi Luo, Hao Zheng
upTSA: A DIMM-Based Near Data Processing Accelerator for Time Series Analysis

Time series analysis (TSA) is an important technique for extracting information from data in various domains. TSA is memory-bound on conventional platforms due to excessive off-chip data movement between processing units and the main memory of the system. Near data processing (NDP) has emerged as a promising solution to alleviate the bottleneck of memory access for data-intensive applications by enabling processing to be performed near the data within memory. In this work, we propose and implement upTSA, a parallel near data processing accelerator on real-world commercial DRAM Dual-Inline Memory Module (DIMM) NDP hardware. Our solution offers high bandwidth with large memory capacity and multi-level parallelism to accelerate TSA. We begin with a detailed characterization of TSA on conventional CPUs. We then design and implement a multi-level parallel accelerator on the real-world DIMM-based NDP hardware UPMEM. We further explore the hardware enhancements to improve the computational capability of current DIMM NDP hardware. Experimental results show that upTSA improves performance by 2.1 $${\times }$$ × compared to the server-class CPU baseline. The enhanced DIMM NDP architecture achieves up to 3.3 $${\times }$$ × speedup on TSA compared with current commercial NDP hardware.

Shunchen Shi, Fan Yang, Qijia Yang, Xiaohui Peng, Xueqi Li, Ninghui Sun
HSC: Scalable Task Scheduling in Large-Scale Edge Environments

Cloud-edge collaboration scheduling has emerged as a solution to alleviate cloud workloads by pushing tasks to edge nodes. However, the existing deep-learning-based scheduling methods are challenging to scale to large-scale environments due to high memory overhead during model training. We propose a hierarchical cloud-edge collaborative task scheduling framework, HSC, to address the scalability issue. HSC organizes cloud servers and edge nodes into areas and employs a two-level hybrid scheduling workflow. The inter-area scheduling uses lightweight load-balanced rules to distribute tasks efficiently, while the intra-area scheduling applies a deep-learning-based model for high-quality decisions. HSC breaks down large-scale scheduling problems into smaller area-level scheduling problems, significantly reducing the amount of training memory required. Experiments demonstrate that HSC reduces task response time by 2.8% to 3.9% and service-level agreement violation ratios by 11.5% to 20.8% compared to the cutting-edge deep-learning-based scheduling model GOSH. HSC is capable of supporting large-scale environments with 2000 cloud and edge nodes, which is 10 times larger than GOSH under the same memory constraint.

Wei Wang, Zhaokang Wang, Yanchao Zhao
A Semi-Decoupled VLM Planner with a Memory Mechanism for Autonomous Driving

In recent years, Vision-Language Models (VLMs) have demonstrated remarkable capabilities in scene understanding, common sense reasoning, and decision-making in the field of autonomous driving, bringing new opportunities for the development of autonomous driving technology . However, existing methods typically integrate VLMs directly into the decision loop, resulting in high inference latency that severely affects the system’s real-time response and driving safety. To address this issue, we propose a semi-decoupled system architecture that decouples VLMs from the real-time control loop, allowing them to function as asynchronous mid-term planners. Furthermore, to ensure efficient and stable collaboration between the high-latency asynchronous VLM decisions and the high-frequency underlying vehicle control, we design a three-layer planning system consisting of macroscopic global routes, VLM mid-term target points, and microscopic PID controllers. Additionally, we introduce a memory module to enhance the decision robustness in rare scenarios. Experimental results in the CARLA simulator demonstrate that the system reduces control layer latency by at least 25% (to 42 ms from the baseline of 56 ms), while maintaining high decision accuracy and cross-scenario generalization ability across various driving tasks.

Yibo Wang, Liang Zhao, Ammar Hawbani, Saeed Hamood Alsamhi, Zhi Liu, Qiang He
An FPGA-Based Distributed Shared Memory Architecture Supporting CXL 2.0+ Specification

Big data and AI applications pose significant challenges to traditional distributed shared memory architectures, where network bandwidth and latency constraints have become critical bottlenecks. Although the Compute Express Link (CXL) protocol promises low-latency, high-bandwidth interconnects for memory expansion, existing CXL 1.1 devices still cannot support fine-grained memory sharing across multiple nodes. This paper proposes an FPGA-based distributed shared memory architecture supporting the CXL 2.0+ specification. It features three key innovations for transparent cross-node memory accesses: 1) replacing conventional network stacks with CXL physical links to mitigate the performance overhead of frequent data copying; 2) a hardware-managed memory controller with interleaved access mechanisms to optimize the bandwidth utilization of the CXL-DDR channel; 3) hierarchical queues to ensure memory access orders under high concurrency. This fine-grained memory sharing architecture supports zero-copy data swapping across multiple servers via a pass-by-reference manner. Experimental results show that the end-to-end access latency of our CXL-based shared memory architecture is as low as $$1.25\,\upmu \text {s}$$ 1.25 μ s , 5 $$\times $$ × lower than that of one-sided Remote Direct Memory Access (RDMA).

Xiuhao Huang, Jinge Ding, Haikun Liu, Zhuohui Duan, Xiaofei Liao, Hai Jin
DPS: A Congestion-Aware Allreduce Job Placement for In-Network Aggregation

As one of the most fundamental communication paradigms in distributed systems, Allreduce is critical for efficient data synchronization in large-scale machine learning and high-performance computing. While traditional software-based Allreduce implementations exhibit inherent limitations such as high latency and low throughput, in-network aggregation technology mitigates these issues by offloading computation to programmable network devices. However, existing deployment approaches overly focus on host-side resource allocation while neglecting network link load balancing, ultimately restricting system throughput. To address this challenge, we propose the Dynamic job Placement Strategy (DPS), which leverages a tree-topology network with in-network aggregation support to jointly optimize computational resource allocation and network load balancing, thereby achieving high-throughput, low-latency distributed job placement. Extensive experimental results demonstrate DPS’s superior performance, showing a 3 $$\times $$ × throughput enhancement for individual Allreduce operations and a 19.3% reduction in average completion time for concurrent Allreduce workloads compared to state-of-the-art alternatives.

Yanrong Hu, Guannan Zhang, Dezun Dong, Zihao Wei, Zhen Ruan
Carbon-Aware Task Scheduling in Distributed Computing Continuum: A Lyapunov-Guided Reinforcement Learning Approach

The rapid expansion of cloud, fog, and edge services is pushing the Distributed Computing Continuum (DCC) toward unsustainable carbon footprints, while existing schedulers optimize latency or energy in isolation and therefore struggle to respect strict carbon budgets. We propose a carbon-aware task-scheduling framework that couples short-term performance with long-term sustainability. A Lyapunov-guided virtual queue converts horizon-wide emission constraints into per-slot stability targets, and a dynamically weighted Proximal Policy Optimization (PPO) agent allocates tasks in real time. After placement, a closed-form convex solver slices heterogeneous node resources, completing a dual-layer loop that blends theoretical guarantees with Deep Reinforcement Learning (DRL). Experiments on a 100-node DCC driven by carbon-intensity traces show that, under a tight 0.15 g CO $$_2$$ 2 eq budget, the proposed method attains 3.02 s average latency and 0.14g emissions, achieving 11% faster service and 7% lower emissions than the best-performing reinforcement learning baseline while reducing latency by 58% compared with uniform slicing. Dynamic weight adaptation stabilizes the virtual queue, whereas static-weight and no-convex variants either violate carbon limits or incur much higher delays. Consistent advantages across all tested budgets from 0.125 to 0.20g CO $$_2$$ 2 eq confirm the practicality of learning-driven, Lyapunov-constrained orchestration and open a path toward carbon-aware management of next-generation distributed infrastructures.

Shujia Niu, Zhenli He, Yuanfei Xiao, Yuxaun Nie, Yao Chen, Bingning Liu
DCTS-RDMA: Adaptive FEC via Dynamic Coding for Efficient RDMA over Lossy Networks

Remote Direct Memory Access (RDMA) is widely deployed in Data Center Networks (DCNs), but it is sensitive to packet loss due to the Go-Back-N retransmission. Forward Error Correction (FEC) can mitigate this issue by introducing redundancy to recover lost packets. However, the fixed coding parameters of FEC hinder its adaptability to dynamic network conditions, leading to inefficiencies and unreliability. To address this issue, we propose DCTS-RDMA, a dynamic coding transmission system for RDMA. DCTS-RDMA incorporates a monitor to detect packet loss in real time and a Dynamic Coding Packet (DCP) algorithm to adaptively adjust the coding group size and redundancy based on the detected loss, thereby overcoming the inefficiency of fixed coding parameters. To support varying coding decisions, the selected parameters are transmitted to an encoder at the sender to generate redundant packets, while a decoder at the receiver reconstructs lost data using both the original and redundant packets. We implement DCTS-RDMA using C++/HLS on an FPGA-based simulation platform and conduct extensive experiments. Under low packet loss conditions (less than 1%), our DCTS-RDMA reduces redundant overhead by approximately 85% and decreases total execution time by 7.5% compared to fixed coding strategies.

Zhiyi Yang, Zhexiong Li, Deze Zeng, Lin Gu
Adap DP-FR: Adaptive Differential Privacy for Federated Recommendation

Distributed systems are fundamental to large-scale personalized recommendation services, enabling collaborative learning across decentralized data sources. Federated recommendation, as a distributed paradigm, leverages Graph Neural Networks (GNNs) to capture high-order interactions while preserving user data locality. However, ensuring user privacy in such distributed environments remains a significant challenge. Existing solutions typically employ differential privacy by injecting fixed noise into client gradients, but this approach often struggles to balance privacy protection and model utility—leading to either degraded recommendation accuracy or insufficient privacy guarantees. To address these limitations, we propose Adap DP-FR, a distributed federated GNN-based recommendation framework that jointly optimizes model performance and privacy. Central to our framework is an Adaptive Sensitivity-based Differential Privacy (ASDP) mechanism, which dynamically assesses the sensitivity of client gradients, allocates privacy budgets accordingly, and injects noise adaptively. This design effectively minimizes utility loss while providing strong privacy guarantees. Extensive experiments on two real-world recommendation datasets demonstrate that, under the same privacy budget, Adap DP-FR reduces RMSE by 2% and lowers the membership inference attack success rate by approximately 67%, significantly outperforming existing dynamic budget allocation baselines.

Guiquan Zheng, Meiju Yu, Dan Qin, Pantong Wang, Xiliang Pang, Bo Wu
LingXi: An Architecture for COM/MON-Based High-Integrity TSN/TTE Switch

COM/MON-based high-integrity switches are essential for fault-tolerant time synchronization in TTE and TSN. Although the principles of COM/MON are simple, the implementation details and challenges of realizing high-integrity switches have rarely been thoroughly discussed. Our FPGA-based experiments identify false alarms caused by inconsistent frame-output sequences from COM and MON as a key issue. Eliminating such false alarms requires strict consistency in the frame-input sequences of COM and MON, which is difficult to enforce due to the ultra-low-latency and low-intrusiveness requirements in time-sensitive switching contexts. We propose LingXi, a COM/MON-based high-integrity switch architecture. Its integrated consensus module enforces strict input-sequence synchronization between COM and MON without intrusive modifications to the switching pipeline. We theoretically prove that LingXi introduces only 1–2 additional clock cycles of delay. Experimental results show that LingXi completely eliminates false alarms while consuming only 2.35% additional logic resources.

Pengye Xia, Zhigang Sun, Weiliang Li, Yiqin Dai, Jiabo Zhang, Xuyan Jiang
TAIR: Achieving Tenant Anomaly Isolation with Request Scheduling in Serverless Computing

In the past decade, serverless computing has emerged as a widely used paradigm due to its simplicity and cost-effectiveness. Under this architecture, cloud service providers receive requests from tenants and schedule them to the clusters. Existing scheduling solutions, driven by economic considerations, often aim for load balancing or maximizing resource utilization. However, resource sharing among tenants in serverless computing may lead to anomalies of one tenant affecting others, thereby reducing quality of service for tenants. To bridge this gap, we explore scheduling strategies that can achieve anomaly isolation between all tenants and formalize it as a nonlinear mixed-integer programming problem. We introduce an approximation algorithm based on submodular function theory, capable of running in polynomial time with a guaranteed approximation ratio of 1-1/e. Simulations using real-world production data demonstrate the effectiveness of our approach. The method successfully prevents risks caused by thousands of function isolation conflicts, while maintaining load balancing factor degradation within $$5\%$$ 5 % compared to non-isolation-enforced schemes.

Junhong Lu, Chu Xu, Gongming Zhao, Hongli Xu, Gangyi Luo, Hao Zheng
GICNet: Goal Interaction Conditioned Network for Human Trajectory Forecasting

Pedestrian motion prediction is of critical significance for intelligent and safe autonomous driving systems design. Human movement is by nature highly non-deterministic and multi-modal. Particularly, humans’ travel goals and their behavioral decisions interact with each other. In this work, we present Goal Interaction Conditioned Network GICNet for flexible and accurate human trajectory forecasting. Social influence, multi-modality, and goal constraints have been incorporated into GICNet to infer socially compliant human trajectories. The approach operates in three key stages: learning a probability distribution of motion intentions from historical data, grouping pedestrians and modeling goal-goal interactions using a masked Graph Attention Network (GAT), and integrating intention with motion history for prediction. Additionally, we present a novel iterative pooling method to adaptively fuse the impact of different pedestrian attributes during trajectory generation, enhancing robustness to neighbor misidentification. We demonstrate that GICNet generates realistic multi-modal trajectories, and improves the state-of-the-art performance on the Stanford Drone trajectory prediction benchmark by $$\sim $$ ∼ 6.7% and on ETH-UCY benchmark by $$\sim $$ ∼ 23.0% under the Best-of-20 evaluation protocol, significantly outperforming existing probabilistic models in dense interaction scenarios.

Jie Tang, Ken Chen, Feihe Guo
FPGA-Accelerated CNN-Transformer Hybrid Model for Real-Time Semantic Segmentation in Autonomous Driving

Semantic segmentation is a crucial task for autonomous driving and intelligent transportation systems, requiring high accuracy and low latency. Traditional architectures based solely on convolutional neural networks (CNNs) or Transformers have notable limitations: CNNs lack global perception capabilities, while Transformers incur high computational overhead and are challenging to deploy on edge devices. To address these issues, this paper proposes a lightweight CNN-Transformer hybrid model and designs a customized accelerator for FPGA to achieve real-time semantic segmentation. The model uses a deformable attention module to capture global context, combines deep separable convolution to improve the efficiency of local feature extraction, and introduces a skip connection enhancement module to restore edge details, taking into account both segmentation accuracy and computational overhead. To adapt to resource-constrained platforms, the accelerator optimizes parallel computing, memory access, and data flow scheduling through 8 bit quantization, operator decomposition, and 3D systolic array design. Experimental results show that the system achieves 72.8% mIoU on the Cityscapes dataset and an inference speed of 67 FPS, which is better than the mainstream CPU/GPU platforms in terms of energy efficiency and latency. This study provides an efficient and feasible solution for the deployment of complex neural networks in edge devices with a wide application potential.

Ao Zhang, Yongjiang Xue, Fei Qiao, Qingzeng Song
ASR of CoMP-UAV Cellular Networks with Specific Eavesdropper

Unmanned aerial vehicles (UAVs) operating as aerial base stations in open airspace face inherent security vulnerabilities during data transmission, particularly from eavesdroppers. To address this, this paper proposes a coordinated multi-point (CoMP) UAV-assisted network architecture to enhance downlink transmission security. The network space is tessellated into equal-sized hexagonal cells, where UAVs within each cell jointly serve ground users. Using stochastic geometry, we develop an analytical framework to characterize the downlink secrecy rate performance. Key metrics—including coverage probability, transmission rate, and average secrecy rate (ASR)—are derived in closed form. Our analysis quantifies the impact of critical system parameters on security performance. Simulations validate the analytical model, demonstrating close alignment between numerical and theoretical results. Notably, the proposed CoMP scheme achieves a 41.2% higher ASR compared to non-cooperative benchmark, significantly improving communication security.

Yan Li, Caoshuai Zhu, Renqi Zhu, Lailong Luo
FedLay: An Energy-Efficient Hierarchical Federated Learning Framework for Heterogeneous Edge Devices

Federated learning has emerged as a key privacy-preserving paradigm for training models on decentralized data. However, the inherent heterogeneity of edge devices in terms of computational power and energy resources poses a significant challenge, leading to prohibitive energy consumption and reduced training efficiency. Existing studies are predicated on an idealized assumption that all client devices are homogeneous, ignoring the profound impact of device heterogeneity on energy consumption. To address this issue, we propose FedLay, a novel hierarchical federated training framework. FedLay introduces a client-leveling mechanism that organizes devices into a three-tier structure, mitigating the high energy costs associated with data transmission. At its core, an adaptive strategy dynamically manages local training efforts, ensuring that even devices with limited power can contribute effectively without compromising the overall learning objective. Extensive experiments conducted on real-world datasets demonstrate the efficiency of our FedLay. Compared to baseline methods, FedLay can significantly reduce the energy consumption of device transmission by approximately $$30.84\%$$ 30.84 % , while maintaining competitive model performance.

Zhuopu Zhang, Renqi Zhu, Zongyang Yuan, Jiaqi Li, Lailong Luo, Deke Guo
Bridge the Gap Between QoS and QoE in Mobile Short Video Service: A CDN Perspective

Emerging mobile short video services pose different yet stringent performance requirements compared to traditional long video services. Content providers (CPs) aspire to a better user-perceived Quality of Experience (QoE) at the application layer, which is imperceptible to the Content Delivery Network (CDN), which monitors Quality of Service (QoS) at the transport layer. The mismatch between QoS and QoE leads to a complex and diverse mapping correlation between the two metrics. In this paper, we illustrate the QoS-QoE mapping correlation in mobile short video services. Although data-driven QoE prediction models can achieve the desired accuracy, complex scenario features are proven to be necessary, and the prediction model still lacks interpretability. Deeper quantitative analysis shows that the correlation becomes complex and diverse when resources are insufficient. The clustering-based prediction framework can successfully summarize scenario features and perform QoE prediction based on QoS metrics alone. Furthermore, we propose predictive QoE-based CDN scheduling. Experiments show that compared to scheduling with QoS metrics, QoE-aware scheduling achieves an average QoE improvement of 9.9% under comparable QoS quality.

Chuanqing Lin, Yangguang Liang, Fuhua Zeng, Zhipeng Huang, Xiaodong Li, Jingyu Yang, Yu Tian, Gerui Lv, Qinghua Wu, Zhenyu Li, Gaogang Xie
RSCAC-NET: A Remote Sensing Image Change Description Network Based on Change-Aware and Multi-stage Global Fusion

Analyzing land cover changes using multi-temporal remote sensing images is of great significance for environmental protection and land planning. However, traditional remote sensing change detection methods cannot directly reveal high-level semantic information such as the attributes of objects within the change regions and the relationships between these objects. Therefore, this paper proposes a Change Description Network called RSCAC. The network is composed of a feature extractor (ResNet101), a Change-aware Attention Module (CAAM), a Multi-Stage Global Fusion Module (MSGFM), and a description decoder. The Change-aware Module integrates a similarity module and a cross-attention mechanism. The Multi-Stage Global Fusion Module utilizes an innovative global fusion mechanism to effectively merge and extract global visual feature representations, enabling a more comprehensive description of the entire change scene. Comparative experiments on the LEVIR-CC dataset demonstrate that RSCAC can generate more coherent, accurate, and comprehensive change descriptions. Compared to the recently well-performing Chg2Cap, RSCAC achieves improvements of 2.05%, 0.17%, and 1.92% in BLEU-4, METEOR, and CIDEr-D, respectively. The visualization results also show that our model can focus on the changes of interest while ignoring irrelevant changes.

Hongyi Dong, Xiuzhen He, Yan Wang, Jing liu, Feilong Bao, Bing Jia
Cultivator: Multi-granularity Tree Construction in Heterogeneous Edge-Cloud Training

The rapid advancement of large AI models such as GPT-4 has transformed numerous domains. However, their deployment on edge devices remains limited by resource constraints, prompting the need for collaborative training with lightweight models. Existing methods often rely on fixed data granularities, which hinder adaptation to models with diverse capacities. Meanwhile, data fragmentation and privacy concerns in edge–cloud systems hinder the centralized construction of multi-granularity representations. Unsupervised methods further struggle to cope with high-dimensional complexity and diverse Quality of Service (QoS) demands. In this paper, we propose Cultivator, a QoS-aware framework for the distributed construction of multi-granularity trees, designed for heterogeneous model scales. Cultivator integrates multi-modal fusion with Federated Learning (FL) to address fragmented data and limited semantics. A dynamic QoS-aware mechanism further guides strategy selection between balanced regularized consistency for low-latency needs and optimal-transport-based multi-modal consistency for high-performance scenarios. Evaluations on various datasets demonstrate that Cultivator improves overall accuracy by up to 11.91% and increases accuracy density by an average of 13.64%, highlighting its efficiency in enhancing distributed training under complex edge–cloud environments.

Meilin Ding, Yunfeng Zhao, Chao Qiu, Fei Gao, Xiaofei Wang, Dajun Zhang
Adaptive Anomaly Detection for IoT Networks: Improved Feature Engineering and Classification

Intrusion Detection Systems (IDS) are vital for securing Internet-of-Things (IoT) networks, yet rule‐based and feature-based schemes fall short against fast-evolving attacks in resource-constrained settings. While deep learning has raised detection accuracy, many models still struggle with temporal dependencies, class imbalance, and interpretability. This study introduces a unified, lightweight, and explainable IDS framework evaluated on the IoTbenchmark, which captures diverse attacks and normal traffic across real IoT protocols. The detection pipeline integrates three components: (i) FW-SMOTE to balance minority classes, (ii) mutual-information SelectKBest to prune redundant features, and (iii) a Bi-CNN-GRU with attention to learn the fine-grained spatio-temporal patterns without heavy computation. Experiments on binary, attack-type, and subtype classification tasks show consistent gains over a CNN-GRU baseline: 99.95% accuracy for binary detection, 97.30% macro-F1 for attack-type, and 71.00% macro-F1 for subtype classification even under severe class imbalance. These results demonstrate that coupling attention and bidirectional recurrence with light preprocessing yields a scalable, interpretable IDS suitable for real-world IoT deployments, advancing both methodology and practical security.

Ayesha Sabir, Songjie Wei, Muhammad Usman Sabir, Abida Naz
BlockSDN-VC: A SDN-Based Virtual Coordinate-Enhanced Transaction Broadcast Framework for High-Performance Blockchains

Modern blockchains need fast, reliable propagation to balance security and throughput. Virtual-coordinate methods speed dissemination but rely on slow iterative updates, leaving nodes out of sync. We present BlockSDN-VC, a transaction-broadcast protocol that centralises coordinate computation and forwarding control in an SDN controller, delivering global consistency, minimal path stretch and rapid response to churn or congestion. In geo-distributed simulations, BlockSDN-VC cuts median latency by up to 62% and accelerates convergence fourfold over state-of-the-art schemes with under 3% control-plane overhead. In a real blockchain environment, BlockSDN-VC boosts confirmed-transaction throughput by 17% under adversarial workloads, requiring no modifications to existing clients.

Wenyang Jia, Jingjing Wang, Ziwei Yan, Guohui Yuan, Tanren Liu, Yakun Ren, Kai Lei
LLM-Enhanced Heterogeneous Graph Embedding Model for Multi-Task DNS Security

Current DNS security analysis methods primarily rely on traditional feature engineering, often neglecting the intrinsic relationships among heterogeneous DNS elements and deep semantic nuances essential for identifying advanced threats. These methods struggle to capture global higher-order relationships and typically lack targeted, interpretable explanations for complex threat patterns. To address these limitations, we propose an advanced framework integrating a Joint DNS Embedding (JDE) model with a specialized Large Language Model (LLM). The JDE model utilizes similarity-enhanced heterogeneous graph embeddings and hypergraph structures, effectively representing complex domain-IP associations to support broad Malicious Domain Detection (MDD). Complementarily, the LLM is fine-tuned specifically to analyze domain string characteristics, precisely identifying DGAs and DNS exfiltration patterns. The JDE model synthesizes spatial domain statistics with graph-derived embeddings, incorporating outputs from the fine-tuned LLM to enhance detection performance. Additionally, the specialized LLM contributes targeted explanations, significantly improving transparency and actionable insights for IP reputation evaluation (IRE). Experiments on a three-month real-world DNS traffic dataset demonstrate that our combined system achieves state-of-the-art results.

Wenyang Jia, Jingjing Wang, Ziwei Yan, Tanren Liu, Kai Lei
Learnable Cloud-Guided LLM Quantization for Resource-Constrained Edge Devices

Model quantization is crucial for deploying large language models (LLMs) on resource-constrained edge devices. However, in cloud-edge collaboration, edge devices (EDs) often lack the resources for on-device quantization. Moreover, existing Post-Training Quantization (PTQ) methods employ a static parameter approach, which fails to adapt to diverse local data distributions. To address these challenges, we propose a novel method, Learnable Quantization Guided by Distribution Correction (LQGDC), for generating an optimal, lightweight model that can be delivered to the ED for local inference within a cloud-edge collaborative framework. In this framework, edge devices upload a small amount of local data to the cloud as a calibration set. The cloud server then selects a suitable pre-trained model from a Model Pool and applies LQGDC to quantize the model. LQGDC introduces learnable parameters for weights, activations, and key-value (KV) cache. LQGDC employs a composite loss function that combines Mean Squared Error (MSE), cosine similarity, and Kullback-Leibler (KL) divergence to fine-tune parameters, thereby matching each device’s unique data distribution. Experiments on seven datasets demonstrate that LQGDC outperforms all three current baselines in both language generation and zero-shot tasks. Specifically, when quantizing LLaMA-13B to W4A4KV4, LQGDC reduces average perplexity (PPL) by 1.92 and improves zero-shot task accuracy by 2.14% compared to the best baseline. This approach shows promise for single AI task implementation on resource-constrained EDs (e.g., complex voice command processing on smartphones, real-time visual defect detection on industrial drones, and document analysis with long-term context).

Qinxiao Deng, Tianfu Pang, Benteng Zhang, Bingbing Nie, Xiaoming He, Yingchi Mao, Jie Wu
Fluid-DataTable: Elastic and Efficient Caching for Cloud Native Big Data Query System

Nowadays, big data query systems are often deployed on cloud native platforms for advantages like automated deployment and elastic scalability. Nonetheless, traditional optimization approaches often fail to fully exploit the potential of cloud native platforms and cache acceleration. It results in cache related QoS drawbacks for running query on cloud native environment, including inadequate space for caching, static cache configurations that are unsuitable for dynamic workloads, and query task scheduling that lacks consideration for cache reuse. To address these issues, we propose Fluid-DataTable, an elastic and efficient table cache management and query task scheduling service designed specifically for query systems on cloud native platforms. It prioritizes loading tables with high acceleration gain into cache cluster, dynamically adjusts the number of cache replicas and cache nodes based on data access frequency, and plans the execution order of queries with consideration of the cache state. Experimental results show that Fluid-DataTable achieves significant improvement in cache efficiency and query performance due to the proposed cloud native cache management and adaptation mechanisms. Also, the cache-aware scheduling strategy reduces cache replacement frequency, bringing about around 30% performance improvement compared to cutting-edge solutions.

Wenxiao Wang, Guowang Chen, Rong Gu, Guoding Ji, Chaozhong Yan, Yan Ding
An Effective Defense Scheme Based on SDN for DDoS Attacks System

Face to DDoS attacks, an effective defense scheme based on SDN for DDoS attacks is proposed in the paper. The proposed scheme includes an attack module, a monitoring module, and a defense module. In the attack module, DDoS attack scenarios are simulated in Mininet by producing the attacking traffic. Then, in the monitoring module, sFlow technology is used to monitor and extract common attack traffic characteristics. Besides, in the defense module, DDoS attack defense is implemented with API in the ONOS controller. Furthermore, when the traffic exceeds the threshold, the speed is limited to prevent the network from being paralyzed, and the traffic type is categorized into attack, ordinary, or business traffic. Then, after DDoS defense is performed, the speed limit is restored, and the QoS rules are used to ensure the forwarding of business traffic. The simulation results demonstrated the excellent performance of the proposed SDN-based solution for defending against DDoS attacks.

Qinghui Chen, Xiaojian Song, Hong Wen, Yazhi Shi, Zhenheng Chen
EdgeInferFlow: A Distributed Inference Acceleration Method for Deep Learning Chained Structure Models for Edge Devices

With the increasing deployment of complex deep learning models on edge devices, addressing the high inference latency caused by their inherent computational bottlenecks is paramount for enabling real-telligent applications. To mitigate the resulting high inference latency, we propose EdgeInferFlow, a novel framework designed to accelerate the distributed inference of chain-structured models across edge-device clusters. The framework is predicated on a two-level optimization methodology: model partitioning and distributed inference orchestration. At the partitioning level, EdgeInferFlow employs a dynamic load-balancing algorithm to mitigate computational load imbalance among partitioned sub-models. This algorithm is integrated with a distributed computation scheme for linear layers and a multi-stage strategy to minimize inter-node data dependencies and communication overhead. At the inference level, EdgeInferFlow constructs a distributed computational graph to orchestrate the data flow and parallel execution of tensors throughout the cluster. Experimental results demonstrate that our proposed method achieves a reduction in end-to-end inference latency of up to 70%.

Hanfeng Zhai, Yifan Wang, Xiaohui Peng, Lei Li, Xueqi Li
Performance Analysis of Multipath QUIC Schedulers for Video Streaming over Hybrid 5G-Satcom Networks

Ensuring reliable real-time video streaming over heterogeneous networks particularly those combining 5G and satellite (Satcom) links poses significant challenges due to dynamic bandwidth, latency, and loss characteristics. This paper conducts an in-depth performance analysis of existing Multipath QUIC (MP-QUIC) schedulers in a hybrid 5G–Satcom environment, using a controlled emulation setup based on real network measurements from two operator datasets. We evaluate scheduler behavior across key Quality of Experience (QoE) metrics, including Peak Signal-to-Noise Ratio (PSNR), frame rate (FPS), and bitrate stability. Performance variability is captured using cumulative distribution functions (CDFs) across 50 experimental runs to ensure reproducibility. Our results reveal that while static rule-based schedulers (e.g., Round Robin, MinRTT) offer stable but conservative performance, they underutilize high-throughput paths in dynamic environments. Conversely, adaptive schedulers (e.g., Peekaboo, DEAR) exploit link diversity more aggressively, improving throughput under dynamic conditions. These observations illustrate critical trade-offs between path utilization and QoE stability. The study provides actionable insights for future multipath scheduling frameworks and underscores the potential of incorporating context-aware or learning-based decision mechanisms to improve video delivery resilience in complex, hybrid network scenarios.

Shravan Kumar Pattiwar, Paresh Saxena, Ozgu Alay
A Privacy-Preserving Edge Inference Framework for Low-Altitude UAV Swarm Intelligence

The integration of Artificial Intelligence (AI) and low-altitude Unmanned Aerial Vehicle (UAV) swarms enables powerful capabilities in real-time surveillance, environmental sensing, and collaborative decision-making. In this paper, we propose a Privacy-preserving Edge Inference (PPEI) framework tailored for UAV swarms, leveraging edge computing and Fully Homomorphic Encryption (FHE) to ensure end-to-end data confidentiality. A hierarchical secure data transmission architecture is designed to support encrypted information exchange among UAVs, edge nodes, and cloud servers. UAV sensory data are encrypted using a hybrid scheme that combines symmetric encryption for efficiency and asymmetric encryption for secure key distribution. All learning and inference tasks are conducted directly on FHE-encrypted data, producing ciphertext predictions that protect both mission-critical inputs and outputs from leakage. Furthermore, we integrate FHE-based model component encryption and differential privacy mechanisms to hide the model structure and weight parameters from untrusted inference environments and potential adversaries. Experimental results demonstrate that our FHE-based inference framework provides strong privacy guarantees with acceptable overhead.

Jianguo Chen, Guoqing Xiao, Longxin Zhang, Guocheng Liao, Bodong Wang, Weijian You
Backmatter
Title
Network and Parallel Computing
Editors
Xiaoliang Wang
Xiaohong Jiang
Noel Crespi
Baoliu Ye
Copyright Year
2026
Electronic ISBN
978-3-032-10459-5
Print ISBN
978-3-032-10458-8
DOI
https://doi.org/10.1007/978-3-032-10459-5

PDF files of this book have been created in accordance with the PDF/UA-1 standard to enhance accessibility, including screen reader support, described non-text content (images, graphs), bookmarks for easy navigation, keyboard-friendly links and forms and searchable, selectable text. We recognize the importance of accessibility, and we welcome queries about accessibility for any of our products. If you have a question or an access need, please get in touch with us at accessibilitysupport@springernature.com.

Premium Partner

    Image Credits
    Neuer Inhalt/© ITandMEDIA, Nagarro GmbH/© Nagarro GmbH, AvePoint Deutschland GmbH/© AvePoint Deutschland GmbH, AFB Gemeinnützige GmbH/© AFB Gemeinnützige GmbH, USU GmbH/© USU GmbH, Ferrari electronic AG/© Ferrari electronic AG