Skip to main content
Top

Network and Parallel Computing

21st IFIP WG 10.3 International Conference, NPC 2025, Nha Trang, Vietnam, November 14–16, 2025, Proceedings, Part II

  • 2026
  • Book
insite
SEARCH

About this book

This two-part volume LNCS 16305 and 16306 constitutes the proceedings of the 21st IFIP International Conference on Network and Parallel Computing, NPC 2025, which was held in Nha Trang, Vietnam, during November 14-16, 2025.

The 76 full papers included in this volume were carefully reviewed and selected from 223 submissions. Topics of interest include, but are not limited to, parallel and distributed applications and algorithms, parallel and distributed architectures and systems, and parallel and distributed software environments and tools.

Table of Contents

Frontmatter
Dynamic Resource Allocation with Adaptive Mode Selection in D2D-V2X Networks

Cellular vehicle-to-everything (V2X) networks leveraging device -to-device (D2D) communications face critical interference and reliability bottlenecks in safety-critical scenarios. This work introduces a dynamic resource orchestration framework that jointly optimizes transmission mode selection, spectrum sharing, and power allocation. By decomposing the mixed-integer non-convex optimization problem through block coordinate descent, our approach iteratively solves coupled subproblems. Non-convex constraints are transformed via successive convex approximation with first-order Taylor expansions, enabling efficient solution convergence. The proposed scheme maximizes vehicle-to-infrastructure (V2I) sum-rate while rigorously guaranteeing ultra-reliable low-latency requirements for vehicle-to-vehicle (V2V) links through adaptive mode switching between dedicated and reused spectrum access. Simulations confirm significant performance gains over conventional methods across diverse urban scenarios.

Xiang Xiao, Peidong Zhu, Jia Song, Gang Su, Lu Feng, Peng Wu, Li Zhu
A Digital Twin-Assisted Multi-agent Task Offloading Method with Priority Scheduling in Vehicular Edge Networks

Vehicular Edge Computing (VEC) has emerged as a promising solution for offloading computation-intensive tasks from vehicles to nearby edge servers, enabling low-latency and energy-efficient services. However, real-world vehicular workloads are often heterogeneous, varying significantly in data size, deadline sensitivity, and task priority. Such heterogeneities introduce significant challenges in effective task scheduling and resource allocation. In this paper, we propose a novel digital twin-assisted offloading method based on a Multi-Agent Deep Deterministic Policy Gradient (MADDPG) framework, named DT-MAP. This method jointly considers task heterogeneity, vehicular mobility, and priority-aware scheduling. Additionally, the proposed method incorporates a dynamic reward shaping mechanism that accounts for task priority, delay sensitivity and load penalty, enabling agents to learn cooperative offloading policies under constrained edge resources. To evaluate our approach, we develop a simulated VEC environment inspired by digital twins, which dynamically reflects vehicle mobility, network conditions, and edge server status. Experiments with varying vehicle numbers show that DT-MAP outperforms baseline strategies (Random, Greedy, SAC), with an average load balancing improvement of 17.48%, 14.81%, and 3.78%, respectively. It also reduces average latency by 9.57%, 7.30%, and 3.24%. Additionally, DT-MAP achieves near-saturation resource utilization, with an average efficiency of 98.70%.

Taotao Yu, Zhou Zhou, Hongbing Cheng
DDPG-Based Joint Dynamic Task Offloading and Resource Allocation for Multi-user MEC Networks

With the rapid development of 5G and IoT technologies, edge computing networks face dynamic challenges in task offloading and resource allocation complexity. In this paper, we propose a Deep Deterministic Policy Gradient (DDPG)-based optimization framework for multi-user multi-edge-server scenarios. The dynamic task offloading and resource allocation problem is formalized as a Markov Decision Process (MDP), defining a state space incorporating server resources, user locations, and task states, along with a hybrid action space combining continuous resource allocation and discrete offloading decisions. The key innovation lies in designing a reward function maximizing task completion volume, implementing dynamic policy optimization through an Actor-Critic network architecture, and enhancing stability via experience replay with target network soft updates. Simulation results demonstrate that the Deep Deterministic Policy Gradient algorithm achieves significantly higher average task completion rewards compared to other algorithms, along with substantially lower reward standard deviation.

Shuang Yang, Xiang Xiao, Peidong Zhu, Lulu Wang, Yu Zheng, Ruihan Chen, Mingzhuo Xie
CGO: Cloud Game Orchestration via Resource Preception and CODEC Optimization

Cloud gaming acceleration faces critical challenges in balancing latency and visual quality. To address these issues, we propose a resource-aware mechanism for fine-grained game analysis, enabling precise identification and management of resource-intensive stages in real-time. By utilizing motion vector (MV) information stored in the encoder, our method enables the edge device to reconstruct and enhance the video frames locally before delivering them to the user. This hybrid architecture not only reduces computational load on the cloud but also minimizes network bandwidth consumption. Experimental results demonstrate that the proposed approach achieves significant improvements in visual quality while maintaining smooth gameplay experiences. Our solution provides a promising direction for optimizing cloud gaming performance by efficiently integrating resource computing with advanced CODEC acceleration techniques.

Taolei Wang, Chao Li, Jing Wang, Xiaofeng Hou, Minyi Guo
Role-Aware Dynamic Grouping for Efficient Coordination in Multi-agent Reinforcement Learning

Cooperative multi-agent reinforcement learning aims to train decentralized agents to accomplish joint tasks by maximizing a global reward. While existing value decomposition methods under the centralized training and decentralized execution paradigm have achieved notable success, they often overlook the latent role structures inherent in multi-agent systems. In real-world scenarios, agents may exhibit functional heterogeneity or behavioral diversity, even when sharing identical observation and action spaces. To address this limitation, we propose Role-Aware Dynamic Grouping (RADG), a novel framework that learns contrastive role representations from agents’ trajectory information and performs adaptive grouping based on these learned roles. The extracted role embeddings capture meaningful behavioral patterns that guide flexible and dynamic group formation. Within each group, agents coordinate more effectively through shared policy information and group-aware value decomposition. RADG enables structured cooperation, improves exploration efficiency, and enhances generalization across diverse tasks. Importantly, it operates without relying on manual supervision or domain-specific priors, making it well-suited for dynamic and complex environments. Experimental results on standard cooperative MARL benchmarks demonstrate that RADG consistently outperforms existing baselines in coordination performance, training stability, and adaptability to varying team structures.

Hongxin Zhang, Zhi Li, Junbo Wang
SynergiCache: A Novel Cluster Cache for Enhancing Performance in Cloud Storage Systems

In cloud storage systems, caching is a commonly employed method to enhance system performance. Utilizing Solid State Drives (SSDs) as caches for Hard Disk Drives (HDDs) can improve overall system performance at a relatively low cost. However, traditional caching systems suffer from inefficiencies in write operations due to their overwrite approach. Furthermore, when slow disks are present in cloud storage systems, caching algorithms are often unable to detect them, resulting in degraded performance of the storage system. To address these challenges, we proposed SynergiCache, a novel cluster caching system for cloud storage systems that significantly enhances performance and efficiency. Initially, we convert the traditional overwrite approach to an efficient append-only method, leveraging Remote Direct Memory Access (RDMA) technology to enhance data transfer efficiency. Furthermore, we employ a composite Key-Value (KV) storage mode and implement a cooperative garbage collection mechanism to optimize data access and storage performance. Subsequently, we introduce a slow-disk-sensitive adaptive (SDSA) caching algorithm that optimizes the flow of data between SSDs and HDDs, thereby reducing the adverse impact of slow disks on cloud storage systems. We implemented SynergiCache, adapted it for integration with Ceph, and conducted comprehensive experiments. The experimental results demonstrate that SynergiCache significantly enhances cloud storage performance, reducing average latency by 89.55% and increasing IOPS by 9.52 $$\times $$ × compared to traditional caching systems.

Yucheng Kang, Jiawei Li, Chenming Chang, Keqiang Li, Yupeng Chen, Yi Zhang
Image Compressive Sensing Approach Based on Mixed Precision Training and Deep Unrolling Network

In this paper, an enhanced version of interpretable optimization-inspired deep network for image compressive sensing(ISTA-Net) based on hybrid precision training, dubbed Mix-ISTA, ensures high compressed reconstruction and speed up training. ISTA-Net is a deep neural network architecture whose design is inspired by the Iterative Shrinkage Threshold Algorithm (ISTA), which has the advantages of both optimization and network. However, the training process for ISTA-Net can be quite time-consuming when trained on large-scale datasets. In order to overcome this problem, we introduce a hybrid precision training technology, which can effectively consume memory and compute costs by combining single-precision (FP32) and half-precision (FP16) operations. The experimental results show that the training speed is significantly improved after mixed precision training, while the reconstruction performance of ISTA-Net is not affected. While greatly speeding up the training, it can still achieve high reconstruction results, so that it has more practical value in practical applications.

Lei Feng, Mingzhu Bian, Jun Zhu, Bo Zhang, Xiuliang Zhang
FastDAG: A Low-Latency and Parallel Wave-Execution Consensus with a Double-Layer DAG

DAG-based Byzantine Fault Tolerant protocols have gained popularity due to their high throughput, but they often suffer from high latency caused by the serial wave-execution model. In this paper, we propose FastDAG, the first asynchronous DAG-based consensus protocol that adopts a parallel wave-execution model. FastDAG introduces a double-layer DAG structure to parallelize voting and a cross-referencing approach to link the two layers, significantly reducing latency. To address the challenges of inconsistency between the two layers, we design a Cross-Reference Fast commit approach that determines block commitment based on voting results from both layers. To address the challenge of Byzantine behavior of the leader, we design a planned-and-forced switching approach. Real-world experiment results on 46 cloud servers show that FastDAG outperforms existing protocols, achieving 22.5% lower latency than GradedDAG [1] and 35% lower than Tusk [2].

Yi Hua, Xiulong Liu, Hao Xu, Chenyu Zhang, Licheng Wang, Keqiu Li
Long-Term Cloud Workload Prediction with Multi-period Augmented LSTM

Long-term Cloud Workload Prediction (LCWP) is critical for efficient resource provisioning and cost optimization in cloud computing environments. However, traditional prediction approaches compress complex workload patterns into a single token, leading to catastrophic forgetting of historical variations. Additionally, they lack the capability to capture distinct periodic patterns (e.g., minutely and hourly cycles) and bursty trends. To overcome these limitations, we propose the Multi-period Augmented LSTM (MUPA), a comprehensive encoder-decoder model featuring explicit cross-period connections designed to maximize utilization of inherent periodic information. MUPA architecture integrates two novel LSTM variants as core components: Multi-input LSTM, which aggregates latent representations across time steps to establish global workload dynamics understanding, and Broadened LSTM, which enhances memory mechanisms by progressively expanding the cell state’s value range to learn long-term dependencies. Extensive experiments on real-world cloud workload datasets demonstrate MUPA’s superior effectiveness for workload prediction tasks.

Wentao Shi, Jiarui Hu, Xiangkai Ma, Wenzhong Li, Shuai Li, Sanglu Lu
In-Orbit Container Registry Planning for Fast Image Downloading in LEO Satellite Constellation

In-orbit computing in Low Earth Orbit (LEO) satellite constellations represents a significant advancement in enhancing the efficiency of satellite data processing. Container-based cloud-native solutions are increasingly applied to enhance the elasticity of in-orbit computing. Albeit with high potential, its performance is significantly constrained by the container image downloading delay via satellite-ground links. Therefore, in-orbit container registry is required so as to reduce expensive image downloading overhead. However, due to the motion of LEO satellites, the network topology changes dynamically, leading to fluctuation in the Inter-Satellite Link (ISL) connectivity and communication rates. This imposes significant challenges to in-orbit container registry planning. Additionally, the planning must also account for the limited on-satellite storage and request popularity. To this end, we investigate the problem of in-orbit container registry planning for overall downloading time minimization. The problem is formulated into an ILP form and proved to be NP-hard. We further propose the In-orbit Registry Planning algorithm based on Randomized Rounding (RR-IRP). The experimental results demonstrate the effectiveness of our RR-IRP algorithm, which averagely reduces container image download time by $$22.71\%$$ 22.71 % compared to classic solutions.

Lifeng Tian, Yuepeng Li, Deze Zeng, Lin Gu, Chengyu Hu, Liang Zhong
TIDF: Timing-Based Device Fingerprinting for PLCs

Industrial Control Systems (ICS) often lack device-level authentication, making them vulnerable to unauthorized access to Programmable Logic Controllers (PLCs). To address this challenge, we propose a lightweight hybrid fingerprinting method, Timing-based Device Fingerprinting (TIDF), for detecting unauthorized PLCs based on communication processing time and clock pulse period. TIDF leverages stable ICS network conditions and inherent PLC hardware characteristics, and integrates these features into a unified system consisting of filtering, training, and anomaly detection modules. By employing Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and One-Class Support Vector Machine (OCSVM), TIDF achieves accurate and efficient classification with low overhead. We evaluate TIDF on real-world data from 13 PLCs, including Siemens and Xinje, and further test its robustness against basic forgery attempts. The results show an anomaly detection rate of 96%, demonstrating the effectiveness of TIDF in detecting unauthorized device access attacks and enhancing ICS security.

Lei Xiang, Hao Han
Deadlock-Free Transaction Processing in Payment Channel Networks

Payment channel networks (PCNs) enhance blockchain scalability by enabling off-chain payments, where Hashed Time-Locked Contracts (HTLCs) ensure the security of multi-hop payments across intermediaries. As the scale of payment channel networks expands, the number of transactions naturally increases, while concurrent executions on overlapping channels trigger resource contention for the limited capacity of shared channels, potentially leading to deadlocks. Based on the principle that consistent execution ordering across overlapping channels prevents deadlocks, we systematically solve the deadlock problem by formalizing deadlock formation conditions, designing detection methods based on these conditions, followed by developing total-order and partial-order sorting strategies to break potential cyclic dependencies, and ultimately proposing a fairness-aware, deadlock-free scheduling mechanism that enhances transaction success rates. Extensive simulations validate our approach, demonstrating competitive transaction success rates alongside robust deadlock prevention.

Rong Cao, Jingjing Zhang, Peizong Yang, Litong Sun, Weigang Wu, Jing Bian
QoSmart-IoT: Secure QoS-Based Reconfiguration and Protocol Adaptation for Hybrid Clustered IoT Systems in Constrained Environments

Current Internet of Things (IoT) communication systems typically employ static protocols and hardcoded resource allocation mechanisms, limiting their ability to adapt to dynamic, resource-constrained environments. These schemes lack real-time adaptability and cannot provide fine-grained Quality of Service (QoS) management, particularly under network stress or time-varying conditions. This paper addresses this limitation by introducing QoSmart-IoT, a three-layered framework operating at client, cluster, and edge levels to enable real-time adaptation through QoS-driven decisions. The framework integrates hybrid clustering algorithms with adaptive Message Queuing Telemetry Transport (MQTT) and Constrained Application Protocol (CoAP) switching, utilizing live performance metrics including latency, energy consumption, throughput, and session-level security feedback. Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithms optimize resource allocation, while Advanced Encryption Standard (AES-128) provides security. The system demonstrates significant performance improvements, implemented on Contiki-NG and evaluated through comprehensive runtime analysis across multiple optimization scenarios and stress tests. Our work reduces average latency by 23.3% (from 55.1 ms to 42.3 ms), improves QoS scores from 0.82 to 0.88 during heavy load conditions, and achieves energy savings of up to 0.4 W during protocol switching operations. The adaptive clustering mechanism successfully executed 9 reconfigurations within 30 s each, enhancing recovery time and communication reliability to demonstrate QoSmart-IoT’s effectiveness in providing secure, adaptive performance in IoT environments.

Osama Dighriri, Priyadarsi Nanda, Manoranjan Mohanty, Bashair Alrashed, Ibrahim Haddadi
ManuMatic: Strategy Injection for Robust Automatic Hybrid Parallelism in Distributed DNN Training

Training modern deep neural networks (DNNs) requires hybrid parallelism. Automatic planners search data, tensor/model, and pipeline shardings with cost models, but decisions can drift from runtime optima due to framework/planner decoupling and overlap mis-modeling. We present ManuMatic, a light-touch planner that lets users pin a few critical operator shardings while automatically deriving globally consistent strategies for the rest. Inside a binary recursive partitioner, ManuMatic prioritizes pins via an infinite compromise price and decomposes multi-dimensional hints into two-way refinements; when hard constraints are infeasible, a soft-penalty variant applies. The design is profiling-free, preserves D-Rec’s short compilation time, and degenerates to D-Rec when no pins are given. Built atop D-Rec, ManuMatic delivers consistent speedups without cost-model reengineering: on Mixtral-8 $$\times $$ × 7B, an expert-parallel-aware BMM pin achieves 2.24 $$\times $$ × over D-Rec; on Llama3-8B, a sequence-parallel-aware MatMul pin reaches 2.04 $$\times $$ × ; on Qwen2.5-72B, a sequence-parallel-aware MatMul pin combined with BMPipe yields 1.45 $$\times $$ × over D-Rec and 1.30 $$\times $$ × over an expert plan. These results show that minimal guidance can robustify automatic parallelism while largely preserving automation.

Ruiwen Wang, Chong Li, Hongxing Wang, Raja Appuswamy, Yujie Yuan
Maximizing the Utility of Multiple UAV Service Providers: A Hierarchical Cooperation Approach

In recent years, unmanned aerial vehicles (UAVs) have been utilized as mobile edge computing (MEC) platforms to tackle computing resource limitations and communication coverage issues, particularly in areas without fixed infrastructure. However, the independent operation of UAV providers often leads to imbalanced service loads, inefficient resource usage, and limited coverage. To address these issues, this paper proposes a hierarchical cooperation approach for multiple UAV service providers, optimizing coalition formation and task offloading strategies to enhance overall system utility. We model the collaboration between UAV providers as a coalition formation game (CFG) and the joint order is employed to ensure stable coalitions, thus maximizing system performance. Task offloading and resource allocation within each coalition are formulated as a many-to-one matching problem to optimize resource utilization and computational efficiency. The Shapley value is applied for fair utility distribution, incentivizing UAV providers to maintain cooperation. Extensive experiments demonstrate the effectiveness of our approach, with the joint order improving system utility by 4.76% and 31.58% over traditional selfish and Pareto orders, respectively. Furthermore, the proposed offloading scheme shows significant performance gains of 13.22% and 20.24% compared to the shortest distance and random access algorithms, respectively.

Zhangzhou Li, Geyao Cheng, Bangbang Ren, Xiaolei Zhou, Lailong Luo, Deke Guo
Slotqueue: A Wait-Free Distributed Multi-producer Single-Consumer Queue with Constant Remote Operations

For some distributed applications, e.g. the distributed actor model, distributed multi-producer, single-consumer (MPSC) queues play a vital role. For these applications to be fault tolerant and performant, a highly efficient non-blocking distributed MPSC queue is desired. Currently, in the literature, there is no non-blocking distributed MPSC queue. Therefore, a question naturally arises: Does there exist a performant non-blocking distributed MPSC queue? We answer this question by proposing Slotqueue, a wait-free distributed MPSC queue with only a constant amount of remote operations per enqueue and dequeue call. This is achieved by the use of timestamps to order the items in the queue and the idea of using a flat array structure to maintain the timestamps. To demonstrate how well Slotqueue performs in practice, we develop a microbenchmark to measure its throughput as compared to other distributed MPSC queues we have surveyed in the literature. We discover that Slotqueue is fault tolerant while performing comparable to other distributed MPSC queues.

Do Nguyen An Huy, Thanh-Dang Diep, Karl Fürlinger, Nam Thoai
Exploiting Hard Samples for Stealthy Backdoor Attacks on Large Language Models

Large language models (LLMs) have made remarkable advances in natural language processing. However, as these models increase in scale and expand their application scope, vulnerabilities in text processing tasks become more prominent. Traditional backdoor attack methods are ineffective against LLMs due to their intricate decision boundaries, massive training data, and outstanding generalization capabilities. This paper introduces a backdoor attack framework based on hard samples. By analyzing and quantifying forgetting events during training, we accurately identify hard samples with ambiguous decision boundaries and implant subtle backdoor triggers in them. This approach leverages the model’s inconsistent classification behavior on specific samples to facilitate backdoor activation, maintaining normal functionality, and enabling highly stealthy targeted attacks. Our experiments conducted on the Emotion and Twitter datasets using Llama2-7B and Llama3-8B models demonstrate that, with only a 30% poisoning rate targeting a single label, the proposed attack framework achieves an attack success rate (ASR) that exceeds traditional methods by more than 70%. Meanwhile, benign accuracy decreases by less than 2%, indicating strong generalization across various models and datasets.

Diqun Yan, Rangding Wang
FedSM: A Federated Spectrum Management Architecture for 6G Network

Efficient spectrum management is critical for 6G mobile communications to meet stringent latency and bandwidth requirements of emerging edge computing applications. However, current spectrum management approaches face two primary challenges: inefficient spectrum utilization due to competitive conflicts among edge devices, and privacy concerns when sharing sensitive channel state information across the network. In this work, inspired by federated learning’s capability for privacy preservation, we present FedSM (Federated Spectrum Management), a novel hierarchical framework that addresses these challenges through two integrated modules: coalition-based spectrum allocation using hedonic coalition game theory to partition devices into strategic groups to reduce competitive conflicts, and bandit-based spectrum sharing employing contextual multi-armed bandit algorithms for adaptive resource allocation within coalitions while preserving privacy. Comprehensive evaluation on both Komondor simulator-based prototype testing and real-world VR application deployments demonstrates FedSM’s superior performance, achieving 93.51% channel utilization compared to 68.92% for baseline approaches in simulation environments, and 78% versus 30% for local management in real-world testbeds, while maintaining reasonable latency of 248.68 ms in simulation and competitive delay performance in real-world scenarios, all with complete privacy preservation.

Jinqi Yan, Zhili He, Chuang Hu, Dazhao Cheng
TriCooling-Sim: Efficient Thermal Simulation for High-Density Micro AI Data Centers

Miniaturized artificial intelligence (AI) data centers (MAIDC) built from high-performance embedded AI nodes have shown great promise in accelerating next-generation edge computing applications. However, MAIDCs are hard to design since they face more stringent thermal constraints due to high power densities exceeding traditional server architectures, limited cooling capacity from compact form factors, and variable conditions such as fluctuating ambient temperatures. In this work, we take the first to explore the thermal behavior of MAIDC and present TriCooling-Sim, a hierarchical and adaptive thermal–computation co-simulation framework for high-density MAIDCs composed of system-on-chip (SoC) nodes. Our design features a novel light-weight physics-guided modeling strategy that can achieve proactive workload–cooling co-optimization, supporting power-efficient architecture design and intelligent resource management. The framework allows multi-scale thermal simulation across six orders of temporal magnitude and three orders of spatial magnitude without prohibitive overhead. Validation across 16 representative MAIDC configurations shows that TriCooling-Sim attains a mean absolute error of 1.7 $$^\circ $$ ∘ C compared with reference CFD simulations while reducing simulation time by up to two orders of magnitude, enabling both rapid design-space exploration and near-real-time operational decision-making for future MAIDC deployments.

Jinyang Guo, Xinkai Wang, Jing Wang, Xiaofeng Hou, Chao Li, Minyi Guo
Semantic-Driven Task-Traffic Co-scheduling for TSN with Generalization Ability: A Heterogeneous Graph Neural Network-Based Method

In the Industrial Internet of Things (IIoT), Time-Sensitive Networking (TSN) is a promising field network of implementing application functions across distributed devices. For a TSN-engaged IIoT system, co-scheduling task execution and TSN transmission is crucial to guarantee the chain execution of application tasks. However, the generalization ability of co-scheduling across varying scenarios is hindered in existing works, which lack characterization for resource conflicts arising from semantic relations among tasks, traffic, and the underlying topology. To address this, we propose a heterogeneous graph neural network (HGNN)-based co-scheduling method featuring explicit conflict characterization. We design a semantic-aware encoder within the HGNN, which aggregates heterogeneous component features through designated graph paths to capture their semantic relations. An agent then extracts conflict patterns from this encoding, and decodes conflict-free scheduling decisions on offloading, task priority assignment, and traffic offset design. To enhance generalization ability in unseen scenarios, the conflict extraction ability and the inductive encoding ability are refined through deep reinforcement learning feedback. Experiments demonstrate that our method achieves 12% higher schedulability and 20% lower task chain delay, and maintains its performance in unseen topologies and task scenarios.

Zhihao Yang, Lei Xu, Shouliang Wang, Kankan Wu, Cailian Chen, Xiaolin Wang
CPU–GPU Heterogeneity Based Pipeline Parallel Architecture in Physical Layer Processing

Efficient performance analysis and software-hardware decoupling are crucial for evaluating future communication technologies. However, current general-purpose processors fail to fully leverage the synergistic computational capabilities of the Central Processing Units (CPUs) and Graphics Processing Units (GPUs) when evaluating the performance of wireless protocol stacks, resulting in inefficient processing of compute-intensive tasks and an inability to meet the high-throughput demands of real-time scenarios. To address this issue, this paper proposes a pipeline parallel processing architecture based on CPU-GPU coordinated scheduling. During the pipeline parallel processing of multiple data frames, this architecture intelligently assigns computational tasks to the most suitable processing unit based on real-time load and processing unit structures, thereby improving processing efficiency, reducing power consumption, and enhancing overall system performance. Experimental results indicate that on the mid-range heterogeneous platform, the pipeline parallel architecture achieves a 172.15% throughput enhancement and a 63.26% latency reduction; on the high-end platform, it attains a 326.32% throughput enhancement and a 76.54% latency reduction, demonstrating robustness across hardware levels. These improvements alleviate the bottlenecks of existing Software-Defined Radio (SDR) simulation architectures.

Shiwen He, Xunzhe Deng, Zhenyu An, Chengzuo Peng, Linhua Liu, Wei Huang
HBD-CE: Efficient Cross-HBD Communication for LLM Training in High-Bandwidth Domain Cluster via Hierarchical Collectives

Large language model(LLM) training relies on multiple parallel strategies, where high-bandwidth domains (HBDs) play a key role in enabling efficient communication between NPUs. Compared to intra-HBD communication, cross-HBD communication is significantly slower and remains unavoidable due to limited HBD sizes and different model partitioning strategies. Hierarchical collectives can reduce cross-HBD communication overhead, but their effectiveness is hindered by imbalanced parallel group distribution across HBDs, different communication costs with different distribution patterns, and interdependencies among parallelism strategies. To address these challenges, we propose HBD-CE, a model placement scheme designed for communication-efficient cross-HBD LLM training in HBD clusters, which optimizes model placement to enhance cross-HBD communication efficiency. We formulate the model placement problem as a mixed-integer nonlinear programming (MINLP) problem. By leveraging the properties of hierarchical collectives, we transform the MINLP into a series of 0–1 integer linear programming (ILP) problems and develop an exact algorithm to solve them. Experiment results demonstrate that HBD-CE can reduce per-iteration communication time by 5%–77.3% across diverse topologies and model configurations.

Huihuang Qin, Shuangwu Chen, Zijian Wen, Zian Wang, Ziyang Zou, Tao Zhang, Xiaobin Tan, Jian Yang
NNia-8: An 8-Core RISC-V Neural Network Inference Accelerator with Efficient Processing Elements and Memory Utilization

RISC-V is widely used for edge AI acceleration, but most existing processors rely on scalar architectures, resulting in low PE density, inefficient bandwidth utilization, and complicated parallel processing schemes. Traditional Dot-Product (Dot-P) instruction extensions only support four 8-bit MAC operations per cycle, the computational density cannot support neural network implementation. Moreover, the general-purpose register constraint in RISC-V strictly limits matrix kernel dimensions, degrading computational resource utilization. Current multicore solutions require Tightly Coupled Data Memory (TCDM) with doubled core numbers of memory banks, along with mandatory padding operations to prevent bank conflicts, increasing memory subsystem complexity and area overhead. To address these challenges, we propose NNia-8, an 8-core RISC-V processor with six custom instructions for neural network inference acceleration (NNia). By replacing Dot-P operations with Out-Product (Out-P) computation paradigms, we achieve enhanced PE density optimization. Through implicit register invocation techniques integrating computation and memory access operations, along with dedicated buffer enhancements, bandwidth utilization and PE efficiency are substantially improved. Evaluated under the CMOS 55 nm process, NNia-8 achieves 49.9 GOPS at 8-bit precision with energy efficiency reaching 322 GOPS/W, outperforming state-of-the-art solutions.

Xingbo Wang, Yucong Huang, Xinyu Kang, Yuru Li, Qi Wang, Terry Tao Ye
GPowerT: LLM-Driven Automated Programming for Power-Constrained IoT Applications

Large Language Models (LLMs) are reshaping Internet of Things (IoT) application development. They let developers generate functional code from natural language. Yet battery-powered IoT devices operate under tight energy budgets. Current LLM-driven programming frameworks offer limited support for power-constrained IoT design. In this paper, we propose GPowerT, an LLM-driven automated programming system for power-constrained IoT applications. GPowerT extracts power-related constraints from natural language requirements, including battery capacity, target lifetime, and energy-saving policies, dependent on hardware power profiles and a low-power design knowledge base. During code generation, it injects strategies such as deep sleep, periodic sampling, and batch communication. A static power estimation module analyzes operation and sleep cycles, communication frequency, and peripheral usage. It predicts total energy consumption and checks compliance with the budget. If the target is not met, the system pinpoints overconsumption and returns optimization hints to the LLM, enabling closed-loop refinement. Evaluations on representative IoT tasks show that GPowerT improves power-budget compliance and developer efficiency. The generated programs achieve long-duration battery life performance, which demonstrates the system’s potential and practicality for low-power IoT application development.

Ruitong Ye, Ming Gao
Millisecond-Level Interference-Aware Scheduling for Multi-Inference Co-Location on Ascend NPUs

The growing popularity of AI inference services has created a substantial demand for AI accelerators. However, the strict tail-latency requirements of inference workloads often conflict with the throughput optimization objectives of these accelerators. Many AI accelerators (e.g., Ascend NPUs, Kunlun chips) support co-located deployment of inference tasks via temporal sharing to improve throughput, where interference between tasks can be abstracted as kernel queuing delays. Motivated by this observation, we design nShare, a system that detects interference at the kernel level on hardware supporting temporal sharing and performs millisecond-scale scheduling control. nShare models interference on temporal shared devices as SLO (Service Level Object) slack and incorporates a dynamic batching mechanism to improve utilization without violating latency constraints. Compared to baseline systems, nShare improves throughput by 39.48%–51.53% while meeting the 99th-percentile SLO.

Wenhao Huang, Fupeng Li, Laiping Zhao, Yeju Zhou, Keqiu Li
An Unsupervised Learning Log Anomaly Detection Method Based on Graph Neural Network

Log anomaly detection is crucial for the security of distributed computing systems. Although existing methods can capture the statistical characteristics and explicit timing patterns of log events, there are still two major challenges for semi-structured log files: 1) Complex topological dependencies between log events and implicit associations with multi-dimensional semantics of key fields have not been fully modeled, and anomalous patterns are easily overwhelmed by high-frequency normal events. 2) Dynamic log formats require field extraction relying on manual rules or supervised learning, which is difficult to adapt to operation and maintenance (O&M) requirements in low resource scenarios. To address these issues, we propose an unsupervised log anomaly detection framework based on graph neural networks (GNN). Specifically, key fields (e.g., anomalous IPs, failed APIs) are automatically extracted using prompt-based few-shot learning, then a weighted directed graph model fusing semantic embedding and temporal dependency is constructed to fully characterize the dynamic interaction patterns among system components. Moreover, global anomaly identification across events is achieved by co-optimizing graph representation learning and anomaly detection objectives based on one-class directed graph convolutional networks. Experimental results show that our method performs remarkably on multiple benchmark datasets and exhibits excellent generalization capabilities for unseen log templates, improving distributed system security.

Xianlang Hu, Guangsheng Feng, Xinling Huang, Xiangying Kong, Hongwu Lv
Adaptive Reed-Solomon Coding for OFDM in Mobile Visible Light Communications

This paper proposes an adaptive orthogonal frequency division multiplexing with Reed-Solomon coding (Adaptive OFDM-RS) scheme for mobile visible light communication (VLC) scenarios. This scheme supports real-time code rate selection based on channel state information, choosing k from the set {163, 183, 203, 223, 243}. By continuously monitoring the channel signal-to-noise ratio (SNR), the system dynamically adjusts the RS coding strength to maintain error-free transmission while maximizing spectral efficiency. Simulation results demonstrate that the proposed real-time code rate selection mechanism achieves: error-free transmission at SNR ≥ 13 dB for a 16-QAM modulated VLC-OFDM system; and error-free transmission at SNR ≥ 20 dB for a 64-QAM modulated system. These results validate the high robustness and effectiveness of the system under dynamic channel conditions.

Qinghui Chen, Binyue Qing, Hong Wen, Ming Chen, Jie Ma, Zhenheng Chen
Fast and Accurate RDMA Congestion Control with Self-Adapting Rate Adjustment

Modern data centers adopt Remote Direct Memory Access (RDMA) to reduce CPU overhead and network latency. RDMA operates over a lossless network, and RDMA congestion control (CC) protocols are key enablers for achieving low-latency and high-throughput data delivery. Through in-depth experiment analysis, we reveal that existing RDMA CC protocols still suffer from sluggish congestion response and convergence speed. In this paper, we propose a switch-driven CC algorithm named $$FACC$$ FACC . $$FACC$$ FACC enables switches to precisely identify flows that actually cause congestion and promptly notifies the congestion information to senders. At the sender, $$FACC$$ FACC leverages the intrinsic packet conservation property of a lossless network to assess the extent of network congestion. Then, $$FACC$$ FACC employs a PI controller to adaptively adjust the sending rate, thereby achieving rapid congestion elimination and improving both the transmission rate and convergence speed. We conduct extensive experiments to evaluate the performance of $$FACC$$ FACC . The results show that $$FACC$$ FACC improves convergence speed while achieving at most 86.6% lower flow completion time (FCT) compared with state-of-the-art approaches.

Xin He, Zihao Zhang, Junchang Wang, Zheng Wu, Weibei Fan
Age of Information Minimization for Secure and Covert UAV Communications

Secure and covert communication in unmanned aerial vehicle (UAV) assisted wireless systems is to ensure security of transmission information, while age of information (AoI) quantifies its timeliness. However, existing works have not explored the AoI minimization for secure and covert UAV communications. This paper formulates the AoI minimization as an optimization problem by jointly optimizing UAV transmit power, flight path and low probability of detection constraints. We then propose an efficient iterative algorithm combining block coordinate descent and successive convex approximation to solve the non-convex problem. Simulation results demonstrate that the proposed scheme significantly achieves a superior trade-off between communication secrecy, information freshness, and power allocation in covert UAV networks.

Yiwen Zhang, Shikai Shen, Bin Yang, Fei Deng, Yumei She, Kaiguo Qian, Riyu Wang, Kai Yang
LGSVE: Leader-Guided Soft Voting Ensemble Model for Class-Imbalanced IoT Intrusion Detection

IoT intrusion detection ensemble methods hold promise for alleviating decision bias under long-tailed class distributions, where single models often fail to achieve minority-class recall. Nevertheless, existing works continue to confront three primary challenges: (1) class-agnostic fusion rules that ignore heterogeneity in feature distributions and model competence across attack types. (2) insufficient exploitation of complementary inductive biases between classifiers, limiting coverage of heterogeneous IoT traffic. and (3) computationally inefficient, noise-prone hyperparameter tuning in high-dimensional parameter spaces. Therefore, we propose a Leader-Guided Soft Voting Ensemble (LGSVE) model that integrates LightGBM, Random Forest, and Bi-Temporal Convolutional Networks to jointly capture structured feature patterns and long-range temporal dependencies. LGSVE employs a class-specific leader strategy combined with weighted soft voting to enhance robustness, particularly for minority-class detection. Furthermore, a Bayesian optimization framework is adopted to improve hyperparameter tuning efficiency and avoid premature convergence. Experiments on CIC-DDoS-2019, CIC-IDS-2017, and TonIot demonstrate that LGSVE significantly boosts minority-class performance while consistently surpassing state-of-the-art baselines in accuracy and robustness.

Shuai Zhao, Huiqiang Wang, Hongwu Lv, Yifan Zou, Runong Yang
LLM-Guided Soft Actor-Critic for Resource Allocation in Mobile Edge Computing Networks

Mobile edge computing (MEC) enhances computation efficiency and reduces latency by offloading tasks from mobile devices (MDs) to nearby edge servers (ESs). To optimize the offloading process and manage resource allocation effectively, deep reinforcement learning (DRL) has been widely adopted as a promising solution. However, as the number of MDs increases, the rapidly expanding state-action space poses significant challenges to the ability of DRL to learn effective policies, often resulting in suboptimal decision-making. Large language model (LLM) with strong reasoning capabilities and extensive prior knowledge offers a potential solution by enabling more efficient exploration and guiding the DRL toward better policies. Therefore, we propose an LLM-guided soft actor-critic (LLM-guided SAC) algorithm, which integrates LLM-generated policy priors with a probabilistic mixed strategy to facilitate learning in high-dimensional decision spaces. By refining the state-action representation, the proposed algorithm enhances policy learning efficiency, while LLM-guided priors enable informed exploration and improve early-stage decision quality. Moreover, the mixed strategy balances exploration and exploitation, contributing to stable and effective learning. Experimental results demonstrate that LLM-guided SAC consistently outperforms baseline methods, particularly in large and complex decision spaces, highlighting its strong potential for resource optimization in MEC networks.

Jianmeng Guo, Xiuhua Li, Jinlong Hao, Lingxiao Chen, Xiaofei Wang, Victor C. M. Leung
ASALP: An Automatic Scaling Architecture for Edge Node Resources Based on Load Prediction

Edge computing provides inherent advantages of low latency and user proximity; however, it encounters significant challenges in achieving resource elasticity and balancing dynamic traffic loads. The default scaling mechanism in Kubernetes, the Horizontal Pod Autoscaler (HPA), adopts a reactive strategy that restricts its capacity to address real-time demands and exhibits limited effectiveness in edge environments. To overcome these limitations, we introduce ASALP (Automatic Scaling Architecture for Edge Node Resources based on Load Prediction), which augments the Kubernetes–KubeEdge framework with an enhanced RWKV-EFE load prediction model and incorporates Nginx, Consul, and Prometheus to enable dynamic load balancing. Evaluated on the MQPS dataset, RWKV-EFE achieves substantially lower mean squared error (MSE) and mean absolute error (MAE), reducing them by 28.71% and 12.58% compared with FEDformer, and by 77.24% and 53.88% compared with Autoformer. Furthermore, in comparison with HPA, THPA, reactive ASALP, and ASALP-FEDformer, ASALP improves the request success rate by 57.17%, 21.33%, 14.62%, and 7.59%, respectively, while also alleviating the adverse effects of unstable communication links. These experimental results confirm the effectiveness of ASALP in enabling efficient resource scaling and load balancing for real-world edge computing deployments.

Hui Liu, Hui Xiang, Yong Wu, Zeguang Liu, Junzhao Du
CRANE: Two-Stage Coordinated Resource Allocation of Network and Compute for Deterministic Workloads

Telecom operators historically built extensive central offices for telephony, which now provide a unique physical substrate for edge computing. The emergence of model training, immersive media, and real-time analytics is transforming these facilities from voice switching hubs into distributed compute sites, demanding deterministic resource scheduling that simultaneously satisfies compute capacity and network SLAs. To this end, we propose CRANE, a coordinated scheduling framework that enables this evolution. CRANE integrates real-time compute and network awareness, multi-objective decision-making for selecting optimal compute nodes, and SRv6 traffic engineering for network SLA enforcement. Experiments demonstrate that CRANE achieves high SLA compliance, lowers end-to-end latency, and improves overall resource utilization, enabling the transformation of legacy telecom infrastructure into a distributed platform that supports deterministic workloads.

Yaqi Yan, Wenlong Zhang, Mingui Zhang, Yajun Li, Wenrui Liu, Wenxuan Zhao, Yuming Xing, Tian Pan
Federated Learning via TEE-Based Dual-Branch Architecture and Interaction-Aware Pruning

Federated learning enhances data privacy but faces security risks like model theft, while traditional cryptographic methods often compromise efficiency or accuracy. Although Trusted Execution Environments (TEEs) offer hardware-level security, their limited memory and lack of hardware acceleration hinder deep neural network deployment. Current partitioning solutions alleviate memory constraints but introduce new vulnerabilities, highlighting the need for lightweight TEE optimization. We propose FedDualPrune, a lightweight federated learning framework that leverages a TEE-based dual-branch architecture. During local training, a frozen pre-trained model in the rich execution environment (REE) serves as a general feature extractor, while a unidirectional feature fusion mechanism securely integrates its outputs with a trainable counterpart inside the TEE. Furthermore, a channel-interaction-aware joint pruning strategy structurally compresses the TEE-hosted branch to satisfy secure memory constraints while preserving model performance. Extensive experiments show that under high pruning rates and Non-IID conditions, the accuracy loss is controlled within 5%, achieving synergistic optimization of privacy protection and model performance.

Wenxuan Zhou, Zhenyu Zhu, Mingyang Xie, Zhihao Qu
Breaking Language Barriers: A Domain-Specific Translation Workflow for Industry

Machine translation (MT) systems have achieved remarkable progress in recent years, especially with the emergence of deep learning and Transformer-based architectures. Nonetheless, the application of a general neural machine translation approach to domain-specific scenarios remains limited. Firstly, the presence of domain-specific terminology may reduce the accuracy of general translation models, which typically lack fine-tuning for the target domain. Secondly, fine-tuning such models necessitates domain-specific training data, which is frequently limited in availability. Finally, general translation tools like Google Translate do not provide a unified pipeline for seamless integration into industry workflows. This paper introduces a unified translation workflow developed to address these limitations, offering adaptability and extensibility for domain-specific applications in industry. Our proposed approach delivers a complete pipeline for end-users, incorporating customizable domain-specific terminology and supporting translation across multiple document formats, thereby enhancing both usability and output quality in real-world applications. We evaluate the proposed approach in an industrial setting, focusing on the textile and garment domain. Using a dataset of textile and apparel domain-specific documents in various formats, along with human evaluation, the experiments demonstrate the effectiveness and practicality of our approach for domain-adaptive machine translation.

Nguyen Duc Loc, Ngo Minh Quan, Ha Trung Chien, Vu Van Minh
PHITS: A Parallel Hyperlink-Induced Topic Search Algorithm with Graph Partitioning and Communication Optimization

In the era of big data, the efficiency of the Hyperlink-Induced Topic Search (HITS) algorithm in processing large-scale web link data has become a critical issue. To address the inefficiency of the traditional serial HITS algorithm when dealing with massive web link graphs, this paper proposes a novel parallel HITS algorithm. The algorithm lies in the rational partitioning of the web link graph. By adopting a graph-based partitioning method, the large-scale web link graph is divided into multiple subgraphs. Each subgraph is assigned to a computing node, enabling each node to independently calculate the authority and hub values of the web pages within its assigned subgraph. In addition, to reduce communication overhead in the parallel computing process, we further design data compression and asynchronous communication strategies. The former is applied to web link data before transmission to effectively reduce the amount of data transferred, while the latter enables processing units to perform other tasks while waiting for data transmission, thereby improving resource utilization. Experimental results demonstrate that the proposed parallel HITS algorithm not only maintains the accuracy of the original HITS algorithm but also achieves a significant improvement in computing efficiency.

Xuanye Chen, Xiaoshuang Xing, Mengjiao Ou, Jialin Chen, Xiaoyu Ma
Reducing Load-Balancing Cost for Multithreading Applications on Asymmetric NUMA Machine

Many applications adopt multithreading to increase the concurrency or computational efficiency. In this scenario, the active threads are often much more than the cores of a server, and are scheduled with the operating system. However, we observe that the current load balancing mechanism based on scheduling domains leads to poor performance on asymmetric NUMA machines. Our investigation shows that the poor performance is due to the unnecessary “far” scheduling with high cost, because the current algorithm of building scheduling domains neglects the physical relevance of NUMA nodes in a machine. We therefore propose a physical relevance-based algorithm to construct scheduling domains on asymmetric NUMA machines to reduce the scheduling cost. Experimental results show that the proposed scheduling domains improve the real-world applications by 14.94% on average (up to 22.33%).

Yuhang Fang, Pu Pang, Quan Chen, Li Li, Minyi Guo
Data Plane Driven Adaptive Routing with In-Network Reinforcement Learning

The rapid evolution of modern network traffic, characterized by its dynamic and unpredictable nature, poses significant challenges for traditional static routing protocols. While Software-Defined Networking (SDN) and programmable data planes like P4 have emerged as promising solutions to enhance network flexibility, a key challenge remains in developing intelligent routing decision-making mechanisms that can adapt in real-time to changing network conditions. This paper proposes a novel intelligent routing framework that integrates the Q-Learning algorithm with a P4-based programmable data plane. By modeling the routing problem as a reinforcement learning task, our approach enables network agents to autonomously learn optimal routing policies based on real-time network states, such as congestion and latency, without the need for pre-defined rules. We define the network state, agent actions, and a reward function to guide the learning process. The learned optimal routing policies are then dynamically translated into P4 rules and deployed on the data plane. Through extensive simulations, we demonstrate that our proposed Q-Learning-based algorithm significantly outperforms traditional routing protocols in terms of reducing end-to-end latency and improving network throughput.

Bo Wu, Meiju Yu, Pantong Wang, Dan Qin, Xiliang Pang, Guiquan Zheng
Backmatter
Title
Network and Parallel Computing
Editors
Xiaoliang Wang
Xiaohong Jiang
Noel Crespi
Baoliu Ye
Copyright Year
2026
Electronic ISBN
978-3-032-10466-3
Print ISBN
978-3-032-10465-6
DOI
https://doi.org/10.1007/978-3-032-10466-3

PDF files of this book have been created in accordance with the PDF/UA-1 standard to enhance accessibility, including screen reader support, described non-text content (images, graphs), bookmarks for easy navigation, keyboard-friendly links and forms and searchable, selectable text. We recognize the importance of accessibility, and we welcome queries about accessibility for any of our products. If you have a question or an access need, please get in touch with us at accessibilitysupport@springernature.com.

Premium Partner

    Image Credits
    Neuer Inhalt/© ITandMEDIA, Nagarro GmbH/© Nagarro GmbH, AvePoint Deutschland GmbH/© AvePoint Deutschland GmbH, AFB Gemeinnützige GmbH/© AFB Gemeinnützige GmbH, USU GmbH/© USU GmbH, Ferrari electronic AG/© Ferrari electronic AG