Zum Inhalt

Advanced Parallel Processing Technologies

16th International Symposium, APPT 2025, Athens, Greece, July 13-16, 2025, Proceedings

  • 2026
  • Buch
insite
SUCHEN

Über dieses Buch

Dieses Buch stellt den Referenten des 16. Internationalen Symposiums für fortschrittliche parallele Verarbeitungstechnologien, APPT 2025, dar, das vom 13. bis 16. Juli 2025 in Athen, Griechenland, stattfand. Die 17 vollständigen und 10 kurzen Beiträge in diesem Buch wurden sorgfältig überprüft und aus 74 Einreichungen ausgewählt. Sie gliederten sich wie folgt in thematische Abschnitte: Chip and Accelerators, Memory and Storage, Cloud and Networking, Design for LLM and ML / AI, Big Data and Graph Processing, and Secure and Verlässliche System.

Inhaltsverzeichnis

Frontmatter

Best Paper Candidates

Frontmatter
DACO: Unlocking Latent Dataflow Opportunities in Edge-Side SIMT Accelerators

Edge AI accelerators commonly use SIMT architectures, but AI kernels often suffer from high memory access overhead, limiting performance. Dataflow execution can improve locality and reduce redundant memory traffic, yet existing solutions are hardware-specific and incompatible with general-purpose SIMT programming. We present DACO, a Dataflow-Aware Compilation Optimization method that extends SIMT compilers to exploit dataflow opportunities automatically. DACO identifies three common patterns—intra-block dataflow, inter-block dataflow, and compute-memory dataflow—through static memory access analysis and generates optimized code with minimal developer effort. Experiments on real AI models show that DACO improves performance by up to 50.1% over the baseline SIMT compiler, highlighting its effectiveness and practicality.

Han Zhao, Yiying Xiang, Yu Liu, Xiaochun Ye, Deze Zeng, Jing Yang, Weihao Cui, Quan Chen, Jingwen Leng, Minyi Guo
ATLAS: Efficient Dynamic GNN System Through Abstraction-Driven Incremental Execution

Dynamic graph neural networks (DGNNs) are increasingly vital for modeling evolving graph-structured data across diverse applications. However, existing methods often incur significant computational redundancy by processing large portions of the graph—even when updates are localized and sparse. In this paper, we present ATLAS, a high-performance DGNN framework that enables abstraction-driven incremental execution through tight algorithm-system co-design. At the algorithmic level, ATLAS constructs lightweight, connectivity-aware graph abstractions anchored at influential nodes, enabling fine-grained and efficient propagation of dynamic updates. At the system level, it applies abstraction-driven scheduling and memory optimizations to balance workload and enhance locality, achieving efficient parallel execution. Extensive experiments demonstrate that ATLAS outperforms current state-of-the-art systems, achieving speedups of 2.44 $$\times $$ × , 3.17 $$\times $$ × , 5.91 $$\times $$ × , and 10.57 $$\times $$ × over RACE, DeltaGNN, DGL, and PyG, respectively, while incurring only negligible accuracy loss (less than 1%).

Jingyi Zhou, Yu Huang, Long Zheng, Yang Wu, Huize Li, Amelie Chi Zhou, Xiaofei Liao, Hai Jin, Jingling Xue
Segmentation-Aware Optimization of Collective for Waferscale Chips

Rapid scaling of large language models (LLMs) has led to an explosive increase in computation and communication demands, prompting interest in waferscale chips (WSCs). However, it suffers from inefficient collective communication due to the large diameter of the mesh network. Previous work improves bandwidth utilization via packet segmentation and pipeline scheduling, yet overlooks the trade-offs induced by segmentation configuration. In this work, we introduce a segmentation-aware cost model that extends the Alpha-Beta model to capture the connection between packet size, die-to-die (D2D) bandwidth, and pipeline depth. Based on this model, we propose an optimization framework that identifies optimal segmentation configuration and a pipeline scheduling strategy tailored to the dynamics of D2D bandwidth introduced by segmentation. Our approach reduces communication overhead by 73.43% and improves collective efficiency by 21.12% over SOTA algorithms.

Qize Yang, Jiaxin Liu, Taiquan Wei, Yuxin Jin, Shouyi Yin, Yang Hu
Area-Efficient Automated Logic Design with Monte-Carlo Tree Search

Automated logic design for digital circuits significantly reduces manual effort in chip design processes, as logic design &verification are the most manual-intensive periods in the entire chip design flow. Without human programming, the Hardware Design Language (HDL) codes, combined with state-of-the-art iterative data-driven methods, monotonically reduce error rates using input-output examples to ensure design functionality. This approach enables the design of large-scale circuits, such as RISC-V CPUs, for tapeout. However, these data-driven methods can lead to significant area overhead due to a lack of prior knowledge about the circuit structure. This paper proposes a Monte Carlo Tree Search (MCTS) based approach, A-BSD, to optimize the area overhead of automated logic design while maintaining the ability to design accurately. The key insight is that, with specific automated circuit design parameters, the area overhead can be significantly reduced. Although the potential parameter-design space is vast and challenging to explore, we formulate automated circuit design as a Monte-Carlo Tree Search problem to reduce the computing complexity with two novel Operations in the design process, i.e., (1) layer insertion and (2) variable switching. To further reduce the costly computational complexity, we train an efficient evaluation model on randomly generated circuits with fast approximate simulation results to guide the evaluation of the design area overhead. Experimental results on standard benchmark circuits demonstrate that our method reduces the design area by 26% compared to the state-of-the-art baselines while maintaining design functionality.

Shuyao Cheng, Xiangtao Guan, Zidong Du

Chip and Accelerators

Frontmatter
NFMap: Node Fusion Optimization for Efficient CGRA Mapping with Reinforcement Learning

Coarse-Grained Reconfigurable Arrays (CGRAs) has gradually become a research hotspot to satisfy the growing demand for computing power and efficiency. However, the execution efficiency of CGRA depends on the mapping framework. Traditional mapping algorithms relying on combinational logic or heuristics struggle with growing application complexity due to high compilation costs and poor mapping effectiveness, as they lack the ability to learn from experience. Reinforcement Learning (RL) has been increasingly adopted for mapping strategies, but long-dependency routings often become bottlenecks, limiting RL-based algorithms. To this end, this paper proposes NFMap, a mapping algorithm that incorporates a network flow-based algorithm to optimize the DFGs through node fusion. NFMap consists of three steps: First, NFMap adpots the network flow-based algorithm with the resource constraints of the hardware platform, and the nodes are fused into node blocks. Then, NFMap generates attributes for the node blocks for RL mapping. Finally, NFMap uses the reinforcement learning-based algorithm for end-to-end mapping. Experiments show that compared to state-of-the-art RL-based mapping framework E2EMap, NFMap achieves an average mapping quality and compilation speedup of $$1.43\times $$ 1.43 × and $$1.14\times $$ 1.14 × , respectively.

Yudong Mu, Siyi Li, Zhihua Fan, Wenming Li, Xuejun An, Xiaochun Ye
A Unified Synthesis Framework for Dataflow Accelerators Through Multi-level Software and Hardware Intermediate Representations

Dataflow accelerators leverage massive parallelism through customized architectural designs, making them highly efficient for tensor computations like matrix multiplications. However, designing dataflow accelerators is a complex task that requires seamless integration of software mapping and hardware optimization. Existing synthesis techniques often focus exclusively on either software or hardware optimizations, resulting in inefficiencies and extended development cycles.This paper proposes a unified synthesis framework that integrates software and hardware parts through a multi-level intermediate representation (IR). The framework progressively lowers dataflow representations into synthesizable hardware descriptions, enabling the rapid exploration of software mapping strategies and transparent hardware synthesis optimizations. The experimental results demonstrate a 14x improvement over CIRCT HLS by leveraging parallelism through software optimizations. The multi-level IR also facilitates efficient cross-level simulation to identify bugs and performance bottlenecks.

Xiaochen Hao, Ruifan Xu, Yun Liang
Defect-Aware Task Scheduling and Mapping for Redundancy-Enhanced Spatial Accelerators

As computational workloads continue to scale exponentially, modern spatial accelerators have emerged as critical enablers due to their superior parallelism, architectural flexibility, and scalable design. However, the increasing die size introduces significant hardware defect challenges that degrade both computational performance and manufacturing yield. While spatial architectures typically incorporate redundant resources for defect tolerance, effectively leveraging these redundancies presents substantial optimization challenges. This paper proposes a hierarchical defect recovery methodology that strategically balances repair efficiency with architectural scalability. Our dual-phase approach synergistically combines intra-chiplet task remapping for optimized local redundancy repairing with inter-chiplet rescheduling that enables global load balancing when local redundancy is exhausted. We formulate this dual-phase optimization problem and develop a comprehensive framework that navigates critical tradeoffs between defect avoidance, spatial locality preservation, and network-on-chip efficiency. Experimental evaluations across diverse defect scenarios demonstrate 35.17% to 50.11% performance improvements over baseline methods, while maintaining over 80% of ideal performance even in severe defect conditions.

Jingchen Zhu, Zhao Wang, Guangyu Sun
Irregular Sparsity-Enabled Search-in-Memory Engine for Accelerating Spiking Neural Networks

Spiking neural networks (SNNs) have gained significant attention due to their efficiency in processing event-driven information. The core computations in SNNs, such as matrix bit-wise AND and ADD operations, align naturally with process-in-memory (PIM) architectures. However, the extended input spike trains in SNNs and the bit-serial processing mechanism of PIM introduce notable latency and frequent analog-to-digital conversions, undermining performance and energy efficiency. To this end, we propose a novel Search-in-Memory (SIM) architecture, called SIMSnn, designed to accelerate SNN inference. Unlike traditional bit-by-bit processing over multiple time steps, SIMSnn processes a sequence of spikes in parallel through associative matches in a CAM crossbar. Additionally, SIMSnn leverages non-structured pruning, which is typically incompatible with most PIM architectures, to reduce CAM overhead. As a weight-agnostic SNN accelerator, SIMSnn adapts seamlessly to evolving SNN models without requiring crossbar array rewrites. Experiments show that SIMSnn achieves a $$25.3\times $$ 25.3 × higher energy efficiency and a $$13.7\times $$ 13.7 × speedup on average compared to the ISAAC-like design. When compared to the state-of-the-art PIM design, NEBULA, SIMSnn can also realize up to a $$7.9\times $$ 7.9 × energy savings and a $$5.7\times $$ 5.7 × speedup.

Fangxin Liu, Zongwu Wang, Ning Yang, Haomin Li, Tao Yang, Haibing Guan, Li Jiang

Memory and Storage

Frontmatter
QRAMsim: Efficiently Simulating, Analyzing, and Optimizing Large-Scale Quantum Random Access Memory

A fundamental challenge in quantum computing is efficiently loading classical data onto quantum computers. Quantum Random Access Memory (QRAM) offers a solution, acting as a universal architecture designed to implement oracles effectively. This article introduces QRAMsim, a novel framework for simulating, analyzing, and optimizing large-scale QRAMs. We first propose a basis vector tracking theorem to leverage the natural sparsity in the Hilbert space of QRAM, significantly reducing the space and time complexity for modeling QRAM. Furthermore, we develop a high-performance simulator integrated with a mapping algorithm to efficiently deploy QRAM on quantum devices. Experiments show that we can achieve $$8.2\times $$ 8.2 × fidelity improvement compared to the conventional qubit mapping method [27], and achieve $$10^8\times $$ 10 8 × simulation speedup compared to Qiskit. QRAMsim is publicly available on ( https://github.com/Chenning-Tao/QRAM_simulator ). To the best of our knowledge, this is the first open-source high-performance QRAM simulation framework.

Chenning Tao, Yujie Ji, Liqiang Lu, Size Zheng, Jianwei Yin
CeDMA: Enhancing Memory Efficiency of Heterogeneous Accelerator Systems Through Central DMA Controlling

As heterogeneous computing systems continue to evolve, emerging workloads increasingly span multiple types of accelerators, resulting in frequent inter-accelerator data transfers. However, traditional CPU-managed memory systems often struggle to coordinate these transfers efficiently, leading to high latency, poor memory bandwidth utilization, and scalability bottlenecks.We propose CeDMA, a centralized and programmable Direct Memory Access (DMA) control architecture that enables high-performance, CPU-decoupled memory coordination across diverse accelerators. CeDMA combines a unified hardware-software co-design: a modular DMA engine with integrated address translation and dual-level arbitration logic on the hardware side, and a lightweight instruction-driven memory management model with adaptive scheduling on the software side. CeDMA enables fine-grained control over memory transfers, minimizes off-chip bandwidth consumption, and exploits memory-level parallelism through dynamic resource partitioning. Cycle-accurate simulation results across a diverse workload suite—including GEMM, Conv2D, and graph traversal kernels—demonstrate up to 75% reduction in external memory access, 60% improvement in performance, and 45% reduction in access latency. Furthermore, CeDMA maintains high throughput and predictable latency at scale, supporting up to 32 concurrent accelerators. These results position CeDMA as a scalable, general-purpose memory management substrate for future heterogeneous SoC architectures.

Ruoshi Li, Long Zheng, Yu Huang, Zhiyuan Shao, Amelie Chi Zhou, Xiaofei Liao, Hai Jin, Jingling Xue
PAMM: Adaptive Memory Management for CXL-/UB-Based Heterogeneous Memory Pooling Systems

The rapid growth of memory-intensive workloads has exposed significant limitations in traditional homogeneous memory architectures. Emerging memory interconnect technologies, such as Compute Express Link (CXL) and Unified Bus (UB), enable flexible memory pooling across multiple computing nodes, but introduce heterogeneous memory latencies, leading to performance degradation for applications designed with uniform memory assumptions. Existing management approaches for these heterogeneous memory systems typically lack real-time performance-aware, making it difficult to dynamically balance application performance and memory resource utilization. This paper presents PAMM, a performance-aware adaptive memory management framework for heterogeneous memory environments enabled by advanced memory interconnects. By leveraging low-overhead runtime profiling and machine learning models, PAMM quantifies real-time application performance and anticipates potential slowdowns. Subsequently, it dynamically adjusts memory allocation strategies and places memory pages into appropriate tiers according to observed access characteristics. Evaluation shows that PAMM achieves substantial reductions in local memory usage while incurring minimal performance degradation.

Jianqin Yan, Zhaoxiang Huang, Yue Yu, Zhenlong Song, Yiming Zhang
STAMP: Accelerating Second-Order DNN Training Via ReRAM-Based Processing-in-Memory Architecture

Deep Neural Network (DNN) training is both compute- and memory-intensive. In this work, we propose a hardware-software co-design approach that leverages ReRAM-based process-in-memory (PIM) technology and second-order training to enhance DNN training efficiency. Second-order training reduces the number of iterations. Importantly, the key operation in second-order training, matrix inversion (INV), can be performed in ReRAM crossbars with $$O\left( 1\right) $$ O 1 time complexity, minimizing the overhead. However, current ReRAM-based INV circuits face insufficient precision. To overcome this limitation, we propose a high-precision matrix inversion method with 8-bit INV circuit. Building on this foundation, we introduce STAMP, a ReRAM-based PIM accelerator specifically designed for second-order training. Experimental results demonstrate that STAMP achieves an average speedup of 114.8 $$\times $$ × and an energy saving of 41.9 $$\times $$ × compared to a GPU counterpart on large-scale DNNs.

Yilong Zhao, Fangxin Liu, Mingyu Gao, Xiaoyao Liang, Qidong Tang, Chengyang Gu, Tao Yang, Naifeng Jing, Li Jiang

Cloud and Networking

Frontmatter
Cochain: Architectural Support Mechanism for Blockchain-Based Task Scheduling

Blockchain-based distributed computing systems are proposed to solve many problems such as single point of failure; however, multiple crucial operations such as task scheduling and executing are still conducted off-the-chain for performance issues, making the operations untrusted. The previous works lack architectural design for supporting robust and high-performance distributed computing upon blockchain. In this paper, we present Cochain, a novel architectural support mechanism for blockchain-based task scheduling. The core idea of Cochain is providing architectural support for on-chain task scheduling and execution to realize a system trusted and transparent with high performance. In Cochain, we creatively organize the blockchain nodes by scheduling domains and realize cross-domain scheduling to reduce search space and ease storage burden. Cochain enables parallel support for task scheduling to improve the task processing capability and provide a novel mechanism for fast block generation. Cochain is evaluated in the mainstream blockchain client, Geth, by comparing Cochain with six benchmarks. Results show that Cochain reduces the ledger size by 97.5%, compared with Geth. Cochain achieves 18.5 $$\times $$ × speedups for task processing. Cochain brings a new sight for fast and robust distributed computing.

Yaozheng Fang, Yibing Jiang, Xueshuo Xie, Zhaolong Jian, Tao Li, Zhiguo Wan, Grace Wang
DyQNet: Optimizing Dynamic Entanglement Routing with Online Request in Quantum Network

Quantum network is a fundamental technique to connect remote quantum computers for distributed quantum computing and quantum key distribution. In quantum networks, entanglement routing is essential for establishing entanglement connections between quantum nodes. Prior works mainly focus on the optimization for static entanglement routing, regarding that the requests and resources are offline determined. However, in practical scenarios, the entangled pairs always arise dynamically with online requests, which makes the routing problem more complicated.In this paper, we propose DyQNet, a framework for modeling and optimizing dynamic entanglement routing of quantum networks. First, we systematically formulate the problem of dynamic entanglement routing and its objective function. Subsequently, we develop a scheduling algorithm that aims to maximize throughput and resource utilization. The algorithm is integrated into a deployable simulation framework, comprising a simulator architecture and a control flow, which supports rigorous simulation with different configurations. Evaluation results show that, compared with the static scheme, DyQNet improves connection throughput by 3.96 $$\times $$ × and reduces the connection delay time by 68.55% under random quantum networks. The source code of DyQNet will be publicly available on ( https://github.com/cty-github/quantum_networks .

Tianyao Chu, Liqiang Lu, Shiyu Li, Xinghui Jia, Chenren Xu, Siwei Tan, Jianwei Yin
Veyth: Adaptive Container Placement for Optimizing Cross-Server Network Traffic of Microservice Applications

The microservice architecture is widely adopted for cloud applications. These microservice applications are typically deployed using containers on distributed servers in cloud datacenters, where efficient placement is critical to meet diverse resource demands and ensure Quality-of-Service (QoS). However, existing approaches often fall short in optimizing cross-server traffic, as they overlook shared microservices and multi-service patterns and lack a global traffic-aware scheme.We therefore propose Veyth, an adaptive container placement system that reduces cross-server traffic and tail latency through three strategies: service-based decoupling, traffic-aware stateless microservice placement, and optimized stateful microservice placement. The decoupling strategy treats queries of each service as an independent execution graph, reducing cross-server traffic and latency by service-level isolation. The stateless microservice placement strategy minimizes cross-server traffic and allocates resources efficiently considering traffic and resource constraints. The stateful microservice placement strategy uses a two-stage server selection method to reduce cross-server traffic. Experimental results show that Veyth reduces the cross-server traffic and 99%-ile latency by 81.8% and 53.4%, respectively, compared to state-of-the-art works.

Jinyuan Chen, Jiuchen Shi, Quan Chen, Lin Gu, Minyi Guo

Design for LLM and ML/AI

Frontmatter
Unifying Two Operators with One PIM: Leveraging Hybrid Bonding for Efficient LLM Inference

Inference of transformer-based large language models (LLMs) is both compute- and memory-intensive. To address this, prior work adopts heterogeneous systems that combine processing-in-memory (PIM) architectures for memory-bound matrix-vector multiplication (GEMV) and neural processing units for compute-bound matrix-matrix multiplication (GEMM). However, these designs suffer from low hardware utilization during edge inference, primarily due to the limited parallelism available in single-batch processing.This paper presents HB-PIM, a hybrid bonding-based PIM architecture with dual-mode execution support. HB-PIM employs a software-hardware co-designed approach, enabling efficient LLM inference. At the hardware level, HB-PIM leverages high-density copper interconnects to integrate logic and DRAM dies, providing high inter-die memory bandwidth and substantial compute capability. A dual-mode processing unit (DuoPU) on the logic die is designed to support adaptive execution of both GEMV and GEMM operations. At the software level, a scheduling framework is introduced to optimize workload partitioning and data mapping. We demonstrate the effectiveness of the proposed schemes using extensive experiments. Experimental results show that HB-PIM significantly enhances hardware utilization while reducing processing latency compared to state-of-the-art baselines.

Jiaxian Chen, Yuxuan Qi, Kaoyi Sun, Zhiliang Lin, Tianyu Wang, Chenlin Ma, Yi Wang
AsymServe: Demystifying and Optimizing LLM Serving Efficiency on CPU Acceleration Units

Current data centers are accommodating more AI-based workloads, especially large-language model (LLM) training and serving in recent years. Given the limited count and significant energy consumption of expensive GPUs, cloud providers tend to utilize more cost-efficient processors for LLM serving, such as Intel scalable CPU equipped with acceleration units AMX. To understand the improvements, bottlenecks, and opportunities on this new platform, we first undertake a comprehensive characterization of LLM serving using AMX on two generations of modern CPUs with various memory devices. Our characterization reveals that the hardware and software behaviors of LLM serving on CPU are distinct from conventional cloud workloads and vary greatly. In this paper, we propose AsymServe to maximize LLM serving efficiency on scalable CPU platforms via handling software and hardware asymmetry. It adjusts hardware allocation and software configurations adaptively to maximize CPU performance-per-watt. Through extensive evaluation, we show that AsymServe improves LLM serving performance. Specifically, it achieves up to 1.71x faster first-token generation, 3.13x greater throughput, and 11.09x better energy efficiency.

Xinkai Wang, Yiming Zhuansun, Chao Li, Jing Wang, Xiaofeng Hou, Lingyu Sun, Luping Wang, Minyi Guo
SparseTem: Boosting the Efficiency of CNN-Based Video Encoders by Exploiting Temporal Continuity

Convolutional Neural Networks (CNNs) are an efficient and high-performance choice for feature extraction and encoding. However, the intensive computational demands of convolution operations hinder its broader adoption as a video encoder. Given the temporal continuity in video frames, changes between consecutive frames are minimal, allowing for the skipping of redundant computations. This technique, which we term as Diff Computation, presents two primary challenges. First, Diff Computation requires to cache intermediate feature maps to ensure the correctness of non-linear computations, leading to significant memory consumption. Second, the imbalance of sparsity among layers, introduced by Diff Computation, incurs accuracy degradation. To address these issues, we propose a memory-efficient scheduling method to eliminate memory overhead and an online adjustment mechanism to minimize accuracy degradation. We integrate these techniques into SparseTem, a unified framework for CNN-based video encoders. SparseTem achieves speedup of 1.79x for EfficientDet and 4.72x for CRNN, with minimal accuracy drop and no additional memory overhead, setting a new state-of-the-art in leveraging temporal redundancy for acceleration.

Kunyun Wang, Shuo Yang, Jieru Zhao, Wenchao Ding, Quan Chen, Jingwen Leng, Minyi Guo
TokenSim: Enabling Hardware and Software Exploration for Large Language Model Inference Systems

The increasing demand for large language model (LLM) serving has necessitated significant advancements in the optimization and profiling of LLM inference systems. As these models become integral to a wide range of applications, the need for efficient and scalable serving solutions has grown exponentially. This work introduces TokenSim, a comprehensive hardware and software exploration system designed specifically for LLM inference. TokenSim is characterized by its support for extensible system optimizations including scheduling and memory management. Furthermore, TokenSim facilitates various insightful explorations into the performance and optimization of LLM serving systems. The code is available at https://github.com/pku-lemonade/TokenSim .

Feiyang Wu, Zhuohang Bian, Guoyang Duan, Tianle Xu, Junchi Wu, Teng Ma, Yongqiang Yao, Ruihao Gong, Youwei Zhuo

Big Data and Graph Processing

Frontmatter
Achieving Efficient Temporal Graph Transformation on the GPU

Temporal graphs, which associate time information with their edges, are fundamental to various time-sensitive applications. To efficiently handle temporal graphs, existing solutions typically apply a transformation-based execution model. This model first transforms the temporal graph into its equivalent Directed Acyclic Graph (DAG) with embedded timing information and then computes this transformed graph using a single-scan strategy. However, due to the intricate vertex expansion based on the timestamps, the temporal graph transformation suffers from high runtime overhead and graph redundancy problem. To overcome these challenges, this paper proposes a redundant-aware temporal graph transformation method on the GPU, called FASTGT. In detail, it detects and merges virtual edges that do not affect the correctness of temporal path problems via temporal path analysis, thereby eliminating graph redundancy. Then, by decoupling the data dependencies among redundancy detection among different virtual edges, it enhances the parallelism of the temporal graph transformation on the GPU. Experiments on an A100 GPU show that FASTGT reduces temporal graph transformation time by up to $$3.2\times $$ 3.2 × and $$9.7\times $$ 9.7 × , while achieving 10% and 50% reductions in GPU global memory usage, compared to TeGraph and OTBC, respectively. Besides, FASTGT achieves up to $$2.0\times $$ 2.0 × and $$5.5\times $$ 5.5 × improvements in end-to-end performance (i.e., including both temporal graph transformation and computation) over TeGraph and OTBC, respectively.

Linchen Yu, Zihan Li, Jin Zhao, Longlong Lin, Hengshan Yue
GASgraph: A GPU-Accelerated Streaming Graph Processing System Based on SubHPMAs

Streaming graph processing is increasingly critical across various domains, requiring continuous edge updates and real-time analysis of evolving graph structures. With its massive parallelism and high-bandwidth memory access, the GPU is a promising platform for accelerating streaming graph processing. However, streaming graph processing on GPUs faces inefficiencies in graph updates and computations. To address these issues, we design and implement GASgraph, a GPU-Accelerated Streaming graph processing system. Specifically, to mitigate the high overhead of expansion and global rebalancing, we propose a novel data structure subHPMAs, which integrate an adaptive hybrid update strategy with a subPMAs-based graph representation. We implement a GPU-optimized incremental computation engine based on subHPMAs and provide a user-friendly programming interface. Extensive experiments show that GASgraph achieves significant performance improvements, with average speedups of $$66.79\times $$ 66.79 × and $$1.25\times $$ 1.25 × over GPMA and LPMA during graph updates, and $$9.06\times $$ 9.06 × and $$3.46\times $$ 3.46 × over Tigr and KickStarter in graph analytics.

Chunxiang Wang, Yuan Zhang, Huawei Cao, Xuejun An, Xiaochun Ye
Accelerating Large-Scale Out-of-GPU-Core GNN Training with Two-Level Historical Caching

Large-scale graph neural network (GNN) training systems on GPUs with CPU memory and storage face the challenge of efficiently caching the embedding data with accuracy guarantee. In this paper, we propose HCGNN, an out-of-GPU-memory GNN training system that combines GPU sampling and historical embedding caching. Our system supports dynamic embedding data caching through heuristic-based historical two-level cache design with lightweight data proactive eviction and high cache hit ratio. Compared with SOTA frameworks, HCGNN shows up to 6.7x speedup on graph sampling and 4.3x speedup on feature gathering within 0.5% accuracy loss.

Jing Wang, Taolei Wang, Juntao Huang, Yibo Liu, Xinkai Wang, Marius Kreutzer, Chao Li, Minyi Guo
Understanding Data Preprocessing for Effective End-to-End Training of DNN

In this paper, we primarily focus on understanding the data preprocessing pipeline for DNN training in the public cloud. First, we run experiments to test the performance implications of the two major data preprocessing methods using either raw data or record files. The preliminary results show that data preprocessing is a clear bottleneck, even with the most efficient software and hardware configuration enabled by NVIDIA DALI, a high-optimized data preprocessing library. Second, we identify the potential causes, exercise a variety of optimization methods, and present their pros and cons. We hope this work will shed light on the new co-design of “data storage, loading pipeline” and “training framework” and flexible resource configurations between them so that the resources can be fully exploited and performance can be maximized.

Ping Gong, Yuxin Ma, Cheng Li, Xiaosong Ma, Sam H. Noh

Secure and Dependable System

Frontmatter
TwinStore: Secure Key-Value Stores Made Faster with Hybrid Trusted/Untrusted Storage

Key-value (KV) stores are one of the most prominent data storage engines in many cloud services, including those that keep sensitive user information and must be protected against privileged attackers in public clouds. While modern hardware processors could offer isolated execution environments during processing, existing secure KV stores still need to use software-based protection to ensure confidentiality, integrity, and freshness for external data I/O to disks. The recent development of hardware-based trusted I/O and disks provides a new way to implement secure KV stores and obviate software encryption. However, we find that although trusted I/O simplifies freshness enforcement, directly putting all data to the trusted disk is not optimal due to the repetitive encryption agnostic to the application behaviors. We thus propose TwinStore to combine the benefits of the software and hardware, by only storing the metadata associated with freshness in the trusted disks to minimize the performance overheads. With our prototypes on two KV store structures, TwinStore outperforms the software-only and hardware-only designs, by $$18.5\times $$ 18.5 × and $$1.1\times $$ 1.1 × , respectively.

Xiang Li, Huanchen Zhang, Mingyu Gao
The Future of Fully Homomorphic Encryption System: From a Storage I/O Perspective

Fully Homomorphic Encryption (FHE) allows computations to be performed on encrypted data, significantly enhancing user privacy. However, the I/O challenges associated with deploying FHE applications remains understudied. We analyze the impact of storage I/O on the performance of FHE applications and summarize key lessons from the status quo. Key results include that storage I/O can degrade the performance of ASICs by as much as 357 $$\times $$ × and reduce GPUs performance by up to 22 $$\times $$ × .

Lei Chen, Erci Xu, Yiming Sun, Shengyu Fan, Xianglong Deng, Guiming Shi, Guang Fan, Liang Kong, Yilan Zhu, Shoumeng Yan, Mingzhe Zhang
LASM: A Lightweight and General TEE Secure Monitor Framework

With growing security concerns in mobile and cloud environments, Trusted Execution Environment (TEE) offers secure isolation for security-critical applications. However, existing designs rely on vertical privilege mechanisms, like dedicated privilege levels and virtualization, making SM implementations complex and less portable and increasing the Trusted Computing Base(TCB). To address this, we propose LASM, a horizontally extended hardware framework that enables secure SM execution without virtualization or extra privilege levels. LASM introduces three lightweight hardware-software co-designed mechanisms: (1) SM Isolation based on horizontal state extension, (2) Secure Communication and Exception, and (3) Enclave Memory Isolation based on mapping. We prototype LASM on QEMU and FPGA, and evaluations on multiple benchmarks show the strong security with low overhead (mostly under 10%). LASM provides a lightweight, general solution for future TEE designs.

Baojun Wang, Tingting Zhang, Tianyi Liu, Huandong Wang, Changbin Xu, Longbing Zhang
Identifying Potential Anomalous Operations in Graph Neural Network Training

Graph Neural Networks (GNNs) have demonstrated transformative potential across domains, driving the development of specialized frameworks like Deep Graph Library (DGL) and PyTorch Geometric (PyG) that employ emerging techniques to overcome computational bottlenecks in large-scale graph learning. However, due to the inherent sparsity of GNN models and the complexity of heterogeneous computing systems, optimizing GNN performance remains a significant challenge. Existing profiling tools, such as Nsight Systems, primarily focus on visualizing resource utilization over time, helping users identify inefficient execution patterns. While this approach provides insights into hardware-level performance, it lacks higher-level, code-centric analysis, making it difficult for developers to pinpoint and resolve performance bottlenecks in GNN training. To address these limitations, we propose GNNProf, an automated performance analysis tool designed to detect and diagnose potential inefficiencies in GNN training. GNNProf collects and restructures CPU function-level performance data into an analyzable format, and applies machine learning and unsupervised learning techniques to identify potential performance anomalies. By automatically recognizing inefficient functions and highlighting performance-critical regions, GNNProf enables developers to gain deeper insights into the execution behavior of GNN training. Additionally, it provides intuitive visualizations that facilitate performance debugging and optimization, ultimately improving training efficiency on heterogeneous systems.

Zhibo Xuan, Hailong Yang, Xin You, Zhongzhi Luan, Yi Liu, Depei Qian

APPT Posters

Frontmatter
DraEC: A Decentralized Routing Algorithm in Erasure-Coded Deduplication System

Data deduplication and erasure coding are two widely used strategies for reducing storage overhead and resisting unexpected failures. However, combining these approaches without careful consideration can potentially diminish fault tolerance and increase access latency. In this paper, we propose DraEC, a decentralized routing algorithm designed specifically for erasure-coded deduplication systems. DraEC encodes data before performing deduplication. It then employs cuckoo hashing to maximize fault tolerance offered by erasure coding without compromising the data deduplication. DraEC further adaptively determines the placement of data and parity blocks to balance the storage load and mitigate access hotspots. Extensive testbed experiments show that DraEC can achieve a 9.2% improvement in write performance, which can increase to 94.1% under intensive write requests. DraEC also achieves a 17.6% improvement in read performance under intensive reads, while introducing less than 1% additional storage overhead.

Ronglong Wu, Jiebin Zhai, Defang Chen, Zhirong Shen
Spatial-Aware Orchestration of LLM Attention on Waferscale Chips

Transformer-based LLMs have driven AI progress but face mounting computational challenges as their scale and context lengths expand. Wafer-scale chips (WSCs) offer compelling alternatives to conventional GPUs with substantially higher transistor density and inter-die bandwidth. However, their rigid 2D mesh topology undermines GPU-optimized ring-attention patterns, while causal attention masks create workload imbalances that conventional token reordering fails to address in wafer-scale settings. We analyze communication overhead in LLM attention blocks on WSCs and develop a spatial-aware cost model tailored to wafer-scale topologies. Our spatial-aware orchestration method optimizes communication patterns and strategically places tensors to leverage high-bandwidth wafer-scale interconnects, reducing latency and balancing workloads. This approach yields 1.5 $$\times $$ × average performance improvement across diverse LLM architectures compared to SOTA training systems.

Taiquan Wei, Huizheng Wang, Zichuan Wang, Shouyi Yin, Yang Hu
ACLP: Towards More Accurate Loop Prediction for Execution Efficiency in High-Performance Processors

Branch prediction is critical to the execution efficiency of high-performance processors and has seen significant advancements. However, existing loop predictors often struggle to accurately track loop iterations, particularly in out-of-order or speculative execution. This paper proposes the Accurate Count Loop Predictor (ACLP), a loop prediction architecture that improves the accuracy of loop iteration tracking in high-performance processors. ACLP employs a dual confidence mechanism to suppress the influence of unstable loop branches and records committed loop branch counts to infer accurate iteration counts. Experimental results show that ACLP reduces mispredictions per kilo instructions (MPKI) by an average of 4.5% compared to a state-of-the-art loop predictor, with a marginal increase in area and power.

Zhen Xue, Wei He, Biwei Xie, Yungang Bao
DSL-SGD: Distributed Local Stochastic Gradient Descent with Delayed Synchronization

Communication overhead is a key challenge in distributed deep learning. This paper introduces DSL-SGD, a distributed training scheme with a local update mechanism. DSL-SGD allows local weights to participate in multiple steps until global updates are completed, mitigating communication latency. To reduce parameter differences, it accumulates and averages gradients from multiple steps to update local weights. Experiments on a 32-GPU cluster show DSL-SGD matches the convergence accuracy of synchronous SGD while reducing end-to-end time by 7.9 $$\times $$ × compared to synchronous SGD and by 71.8% compared to Local-SGD. It also demonstrates superior scalability and computational efficiency by reducing weight updates by a factor of $$k$$ k .

Enda Yu, Zhe Bai, Dezun Dong
Exploiting Large Language Models for Software-Defined Solid-State Drives Design

Software-Defined SSDs enable customizable hardware components to effectively optimize storage performance for specific workloads. However, optimizing configurations is challenging due to complex inter-dependencies among numerous parameters. Existing methods are limited by insufficient workload-awareness, high search overhead, and inability to leverage external insights. To address these challenges, LLMs could be a promising technique, as they excel in handling complex, high-dimensional parameter space exploration by leveraging their advanced capability to identify patterns and optimize solutions.In this work, we explore the potential of LLMs in understanding and efficiently managing Software-Defined SSD design space. Specifically, we propose LLM-S3D, an LLM-driven framework that comprehensively understands workloads via a novel compression scheme, efficiently explores configuration spaces, and iteratively optimizes SSD parameters. Evaluation results demonstrate that LLM-S3D delivers a 59.57% performance improvement for target workloads compared to commodity SSDs.

Qian Wei, Zehao Chen, Tianren Zhou, Wenbin Zhu, Zhenge Jia, Mengying Zhao, Zhaoyan Shen
Comber: QoS-Aware and Efficient Deployment for Co-located Microservices and Best-Effort Tasks in Disaggregated Datacenters

Current cloud providers mostly adopt the microservice architecture for latency-critical (LC) services, while co-locating the Best-Effort (BE) tasks like big data analytics to improve the resource utilization of datacenters. Meanwhile, datacenters are evolving toward the disaggregated architecture, in which the compute cluster offers strong computing capabilities while the storage cluster is closer to data storage. Current works partition computing resources for the LC and BEs, to improve the execution performance of BEs while ensuring the Quality-of-Service (QoS) of the LC. However, without appropriate deployment between disaggregated clusters, they result in increased JCT of BE tasks. We propose Comber, a deployment system that reduces BE job completion time (JCT) while ensuring microservice QoS in disaggregated datacenters. Comber consists of a BE task placer and a microservice distributor. Experiments show it reduces BE JCT by 72.4% on average compared to state-of-the-art methods.

Ruogang Ma, Jiuchen Shi, Quan Chen, Minyi Guo
NISA-DV: Verification Framework for Neuromorphic Processors with Customized ISA

An effective and efficient verification is a fundamental part of the chip design process, and it becomes more important and critical as the latter gets more complex. The efficacy of classical verification methods such as random test is significantly diminished when applied to emerging architectures such as neuromorphic processors due to the lack of corresponding decoding for instruction encoding and computational characteristics. Therefore, this paper proposes NISA-DV (Design verification for neuromorphic ISA), a verification framework to verify RISC-V-based neuromorphic ISA extensions, including random test, directed test, and boundary condition test methods. Using a combination of random constrained tests with directed feedback and manually designed directed tests, we obtain 8.34% block coverage and 29.63% toggle coverage improvement on the NeuroRVCore sample neuromorphic processor, compared to the random method using only RISCV-DV.

Yuanfeng Luo, Zhijie Yang, Yi Wei, Ping Yu, Yang Guo, Lei Wang
Lembda: Optimizing LLM Inference on Embedded Platforms via CPU/FPGA Co-processing

The rapid advancement of Large Language Models (LLMs) has provided new opportunities for edge applications. Embedded FPGA platforms are well-suited for deploying LLMs at the edge due to their hardware programmability and high energy efficiency. However, the limited hardware resources of embedded systems pose significant challenges to comprehensive acceleration. As not all operators can be offloaded to programmable logic (PL), the remaining operations must execute on the processing system (PS), creating potential performance bottlenecks and degrading inference efficiency. In this paper, we introduce Lembda, a collaborative optimization framework that harnesses the computational capability of both PL and PS for efficient embedded LLM inference. On the PL side, we employ W4A8 quantization and implement high-throughput GEMM/GEMV kernels. On the PS side, we optimize high-precision operations by exploiting model sparsity in attention layers and approximate nonlinear functions via lightweight polynomial fitting. Moreover, we carefully orchestrate PL/PS operations to exploit operation-level parallelism and further enhance performance. Evaluations on the AMD Kria KV260 platform demonstrates that Lembda delivers $$187.9195\ \mathrm {tok/s}$$ 187.9195 tok / s for prefilling and $$9.7857\ \mathrm {tok/s}$$ 9.7857 tok / s for decoding on the Qwen2.5-0.5B-Instruct model, achieving $$65.9\times $$ 65.9 × / $$3.8\times $$ 3.8 × speedup compared to the baseline method with negligible accuracy loss.

Jinwei Zhou, Chenhao Xue, Xiping Dong, Yi Ren, Jiaxing Zhang, Guangyu Sun, Xinnan Lin
QDLoRA: Enhanced LoRA Fine-Tuning on Quantized LLMs via Integrated Low-Rank Decomposition

We propose QDLoRA, a parameter-efficient fine-tuning (PEFT) framework that integrates low-rank decomposition and quantization into the LoRA fine-tuning process for pretrained large language models (LLMs). Unlike prior methods such as LoftQ and ApiQ that rely solely on quantization and suffer performance degradation under extreme compression, QDLoRA preserves more informative structure at the same compression ratio, thereby improving fine-tuning results. To further enhance robustness, QDLoRA introduces a similarity-aware rank selection strategy and a quantization-aware initialization scheme. Experimental results on various model architectures across diverse NLP benchmarks demonstrate that QDLoRA achieves superior accuracy and efficiency compared to existing methods, particularly under limited resource budgets. The proposed method offers a practical and scalable solution for efficient fine-tuning of large language models.

Xingyi Su, Rui Wang, Zhongzhi Luan, Yi Liu, Depei Qian
Backmatter
Titel
Advanced Parallel Processing Technologies
Herausgegeben von
Chao Li
Xuehai Qian
Dimitris Gizopoulos
Boris Grot
Copyright-Jahr
2026
Verlag
Springer Nature Singapore
Electronic ISBN
978-981-9510-21-4
Print ISBN
978-981-9510-20-7
DOI
https://doi.org/10.1007/978-981-95-1021-4

Die PDF-Dateien dieses Buches wurden gemäß dem PDF/UA-1-Standard erstellt, um die Barrierefreiheit zu verbessern. Dazu gehören Bildschirmlesegeräte, beschriebene nicht-textuelle Inhalte (Bilder, Grafiken), Lesezeichen für eine einfache Navigation, tastaturfreundliche Links und Formulare sowie durchsuchbarer und auswählbarer Text. Wir sind uns der Bedeutung von Barrierefreiheit bewusst und freuen uns über Anfragen zur Barrierefreiheit unserer Produkte. Bei Fragen oder Bedarf an Barrierefreiheit kontaktieren Sie uns bitte unter accessibilitysupport@springernature.com.

    Bildnachweise
    AvePoint Deutschland GmbH/© AvePoint Deutschland GmbH, NTT Data/© NTT Data, Wildix/© Wildix, arvato Systems GmbH/© arvato Systems GmbH, Ninox Software GmbH/© Ninox Software GmbH, Nagarro GmbH/© Nagarro GmbH, GWS mbH/© GWS mbH, CELONIS Labs GmbH, USU GmbH/© USU GmbH, G Data CyberDefense/© G Data CyberDefense, FAST LTA/© FAST LTA, Vendosoft/© Vendosoft, Kumavision/© Kumavision, Noriis Network AG/© Noriis Network AG, WSW Software GmbH/© WSW Software GmbH, tts GmbH/© tts GmbH, Asseco Solutions AG/© Asseco Solutions AG, AFB Gemeinnützige GmbH/© AFB Gemeinnützige GmbH