Zum Inhalt

Database Systems for Advanced Applications

29th International Conference, DASFAA 2024, Gifu, Japan, July 2–5, 2024, Proceedings, Part II

  • 2025
  • Buch
insite
SUCHEN

Über dieses Buch

Der siebenbändige Band LNCS 14850-14856 bildet den Rahmen für die 29. Internationale Konferenz über Datenbanksysteme für fortgeschrittene Anwendungen, DASFAA 2024, die im Juli 2024 in Gifu, Japan, stattfand. Die insgesamt 147 vollständigen und 85 kurzen Beiträge, die in diesem siebenbändigen Satz zusammengefasst wurden, wurden sorgfältig geprüft und aus 722 Einreichungen ausgewählt. Zusätzlich sind 14 Industriearbeiten, 18 Demopapiere und 6 Tutorials enthalten. Die Konferenz präsentiert Vorträge zu Themen wie: Teil I: Raum- und Zeitdaten; Kerntechnologie von Datenbanken; föderiertes Lernen. Teil II: Maschinelles Lernen; Textverarbeitung. Teil III: Empfehlung; Multimedia. Teil IV: Datenschutz und Sicherheit; Wissensbasis und Grafiken. Teil V: Verarbeitung natürlicher Sprache; großes Sprachmodell; Zeitreihen und Streamdaten. Teil VI: Grafik und Netzwerk; Hardwarebeschleunigung. Teil VII: Neue Anwendungen; Industriearbeiten; Demoarbeiten.

Inhaltsverzeichnis

Frontmatter

Machine Learning

Frontmatter
Variational Kernel Density Estimation Recommendation Algorithm for Users with Diverse Activity Levels

Top-N recommendation is widely accepted as an effective method in personalized service that well serves users of different interests. However, as analyzed from the SOTAs, their performance on the users with diverse activity levels has significant distinction, which seriously damage the service quality of personalized recommendation. Existing studies do not pay a high attention to this issue, which simply assume the preference of all users follows a common probability distribution and then use a fixed schema (e.g., one latent vector) to model user representation. This assumption makes existing models hard to accommodate users of diverse activity levels. In this work, we propose a Variational Kernel Density Estimation (VKDE) model, a non-parametric estimation, which aims to fit arbitrary preference distributions for users. VKDE constructs user (global) preference distribution with multiple local distributions collectively. We propose a variational kernel function to infer user one-faceted interests and generate each local distribution. A sampling strategy for user one-faceted interest is further proposed to reduce training complexity and keep the recommendation effectiveness. Our experimental results on three public datasets show that VKDE outperforms SOTAs and greatly improves the accuracy for users of diverse activity levels.

Wei Liu, Shangsong Liang, Huaijie Zhu, Leong Hou U, Jianxing Yu, Xiang Li, Jian Yin
Accelerating Training of Large Neural Models by Gradient-Based Growth Learning

With the development of large neural models, the scale of their parameters is increasing rapidly. The amount of their training computation increases almost exponentially, leading to high costs. To address the problem, this paper proposes a new accelerated method called Grad-Grow (Gradient-based Growth Learning) to improve training efficiency and obtain competitive performance. This method adopts the local-to-global learning strategy, which first optimizes a sub-model and then growingly trains larger sub-models step-by-step. Eventually, the whole model is learned. We design the growth strategy from the perspective of progressive optimization and use gradient signals to guide the growth direction. With this growth learning way, the small sub-model can be trained quickly and provides a better initialization to learn the larger sub-model in the next stage. That can boost convergence and reduce computational costs. Extensive experiments are conducted on two tasks’ datasets with different network architectures. The results verify the effectiveness of our approach, with 30% training acceleration on average.

Haowei Jiang, Jianxing Yu, Libin Zheng, Huaijie Zhu, Wei Liu, Jian Yin
Characterizing the Influence of Topology on Graph Learning Tasks

Graph neural networks (GNN) have achieved remarkable success in a wide range of tasks by encoding features combined with topology to create effective representations. However, the fundamental problem of understanding and analyzing how graph topology influences the performance of learning models on downstream tasks has not yet been well understood. In this paper, we propose a metric, TopoInf, which characterizes the influence of graph topology by measuring the level of compatibility between the topological information of graph data and downstream task objectives. We provide analysis based on the decoupled GNNs on the contextual stochastic block model to demonstrate the effectiveness of the metric. Through extensive experiments, we demonstrate that TopoInf is an effective metric for measuring topological influence on corresponding tasks and can be further leveraged to enhance graph learning.

Kailong Wu, Yule Xie, Jiaxin Ding, Yuxiang Ren, Luoyi Fu, Xinbing Wang, Chenghu Zhou
Distributed Temporal Graph Neural Network Learning over Large-Scale Dynamic Graphs

Temporal Graph Neural Networks (TGNNs) have achieved success in real-world graph-based applications. The increasing scale of dynamic graphs necessitates distributed training. However, deploying TGNNs in a distributed setting poses challenges due to the temporal dependencies in dynamic graphs, the need for computation balance during distributed training, and the non-ignorable communication costs across disjointed trainers. In this paper, we propose DisTGL, a distributed temporal graph neural network learning system. Leveraging a temporal-aware partitioning scheme and a series of enhanced communication techniques, DisTGL ensures efficient distributed computation and minimizes communication overhead. Based on that, DisTGL facilitates fast TGNN training and downstream tasks. An evaluation of DisTGL using various TGNN models shows that i) DisTGL achieves acceleration of up to 10 $$\times $$ × compared to existing TGNN frameworks; and ii) the proposed distributed dynamic graph partitioning reduces cross-machine operations by 25 $$\%$$ % , while the optimized communication reduce the costs by 1.5–2.5 $$\times $$ × .

Ziquan Fang, Qichen Sun, Qilong Wang, Lu Chen, Yunjun Gao
Score Network with Adaptive Augmentation Aggregator for Multivariate Time Series Representation Contrastive Learning

The complexity of multichannel data, the intricate temporal dynamics, and the diverse frequency characteristics of time series pose significant challenges for self-supervised representation learning. To address these issues, we present the Teacher Student Score (TSS) framework, a novel contrastive learning approach for multidimensional time series representations. This framework introduces two key innovations. First, we present time-channel-frequency consistency (TCF-C) approach of time, channel, and frequency-based contrastive representations and incorporate it into contrastive learning framework. This technique utilizes a weighting mechanism to prioritize self-supervised tasks that emphasize consistency across these dimensions. Second, we propose a Score Network with Adaptive Augmentation Aggregator (AAA) module. This module dynamically combines augmented strategies to create a unified augmented representation, enhancing the efficacy of augmentation in contrastive learning. We evaluate our method on UEA datasets against eight state-of-the-art methods, and the results show that TSS achieves significant improvements over existing SOTAs of self-supervised learning for time series classification.

Guichun Zhou, Yijiang Chen, Xiangdong Zhou
Reusing Keywords for Fine-grained Representations and Matchings

Question retrieval aims to find semantically equivalent questions for an exemplary question, suffering from a key challenge — lexical gap. Previous solutions mainly focus on utilizing translation models, topic models and deep learning techniques to perform global matching. Different from the previous solutions, we propose new insights of reusing important keywords to construct fine-grained semantic representations of questions and then fine-grained matchings for two questions, which will inspire future research to explore and mine new solutions from the questions themselves. To realize these insights, we propose a practical fine-grained matching network with two cascaded units: (i) fine-grained representation unit, which uses multi-level keyword sets to represent question semantics of different granularity; (ii) fine-grained matching unit, which performs matchings in multiple granularities (to achieve both global matching and local matching) and multiple views (to achieve both lexical matching and semantic matching). We conduct extensive experiments on three public datasets and the experimental results show that our proposed model outperforms the state-of-the-art solutions.

Li Chong, Denghao Ma, Yueguo Chen, Xueqiang Lv
Modeling Learning Transfer Effects in Knowledge Tracing: A Dynamic and Bidirectional Perspective

Knowledge Tracing (KT) is a crucial task in online intelligent education systems, which aims to dynamically monitor students’ evolving knowledge state. During students’ learning process, not only do knowledge states on later-learned concepts impact the understanding of earlier-learned concepts, but the knowledge states on earlier-learned concepts also influence the learning of later-learned concepts, forming bidirectional learning transfer. However, existing work typically focuses on either the backward impact of the currently learned concept on previously learned concepts or the forward impact of previously learned concepts on future concepts. Moreover, they commonly assume the transfer influence weights between concepts remain fixed throughout the entire learning process, thereby neglecting students’ dynamic transfer mechanisms. In this paper, we introduce the Dynamic Bidirectional Transfer Knowledge Tracing (DBTKT) model, which simultaneously takes into account learning transfer effects in both directions and utilizes students’ personalized learning experiences to measure dynamic transfer influence weights. Extensive experimental results on public datasets demonstrate that our model not only generates improved performance predictions compared to existing methods but also offers meaningful insights into knowledge state evolution from a learning transfer perspective.

Weizhe Huang, Shuanghong Shen, Zhenya Huang, Qi Liu, Junyu Lu, Yu Su
Acceleration-Guided Diffusion Model for Multivariate Time Series Imputation

Multivariate time series data are pervasive in various domains, often plagued by missing values due to diverse reasons. Diffusion models have demonstrated their prowess for imputing missing values in time series by leveraging stochastic processes. Nonetheless, a persistent challenge surfaces when diffusion models encounter the task of accurately modeling time series data with quick changes. In response to this challenge, we present the Acceleration-guided Diffusion model for Multivariate time series Imputation (ADMI). Time-series representation learning is first effectively conducted through an acceleration-guided masked modeling framework. Subsequently, representations with a special care of quick changes are incorporated as guiding elements in the diffusion model, utilizing the cross-attention mechanism. Thus our model can self-adaptively adjust the weights associated with the representation during the denoising process. Our experiments, conducted on real-world datasets featuring genuine missing values, conclusively demonstrate the superior performance of our ADMI model. It excels in both imputation accuracy and the overall enhancement of downstream applications.

Xinyu Yang, Yu Sun, Shaoxu Song, Xiaojie Yuan, Xinyang Chen
Multi-Interest Granularity Guided Semi-Joint Learning for N-Successive POI Recommendation

Massive user check-in histories provide valuable opportunities to understand users’ behavior and make successive point-of-interest (POI) recommendation. However, most existing works only focus on predicting the POIs that users will visit next, while ignoring when they are interested in visiting these POIs. A few works set a fixed time window to constrain the target, but still suffer form the following problems: 1) the inability to respond to interests within future dynamic time windows; 2) the inability to answer when and in what order to visit the recommended POIs. This results in them performing poorly in providing recommendations that are more beneficial to merchants and more likely to excite users. In contrast, we propose a new meaningful task, namely N-successive POI recommendation, which aims to suggest POI sequences to users that will be visited in the next N consecutive time slots. The challenge of this task lies in how to efficiently integrate interests of different granularities to maximize their value, and how to exploit consecutive interest dependencies. To this end, we propose a multi-interest granularity guided semi-joint learning model. It performs multiple combined encodings of short-term interests, long-term interests, and clock influence in a simple and effective way while learning consecutive interest dependencies. The experimental results show that our model can effectively perform N-successive POI recommendations.

Fuqiang Yu, Fenghua Tong, Dawei Zhao, Lijuan Xu, Ning Liu
PT-Tuning: Bridging the Gap between Time Series Masked Reconstruction and Forecasting via Prompt Token Tuning

Self-supervised learning has been actively studied in time series domain, especially for masked reconstruction. Most of these methods follow the “Pre-training + Fine-tuning” paradigm in which a new decoder replaces the pre-trained decoder to fit for a specific downstream task, leading to inconsistency of upstream and downstream tasks. In this paper, we point out that the unification of task objectives and adaptation for task difficulty are critical for bridging the gap between masked reconstruction and forecasting. By reserving pre-trained mask token during fine-tuning stage, forecasting can be taken as a special case of masked reconstruction, where future values are masked and reconstructed based on history values. It guarantees the consistency of task objectives but there is still a gap in task difficulty. Because masked reconstruction can utilize contextual information while forecasting can only use historical information to reconstruct. To further mitigate the existed gap, we propose a simple yet effective prompt token tuning (PT-Tuning) paradigm, in which all pre-trained parameters are frozen and only a few trainable prompt tokens are added to extended mask tokens in element-wise manner. Extensive experiments on real-world datasets demonstrate the superiority of proposed paradigm with state-of-the-art performance compared to representation learning and end-to-end forecasting methods.

Hao Liu, Jinrui Gan, Xiaoxuan Fan, Yi Zhang, Chuanxian Luo, Jing Zhang, Guangxin Jiang, Yucheng Qian, Changwei Zhao, Huan Ma, Zhenyu Guo
Beyond Users: Denoising Behavior-based Contrastive Learning for Disentangled Cross-Domain Recommendation

Cross-Domain Recommendation (CDR) has emerged as an effective solution to the data sparsity issue in Recommender Systems (RS). Existing CDR methods typically disentangle user features into domain-invariant and domain-specific features to avoid negative transfer, known as DCDR. Nevertheless, these methods often neglect the side effects of noisy behaviors (interactions) during disentanglement. Furthermore, they fail to account for item features during disentanglement, which significantly influence the generation of user features. These two critical oversights lead to the degraded performance of existing DCDR methods. To overcome these issues, we introduce a Denoising Contrastive learning framework specifically tailored for DCDR (DCDC). DCDC conducts denoising at both the structure and feature levels. Structure denoising prunes unreliable behaviors, restricting message propagation during graph convolution to reliable edges. Feature denoising modifies the similarity between nodes based on the edge’s reliability. Additionally, we design two contrastive learning (CL) constraints based on user and item mutual information for thorough disentanglement. Contrastive exclusion constraint distinguishes domain-invariant and domain-specific features within a domain, while contrastive proximity constraint minimizes the distance of cross-domain invariant features. The final results showcase the consistent outperformance of our model compared to state-of-the-art methods across three diverse real-world datasets.

Lele Sun, Jing Liu, Shenyuan Zhang, Weizhi Nie, Anan Liu, Yuting Su
Speal: Achieving a More Accurate Model with Less Training Data in Performance Evaluation of Storage System through Sampling Optimization

Performance evaluation, as a crucial component of Quality of Service (QoS), holds significant importance for modern storage systems. Previous machine learning-based methods ignore the varied improvements in the model after applying different datasets for training. Suboptimized random sampling methods may lead to the collection of unnecessary training data, resulting in excessively high dataset construction costs. This problem becomes more pronounced when there are constraints on the sampling and storage system resources. In this paper, we propose Speal, Storage System Performance Evaluator with Active Learning, which utilizes machine learning to predict the performance of the workload running on the storage system. We present a straightforward yet highly effective active learning algorithm called E2 sampling, employed during the model construction phase to reduce the cost of training dataset acquisition. Furthermore, we apply Speal to the storage system to facilitate bandwidth control and optimize performance. In our experiments using performance data collected from the real storage system, Speal exhibits up to 1.75x reduction in prediction error compared to other active learning algorithms. Additionally, implementing the bandwidth control enhanced by Speal ’s performance evaluation to the storage system leads to an average throughput improvement of up to 1.51x and a reduction in tail latency by up to 1.71x, surpassing the baseline.

Liang Bao, Hua Wang, Ke Zhou, Guangyu Zhang, Ji Zhang, Xi Peng, Qingqing Yang, Renhai Chen, Gong Zhang
Enhancing Spatiotemporal Prediction with Intra- and Inter-granularity Contrastive Learning

Spatiotemporal prediction (STP) utilizes historical properties to predict future trends. However, most STPs only consider the single granularity data and ignore the diversity of spatiotemporal patterns, i.e., different granularities, thus performing mediocrely. In this paper, we propose an Intra- and Inter-granularity Contrastive Learning Framework (IICLF) to enhance STP with several key challenges: i) Difficult to learn reliable spatiotemporal representations under different granularities; ii) STP suffers from uneven data distribution. To address the first challenge, we devise an intra- and inter-granularity contrastive learning module, which enhances the spatiotemporal representations by incorporating the commonalities and differences across granularities. For the second, we design a dual hypergraph convolutional module integrating geographical hypergraph and semantic hypergraph to capture higher-order dependencies among nodes and mitigate the problem of uneven data distribution. In addition, we introduce a supervised task that distinguishes the importance of different granularities and makes compelling predictions. Extensive experiments on three real-world datasets validate that our proposed IICLF is superior to various state-of-the-art STP methods.

Qilong Han, Shanshan Sui, Dan Lu, Shiqing Wu, Guandong Xu
TDMixer: Lightweight Long-Term Series Forecasting using Time-Continuous Embedding and Magnitude Decomposition

Time series forecasting plays a key role in several fields, including energy, transportation, weather, etc. Conventional forecasting techniques, along with deep learning-based series forecasting methods such as RNN, CNN, Transformer, and MLP models, are currently thriving. Nevertheless, current approaches struggle to decrease the computational complexity of the model without sacrificing accuracy. To this end, we propose TDMixer, a hybrid lightweight network for long-term series forecasting using time-continuous embedding and magnitude decomposition. (1)To start, we propose time-continuous embedding, a method that converts past timestamps into continuous temporal relationships. This facilitates the model in understanding historical temporal correlation, contributing to improved predictive performance. (2)Additionally, to decrease computational complexity, we suggest utilizing the magnitude decomposition method and constructing a lightweight temporal patterns learner to comprehend diverse temporal patterns. Our investigation uncovers that the significant temporal patterns are predominantly concentrated in the higher magnitude region in the frequency domain. We evaluate TDMixer on five real-world datasets and the experimental results demonstrate its excellent predictive performance and low computational complexity.

Hui Liu, Qiaoqiao Liu, Zhihan Yang, Junzhao Du
AoSE-GCN: Attention-Aware Aggregation Operator for Spatial-Enhanced GCN

Graph Neural Networks (GNNs) have shown remarkable success in various graph-relevant applications, which can be generally classified into spatial-based and spectral-based methods. Particularly, spatial methods capture local neighborhoods well but lack global structural insight since they are defined on the local nodes based on the aggregator operators. Adversely, spectral methods incorporate global structural information but struggle with local details due to the nature of the Laplacian matrix. Notably, spectral methods employ frequency component adjustment on filters to achieve effective convolutions, yet they remain less flexible compared to spatial methods. The challenge lies in balancing these approaches, enabling GNN models to capture both the global-level and local-level information, thus promoting graph representation learning. To tackle the problem, we introduce a novel attention-aware aggregation operator, denoted as $$L_{att}$$ L att , by appending the attention score as additional weight in the Laplacian matrix. Inspired by its benefits, we integrate $$L_{att}$$ L att into the GCN model to perceive different levels of fields, called AoSE-GCN. Notably, our $$L_{att}$$ L att is not limited to GCN where any spectral methods can be easily plugged in. Extensive experiments on benchmark datasets validate the superiority of AoSE-GCN for node classification tasks in fully-supervised or semi-supervised settings.

Jiazhen Ye, Chunyan An, Qiang Yang, Zhixu Li
Data-free Knowledge Distillation based on GNN for Node Classification

Data-free Knowledge Distillation (KD) circumvents the limitation of knowledge extraction from original training data by utilizing generated data. Data-free KD has made good progress in models for processing grid data. However, for Graph Neural Networks (GNN) that process non-grid data, existing related methods are primarily designed for graph classification tasks involving small-sized graphs. Research on node classification tasks for larger graphs remains unexplored. To address this issue, we propose the first Data-Free Knowledge Distillation framework for Node Classification (DFKD-NC). DFKD-NC obtains training data by generating target node and its neighbor information. For data generation, we predefine a full tree pseudo-subgraph template for each target node and use a generator component to generate node features. We train student model and generator using adversarial training. Moreover, we introduce contrastive learning to diversify the generated data and regularize the generated node features to facilitate student model convergence. Extensive experiments on six benchmark datasets have demonstrated that DFKD-NC achieves state-of-the-art results in node classification tasks. Code is available at https://github.com/zengxinfeng/DFKD-NC.

Xinfeng Zeng, Tao Liu, Ming Zeng, Qingqiang Wu, Meihong Wang
Multi-class Imbalanced Data Classification by Deep Multi-set Discriminant Metric Learning with Optimal Balance Sampling

Data classification is one of the core technologies of data mining, which has great scientific significance and commercial value because it is widely used in medical science, bioinformatics, and computer science. However, data often show the characteristics of class imbalance. This will lead to the minority class samples misclassifying into majority class, and then reduce the classification performance of the classifier. Compared with the two-class classification scenario, multi-class imbalanced data classification is more difficult since it causes the samples from different minority classes to be misclassified into majority classes. In this paper, we propose a novel class-imbalanced learning model for multi-class imbalanced data classification. Specifically, we first define a multiple balanced subset construction strategy by optimal balance sampling and then design a deep multi-set discriminant metric learning network for multiple subset feature learning. Extensive experimental results on four typical class-imbalanced datasets from three important fields demonstrate that compared with state-of-the-art methods, our approach can improve the Average classification performance by 4.02% on contraceptive, 7.82% on yeast, 5.50% on mushroom, 4.12% on pageblocks.

Xinyu Zhang, Xiao-Yuan Jing, Xiaocui Li, Jiagang Liu
Time-aware Dual-kernel Hawkes Process for Sequential Recommendation

Sequence recommendation systems usually learn users’ personalized preferences based on historical behavior sequences of users. Previous methods often use rich item interaction informations combined with context to mine user sequential patterns. However, the time decay and dynamic evolution of users’ preferences are rarely taken into consideration. In this paper, we propose TDH4Rec(Time-aware Dual-kernel Hawkes process for sequential recommendation). First, we establish a time-aware attention embedding module. We utilize time sensitivity and importance of interactions to capture the time dependence and frequency dependence of items during the interaction process. Secondly, we design a Hawkes process based dual-kernel learning module. The Hawkes interaction kernel function with historical interaction effect and the Hawkes time kernel function with time decay effect are designed to capture the influence of historical interaction and the influence of time change of interacted items, respectively. Finally, extensive experiments on three real-world datasets show the efficacy of our model compared with conventional methods and state-of-the-art methods.

Jingyang Liu, Nan Wang, Yingli Zhong, Zhonghui Shen
Shapley-Optimized Reinforcement Learning for Human-Machine Collaboration Policy

Human-machine collaboration is a promising training framework aimed at learning optimal strategies in high-cost exploration scenarios. However, such work is challenging. On one hand, current research on human-machine collaboration primarily focuses on imitation learning, overlooking the optimization of interactions between the collaborative entities. Hence, we propose a conceptual framework and modeling approach for collaborative learning based on imitation learning. On the other hand, the difficulty lies in explaining the contributions of humans and machines in the learning process and the lack of metrics for measuring the learned strategies and the uncertainty of decision gradients. To address these issues, we introduce an RL (Reinforcement Learning) framework for human-machine collaboration, known as Human-Machine RL. This framework employs reward shaping techniques for offline policy learning. In order to assess the policies, we design an estimation algorithm tailored for human-machine collaboration scenarios, based on reinforcement learning. Additionally, we incorporate Shapley as a mathematical interpretive tool for policy rewards. We tackle the issue of gradient variance that may arise from Shapley. The feasibility of our approach is theoretically demonstrated, and we have made the source code available for result reproducibility.

Jie Zhang, Yiqun Niu, Wei He, Cheng Jin, Chongjun Wang
Fine-Grained Urban Flow Inference with Dynamic Multi-scale Representation Learning

Fine-grained urban flow inference (FUFI) is a crucial transportation service aimed at improving traffic efficiency and safety. FUFI can infer fine-grained urban traffic flows based solely on observed coarse-grained data. However, most of existing methods focus on the influence of single-scale static geographic information on FUFI, neglecting the interactions and dynamic information between different-scale regions within the city. Different-scale geographical features can capture redundant information from the same spatial areas. In order to effectively learn multi-scale information across time and space, we propose an effective fine-grained urban flow inference model called UrbanMSR, which uses self-supervised contrastive learning to obtain dynamic multi-scale representations of neighborhood-level and city-level geographic information, and fuses multi-scale representations to improve fine-grained accuracy. The fusion of multi-scale representations enhances fine-grained. We validate the performance through extensive experiments on three real-world datasets. The resutls compared with state-of-the-art methods demonstrate the superiority of the proposed model.

Shilu Yuan, Dongfeng Li, Wei Liu, Xinxin Zhang, Meng Chen, Junjie Zhang, Yongshun Gong
SP-Aug: Towards Efficient Semantic-Preserving Augmentations in Contrastive Learning via Hierarchical Outlier Factor

Data augmentation is an effective way to generate abundant synthetic data from original data, with the same labels preserved. It has recently demonstrated its significant advantages in contrastive learning by constructing positive pairs with augmented data. However, the implicit underlying assumption that data augmentation is consistently semantic-preserving is unrealistic and we observe that not all augmentations are beneficial for contrastive learning. Yet it is challenging to distinguish which augmentations preserve the semantic consistency of labels for a specific task. To tackle this challenge, we formulate the problem of selecting optimal augmentations as a multi-level anomaly detection problem, and propose a novel method towards efficient Semantic-Preserving Augmentations (SP-Aug) to leverage intrinsic hierarchical clustering information to guide the detection of unfavorable augmentations from local and intra-cluster perspectives. We design a new metric named Hierarchical Outlier Factor (HOF) to measure the multi-level semantic inconsistency of augmentations across different clustering partitions. Empirically we evaluate our method on 3 popular contrastive learning models and compare it with 3 state-of-the-art augmentation selection methods, demonstrating the superiority of ours in discovering outliers and identifying semantic-preserving augmentations in classification tasks.

Qianwen Meng, Hangwei Qian, Yonghui Xu, Lizhen Cui

Text Processing

Frontmatter
An Efficient Algorithm for Regular Expression Matching Using Variable-length-gram Inverted Index

Regular expression (regex) matching is widely used in many applications, such as code searching, entity extraction, and intrusion detection, which requires efficient matching efficiency. The traditional approaches utilize the finite state automaton to match the regex query from a text, even though they can employ some filtering strategies to avoid irrelevant characters that cannot be the matching results, there are still large numbers of contents that need to be verified by the automaton. Recent methods use the positional inverted index based on q-gram (q-length substring) to match all results for the regex query, which avoids the time-consuming automaton-based verification. However, using the fixed length of substrings (i.e., q-grams) to index the positions of the text could result in the high frequently occurred q-grams being used for the regex matching, finally limiting the matching efficiency. To this end, we employ a variable-length gram technique to boost the index-based regex matching efficiency. At first, we build the positional inverted index based on the variable-length grams so that a better balance is obtained between the number of grams and the number of gram occurrences on the text. Then, we design a data structure (VGgraph) based on variable-length grams to represent the regex query and propose the VGgraph-based matching algorithm using the variable-length-gram inverted index. Although computing the optimal VGgraph with the minimal matching cost is NP-hard, we propose a greedy algorithm to construct the VGgraph which obtains a $$\ln {n}$$ ln n approximation to the optimal VGgraph. Extensive experiments on real-world datasets demonstrate that the variable-length gram technique can significantly improve the efficiency of index-based regex matching.

Tao Qiu, Zheng Gong, Mengxiang Wang, Chuanyu Zong, Rui Zhu, Xiaochun Yang
A Retrieval-Augmented Framework for Tabular Interpretation with Large Language Model

Relational tables on the web hold a vast amount of knowledge, and it is critical for machine learning models to capture the semantics of these tables such that the models can achieve good performance on table interpretation tasks, such as entity linking, column type annotation and relation extraction. However, it is very challenging for ML models to process a large amount of tables and/or retrieve inter-table context information from the tables. Instead, existing works usually rely on heavily engineered features, user-defined rules or pre-training corpus.In this work, we propose a unified Retrieval-Augmented Framework for tabular interpretation with Large language model ( $$\textsf{RAFL}$$ RAFL ), a novel 2-step framework for addressing the table interpretation task. $$\textsf{RAFL}$$ RAFL first adopts a graph-enhanced model to obtain the inter-table context information by retrieving schema-similar and topic-relevant tables from a large range of corpus; $$\textsf{RAFL}$$ RAFL then conducts tabular interpretation learning by combining a light-weighted pre-ranking model with a re-ranking-based large language model. We verify the effectiveness of $$\textsf{RAFL}$$ RAFL through extensive evaluations on 3 tabular interpretation tasks (including entity linking, column type annotation and relation extraction), where $$\textsf{RAFL}$$ RAFL substantially outperforms existing methods on all tasks.

Mengyi Yan, Weilong Ren, Yaoshu Wang, Jianxin Li
A Type Fusion and Span Relation Enhanced Event Extraction Framework for Confused Event

Event extraction (EE) is a crucial aspect of information extraction facing challenges with overlapped, nested, and confused events. Most prior efforts have focused on overlapped or nested events, overlooking confused events. To address these issues, we introduce a novel event extraction model which employs type fusion to guide argument extraction tasks and utilizes span relation for event splitting to address the issue of confused event. This model simultaneously addresses the challenges of Overlapped, Nested and Confused events for Event Extraction named as ONCEE. Evaluation results on public event extraction datasets FewFC and FNDEE demonstrate a noteworthy improvement in our approach to event extraction. The average F1 score on the datasets increased by 1.8% compared to CasEE and 1.2% compared to OneEE. Besides, our model demonstrates superior performance in extracting confused events.

Fanshen Meng, Rongheng Lin
Cryptocurrency Topic Burst Prediction via Hybrid Twin-structured Multi-modal Learning

Social media users actively engage in discussions concerning news and events within the dynamic cryptocurrency market, resulting in the widespread dissemination of cryptocurrency-related topics across various platforms. The ongoing monitoring of these topics is crucial to informed investment, effective platform regulation, and heightened community engagement. Nevertheless, the abruptness of bursts in cryptocurrency-related topics presents significant challenges to conventional prediction methods based on features and time series, particularly in understanding their regularity and causation. To address these challenges, this paper introduces Nostredame, an innovative cryptocurrency topic burst prediction method based on hybrid twin-structured multi-modal learning. By employing a defined burst score as a quantitative measure for day-wise topic evolution, the proposed method aims to predict bursts from a multi-modal perspective. This is achieved through hyperbolic temporal encoders that capture temporal behaviors with cross-modal representations in the twin structure, followed by the fusion within the hybrid learning module. Experimental results indicate that cryptocurrency transactions, social media contents, and topic evolution collectively influence topic bursts, and the proposed method notably outperforms all baseline methods across common metrics, highlighting its efficacy in predicting cryptocurrency-related topic bursts.

Keting Yin, Xiaen Sun, Tian Feng
EBCPL: A Novel Evidence-Based Method for Concept Prerequisite Relation Learning

Concept prerequisite relation learning (CPL) plays a crucial role in building educational applications such as learning path planning and educational question and answer system. Previous deep learning-based methods usually attempt to extract the relevant information for prerequisite relations between concepts from lengthy documents of educational data. However, the process of selecting evidence information that can infer prerequisite relations still lacks explicitness and interpretability. To explicitly select evidence information and utilize it to improve CPL, we propose a novel Evidence-Based method for Concept Prerequisite relation Learning (EBCPL). Firstly, we introduce a tailored evidence extraction method for educational data, which can explicitly extract evidence sentences from these documents. Secondly, we construct a relation extraction model with BiLSTM to infer prerequisite relations from the extracted evidence. Our experiments on multiple datasets demonstrate that the proposed method achieves state-of-the-art results in comparison with existing methods.

JiaYu Zhang, XiaoYan Zhang, XiaoFeng Du, TianBo Lu
LongSum: An Efficient Transformer for Long Document Summarization

Pre-trained models have made remarkable strides in fields such as natural language processing and multi-modal learning. However, one limitation of these models is their inability to effectively process extremely long texts, as they often struggle to remember and handle all the essential information. Long document summarization is a challenging task that utilizes a pretrained transformer-based architecture to summarize the contents of documents. However, due to the quadratic cost of self-attention, scaling transformer-based models to long sequences is prohibitively expensive for implementation. Additionally, information locality pattern commonly exists in text summarization, requiring the inductive bias of local concentration. Therefore, we propose a new Transformer variant called LongSum, which addresses challenges by designing a novel Local-Global Attention to efficiently capture and fuse representative information from both local and global views. Extensive experiments are carried out on multiple datasets from different domains. LongSum outperforms baseline models on summarization performance quantitatively and qualitatively, which demonstrates its effectiveness and efficiency.

Jitong Wei, Yang Gao
Chimera Model of Candidate Soups for Non-Autoregressive Translation

Non-Autoregressive Translation (NAT) models have drawn much attention because of their excellent decoding speed. However, NAT models suffer a significant drop in translation quality compared to Autoregressive Translation (AT) models. Candidate Soups (CandiSoups) is an effective method that can fully use the different candidate translations, significantly improving the translation quality for NAT models. However, it needs to use an additional AT model for re-scoring to achieve the best performance, which slows down its inference speed and takes up more computing resources. In this paper, we propose a Chimera Model framework of CandiSoups (CMCS), which can significantly accelerate inference speed while maintaining superior performance for CandiSoups. Specifically, by modifying the decoder, we fuse the AT and NAT models to construct a Chimera Model that can perform self-rescore. Moreover, we propose a novel adaptive training method to help train Chimera Models better. Experimental results on two major benchmarks demonstrate the effectiveness of our proposed approach, which can significantly improve translation quality while maintaining the excellent inference speed.

Huanran Zheng, Wei Zhu, Xiaoling Wang
Multi-Aspect Matching between Disentangled Representations of User Interests and Content for News Recommendation

Personalized news recommendation is a crucial technique to help users find the content of interest from massive news. While most news recommendation approaches learn a single representation for both users and news, they overlook the nuanced diversity of user interests. Some recent works focused on learning multi-aspect representations of user interests. However, they ignore that news can encompass various aspects of a user’s interests, failing to capture the intricate interactions between news content and user preferences. Meanwhile, a user could occasionally click on some news by mistake. In this paper, we propose a novel news recommendation model which learns disentangled representations for both user interests and news content. This allows for capturing the characteristics of different aspects of news content and user interests. An aspect-wise matching is then applied to capture the fine-grained interactions between news and users. A disentanglement loss is proposed to encourage independence of different aspects. Furthermore, we leverage contrastive learning on a news-level to emphasize the aspect-related information as well as on a user-level to mitigate the impact of misclicked news and thus further improve the model’s robustness. Extensive experiments on two real-world datasets demonstrate the effectiveness of our model.

Yingzhi Miao, Martin Pavlovski, Zhiqiang Chen, Fang Zhou
A Dynamic pre-trained Model for Chinese Classical Poetry

Chinese classical poetry, inheriting thousands of years of Chinese civilization, reflects the social ethos and cultural aspects of its times. In recent years, researchers have increasingly focused on using artificial intelligence to analyze Chinese classical poetry. Many of these studies rely on pre-trained language models. Unfortunately, Chinese classical poetry has a unique form and the direct use of a general pre-trained language model is ineffective. Its long time span, frequent changes in language meaning, and small amount of training data limit the development of pre-trained models for Chinese classical poetry. To address these challenges, we construct a dynamic pre-trained model for Chinese classical poetry, based on SikuBERT and using comparative learning and multi-task training strategy. During the training process, we search for hard negative and positive examples and use them for data augmentation. And we introduce sliding window to dynamically learn poetry information. Compared to the encoding provided by the baseline model, our model’s encoding achieves better performance in downstream tasks classification, translation and poem-poet matching.

Xiaotong Wang, Xuanning Liu, Haorui Wang, Bin Wu
Detecting Incoming Fake News in News Streams via Efficient Topic-Based Correlation

The news stream on social media has become the primary source for instant insights into real-world topics. However, the rapid spread of fake news poses a significant challenge that needs to be strictly addressed. Existing fake news detection methods analyze individual news articles in a temporally sequenced news stream but often overlook the topic-based correlations. Consequently, they inefficiently detect incoming fake news related to similar topics with prior fake likelihoods. To enhance the detection efficiency of incoming fake news in news streams, this study introduces an efficient Topic-based Correlation Framework (TCF) consisting of two innovative components: the dynamic correlation mapping module and the transfer correlation learning module. The dynamic correlation mapping module utilizes an incremental clustering approach to establish mapping relationships between news articles and topics. It automatically classifies the input news passages into familiar and unfamiliar topic domains. In addition, the transfer correlation learning module employs a domain adaptation network to detect incoming fake news across familiar and unfamiliar topic domains. By leveraging this approach, our proposed framework demonstrates superior performance compared to existing state-of-the-art methods in detecting incoming fake news in news streams.

Xiaomei Wei, Yongcheng Zhang, Ruohan Yang, Huan Wang
Document Hashing by Exploiting Noisy Neighborhood Information with Fault-Tolerant Mutual-Information-Preserving VAE

Semantic hashing is an effective technique for large-scale information retrieval. Currently, some methods have suggested learning high-quality binary hash codes of documents by leveraging both document contents and neighborhood information. However, it is found that erroneous connections often exist in the provided neighborhood information, but were never taken into account in these models. To alleviate their negative impacts on hash code learning, we first build a basic generative model to simultaneously model the document content and neighborhood. Then, we show that the basic generative model can be placed under a more general framework, dubbed mutual-information (MI) preserving variational auto-encoder (VAE). Capitalizing on this connection, a new hashing method that can tolerate the noisy characteristic of the neighborhood information is further developed by proposing a novel fault-tolerant lower bound for MI. Extensive experiments are conducted on six real-world datasets, and significant performance gains are observed over current state-of-the-art models.

Jiayang Chen, Qinliang Su, Zetong Li, Hai Wan, Defu Lian
Few-Shot Log Analysis with Prompt-Based Multi-task Transfer Learning

Log analysis is critical to software system operations and maintenance. Sparse annotated log data can reduce the performance of mainstream automated log analysis methods. Fortunately, the various types of tasks in log analysis can mutually promote performance. we argue that leveraging knowledge learned from source log analysis tasks can improve the performance of target log analysis tasks in few-shot scenarios. In this paper, we propose LogMT, a two-stage method that leverages deep prompt tuning to learn log analysis knowledge from multiple source tasks and transfers it to few-shot target tasks by a mixture-of-experts (MoE) router. To evaluate LogMT’s performance, we conduct nine few-shot log analysis experiments, each of them consisting of eight source log analysis tasks and one few-shot target task. The results demonstrate that LogMT achieves state-of-the-art performance on nine few-shot log analysis experiments. The source code of this paper can be found in https://github.com/nonauthor/LogMT .

Mingjie Zhou, Weidong Yang, Lipeng Ma, Sihang Jiang, Bo Xu, Yanghua Xiao
Post-hoc Facts augmented Legal Judgment Prediction

Legal Judgment Prediction (LJP) has emerged as a fundamental aspect of legal AI systems, encompassing both civil and criminal cases. In this paper, we focus on civil LJP, which aims to predict the judgment result based on both case facts and the plaintiff’s claim. Existing approaches rely on post-hoc facts summarized by judges, which has proposed leveraging actual fact (i.e., court debate data) to make judgment prediction. However, this approach still suffers from limitations (e.g., data noise), resulting in suboptimal predictions. To this end, we propose a post-hoc facts augmented LJP (PF-LJP) to further explore this “real pattern”. Specifically, we encourage the court debate data to be closer to its corresponding post-hoc fact in the vector space with the contrastive learning and distribution alignment method. During the inference, only the court debate data is employed to yield results without any post-hoc fact. Experimental results on a real-world dataset validate the effectiveness of PF-LJP.

Yanqing An, Linan Yue, Weibo Gao, Kai Zhang, Qi Liu
SGDG: Improving Transformer Seq2Seq Models through Span Generation and Denoise Generation

Recently, Transformer-based generative models have made remarkable advancements in various domains. However, the generation quality and inference stability of existing approaches are unsatisfactory due to the local overfitting problem. To this end, we propose a novel Span Generation and Denoise Generation strategy, SGDG, to alleviate this problem. Span Generation enhances the model’s ability to globally fit the target text by predicting a continuous segment (span) simultaneously using the span attention mechanism. Additionally, we incorporate Denoise Generation, which randomly replaces the token from the most recent step with a prediction noise, to prevent the model from excessively relying on local historical information. Our extensive experiments on three tasks (dialogue generation, summarization, and question generation) demonstrate improved generation quality of Transformer Seq2Seq models with the proposed SGDG over existing strategies.

Zhenfei Yang, Beiming Yu, Chenxiao Dou, Qian Zhang, Yansong Chua
What if User Preferences Shifts: Causal Disentanglement for News Recommendation

In the realm of personalized news recommendations (NR), prevailing approaches assist users in discovering content of interest, where user preferences are assumed to be invariant. Unfortunately, violations of such assumptions are common in realistic scenarios with shifted user preferences. For example, concerning sports news, users in South America typically tend to be interested in football, whereas basketball attracts more interest in North America. To bridge this gap, we contribute a novel NR problem named Generalizable NR against Shifted Preference (GNR-SP) in this paper by allowing shifted user preferences. From a causal perspective, we address GNR-SP by disentangling representations of news content and user’s preference, where popularity serves as the observed confounder that influences both semantic content and users’ preferences simultaneously. To this end, we propose a Causal Disentanglement for News Recommendation (CDNR) framework by optimizing a Transformer-based Identifiable Variational Autoencoder (T-iVAE). Our experiments on two real-world datasets showcase the efficacy of our model in handling news recommendations against preference shifts.

Yingzhi Miao, Zhiqiang Chen, Fang Zhou
Backmatter
Titel
Database Systems for Advanced Applications
Herausgegeben von
Makoto Onizuka
Jae-Gil Lee
Yongxin Tong
Chuan Xiao
Yoshiharu Ishikawa
Sihem Amer-Yahia
H. V. Jagadish
Kejing Lu
Copyright-Jahr
2025
Verlag
Springer Nature Singapore
Electronic ISBN
978-981-9757-79-4
Print ISBN
978-981-9757-78-7
DOI
https://doi.org/10.1007/978-981-97-5779-4

Informationen zur Barrierefreiheit für dieses Buch folgen in Kürze. Wir arbeiten daran, sie so schnell wie möglich verfügbar zu machen. Vielen Dank für Ihre Geduld.

    Bildnachweise
    AvePoint Deutschland GmbH/© AvePoint Deutschland GmbH, NTT Data/© NTT Data, Wildix/© Wildix, arvato Systems GmbH/© arvato Systems GmbH, Ninox Software GmbH/© Ninox Software GmbH, Nagarro GmbH/© Nagarro GmbH, GWS mbH/© GWS mbH, CELONIS Labs GmbH, USU GmbH/© USU GmbH, G Data CyberDefense/© G Data CyberDefense, Vendosoft/© Vendosoft, Kumavision/© Kumavision, Noriis Network AG/© Noriis Network AG, WSW Software GmbH/© WSW Software GmbH, tts GmbH/© tts GmbH, Asseco Solutions AG/© Asseco Solutions AG, AFB Gemeinnützige GmbH/© AFB Gemeinnützige GmbH, Ferrari electronic AG/© Ferrari electronic AG