Zum Inhalt

Chinese Computational Linguistics

24th China National Conference, CCL 2025, Jinan, China, August 11–14, 2025, Proceedings

  • 2026
  • Buch
insite
SUCHEN

Über dieses Buch

Dieses Buch stellt die referierten Beiträge der 24. Nationalen chinesischen Konferenz für Computerlinguistik dar, die vom 11. bis 14. August 2025 in Jinan, China, stattfand. Das Buch beschreibt die Schlüsselthemen in den Bereichen Large Language Models, Machine Translation and Multilingual Information Processing, Text Generation, Dialogue and Summarization, Information Retrieval, Text Classification, QA, Language Resource and Evaluation, NLP Applications, ACL ARR Fast Track.

Inhaltsverzeichnis

Frontmatter

NLP Applications

Frontmatter
Self-supervised Contrastive Learning for Content-Centric Speech Representation
Abstract
Self-supervised learning (SSL) speech models have achieved remarkable performance across various tasks, with the learned representations often exhibiting a high degree of generality and applicability to multiple downstream tasks. However, these representations contain both speech content and some paralinguistic information, which may be redundant for content-focused tasks. Decoupling this redundant information is challenging. To address this issue, we propose a Self-Supervised Contrastive Representation Learning method (SSCRL), which effectively disentangles paralinguistic information from speech content by aligning similar content speech representations in the feature space using self-supervised contrastive learning with pitch perturbation and speaker perturbation features. Experimental results demonstrate that the proposed method, when fine-tuned on the LibriSpeech 100-hour dataset, achieves superior performance across all content-related tasks in the SUPERB Benchmark, generally outperforming prior approaches.
Jinlong Li, Ling Dong, Wenjun Wang, Zhengtao Yu, Shengxiang Gao
A Chunk-Based Chain of Thought Prompting Method for Mitigating Over-Correction in Chinese Grammatical Error Correction
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in semantic understanding and text generation. However, when applied to downstream tasks such as Chinese Grammatical Error Correction (CGEC), they often suffer from over-correction issues, where grammatically correct parts are mistakenly altered. Moreover, some exciting methods aim to address over-correction in Sequence-to-Sequence (Seq2Seq) models, they are difficult to adapt to decoder-only LLMs. To address these challenges, we propose a Chunk-based Chain of Thought (CoT) Prompting Method. Our study is structured into three key components. Initially, we identify specific types of grammatical errors in the input sentences. Following this, sentences are segmented into smaller chunks, and each chunk is analyzed to match the detected error types. Ultimately, the aggregated information guides LLMs in performing localized correction within the input sentences. The experimental results have proved the effectiveness of our method in mitigating over-correction, achieving higher \(F_{0.5}\) score while maintaining robust grammatical error correction performance. This method provides innovative perspectives on employing LLMs to enhance the precision and granularity of CGEC task.
Xinquan Chang, Junguo Zhu
Linguistic Differences Between AI and Human Comments in Weibo: Detect AI-Generated Text Through Stylometric Features
Abstract
LLM-enhanced social robots (LLM-Bots) generate responses similar to human interactions and pose risks to social media platforms. Distinguishing AI-generated texts (AIGTs) from human-written content is important for mitigating these threats. However, current AIGT detection technologies face limitations in social media contexts, including inadequate performance on short texts, poor interpretability, and a reliance on synthetic datasets. To address these challenges, this study first constructs a social media dataset composed of 463,382 Weibo comments to capture real-world interactions between LLM-Bots and human users. Second, a stylometric feature set tailored to Chinese social media is developed. We conduct a comparative analysis of these features to reveal linguistic differences between human-written and AI-generated comments. Third, we propose a lightweight stylometric feature-based self-attention classifier (SFSC). This model achieves a strong F1-score of 91.8% for detecting AI-generated short comments in Chinese while maintaining low computational overhead. Additionally, we provide interpretable criteria for the SFSC in AIGT detection through feature importance analysis. This study advances detection for AI-generated short texts in Chinese social media.
Ziqi Li, Qi Zhang
Fine-Tuning GEC Model Based on Language Family Corpus
Abstract
It is widely known that the first language (L1) of the English learners will influence their language study, causing them make to biased errors. However, it is relatively limited for the research of using the L1 information to improve Grammatical Error Correction (GEC) models. Among the limited research, a common method is to train a set of GEC models, and each model is trained by a corpus from one (and only one) specific L1 background. This method has been proven efficient, while the waste of the training/fine-tuning data makes it suffer from the data limitation issue. This paper introduces a novel method to address this issue by exploiting the linguistic similarities between a language family and its member languages. We expand the fine-tuning data from one specific L1 background to its language family one, making the quantity increase exponentially. We use the Italic language family corpus as our language family corpus and experiment with two approaches facing two situations, mainly differing in development data. The results show that, for the approach that uses the Italic language family corpus to be the fine-tuning data and uses the development data where the L1 background is the same as the one of the test data, the GEC models improve clearly; however, the way that influences the models is not uniform, and varies by error types.
Yitao Liu, Mark Dras
Statistically Optimized SGNS Model: Enhancing Word Vector Representation with Global Semantic Weight
Abstract
Addressing the limitations of the Skip-gram with Negative Sampling (SGNS) model related to negative sampling, subsampling, and its fixed context window mechanism, this paper first presents an in-depth statistical analysis of the optimal solution for SGNS matrix factorization, deriving the theoretically optimal distribution for negative sampling. Building upon this analysis, we propose the concept of Global Semantic Weight (GSW), derived from Pointwise Mutual Information (PMI). We integrate GSW with word frequency information to improve the effectiveness of both negative sampling and subsampling. Furthermore, we design dynamic adjustment mechanisms for the context window size and the number of negative samples based on GSW, enabling the model to adaptively capture contextual information commensurate with the semantic importance of the center word. Notably, our optimized model maintains the same time complexity as the original SGNS implementation. Experimental results demonstrate that our proposed model achieves competitive performance aganist state-of-the-art word embedding models including SGNS, CBOW, and GloVe, across multiple benchmark tasks. Compared with the current mainstream dynamic word vector models, this work emphasizes achieving a balance between efficiency and performance within a static embedding framework, and provides potential supplementation and support for complex models such as LLMs.
Yulin Liu, Feng Xiong, Minghui Wu, Wanwei Liu
HFSD-V2C: Zero-Shot Visual Voice Cloning Via Hierarchical Face-Styled Diffusion Model
Abstract
The goal of this work is zero-shot visual voice cloning (ZS-V2C), which aims to generate speech samples with unseen speaker identity and prosody derived from a video clip and an acoustic reference. ZS-V2C presents greater challenges as: 1) unseen speaker modeling; and 2) unseen prosody modeling. Unlike previous works, we propose a novel ZS-V2C framework that incorporates a hierarchical face-styled diffusion model (HFSD-V2C). Specifically, first, we leverage cross-modal biometrics to predict unseen speaker embeddings based on facial features. Then, we jointly model the unseen prosodic features at the text, speech and video levels. Finally, a diffusion model is constructed based on the embeddings of the unseen speaker and prosodic features, enabling the generation of expressive and diverse speech. Extensive experiments on the LRS2 and GRID benchmark dataset demonstrate the superior performance of our proposed method.
Yaping Liu, Linqin Wang, Shengxiang Gao, Zhengtao Yu, Ling Dong
CRAF: Cross-Modal Representation Alignment and Fusion for Speech Translation
Abstract
The end-to-end speech translation task involves directly transforming speech into the text of another language, bypassing the generation of an intermediate transcription. However, existing methods may lose key information during cross-modal length alignment and fail to effectively integrate different representations, resulting in low quality of the fused representation. To address these issues, we propose an efficient method named CRAF for effective cross-modal alignment and fusion for speech translation, which reduces information loss and enhances the integration of cross-modal representations. First, CRAF minimizes information loss by improving the cross-modal length alignment, ensuring the alignment process retains more critical information from the speech modality. Second, CRAF strengthens the integration of cross-modal representations by allowing the model to combine complementary features from diverse modalities, enhancing its capacity to concentrate on the most pertinent and critical information. Finally, we evaluate CRAF by conducting extensive experiments on eight language pairs from the MuST-C dataset. Experiments show that the average BLEU score of CRAF achieves 29.0, outperforming other comparison methods. Our code is available at https://​github.​com/​wu-wen-zhou/​first/​tree/​master.
Zhenbei Guo, Wenzhou Wu, Hua Lai, Yan Xiang, Yuxin Huang, Zhengtao Yu
Lao-English Code-Switched Speech Synthesis Via Neural Codec Language Modeling
Abstract
This paper addresses the challenges of data scarcity and limited speaker resources in Lao-English code-switched speech synthesis. We propose a neural encoder-decoder-based method for mixed-lingual speech synthesis. The method first extracts phoneme-level speech representations and employs a dot-product attention mechanism to map Lao and English phonemes into a shared latent space, thereby enhancing the model’s capability to represent cross-lingual phonetic information. In addition, language ID embedding module is extended to explicitly indicate the language of each input token, helping the model distinguish and adapt to language-specific pronunciation characteristics. Experiments are conducted on the open-source English dataset LibriTTS and a proprietary Lao speech corpus. Both subjective evaluations (MOS, AB preference tests) and objective metrics (RMSE) demonstrate that the proposed approach significantly outperforms the baseline VALL-E X model in terms of naturalness and language-switching fluency. Furthermore, ablation studies confirm that both the shared phoneme latent space and the language ID module play critical roles in improving synthesis quality. This approach offers a novel solution for integrating low-resource languages into mixed-lingual speech synthesis.
Yaping Liu, Linqin Wang, Shengxiang Gao, Zhengtao Yu, Ling Dong, Tian Tian
UMAD: Enhancing LLM Debiasing via Multi-agent Debate and Token-Level Bias Interpretation
Abstract
Textual data often contain biases that compromise fairness in AI systems, particularly in sensitive areas such as gender, race, and politics. While large language models (LLMs) have shown success across various tasks, they still face limitations due to inherent biases within the models and restrictive safety policies that hinder direct bias mitigation. To overcome these challenges, we propose UMAD (Unsupervised Multi-Agent Debate), a novel framework that leverages a Multi-Agent Debate mechanism alongside Best-Worst Scaling (BWS) to foster more effective discussions among LLMs, facilitating the identification of biases. By combining this with gradient-based interpretation techniques, UMAD extracts token-level bias insights, which are then integrated into models using in-context learning. This enhances the debiasing performance, as shown by our experiments across three bias categories—gender, religion, and politics—using five different LLMs. Our approach demonstrates significant improvements in metrics, with large models matching or even surpassing GPT-4 in Style Accuracy (STA). We release our code at: https://​github.​com/​Couen/​UMAD.​git.
  https://static-content.springer.com/image/chp%3A10.1007%2F978-981-95-2725-0_9/MediaObjects/668949_1_En_9_Figa_HTML.gif
The image contains a warning message in red text stating, "Warning: this paper contains content that may be offensive or upsetting." The background is plain white.
Hanwen Gu, Jie Ma, Ying Qin, Ling Hu
Instruction-Driven In-Context Learning for Domain-Specific Chinese Spelling Correction
Abstract
This paper investigates domain adaptation in Chinese Spelling Correction (CSC) based on the instruction-following ability of large language models (LLMs). In the instructions, we include a variety of domain-specific requirements for spelling correction, such as the domain’s formality or writing tone, which go beyond the considerations of previous CSC research. To evaluate the LLMs’ performance on instruction-following, we propose IDSpell, a semi-supervised construction pipeline for a CSC dataset containing a wide range of domain-specific sentences along with specific instructions. We construct a dataset with IDSpell and evaluate it on Qwen2.5 and GPT-4o, where we find that instructions serve a meaningful influence in correction, increasing the average F1 score by 10.4% compared to when the instructions are not provided. To further enhance the result, we propose Contrastive Prompting, a method incorporating contrastive false examples into the prompt to better guide the model to understand the instruction. Experiments demonstrate that our method outperforms baseline prompting with an average improvement of 5.4%. Our dataset and code are publicly available for further research.
Hyunsoo Park, Hongqiu Wu, Hai Zhao
DualReward: A Dynamic Reinforcement Learning Framework for Cloze Tests Distractor Generation
Abstract
This paper introduces DualReward, a novel reinforcement learning framework for automatic distractor generation in cloze tests. Unlike conventional approaches that rely primarily on supervised learning or static generative models, our method employs a dual reward structure with adaptive scaling that differentiates between human-created gold standard distractors and model-generated candidates. The framework dynamically adjusts reward signal intensity based on model performance and confidence. We evaluate our approach on both passage-level (CLOTH-F) and sentence-level (MCQ) cloze test datasets, demonstrating consistent improvements over state-of-the-art baselines. Experimental results show that our adaptive reward scaling mechanism provides modest but consistent benefits on homogeneous datasets (CLOTH-F) and more substantial improvements (3.48–3.86% in P@1) on diverse, cross-domain data (MCQ), suggesting its particular effectiveness for handling varied question types and domains. Our work offers a flexible framework that effectively balances learning from reliable human examples while exploring novel, high-quality distractors for automated test generation.
Tianyou Huang, Xinglu Chen, Jingshen Zhang, Xinying Qiu, Ruiying Niu

Large Language Models

Frontmatter
RankLLM: A Multi-Criteria Decision-Making Method for LLM Performance Evaluation in Sentiment Analysis
Abstract
Large Language Models (LLMs) have made significant advancements in sentiment analysis, yet their quality and reliability vary widely. Existing LLM evaluation studies are limited in scope, lack a comprehensive framework for integrating diverse capabilities, and fail to quantify the impact of prompt design on performance. To address these gaps, this paper introduces a set of LLM evaluation criteria with detailed explanations and mathematical formulations, aiding users in understanding LLM limitations and selecting the most suitable model for sentiment analysis. Using these criteria, we apply the Technique for Order Preference by Similarity to an Ideal Solution (TOPSIS), a classic decision-making method, to rank the performance of LLMs in sentiment analysis. We evaluated six popular LLMs on three Twitter datasets covering different topics and analyzed the impact of prompt design by assessing model-prompt combinations. Additionally, a validation experiment on a publicly available annotated dataset further confirms our ranking results. Finally, our findings offer valuable insights into the evaluation and selection of LLMs for sentiment analysis.
Huzhi Xue, Butian Zhao, Haihua Xie, Zeyu Sun
MQM-MSC: Enhancing Translation Quality Estimation Interpretability with Mask-Driven Self-correction in Large Language Models
Abstract
Large Language Models (LLMs) have demonstrated significant potential in interpretable translation quality estimation by providing both holistic ratings and fine-grained feedback. However, state-of-the-art methods, such as GEMBA-MQM, still suffer from an excessive number of false positives in error prediction, leading to misalignment with human annotations and reducing interpretability. To address this issue, we propose MQM-MSC, a novel training-free framework that employs a mask-driven self-correction (MSC) mechanism. The core of MSC is to use masks to highlight error spans in the initial prediction, enabling the model to re-evaluate these masked portions and verify their correctness. This approach mirrors human cognitive processes: when individuals express inconsistent judgments about the same issue at different times, it often indicates that their initial assessment was flawed. Similarly, MSC exploits contradictions between two evaluations to identify and filter false positives, thereby improving the accuracy and reliability of error annotations. Experimental results show that MQM-MSC effectively reduces false positives across four LLMs and three language pairs, consistently improving the reliability and quality of error annotations in the GEMBA-MQM approach.
Guanghui Cai, Junguo Zhu
Improving Abstract Reasoning Ability of Large Language Models Through Mixture Program-Based Data Synthesis
Abstract
Abstract reasoning is a challenging task that involves identifying patterns from limited input-output grids and applying them to new grids. With the development of large language models (LLMs), recent studies attempt to transfer the problems to textual format and tackle abstract reasoning tasks using models such as GPT-4. However, the overall accuracy is still low, which also results in the poor quality of abstract reasoning data directly synthesized by GPT-4, making it unsuitable as effective fine-tuning data. In this paper, we propose mixture program-based data synthesis strategies, including low-level code-based synthesis, high-level DSL-based synthesis, and shuffle-based synthesis. Through these strategies, we construct diverse and valid abstract reasoning instruction data to help improving the general abstract reasoning ability of LLMs for multiple datasets. Experimental results show that, by supervised fine-tuning Qwen-2.5-7B on our synthesized instruction data, the resulting model shows improved abstract reasoning ability and outperforms various strong baseline LLMs, including closed-source model GPT-4 and open-source models such as LLaMA-3 and Qwen-2.5. We release the logs by GPT and our model at https://​github.​com/​szu-tera/​ARC.
Yile Wang, Hui Huang
RJAG: Retrieval Judgment Augmented Generation
Abstract
Large Language Models (LLMs) inevitably suffer from hallucinations, as relying solely on their parametric knowledge cannot guarantee the accuracy of generated content. To enhance text generation, retrieval-augmented generation (RAG) is proposed to incorporate external knowledge to achieve this. However, its effectiveness heavily depends on the relevance of retrieved documents, which poses a critical challenge: how to ensure the accuracy and reliability of model responses when retrieval results are inaccurate. Tackling this challenge, we propose Retrieval Judgment Augmented Generation (RJAG), a method that can enhance RAG through LLM-driven fine-grained relevance judgment mechanism and a task-adaptive knowledge combination strategy. RJAG judges and dynamically combines retrieved documents for both open-ended generation and closed-ended selection tasks. Additionally, large-scale web search is also included to expand the knowledge beyond static corpora. Experimental results on multiple benchmarks show that RJAG outperforms existing RAG methods, which will significantly enhance the accuracy and reliability while maintaining the system’s simplicity. Code is available at https://​github.​com/​wangkz2023/​RJAG.
Kuangzhi Wang, Zhenhua Hu, Min Ren, Xiangzhi Tao
Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models
Abstract
In real-world scenarios, large language models (LLMs) can serve as assistants to help users accomplish their jobs and support the development of advanced applications. For the wide application of LLMs, inference efficiency is an essential concern that has been widely studied in existing work, accompanied by numerous optimization algorithms and code libraries to improve it. Nonetheless, users still find it challenging to compare the effectiveness of all the above methods and understand the underlying mechanisms. In this work, we propose a coarse-to-fine method that encompasses both experimental and analytical components. This method can be applied across various models and inference libraries. Specifically, we examine four usage scenarios within two practical applications. We further provide both theoretical and empirical fine-grained analyses of each module in the Transformer architecture. Our methods can be a general and invaluable resource for researchers to evaluate various code libraries and improve inference strategies across different LLMs. We open-source the supporting dataset, code, and evaluation scripts at the link: https://​github.​com/​RUCAIBox/​Inference-Efficiency-Evaluation.
Yushuo Chen, Tianyi Tang, Erge Xiang, Linjiang Li, Wayne Xin Zhao, Jing Wang, Yunpeng Chai, Ji-Rong Wen

Language Resource and Evaluation

Frontmatter
Unveiling the Linguistic Acceptability Judgments of Large Language Models in Multilingual Contexts
Abstract
Linguistic acceptability judgments are essential for evaluating how language models internalize human-like grammatical knowledge. Though some studies have evaluated large language models (LLMs) in this context, existing research lacks systematic exploration of diverse learning paradigms in a multilingual setting. In this paper, we present the first multilingual evaluation of LLMs across four languages (English, Chinese, Japanese, and Russian) in the field of linguistic acceptability. Our evaluation spans both general-purpose (i.e., GPT-4o, GPT-4o mini, DeepSeek-V3, GLM-4-32B, and the Qwen series) and reasoning-oriented (QwQ-32B-Preview and DeepSeek-R1-32B) models under zero-shot and monolingual, cross-lingual and multilingual fine-tuning settings, with comparisons to pre-trained language model (PLM) baselines. Our analysis highlights the strong generalizability of large-scale LLMs through zero-shot prompting, the challenges of fine-tuning small-sized LLMs with skewed training data, the effectiveness of multilingual fine-tuning for low-resource languages, the scaling law exhibited on the task, and the limitation of reasoning-oriented models on the task, even when “aha moments” occur during the reasoning process.
Fuyu Xing, Haoyu Huang, Dawei Mo, Xinzhuo Yang, Zixuan Gao, Wei Wang, Zimu Wang, Haiyang Zhang
MASP: A Multilingual Dataset for Probing Scalar Modifier Understanding in LLMs
Abstract
This study aims to test how large language models (LLMs) understand gradable adjectives and whether their understanding compares with humans, under the framework of formal semantics. We introduce a diagnostic dataset, referred to as the Modifier-Adjective Scale Probe (MASP), to evaluate how well LLMs understand a gradable adjective (e.g., long) when the adjective is combined with one modifier (e.g., very long or slightly long, a condition referred to as degree modification) or is further negated (e.g., very not long and not very long, a condition referred to as compositional negation). The dataset consists of over 80,000 natural language inference questions in both Chinese and English. We apply the MASP dataset to test both humans and 11 popular LLMs, including GPT-4 and Gemini-2.0-Flash. The results show that most LLMs can correctly understand whether a modifier boosts (e.g., very) an adjective. However, they fail to understand the modifiers that weaken the degree and the negation forms of modifiers. Furthermore, we parameterize the human and LLM behavior, and find that the judgment patterns of LLMs differ from humans especially in the Chinese tests. These findings suggest that LLMs are still not well aligned with humans in terms of the interpretation of simple adjective phrases, and MASP provides a new approach to quantify the interpretation of adjective phrases in LLMs.
Xinyu Gao, Nai Ding, Wei Liu

Text Generation, Dialogue and Summarization

Frontmatter
Self-preference: An Automated Method for Preference-Aligned Data Constructed from Business Metrics
Abstract
Large language models (LLMs) have become integral components of various AI solutions, with the reinforcement learning from human feedback (RLHF) stage playing a critical role in aligning model outputs with human preferences. However, generating the human preference data required for RLHF is often costly and time-consuming due to its reliance on human evaluation. This study addresses this challenge within the dialogue scenarios of the fintech industry. We leverage rich, non-confidential, multi-turn dialogue data, such as call center dialogue records, which include associated business metrics (e.g., problem-solving rates, turnover ratios) to construct preference-aligned data. We introduce Self-Preference, an automated method for creating preference-aligned data guided by these objective business metrics. The approach involves clustering dialogue histories based on their semantic representations and calculating a well-designed conditional probability ratio that correlates sequences with business metrics to generate preference data. In contrast to traditional preference alignment data generation methods that depend on subjective human evaluations, Self-Preference significantly reduces labeling costs and mitigates model-induced biases. Experimental results indicate that models trained with Self-Preference generated data demonstrate a strong positive correlation with target business metrics, highlighting the method’s effectiveness in facilitating efficient, goal-oriented alignment of LLMs.
Feng Gao, Xuan Zhang, Boyi Ni, Chunping Wang, Lei Chen
TAG: Dialogue Summarization Based on Topic Segmentation and Graph Structures
Abstract
In recent years, dialogue summarization has emerged as a rapidly growing area of research in natural language processing. Dialogue summarization is challenging due to dispersed key information, redundant expressions, ambiguous topic identification, and difficult content selection. To address these challenges, we propose an innovative approach to dialogue summarization that integrates topic segmentation and graph-structured modeling. Specifically, we first perform topic segmentation of the dialogue through clustering and quantify the key information in each utterance, thereby capturing the dialogue topics more effectively. Then, a redundancy graph and a keyword graph are constructed to suppress redundant information and extract key content, thereby enhancing the conciseness and coherence of the summary. Evaluations were conducted on the DialogSum, SAMSum, CSDS, and NaturalConv datasets. The experimental results demonstrate that the proposed method significantly outperforms existing benchmark models in terms of summary accuracy and information coverage. The Rouge-1 scores achieved were 48.03%, 53.75%, 60.78%, and 81.48%, respectively, validating its effectiveness in the dialogue summarization task. Our code is available at https://​anonymous.​4open.​science/​r/​TAG-E64A.
Yatian Shen, Qichao Hao, Guosong Deng, Songyang Wang, Eryan Zhang
Enabling Real-Time Conversations with Minimal Training Costs
Abstract
Large language models (LLMs) have demonstrated the ability to improve human efficiency through conversational interactions. Conventional LLM-powered dialogue systems, operating on a turn-based paradigm, preclude real-time interaction during response generation. To address this limitation, researchers have proposed duplex models. These models can dynamically adapt to user input, facilitating real-time interactive feedback. However, these methods typically require substantial computational resources to acquire the duplex capability. To reduce overhead, this paper presents a new duplex decoding approach that enhances LLMs with duplex ability, requiring minimal additional training. Specifically, our method employs parallel decoding of input and responses in conversations, effectively implementing a channel-division-multiplexing decoding strategy. Experimental results indicate that our proposed method significantly enhances the naturalness and human-likeness of user-AI interactions with minimal training costs.
Wang Xu, Haoyu Wang, Shuo Wang, Weilin Zhao, Xu Han, Yukun Yan, Haiyan Zhao, Yudi Zhang, Zhe Tao, Zhiyuan Liu, Wanxiang Che

Information Retrieval, Text Classification and QA

Frontmatter
EDGE: Enhanced Debiased Gradient Extraction for Robust Fine-Tuning
Abstract
Recent advances in large-scale pre-training have substantially enhanced the robustness and generalization capabilities of foundation models (e.g., Qwen3 and Llama-4). However, when fine-tuning them on downstream tasks, these models often latch onto dataset-specific biases, learning spurious correlations tied to easy-to-learn but non-robust features. This undermines their performance under distribution shifts, despite strong in-distribution (ID) accuracy. Existing fine-tuning methods, including full-parameter and parameter-efficient techniques, primarily optimize for ID performance and largely overlook out-of-distribution (OOD) robustness. Meanwhile, debiasing has been explored in full fine-tuning, while debiasing strategies on Parameter-Efficient Fine-Tuning (PEFT) remain underexplored. To this end, in this paper, we propose Enhanced Debiased Gradient Extraction (EDGE), a lightweight gradient projection-based method that explicitly suppresses bias-amplifying updates during fine-tuning process. EDGE is a model-agnostic, and plug-and-play debiasing method that operates without relying on predefined bias types or labels. It seamlessly integrates with both full and parameter-efficient fine-tuning, and generalizes across NLP and vision tasks. Experiments on synthetic and real-world benchmarks demonstrate that EDGE effectively reduces bias and consistently improves OOD generalization, offering a unified and practical framework for robust adaptation under dataset bias.
Jinglong Li, Kun Zhang, Chenyu Zou, Wei Shi, Xin Li, Si Wei
BiSaGA: A Novel Bidirectional Sparse Graph Attention Adapter for Evidence-Based Fact-Checking
Abstract
Evidence-based fact-checking aims to verify or debunk claims using evidence and has greatly benefited from advancements in Large Language Models (LLMs). This task relies on clarifying and discriminating relations between entities. However, autoregressive LLMs struggle with understanding relations presented in different orders or narratives, as their unidirectional nature hampers effective performance. To address this challenge, we propose a novel method that leverages bidirectional attention as an external adapter to facilitate two-way information aggregation. Additionally, we employ hierarchical sparse graphs to merge local and global information and introduce an efficient feature-compression technique to minimize the number of adapter parameters. Experimental results on both English and Chinese datasets demonstrate the significant improvements achieved by our approach, showcasing state-of-the-art performance in the evidence-based fact-checking task.
Junfeng Ran, Weiyao Luo, Zailong Tian, Guangxiang Zhao, Dawei Zhu, Longyun Wu, Hailiang Huang, Sujian Li

ACL ARR Fast Track

Frontmatter
Cross-Modal Ambiguity Learning with Heterogeneous Interaction Analysis for Rumor Detection
Abstract
Rumor detection on social media has recently attracted significant attention. Due to the complex user group and lack of regulation, rumor-spreaders intentionally disseminate rumors to sway public opinion, severely harming the general interests. Existing approaches generally perform rumor detection by analyzing both image and text modalities, and pay less attention to the interaction behaviors in social media, which can assist in distinguishing rumors from normal information. Furthermore, the images associated with rumors are often inconsistent or manipulated, how to distinguish these different features and utilize them effectively has become crucial in preventing the widespread dissemination of rumors. To address the aforementioned issues, we propose Cross-modal Ambiguity Learning with Heterogeneous Interaction Analysis (CAHIA) for rumor detection. Specially, we design a novel heterogeneous graph feature extractor to fully utilize the different types of behavioral patterns in social interaction networks, we design an frequency inception net to extract manipulated visual features and adopt different fusing strategies to detect various types of rumors according to the ambiguity between text and image. Finally, a hierarchical cross-modal fusing mechanism is used to simulate the process users view and determine the authenticity of posts. Extensive experiments results demonstrate that CAHIA outperforms state-of-the-art models on four large-scale datasets for rumor detection in social media.
Zhuo Fan, Qing Zhu, Yang Xiao
Beyond Instruction Following: Evaluating Inferential Rule Following of Large Language Models
Abstract
Although Large Language Models (LLMs) have demonstrated strong instruction-following ability, they are further supposed to be controlled and guided by inferential rules in real-world scenarios to be safe, accurate, and intelligent. This demands the possession of inferential rule-following capability of LLMs. However, no prior work has made a clear evaluation of the inferential rule-following capability of LLMs. Previous studies that try to evaluate the inferential rule-following capability of LLMs fail to distinguish the inferential rule-following scenarios from the instruction-following scenarios. Therefore, this paper first clarifies the concept of inferential rule-following and proposes a comprehensive benchmark, RuleBench, to evaluate a diversified range of inferential rule-following abilities. Our experimental results on a variety of LLMs show that they are still limited in following rules. Our analysis based on the evaluation results provides insights into the improvements for LLMs toward a better inferential rule-following intelligent agent. We further propose Inferential Rule-Following Tuning (IRFT). The experimental results show that through IRFT, LLMs can learn abstract inferential rule-following abilities from purely synthetic data and then generalize to RuleBench. The data and code can be found at: https://​anonymous.​4open.​science/​r/​llm-rule-following-B3E3/​.
Wangtao Sun, Chenxiang Zhang, Xueyou Zhang, Xuanqing Yu, Ziyang Huang, Haotian Xu, Shizhu He, Jun Zhao, Kang Liu
DSMR-SQL: Enhancing Text-to-SQL with Dual-Strategy SQL Generation and Multi-role SQL Selection
Abstract
Recent advancements in Large Language Models (LLMs) have markedly improved SQL generation. Nevertheless, existing approaches typically rely on single-model designs, limiting their capacity to effectively handle complex user queries. In addition, current methods often face difficulties in selecting the optimal SQL from multiple candidates. To mitigate these limitations, this study presents DSMR-SQL, a two-stage framework consisting of: (1) Dual-Strategy SQL Generation: DSMR-SQL aims to produce a broader spectrum of SQL queries by using multiple models with two strategies: Supervised Fine-Tuning and In-Context Learning; (2) Multi-Role SQL Selection: DSMR-SQL seeks to identify the SQL most aligning with user intent by introducing a collaborative framework involving three roles (i.e., Proposer, Critic, Summarizer). Extensive experiments on various datasets substantiate the efficacy of DSMR-SQL in enhancing SQL generation.
Yiming Huang, Jiyu Guo, Jichuan Zeng, Cuiyun Gao, Peiyi Han, Chuanyi Liu

Machine Translation and Multilingual Information Processing

Frontmatter
Quality-Aware Neural Machine Translation with Self-evaluation
Abstract
The performance of neural machine translation relies on a large amount of data, but crawled sentence pairs are of different quality. The low-quality sentence pairs may provide helpful translation knowledge but also teach the model to generate low-quality translations. Making the model aware of the quality of training instances may help the model distinguish between good and bad translations while leveraging the translation knowledge. In this paper, we evaluate the quality of training instances with the average per-token loss (negative log-likelihood) from translation models, convert the quality scores into embeddings through vector interpolation and feed the quality embedding into the translation model during its training. We ask the model to decode with the best quality score to generate good translations during inference. Experiments on the IWSLT 14 German to English, WMT 14 English to German and WMT 22 English to Japanese translation tasks show that our method can effectively lead to consistent and significant improvements across multiple metrics.
Jiajia Cui, Lingling Mu, Qiuhui Liu, Hongfei Xu
Backmatter
Titel
Chinese Computational Linguistics
Herausgegeben von
Maosong Sun
Peiyong Duan
Zhiyuan Liu
Ruifeng Xu
Weiwei Sun
Yubo Chen
Zhiliang Tian
Zhenghao Liu
Copyright-Jahr
2026
Verlag
Springer Nature Singapore
Electronic ISBN
978-981-9527-25-0
Print ISBN
978-981-9527-24-3
DOI
https://doi.org/10.1007/978-981-95-2725-0

Die PDF-Dateien dieses Buches wurden gemäß dem PDF/UA-1-Standard erstellt, um die Barrierefreiheit zu verbessern. Dazu gehören Bildschirmlesegeräte, beschriebene nicht-textuelle Inhalte (Bilder, Grafiken), Lesezeichen für eine einfache Navigation, tastaturfreundliche Links und Formulare sowie durchsuchbarer und auswählbarer Text. Wir sind uns der Bedeutung von Barrierefreiheit bewusst und freuen uns über Anfragen zur Barrierefreiheit unserer Produkte. Bei Fragen oder Bedarf an Barrierefreiheit kontaktieren Sie uns bitte unter accessibilitysupport@springernature.com.

    Bildnachweise
    AvePoint Deutschland GmbH/© AvePoint Deutschland GmbH, NTT Data/© NTT Data, Wildix/© Wildix, arvato Systems GmbH/© arvato Systems GmbH, Ninox Software GmbH/© Ninox Software GmbH, Nagarro GmbH/© Nagarro GmbH, GWS mbH/© GWS mbH, CELONIS Labs GmbH, USU GmbH/© USU GmbH, G Data CyberDefense/© G Data CyberDefense, FAST LTA/© FAST LTA, Vendosoft/© Vendosoft, Kumavision/© Kumavision, Noriis Network AG/© Noriis Network AG, WSW Software GmbH/© WSW Software GmbH, tts GmbH/© tts GmbH, Asseco Solutions AG/© Asseco Solutions AG, AFB Gemeinnützige GmbH/© AFB Gemeinnützige GmbH