Skip to main content
Top

2024 | Book

Neural Information Processing

30th International Conference, ICONIP 2023, Changsha, China, November 20–23, 2023, Proceedings, Part VI

Editors: Biao Luo, Long Cheng, Zheng-Guang Wu, Hongyi Li, Chaojie Li

Publisher: Springer Nature Singapore

Book Series : Lecture Notes in Computer Science

insite
SEARCH

About this book

The six-volume set LNCS 14447 until 14452 constitutes the refereed proceedings of the 30th International Conference on Neural Information Processing, ICONIP 2023, held in Changsha, China, in November 2023.
The 652 papers presented in the proceedings set were carefully reviewed and selected from 1274 submissions. They focus on theory and algorithms, cognitive neurosciences; human centred computing; applications in neuroscience, neural networks, deep learning, and related fields.

Table of Contents

Frontmatter

Applications

Frontmatter
MIC: An Effective Defense Against Word-Level Textual Backdoor Attacks

Backdoor attacks, which manipulate model output, have garnered significant attention from researchers. However, some existing word-level backdoor attack methods in NLP models are difficult to defend effectively due to their concealment and diversity. These covert attacks use two words that appear similar to the naked eye but will be mapped to different word vectors by the NLP model as a way of bypassing existing defenses. To address this issue, we propose incorporating triple metric learning into the standard training phase of NLP models to defend against existing word-level backdoor attacks. Specifically, metric learning is used to minimize the distance between vectors of similar words while maximizing the distance between them and vectors of other words. Additionally, given that metric learning may reduce a model’s sensitivity to semantic changes caused by subtle perturbations, we added contrastive learning after the model’s standard training. Experimental results demonstrate that our method performs well against the two most stealthy existing word-level backdoor attacks.

Shufan Yang, Qianmu Li, Zhichao Lian, Pengchuan Wang, Jun Hou
Active Learning for Open-Set Annotation Using Contrastive Query Strategy

Active learning has achieved remarkable success in minimizing labeling costs for classification tasks with all data samples drawn from known classes. However, in real scenarios, most active learning methods fail when encountering open-set annotation (OSA) problem, i.e., numerous samples from unknown classes. The main reason for such failure comes from existing query strategies that are unavoidable to select unknown class samples. To tackle such problem and select the most informative samples, we propose a novel active learning framework named OSA-CQ, which simplifies the detection work of samples from known classes and enhances the classification performance with an effective contrastive query strategy. Specifically, OSA-CQ firstly adopts an auxiliary network to distinguish samples using confidence scores, which can dynamically select samples with the highest probability from known classes in the unlabeled set. Secondly, by comparing the predictions between auxiliary network, classification, and feature similarity, OSA-CQ designs a contrastive query strategy to select these most informative samples from unlabeled and known classes set. Experimental results on CIFAR10, CIFAR100 and Tiny-ImageNet show the proposed OSA-CQ can select samples from known classes with high information, and achieve higher classification performance with lower annotation cost than state-of-the-art active learning algorithms.

Peng Han, Zhiming Chen, Fei Jiang, Jiaxin Si
Cross-Domain Bearing Fault Diagnosis Method Using Hierarchical Pseudo Labels

Data-driven bearing fault diagnosis methods have become increasingly crucial for the health management of rotating machinery equipment. However, in actual industrial scenarios, the scarcity of labeled data presents a challenge. To alleviate this problem, many transfer learning methods have been proposed. Some domain adaptation methods use models trained on source domain to generate pseudo labels for target domain data, which are further employed to refine models. Domain shift issues may cause noise in the pseudo labels, thereby compromising the stability of the model. To address this issue, we propose a Hierarchical Pseudo Label Domain Adversarial Network. In this method, we divide pseudo labels into three levels and use different training approach for diverse levels of samples. Compared with the traditional threshold filtering methods that focus on high-confidence samples, our method can effectively exploit the positive information of a great quantity of medium-confidence samples and mitigate the negative impact of mislabeling. Our proposed method achieves higher prediction accuracy compared with the-state-of-the-art domain adaptation methods in harsh environments.

Mingtian Ping, Dechang Pi, Zhiwei Chen, Junlong Wang
Differentiable Topics Guided New Paper Recommendation

There are a large number of scientific papers published each year. Since the progresses on scientific theories and technologies are quite different, it is challenging to recommend valuable new papers to the interested researchers. In this paper, we investigate the new paper recommendation task from the point of involved topics and use the concept of subspace to distinguish the academic contributions. We model the papers as topic distributions over subspaces through the neural topic model. The academic influences between papers are modeled as the topic propagation, which are learned by the asymmetric graph convolution on the academic network, reflecting the asymmetry of academic knowledge propagation. The experimental results on real datasets show that our model is better than the baselines on new paper recommendation. Specially, the introduced subspace concept can help find the differences between high quality papers and others, which are related to their innovations. Besides, we conduct the experiments from multiple aspects to verify the robustness of our model.

Wen Li, Yi Xie, Hailan Jiang, Yuqing Sun
IIHT: Medical Report Generation with Image-to-Indicator Hierarchical Transformer

Automated medical report generation has become increasingly important in medical analysis. It can produce computer-aided diagnosis descriptions and thus significantly alleviate the doctors’ work. Inspired by the huge success of neural machine translation and image captioning, various deep learning methods have been proposed for medical report generation. However, due to the inherent properties of medical data, including data imbalance and the length and correlation between report sequences, the generated reports by existing methods may exhibit linguistic fluency but lack adequate clinical accuracy. In this work, we propose an image-to-indicator hierarchical transformer (IIHT) framework for medical report generation. It consists of three modules, i.e., a classifier module, an indicator expansion module and a generator module. The classifier module first extracts image features from the input medical images and produces disease-related indicators with their corresponding states. The disease-related indicators are subsequently utilised as input for the indicator expansion module, incorporating the “data-text-data” strategy. The transformer-based generator then leverages these extracted features along with image features as auxiliary information to generate final reports. Furthermore, the proposed IIHT method is feasible for radiologists to modify disease indicators in real-world scenarios and integrate the operations into the indicator expansion module for fluent and accurate medical report generation. Extensive experiments and comparisons with state-of-the-art methods under various evaluation metrics demonstrate the great performance of the proposed method.

Keqiang Fan, Xiaohao Cai, Mahesan Niranjan
OD-Enhanced Dynamic Spatial-Temporal Graph Convolutional Network for Metro Passenger Flow Prediction

Metro passenger flow prediction is crucial for efficient urban transportation planning and resource allocation. However, it faces two challenges. The first challenge is extracting the diverse passenger flow patterns at different stations, e.g., stations near residential areas and stations near commercial areas, while the second one is to model the complex dynamic spatial-temporal correlations caused by Origin-Destination (OD) flows. Existing studies often overlook the above two aspects, especially the impact of OD flows. In conclusion, we propose an OD-enhanced dynamic spatial-temporal graph convolutional network (DSTGCN) for metro passenger flow prediction. First, we propose a static spatial module to extract the flow patterns of different stations. Second, we utilize a dynamic spatial module to capture the dynamic spatial correlations between stations with OD matrices. Finally, we employ a multi-resolution temporal dependency module to learn the delayed temporal features. We also conduct experiments based on two real-world datasets in Shanghai and Hangzhou. The results show the superiority of our model compared to the state-of-the-art baselines.

Lei Ren, Jie Chen, Tong Liu, Hang Yu
Enhancing Heterogeneous Graph Contrastive Learning with Strongly Correlated Subgraphs

Graph contrastive learning maximizes the mutual information between the embedding representations of the same data instances in different augmented views of a graph, obtaining feature representations for graph data in an unsupervised manner without the need for manual labeling. Most existing node-level graph contrastive learning models only consider embeddings of the same node in different views as positive sample pairs, ignoring rich inherent neighboring relation and resulting in certain contrastive information loss. To address this issue, we propose a heterogeneous graph contrastive learning model that incorporates strongly correlated subgraph features. We design a contrastive learning framework suitable for heterogeneous graphs and introduce high-level neighborhood information during the contrasting process. Specifically, our model selects a strongly correlated subgraph for each target node in the heterogeneous graph based on both topological structure information and node attribute feature information. In the calculation of contrastive loss, we perform feature shifting operations on positive and negative samples based on subgraph encoding to enhance the model’s ability to discriminate between approximate samples. We conduct node classification and ablation experiments on multiple public heterogeneous datasets and the results verify the effectiveness of the research contributions of our model.

Yanxi Liu, Bo Lang
DRPDDet: Dynamic Rotated Proposals Decoder for Oriented Object Detection

Oriented object detection has gained popularity in diverse fields. However, in the domain of two-stage detection algorithms, the generation of high-quality proposals with a high recall rate remains a formidable challenge, especially in the context of remote sensing images where sparse and dense scenes coexist. To address this, we propose the DRPDDet method, which aims to improve the accuracy and recall of proposals for Oriented target detection. Our approach involves generating high-quality horizontal proposals and dynamically decoding them into rotated proposals to predict the final rotated bounding boxes. To achieve high-quality horizontal proposals, we introduce the innovative HarmonyRPN module. This module integrates foreground information from the RPN classification branch into the original feature map, creating a fused feature map that incorporates multi-scale foreground information. By doing so, the RPN generates horizontal proposals that focus more on foreground objects, which leads to improved regression performance. Additionally, we design a dynamic rotated proposals decoder that adaptively generates rotated proposals based on the constraints of the horizontal proposals, enabling accurate detection in complex scenes. We evaluate our proposed method on the DOTA and HRSC2016 remote sensing datasets, and the experimental results demonstrate its effectiveness in complex scenes. Our method improves the accuracy of proposals in various scenarios while maintaining a high recall rate.

Jun Wang, Zilong Wang, Yuchen Weng, Yulian Li
MFSFFuse: Multi-receptive Field Feature Extraction for Infrared and Visible Image Fusion Using Self-supervised Learning

The infrared and visible image fusion aims to fuse complementary information in different modalities to improve image quality and resolution, and facilitate subsequent visual tasks. Most of the current fusion methods suffer from incomplete feature extraction or redundancy, resulting in indistinctive targets or lost texture details. Moreover, the infrared and visible image fusion lacks ground truth, and the fusion results obtained by using unsupervised network training models may also cause the loss of important features. To solve these problems, we propose an infrared and visible image fusion method using self-supervised learning, called MFSFFuse. To overcome these challenges, we introduce a Multi-Receptive Field dilated convolution block that extracts multi-scale features using dilated convolutions. Additionally, different attention modules are employed to enhance information extraction in different branches. Furthermore, a specific loss function is devised to guide the optimization of the model to obtain an ideal fusion result. Extensive experiments show that, compared to the state-of-the-art methods, our method has achieved competitive results in both quantitative and qualitative experiments.

Xueyan Gao, Shiguang Liu
Progressive Temporal Transformer for Bird’s-Eye-View Camera Pose Estimation

Visual relocalization is a crucial technique used in visual odometry and SLAM to predict the 6-DoF camera pose of a query image. Existing works mainly focus on ground view in indoor or outdoor scenes. However, camera relocalization on unmanned aerial vehicles is less focused. Also, frequent view changes and a large depth of view make it more challenging. In this work, we establish a Bird’s-Eye-View (BEV) dataset for camera relocalization, a large dataset contains four distinct scenes (roof, farmland, bare ground, and urban area) with such challenging problems as frequent view changing, repetitive or weak textures and large depths of fields. All images in the dataset are associated with a ground-truth camera pose. The BEV dataset contains 177242 images, a challenging large-scale dataset for camera relocalization. We also propose a Progressive Temporal transFormer (dubbed as PTFormer) as the baseline model. PTFormer is a sequence-based transformer with a designed progressive temporal aggregation module for temporal correlation exploitation and a parallel absolute and relative prediction head for implicitly modeling the temporal constraint. Thorough experiments are exhibited on both the BEV dataset and widely used handheld datasets of 7Scenes and Cambridge Landmarks to prove the robustness of our proposed method.

Zhuoyuan Wu, Jiancheng Cai, Ranran Huang, Xinmin Liu, Zhenhua Chai
Adaptive Focal Inverse Distance Transform Maps for Cell Recognition

The quantitative analysis of cells is crucial for clinical diagnosis, and effective analysis requires accurate detection and classification. Using point annotations for weakly supervised learning is a common approach for cell recognition, which significantly reduces the labeling workload. Cell recognition methods based on point annotations primarily rely on manually crafted smooth pseudo labels. However, the diversity of cell shapes can render the fixed encodings ineffective. In this paper, we propose a multi-task cell recognition framework. The framework utilizes a regression task to adaptively generate smooth pseudo labels with cell morphological features to guide the robust learning of probability branch and utilizes an additional branch for classification. Meanwhile, in order to address the issue of multiple high-response points in one cell, we introduce Non-Maximum Suppression (NMS) to avoid duplicate detection. On a bone marrow cell recognition dataset, our method is compared with five representative methods. Compared with the best performing method, our method achieves improvements of 2.0 F1 score and 3.6 F1 score in detection and classification, respectively.

Wenjie Huang, Xing Wu, Chengliang Wang, Zailin Yang, Longrong Ran, Yao Liu
Stereo Visual Mesh for Generating Sparse Semantic Maps at High Frame Rates

The Visual Mesh is an input transform for deep learning that allows depth independent object detection at very high frame rates. The present study introduces a Visual Mesh based stereo vision method for sparse stereo semantic segmentation. A dataset of simulated 3D scenes was generated and used for training to show that the method is capable of processing high resolution stereo inputs to generate both left and right sparse semantic maps. The new stereo method demonstrated better classification accuracy than the corresponding monocular approach. The high frame rates and high accuracy may make the proposed approach attractive to fast-paced on-board robot or IoT applications.

Alexander Biddulph, Trent Houliston, Alexandre Mendes, Stephan Chalup
Micro-expression Recognition Based on PCB-PCANet+

Micro-expressions (MEs) have the characteristics of small motion amplitude and short duration. How to learn discriminative ME features is a key issue in ME recognition. Motivated by the success of PCB model in person retrieval, this paper proposes a ME recognition method called PCB-PCANet+. Considering that the important information of MEs is mainly concentrated in a few key facial areas like eyebrows and eyes, based on the output of shallow PCANet+, we use a multiple branch LSTM networks to separately learn the local spatio-temporal features for each facial ROI region. In addition, in the stage of multiple branch fusion, we design a feature weighting strategy according to the significances of different facial regions to further improve the performances of ME recognition. The experimental results on the SMIC and CASME II datasets validate the effectiveness of the proposed method.

Shiqi Wang, Fei Long, Junfeng Yao
Exploring Adaptive Regression Loss and Feature Focusing in Industrial Scenarios

Industrial defect detection is designed to detect quality defects in industrial products. However, the surface defects of different industrial products vary greatly-for example, the variety of texture shapes and the complexity of background information. A lightweight Focus Encoder-Decoder Network (FEDNet) is presented to solve these problems. Specifically, the novelty of FEDNet is as follows: First, the feature focusing module (FFM) is designed to focus the attention on defect features in complex backgrounds. Secondly, a lightweight texture extraction module (LTEM) is proposed to lightly extract the texture and relative location information of shallow network defect features. Finally, the AZIoU, an adaptive adjustment loss function, is reexamined in the prediction box’s specific circumference and length-width bits. Experiments on two industrial defect datasets show that FEDNet achieves the accuracy of Steel at 42.86% and DeepPCB at 72.19% using only 15.3 GFLOPs.

Mingle Zhou, Zhanzhi Su, Min Li, Delong Han, Gang Li
Optimal Task Grouping Approach in Multitask Learning

Multi-task learning has become a powerful solution in which multiple tasks are trained together to leverage the knowledge learned from one task to improve the performance of the other tasks. However, the tasks are not always constructive on each other in the multi-task formulation and might play negatively during the training process leading to poor results. Thus, this study focuses on finding the optimal group of tasks that should be trained together for multi-task learning in an automotive context. We proposed a multi-task learning approach to model multiple vehicle long-term behaviors using low-resolution data and utilized gradient descent to efficiently discover the optimal group of tasks/vehicle behaviors that can increase the performance of the predictive models in a single training process. In this study, we also quantified the contribution of individual tasks in their groups and to the other groups’ performance. The experimental evaluation of the data collected from thousands of heavy-duty trucks shows that the proposed approach is promising.

Reza Khoshkangini, Mohsen Tajgardan, Peyman Mashhadi, Thorsteinn Rögnvaldsson, Daniel Tegnered
Effective Guidance in Zero-Shot Multilingual Translation via Multiple Language Prototypes

In a multilingual neural machine translation model that fully shares parameters across all languages, a popular approach is to use an artificial language token to guide translation into the desired target language. However, recent studies have shown that language-specific signals in prepended language tokens are not adequate to guide the MNMT models to translate into right directions, especially on zero-shot translation (i.e., off-target translation issue). We argue that the representations of prepended language tokens are overly affected by its context information, resulting in potential information loss of language tokens and insufficient indicative ability. To address this issue, we introduce multiple language prototypes to guide translation into the desired target language. Specifically, we categorize sparse contextualized language representations into a few representative prototypes over training set, and inject their representations into each individual token to guide the models. Experiments on several multilingual datasets show that our method significantly alleviates the off-target translation issue and improves the translation quality on both zero-shot and supervised directions.

Yafang Zheng, Lei Lin, Yuxuan Yuan, Xiaodong Shi
Extending DenseHMM with Continuous Emission

Traditional Hidden Markov Models (HMM) allow us to discover the latent structure of the observed data (both discrete and continuous). Recently proposed DenseHMM provides hidden states embedding and uses the co-occurrence-based learning schema. However, it is limited to discrete emissions, which does not meet many real-world problems. We address this shortcoming by discretizing observations and using a region-based co-occurrence matrix in the training procedure. It allows embedding hidden states for continuous emission problems and reducing the training time for large sequences. An application of the proposed approach concerns recommender systems, where we try to explain how the current interest of a given user in a given group of products (current state of the user) influences the saturation of the list of recommended products with the group of products. Computational experiments confirmed that the proposed approach outperformed regular HMMs in several benchmark problems. Although the emissions are estimated roughly, we can accurately infer the states.

Klaudia Balcer, Piotr Lipinski
An Efficient Enhanced-YOLOv5 Algorithm for Multi-scale Ship Detection

Ship detection has gained considerable attentions from industry and academia. However, due to the diverse range of ship types and complex marine environments, multi-scale ship detection suffers from great challenges such as low detection accuracy and so on. To solve the above issues, we propose an efficient enhanced-YOLOv5 algorithm for multi-scale ship detection. Specifically, to dynamically extract two-dimensional features, we design a MetaAconC-inspired adaptive spatial-channel attention module for reducing the impact of complex marine environments on large-scale ships. In addition, we construct a gradient-refined bounding box regression module to enhance the sensitivity of loss function gradient and strengthen the feature learning ability, which can relieve the issue of uneven horizontal and vertical features in small-scale ships. Finally, a Taylor expansion-based classification module is established which increases the feedback contribution of gradient by adjusting the first polynomial coefficient vertically, and improves the detection performance of the model on few sample ship objects. Extensive experimental results confirm the effectiveness of the proposed method.

Jun Li, Guangyu Li, Haobo Jiang, Weili Guo, Chen Gong
Double-Layer Blockchain-Based Decentralized Integrity Verification for Multi-chain Cross-Chain Data

With the development of blockchain technology, issues like storage, throughput, and latency emerge. Multi-chain solutions are devised to enable data sharing across blockchains, but in complex cross-chain scenarios, data integrity faces risks. Due to the decentralized nature of blockchain, centralized verification schemes are not feasible, making decentralized cross-chain data integrity verification a critical and challenging problem. In this paper, based on the ideas of “governing the chain by chain” and “double layer blockchain”, we propose a double-layer blockchain-based decentralized integrity verification scheme. We construct a supervision-chain by selecting representative nodes from multiple blockchains, which is responsible for cross-chain data integrity verification and recording results. Specifically, our scheme relies on two consensus phases: integrity consensus for verification and block consensus for result recording. We also integrate a reputation system and an election algorithm within the supervision-chain. Through security analysis and performance evaluation, we demonstrate the security and effectiveness of our proposed scheme.

Weiwei Wei, Yuqian Zhou, Dan Li, Xina Hong
Inter-modal Fusion Network with Graph Structure Preserving for Fake News Detection

The continued ferment of fake news on the network threatens the stability and security of society, prompting researchers to focus on fake news detection. The development of social media has made it challenging to detect fake news by only using uni-modal information. Existing studies tend to integrate multi-modal information to pursue completeness for information mining. How to eliminate modality differences effectively while capturing structure information well from multi-modal data remains a challenging issue. To solve this problem, we propose an Inter-modal Fusion network with Graph Structure Preserving (IF-GSP) approach for fake news detection. An inter-modal cross-layer fusion module is designed to bridge the modality differences by integrating features in different layers between modalities. Intra-modal and cross-modal contrastive losses are designed to enhance the inter-modal semantic similarity while focusing on modal-specific discriminative representation learning. A graph structure preserving module is designed to make the learned features fully perceive the graph structure information based on a graph convolutional network (GCN). A multi-modal fusion module utilizes an attention mechanism to adaptively integrate cross-modal feature representations. Experiments on two widely used datasets show that IF-GSP outperforms related multi-modal fake news detection methods.

Jing Liu, Fei Wu, Hao Jin, Xiaoke Zhu, Xiao-Yuan Jing
Learning to Match Features with Geometry-Aware Pooling

Finding reliable and robust correspondences across images is a fundamental and crucial step for many computer vision tasks, such as 3D-reconstruction and virtual reality. However, previous studies still struggle in challenging cases, including large view changes, repetitive pattern and textureless regions, due to the neglect of geometric constraint in the process of feature encoding. Accordingly, we propose a novel GPMatcher, which is designed to introduce geometric constraints and guidance in the feature encoding process. To achieve this goal, we compute camera poses with the corresponding features in each attention layer and adopt a geometry-aware pooling to reduce the redundant information in the next layer. By these means, an iterative geometry-aware pooing and pose estimation pipeline is constructed, which avoids the updating of redundant features and reduces the impact of noise. Experiments conducted on a range of evaluation benchmarks demonstrate that our method improves the matching accuracy and achieves the state-of-the-art performance.

Jiaxin Deng, Xu Yang, Suiwu Zheng
PnP: Integrated Prediction and Planning for Interactive Lane Change in Dense Traffic

Making human-like decisions for autonomous driving in interactive scenarios is crucial and difficult, requiring the self-driving vehicle to reason about the reactions of interactive vehicles to its behavior. To handle this challenge, we provide an integrated prediction and planning (PnP) decision-making approach. A reactive trajectory prediction model is developed to predict the future states of other actors in order to account for the interactive nature of the behaviors. Then, n-step temporal-difference search is used to make a tactical decision and plan the tracking trajectory for the self-driving vehicle by combining the value estimation network with the reactive prediction model. The proposed PnP method is evaluated using the CARLA simulator, and the results demonstrate that PnP obtains superior performance compared to popular model-free and model-based reinforcement learning baselines.

Xueyi Liu, Qichao Zhang, Yinfeng Gao, Zhongpu Xia
Towards Analyzing the Efficacy of Multi-task Learning in Hate Speech Detection

Secretary-General António Guterres launched the United Nations Strategy and Plan of Action on Hate Speech in 2019, recognizing the alarming trend of increasing hate speech worldwide. Despite extensive research, benchmark datasets for hate speech detection remain limited in volume and vary in domain and annotation. In this paper, the following research objectives are deliberated (a) performance comparisons between multi-task models against single-task models; (b) performance study of different multi-task models (fully shared, shared-private) for hate speech detection, considering individual dataset as a separate task; (c) what is the effect of using different combinations of available existing datasets in the performance of multi-task settings? A total of six datasets that contain offensive and hate speech on the accounts of race, sex, and religion are considered for the above study. Our analysis suggests that a proper combination of datasets in a multi-task setting can overcome data scarcity and develop a unified framework.

Krishanu Maity, Gokulapriyan Balaji, Sriparna Saha
Exploring Non-isometric Alignment Inference for Representation Learning of Irregular Sequences

The development of Internet of Things (IoT) technology has led to increasingly diverse and complex data collection methods. This unstable sampling environment has resulted in the generation of a large number of irregular monitoring data streams, posing significant challenges for related data analysis tasks. We have observed that irregular sequence sampling densities are uneven, containing randomly occurring dense and sparse intervals. This data imbalance tendency often leads to overfitting in the dense regions and underfitting in the sparse regions, ultimately impeding the representation performance of models. Conversely, the irregularity at the data level has limited impact on the deep semantics of sequences. Based on this observation, we propose a novel Non-isometric Alignment Inference Architecture (NAIA), which utilizes a multi-level semantic continuous representation structure based on inter-interval segmentation to learn representations of irregular sequences. This architecture efficiently extracts the latent features of irregular sequences. We evaluate the performance of NAIA on multiple datasets for downstream tasks and compare it with recent benchmark methods, demonstrating NAIA’s state-of-the-art performance results.

Fang Yu, Shijun Li, Wei Yu
Retrieval-Augmented GPT-3.5-Based Text-to-SQL Framework with Sample-Aware Prompting and Dynamic Revision Chain

Text-to-SQL aims at generating SQL queries for the given natural language questions and thus helping users to query databases. Prompt learning with large language models (LLMs) has emerged as a recent approach, which designs prompts to lead LLMs to understand the input question and generate the corresponding SQL. However, it faces challenges with strict SQL syntax requirements. Existing work prompts the LLMs with a list of demonstration examples (i.e. question-SQL pairs) to generate SQL, but the fixed prompts can hardly handle the scenario where the semantic gap between the retrieved demonstration and the input question is large. In this paper, we propose a retrieval-augmented prompting method for an LLM-based Text-to-SQL framework, involving sample-aware prompting and a dynamic revision chain. Our approach incorporates sample-aware demonstrations, which include the composition of SQL operators and fine-grained information related to the given question. To retrieve questions sharing similar intents with input questions, we propose two strategies for assisting retrieval. Firstly, we leverage LLMs to simplify the original questions, unifying the syntax and thereby clarifying the users’ intentions. To generate executable and accurate SQLs without human intervention, we design a dynamic revision chain that iteratively adapts fine-grained feedback from the previously generated SQL. Experimental results on three Text-to-SQL benchmarks demonstrate the superiority of our method over strong baseline models.

Chunxi Guo, Zhiliang Tian, Jintao Tang, Shasha Li, Zhihua Wen, Kaixuan Wang, Ting Wang
Improving GNSS-R Sea Surface Wind Speed Retrieval from FY-3E Satellite Using Multi-task Learning and Physical Information

Global Navigation Satellite System Reflectometry (GNSS-R) technology has great advantages over traditional satellite remote sensing detection of sea surface wind field in terms of cost and timeliness. It has attracted increasing attention and research from scholars around the world. This paper focuses on the Fengyun-3E (FY-3E) satellite, which carries the GNOS II sensor that can receive GNSS-R signals. We analyze the limitations of the conventional sea surface wind speed retrieval method and the existing deep learning model for this task, and propose a new sea surface wind speed retrieval model for FY-3E satellite based on a multi-task learning (MTL) network framework. The model uses the forecast product of Hurricane Weather Research and Forecasting (HWRF) model as the label, and inputs all the relevant information of Delay-Doppler Map (DDM) in the first-level product into the network for comprehensive learning. We also add wind direction, U wind and V wind physical information as constraints for the model. The model achieves good results in multiple evaluation metrics for retrieving sea surface wind speed. On the test set, the model achieves a Root Mean Square Error (RMSE) of 2.5 and a Mean Absolute Error (MAE) of 1.85. Compared with the second-level wind speed product data released by Fengyun Satellite official website in the same period, which has an RMSE of 3.37 and an MAE of 1.9, our model improves the performance by 52.74% and 8.65% respectively, and obtains a better distribution.

Zhenxiong Zhou, Boheng Duan, Kaijun Ren
Incorporating Syntactic Cognitive in Multi-granularity Data Augmentation for Chinese Grammatical Error Correction

Chinese grammatical error correction (CGEC) has recently attracted a lot of attention due to its real-world value. The current mainstream approaches are all data-driven, but the following flaws still exist. First, there is less high-quality training data with complexity and a variety of errors, and data-driven approaches frequently fail to significantly increase performance due to the lack of data. Second, the existing data augmentation methods for CGEC mainly focus on word-level augmentation and ignore syntactic-level information. Third, the current data augmentation methods are strongly randomized, and fewer can fit the cognition pattern of students on syntactic errors. In this paper, we propose a novel multi-granularity data augmentation method for CGEC, and construct a syntactic error knowledge base for error type Missing and Redundant Components, and syntactic conversion rules for error type Improper Word Order based on a finely labeled syntactic structure treebank. Additionally, we compile a knowledge base of character and word errors from actual student essays. Then, a data augmentation algorithm incorporating character, word, and syntactic noise is designed to build the training set. Extensive experiments show that the $$F_{0.5}$$ F 0.5 in the test set is 36.77%, which is a 6.2% improvement compared to the best model in the NLPCC Shared Task, proving the validity of our method.

Jingbo Sun, Weiming Peng, Zhiping Xu, Shaodong Wang, Tianbao Song, Jihua Song
Long Short-Term Planning for Conversational Recommendation Systems

In Conversational Recommendation Systems (CRS), the central question is how the conversational agent can naturally ask for user preferences and provide suitable recommendations. Existing works mainly follow the hierarchical architecture, where a higher policy decides whether to invoke the conversation module (to ask questions) or the recommendation module (to make recommendations). This architecture prevents these two components from fully interacting with each other. In contrast, this paper proposes a novel architecture, the long short-term feedback architecture, to connect these two essential components in CRS. Specifically, the recommendation predicts the long-term recommendation target based on the conversational context and the user history. Driven by the targeted recommendation, the conversational model predicts the next topic or attribute to verify if the user preference matches the target. The balance feedback loop continues until the short-term planner output matches the long-term planner output, that is when the system should make the recommendation.

Xian Li, Hongguang Shi, Yunfei Wang, Yeqin Zhang, Xubin Li, Cam-Tu Nguyen
Gated Bi-View Graph Structure Learning

Graph structure learning (GSL), which aims to optimize graph structure and learn suitable parameters of graph neural networks (GNNs) simultaneously, has shown great potential in boosting the performance of GNNs. As a branch of GSL, multi-view methods mainly learn an optimal graph structure (final view) from multiple information sources (basic views). However, basic views’ structural information is insufficient, existing methods ignore the fact that different views can complement each other. Moreover, existing methods obtain the final view through simple combination, fail to constrain the noise, which inevitably brings irrelevant information. To tackle these problems, we propose a Gated Bi-View GSL architecture, named GBV-GSL, which interacts two basic views through a selection gating mechanism, so as to “turn off” noise as well as supplement insufficient structures. Specifically, two basic views that focus on different knowledge are extracted from original graph as two inputs of the model. Furthermore, we propose a novel view interaction technique based on selection gating mechanism to remove redundant structural information and supplement insufficient topology while retaining their focused knowledge. Finally, we design a view attention fusion mechanism to adaptively fuse two interacted views to generate the final view. In numerical experiments involving both clean and attacked conditions, GBV-GSL shows significant improvements in the effectiveness and robustness of structure learning and node representation learning. Code is available at https://github.com/Simba9257/GBV-GSL .

Xinyi Wang, Hui Yan
How Legal Knowledge Graph Can Help Predict Charges for Legal Text

The existing methods for predicting Easily Confused Charges (ECC) primarily rely on factual descriptions from legal cases. However, these approaches overlook some key information hidden in these descriptions, resulting in an inability to accurately differentiate between ECC. Legal domain knowledge graphs can showcase personal information and criminal processes in cases, but they primarily focus on entities in cases of insolation while ignoring the logical relationships between these entities. Different relationships often lead to distinct charges. To address these problems, this paper proposes a charge prediction model that integrates a Criminal Behavior Knowledge Graph (CBKG), called Charge Prediction Knowledge Graph (CP-KG). Firstly, we defined a diverse range of legal entities and relationships based on the characteristics of ECC. We conducted fine-grained annotation on key elements and logical relationships in the factual descriptions. Subsequently, we matched the descriptions with the CBKG to extract the key elements, which were then encoded by Text Convolutional Neural Network (TextCNN). Additionally, we extracted case subgraphs containing sequential behaviors from the CBKG based on the factual descriptions and encoded them using a Graph Attention Network (GAT). Finally, we concatenated these representations of key elements, case subgraphs, and factual descriptions, collectively used for predicting the charges of the defendant. To evaluate the CP-KG, we conducted experiments on two charge prediction datasets consisting of real legal cases. The experimental results demonstrate that the CP-KG achieves scores of 99.10% and 90.23% in the Macro-F1 respectively. Compared to the baseline methods, the CP-KG shows significant improvements with 25.79% and 13.82% respectively.

Shang Gao, Rina Sa, Yanling Li, Fengpei Ge, Haiqing Yu, Sukun Wang, Zhongyi Miao
CMFN: Cross-Modal Fusion Network for Irregular Scene Text Recognition

Scene text recognition, as a cross-modal task involving vision and text, is an important research topic in computer vision. Most existing methods use language models to extract semantic information for optimizing visual recognition. However, the guidance of visual cues is ignored in the process of semantic mining, which limits the performance of the algorithm in recognizing irregular scene text. To tackle this issue, we propose a novel cross-modal fusion network (CMFN) for irregular scene text recognition, which incorporates visual cues into the semantic mining process. Specifically, CMFN consists of a position self-enhanced encoder, a visual recognition branch and an iterative semantic recognition branch. The position self-enhanced encoder provides character sequence position encoding for both the visual recognition branch and the iterative semantic recognition branch. The visual recognition branch carries out visual recognition based on the visual features extracted by CNN and the position encoding information provided by the position self-enhanced encoder. The iterative semantic recognition branch, which consists of a language recognition module and a cross-modal fusion gate, simulates the way that human recognizes scene text and integrates cross-modal visual cues for text recognition. The experiments demonstrate that the proposed CMFN algorithm achieves comparable performance to state-of-the-art algorithms, indicating its effectiveness.

Jinzhi Zheng, Ruyi Ji, Libo Zhang, Yanjun Wu, Chen Zhao
Introducing Semantic-Based Receptive Field into Semantic Segmentation via Graph Neural Networks

Current semantic segmentation models typically use deep learning models as encoders. However, these models have a fixed receptive field, which can cause mixed information within the receptive field and lead to confounding effects during neural network training. To address these limitations, we propose the “semantic-based receptive field” based on our analysis in current models. This approach seeks to improve the segmentation performance by aggregate image patches with similar representation rather than their physical location, aiming to enhance the interpretability and accuracy of semantic segmentation models. For implementation, we utilize Graph representation learning (GRL) approaches into current semantic segmentation models. Specifically, we divide the input image into patches and construct them into graph-structured data that expresses semantic similarity. Our Graph Convolution Receptor block uses graph-structured data purpose-built from image data and adopt a node-classification-like perspective to address the problem of semantic segmentation. Our GCR module models the relationship between semantic relative patches, allowing us to mitigate the adverse effects of confounding information and improve the quality of feature representation. By adopting this approach, we aim to enhance the accuracy and robustness of the semantic segmentation task. Finally, we evaluated our proposed module on multiple semantic segmentation models and compared its performance to baseline models on multiple semantic segmentation datasets. Our empirical evaluations demonstrate the effectiveness and robustness of our proposed module, as it consistently outperformed baseline models on these datasets.

Daixi Jia, Hang Gao, Xingzhe Su, Fengge Wu, Junsuo Zhao
Transductive Cross-Lingual Scene-Text Visual Question Answering

Multilingual modeling has gained increasing attention in recent years, as the cross-lingual Text-based Visual Question Answering (TextVQA) are requried to understand questions and answers across different languages. Current researches mainly work on multimodal information assuming that multilingual pretrained models are effective to encode questions. However, the semantic comprehension of a text-based question varies between languages, creating challenges in directly deducing its answer from an image. To this end, we propose a novel multilingual text-based VQA framework suited for cross-language scenarios(CLVQA), transductively considering multiple answer generating interactions with questions. First, a question reading module densely connects encoding layers in a feedforward manner, which can adaptively work together with answering. Second, a multimodal OCR-based module decouples OCR features in an image into visual, linguistic, and holistic parts to facilitate the localization of a target-language answer. By incorporating enhancements from the above two input encoding modules, the proposed framework outputs its answer candidates mainly from the input image with a object detection module. Finally, a transductive answering module jointly understands input multimodal information and identified answer candidates at the multilingual level, autoregressively generating cross-lingual answers. Extensive experiments show that our framework outperforms state-of-the-art methods for both of cross-lingual (English $${<}$$ < - $${>}$$ > Chinese) and mono-lingual (English $${<}$$ < - $${>}$$ > English and Chinese $${<}$$ < - $${>}$$ > Chinese) tasks in terms of accuracy based metrics. Moreover, significant improvements are achieved in zero-shot cross-lingual settings(French $${<}$$ < - $${>}$$ > Chinese).

Lin Li, Haohan Zhang, Zeqin Fang, Zhongwei Xie, Jianquan Liu
Learning Representations for Sparse Crowd Answers

When collecting answers from crowds, if there are many instances, each worker can only provide the answers to a small subset of the instances, and the instance-worker answer matrix is thus sparse. The solutions for improving the quality of crowd answers such as answer aggregation are usually proposed in an unsupervised fashion. In this paper, for enhancing the quality of crowd answers used for inferring true answers, we propose a solution with a self-supervised fashion to effectively learn the potential information in the sparse crowd answers. We propose a method named CrowdLR which first learns rich instance and worker representations from the crowd answers based on two types of self-supervised signals. We create a multi-task model with a Siamese structure to learn two classification tasks for two self-supervised signals in one framework. We then utilize the learned representations to complete the answers to fill the missing answers, and can utilize the answer aggregation methods to the complete answers. The experimental results based on real datasets show that our approach can effectively learn the representations from crowd answers and improve the performance of answer aggregation especially when the crowd answers are sparse.

Jiyi Li
Identify Vulnerability Types: A Cross-Project Multiclass Vulnerability Classification System Based on Deep Domain Adaptation

Software Vulnerability Detection(SVD) is a important means to ensure system security due to the ubiquity of software. Deep learning-based approaches achieve state-of-the-art performance in SVD but one of the most crucial issues is coping with the scarcity of labeled data in projects to be detected. One reliable solution is to employ transfer learning skills to leverage labeled data from other software projects. However, existing cross-project approaches only focused on detecting whether the function code is vulnerable or not. The requirement to identify vulnerability types is essential because it offers information to patch the vulnerabilities. Our aim in this paper is to propose the first system for cross-project multiclass vulnerability classification. We detect at the granularity of code snippet, which is finer-grained compare to function and effective to catch inter-procedure vulnerability patterns. After generating code snippets, we define several principles to extract snippet attentions and build a deep model to obtain the fused deep features; We then extend different domain adaptation approaches to reduce feature distributions of different projects. Experimental results indicate that our system outperforms other state-of-the-art systems.

Gewangzi Du, Liwei Chen, Tongshuai Wu, Chenguang Zhu, Gang Shi
Backmatter
Metadata
Title
Neural Information Processing
Editors
Biao Luo
Long Cheng
Zheng-Guang Wu
Hongyi Li
Chaojie Li
Copyright Year
2024
Publisher
Springer Nature Singapore
Electronic ISBN
978-981-9980-76-5
Print ISBN
978-981-9980-75-8
DOI
https://doi.org/10.1007/978-981-99-8076-5

Premium Partner