Skip to main content

2024 | Book

Neural Information Processing

30th International Conference, ICONIP 2023, Changsha, China, November 20–23, 2023, Proceedings, Part IV

Editors: Biao Luo, Long Cheng, Zheng-Guang Wu, Hongyi Li, Chaojie Li

Publisher: Springer Nature Singapore

Book Series : Lecture Notes in Computer Science


About this book

The six-volume set LNCS 14447 until 14452 constitutes the refereed proceedings of the 30th International Conference on Neural Information Processing, ICONIP 2023, held in Changsha, China, in November 2023.
The 652 papers presented in the proceedings set were carefully reviewed and selected from 1274 submissions. They focus on theory and algorithms, cognitive neurosciences; human centred computing; applications in neuroscience, neural networks, deep learning, and related fields.

Table of Contents


Human Centred Computing

Cross-Modal Method Based on Self-Attention Neural Networks for Drug-Target Prediction

Prediction of drug-target interactions (DTIs) plays a crucial role in drug retargeting, which can save costs and shorten time for drug development. However, existing methods are still unable to integrate the multimodal features of existing DTI datasets. In this work, we propose a new multi-head-based self-attention neural network approach, called SANN-DTI, for dti prediction. Specifically, entity embeddings in the knowledge graph are learned using DistMult, then this information is interacted with traditional drug and protein representations via multi-head self-attention neural networks, and finally DTIs is computed using fully connected neural networks for interaction features. SANN-DTI was evaluated in three scenarios across two baseline datasets. After ten fold cross-validation, our model outperforms the most advanced methods. In addition, SANN-DT has been applied to drug retargeting of breast cancer via HRBB2 targets. It was found that four of the top ten recommended drugs have been supported by the literature. Ligand-target docking results showed that the second-ranked drug in the recommended list had a clear affinity with HRBB2, which provides a promising approach for better understanding drug mode of action and drug repositioning.

Litao Zhang, Chunming Yang, Chunlin He, Hui Zhang
GRF-GMM: A Trajectory Optimization Framework for Obstacle Avoidance in Learning from Demonstration

Learning from demonstrations (LfD) provides a convenient pattern to teach robot to gain skills without mechanically programming. As an LfD approach, Gaussian mixture model/Gaussian mixture regression (GMM/GMR) has been widely used for its robustness and effectiveness. However, there still exist many problems of GMM when an obstacle, which is not presented in original demonstrations, appears in the workspace of robots. To address these problems, this paper presents a novel method based on Gaussian repulsive field-Gaussian mixture model (GRF-GMM) for obstacle avoidance by optimizing the model parameters. A Gaussian repulsive force is calculated through Gaussian functions and employed to work on Gaussian components to optimize the mixture distribution which is learnt from original demonstrations. Our approach allows the reproduced trajectory to keep a safe distance away from the obstacle. Finally, the feasibility and effectiveness of the proposed method are revealed through simulations and experiments.

Bin Ye, Peng Yu, Cong Hu, Binbin Qiu, Ning Tan
SLG-NET: Subgraph Neural Network with Local-Global Braingraph Feature Extraction Modules and a Novel Subgraph Generation Algorithm for Automated Identification of Major Depressive Disorder

Major depressive disorder (MDD) is a severe mental illness that poses significant challenges to both society and families. Recently, several graph neural network (GNN)-based methods have been proposed for MDD diagnosis and achieved promising results. However, these methods encode entire braingraph directly, have overlooked the subgraph structure of braingraph, which leads to poor specificity to braingraphs. Additionally, the GNN framework they used is rudimentary, resulting in insufficient feature extraction capabilities. In light of the two shortcomings mentioned above, this paper designed a novel depression diagnosis framework named SLG-NET based on subgraph neural network. To the best of our knowledge, this study is the first attempt to apply subgraph neural network to the field of depression diagnosis. In order to enhance the specificity of our model to braingraphs, we propose a novel subgraph generation algorithm based on sub-structure information of brain. To improve feature extraction capabilities, a local and global braingraph feature extraction modules are proposed to extract braingraph properties at both local and global levels. Comprehensive experiments performed on rest-metamdd dataset show that the performance of proposed SLG-NET significantly surpasses many state-of-the-art methods, which show that the SLG-NET has the potential for auxiliary diagnosis of depression in clinical scenarios.

Yan Zhang, Xin Liu, Panrui Tang, Zuping Zhang
CrowdNav-HERO: Pedestrian Trajectory Prediction Based Crowded Navigation with Human-Environment-Robot Ternary Fusion

Navigating safely and efficiently in complex and crowded scenarios is a challenging problem of practical significance. A realistic and cluttered environmental layout usually significantly impacts crowd distribution and robotic motion decision-making during crowded navigation. However, previous methods almost either learn and evaluate navigation strategies in unrealistic barrier-free settings or assume that expensive features like pedestrian speed are available. Although accurately measuring pedestrian speed in large-scale scenarios is itself a difficult problem. To fully investigate the impact of static environment layouts on crowded navigation and alleviate the reliance of robots on costly features, we propose a novel crowded navigation framework with Human-Environment-Robot (HERO) ternary fusion named CrowdNav-HERO. Specifically, (i) a simulator that integrates an agent, a variable number of pedestrians, and a series of realistic environments is customized to train and evaluate crowded navigation strategies. (ii) Then, a pedestrian trajectory prediction module is introduced to eliminate the dependence of navigation strategies on pedestrian speed features. (iii) Finally, a novel crowded navigation strategy is designed by combining the pedestrian trajectory predictor and a layout feature extractor. Convincing comparative analysis and sufficient benchmark tests demonstrate the superiority of our approach in terms of success rate, collision rate, and cumulative rewards. The code is published at .

Siyi Lu, Bolei Chen, Ping Zhong, Yu Sheng, Yongzheng Cui, Run Liu
Modeling User’s Neutral Feedback in Conversational Recommendation

Conversational recommendation systems (CRS) enable the traditional recommender systems to obtain dynamic user preferences with interactive conversations. Although CRS has shown success in generating recommendation lists based on user’s preferences, existing methods restrict users to make binary responses, i.e., accept and reject, after recommending, which limits users from expressing their needs. In fact, the user’s rejection feedback may contain other valuable information. To address this limitation, we try to refine user’s negative item-level feedback into attribute-level and extend CRS to a more realistic scenario that not only incorporates positive and negative feedback, but also neutral feedback. Neutral feedback denotes incomplete satisfaction with recommended items, which can help CRS infer user’s preferences. To better cope with the new setting, we propose a CRS model called Neutral Feedback in Conversational Recommendation (NFCR). We adopt a joint learning task framework for feature extraction and use inverse reinforcement learning to train the decision network, helping CRS make appropriate decisions at each turn. Finally, we utilize the fine-grained neutral feedback from users to acquire their dynamic preferences in the update and deduction module. We conducted comprehensive evaluations on four benchmark datasets to demonstrate the effectiveness of our model.

Xizhe Li, Chenhao Hu, Weiyang Kong, Sen Zhang, Yubao Liu
A Domain Knowledge-Based Semi-supervised Pancreas Segmentation Approach

The five-year survival rate of pancreatic cancer is extremely low, and the survival time of patients can be extended by timely detection and treatment. Deep learning-based methods have been used to assist radiologists in diagnosis, with remarkable achievements. However, obtaining sufficient labeled data is time-consuming and labor-intensive. Semi-supervised learning is an effective way to alleviate dependence on annotated data by combining unlabeled data. Since the existing semi-supervised pancreas segmentation works are easier to ignore the domain knowledge, leading to location and shape bias. In this paper, we propose a semi-supervised pancreas segmentation method based on domain knowledge. Specifically, the prior constraints for different organ sub-regions are used to guide the pseudo-label generation for unlabeled data. Then the bidirectional information flow regularization is designed by further utilizing pseudo-labels, encouraging the model to align the labeled and unlabeled data distributions. Extensive experiments on NIH pancreas datasets show: the proposed method achieved Dice of 76.23% and 80.76% under 10% and 20% labeled data, respectively, which is superior to other semi-supervised pancreas segmentation methods.

Siqi Ma, Zhe Liu, Yuqing Song, Yi Liu, Kai Han, Yang Jiang
Soybean Genome Clustering Using Quantum-Based Fuzzy C-Means Algorithm

Bioinformatics is a new area of research in which many computer scientists are working to extract some useful information from genome sequences in a very less time, whereas traditional methods may take years to fetch this. One of the studies that belongs to the area of Bioinformatics is protein sequence analysis. In this study, we have considered the soybean protein sequence which does not have class information therefore clustering of these sequences is required. As these sequences are very complex and consist of overlapping sequences, therefore Fuzzy C-Means algorithm may work better than crisp clustering. However, the clustering of these sequences is a very time-consuming process also the results are not up to the mark by using existing crisp and fuzzy clustering algorithms. Therefore we propose here a quantum Fuzzy c-Means algorithm that uses the quantum computing concept to represent the dataset in the quantum form. The proposed approach also use the quantum superposition concept which fastens the process and also gives better result than the FCM algorithm.

Sai Siddhartha Vivek Dhir Rangoju, Keshav Garg, Rohith Dandi, Om Prakash Patel, Neha Bharill
DAMFormer: Enhancing Polyp Segmentation Through Dual Attention Mechanism

Polyp segmentation has been a challenging problem for researchers because it does not define a specific shape, color, or size. Traditional deep learning models, based on convolutional neural networks (CNNs), struggle to generalize well on unseen datasets. However, the Transformer architecture has shown promising potential in addressing medical problems by effectively capturing long-range dependencies through self-attention. This paper introduces the DAMFormer model based on Transformer for high accuracy while keeping lightness. The DAMFormer utilizes a Transformer encoder to extract better global information. The Transformer outputs are strategically fed into the ConvBlock and Enhanced Dual Attention Module to effectively capture high-frequency and low-frequency information. These outputs are further processed through the Effective Feature Fusion module to combine global and local features efficiently. In our experiment, five standard benchmark datasets were used Kvasir, CVC-Clinic DB, CVC-ColonDB, CVC-T, and ETIS-Larib.

Huy Trinh Quang, Mai Nguyen, Quan Nguyen Van, Linh Doan Bao, Thanh Dang Hong, Thanh Nguyen Tung, Toan Pham Van
BIN: A Bio-Signature Identification Network for Interpretable Liver Cancer Microvascular Invasion Prediction Based on Multi-modal MRIs

Microvascular invasion (MVI) is a critical factor that affects the postoperative cure of hepatocellular carcinoma (HCC). Precise preoperative diagnosis of MVI by magnetic resonance imaging (MRI) is crucial for effective treatment of HCC. Compared with traditional methods, deep learning-based MVI diagnostic models have shown significant improvements. However, the black-box nature of deep learning models poses a challenge to their acceptance in medical fields that demand interpretability. To address this issue, this paper proposes an interpretable deep learning model, called Biosignature Identification Network (BIN) based on multi-modal MRI images for the liver cancer MVI prediction task. Inspired by the biological ways to distinguish the species through the biosignatures, our proposed BIN method classifies patients into MVI absence (i.e., Non-MVI or negative) and MVI presence (i.e., positive) by utilizing Non-MVI and MVI biosignatures. The adoption of a transparent decision-making process in BIN ensures interpretability, while the proposed biosignatures overcome the limitations associated with the manual feature extraction. Moreover, a multi-modal MRI based BIN method is also explored to further enhance the diagnostic performance with an attempt to interpretability of multi-modal MRI fusion. Through extensive experiments on the real dataset, it was found that BIN maintains deep model-level performance while providing effective interpretability. Overall, the proposed model offers a promising solution to the challenge of interpreting deep learning-based MVI diagnostic models.

Pengyu Zheng, Bo Li, Huilin Lai, Ye Luo
Human-to-Human Interaction Detection

Different from conventional human interaction recognition, which uses choreographed videos as inputs, neglects concurrent interactive groups, and performs detection and recognition in separate stages, we introduce a new task named human-to-human interaction detection (HID). HID devotes to detecting subjects, recognizing person-wise actions, and grouping people according to their interactive relations, in one model. First, based on the popular AVA dataset created for action detection, we establish a new HID benchmark, termed AVA-Interaction (AVA-I), by adding annotations on interactive relations in a frame-by-frame manner. AVA-I consists of 85,254 frames and 86,338 interactive groups, and each image includes up to 4 concurrent interactive groups. Second, we present a novel baseline approach SaMFormer for HID, containing a visual feature extractor, a split stage which leverages a Transformer-based model to decode action instances and interactive groups, and a merging stage which reconstructs the relationship between instances and groups. All SaMFormer components are jointly trained in an end-to-end manner. Extensive experiments on AVA-I validate the superiority of SaMFormer over representative methods.

Zhenhua Wang, Kaining Ying, Jiajun Meng, Jifeng Ning
Reconstructing Challenging Hand Posture from Multi-modal Input

3D Hand reconstruction is critical for immersive VR/AR, action understanding or human healthcare. Without considering actual skin or texture details, existing solutions have concentrated on recovering hand pose and shape using parametric models or learning techniques. In this study, we introduce a challenging hand dataset, CHANDS, which is composed of articulated precise 3D geometry corresponding to previously unheard-of challenging gestures performed by real hands. Specifically, we construct a multi-view camera setup to acquire multi-view images for initial 3D reconstructions and use a hand tracker to separately capture the skeleton. Then, we present a robust method for reconstructing an articulated geometry and matching the skeleton to the geometry using a template. In addition, we build a hand pose model from CHANDS that covers a wider range of poses and is particularly helpful for difficult poses.

Xi Luo, Yuwei Li, Jingyi Yu
A Compliant Elbow Exoskeleton with an SEA at Interaction Port

In recent years, various series elastic actuators (SEAs) have been proposed to enhance the flexibility and safety of wearable exoskeletons. This paper proposes an SEA composed of wave springs and installs it at human-robot interaction port. Considering the hysteresis nonlinear characteristics of the SEA, displacement-force models of the SEA are established based on long short-term memory (LSTM) model and T-S fuzzy model in a nonlinear auto-regression moving average with exogenous input (NARMAX) structure. Based on the established models, the SEA can effectively serve as an interaction force sensor. Subsequently, the SEA is integrated into an elbow exoskeleton, and a compliant admittance controller is designed based on the displacement-force model. Experimental results demonstrate that the proposed approach effectively enhances the flexibility of human-robot interaction.

Xiuze Xia, Lijun Han, Houcheng Li, Yu Zhang, Zeyu Liu, Long Cheng


Differential Fault Analysis Against AES Based on a Hybrid Fault Model

In this paper, a differential fault analysis based on a hybrid fault model is proposed. The hybrid fault model is comprising a one-byte and multi-byte by injecting faults in the state. Through both theory and simulations, which successfully derived the key of AES-128, 192, and 256 with two, three, and four pairs of faulty ciphertexts (pairs of faulty ciphertext refers to the correct ciphertext and the responding faulty ciphertext after injecting faults) without exhaustive search, respectively. Compared with the latest methods, the method proposed only requires the fault injected in a single round, thus it is easier to carry out to an attacker. When considering AES-192, fewer faulty ciphertexts are needed. In addition, for both AES-192 and 256, our method requires fewer depths of induced fault (the entire key can be retrieved only need to induce fault in the T-2 round). Thus, the DFA proposed in this article is more efficient.

Xusen Wan, Jinbao Zhang, Weixiang Wu, Shi Cheng, Jiehua Wang
Towards Undetectable Adversarial Examples: A Steganographic Perspective

Over the past decade, adversarial examples have demonstrated an enhancing ability to fool neural networks. However, most adversarial examples can be easily detected, especially under statistical analysis. Ensuring undetectability is crucial for the success of adversarial examples in practice. In this paper, we borrow the idea of the embedding suitability map from steganography and employ it to modulate the adversarial perturbation. In this way, the adversarial perturbations are concentrated in the hard-to-detect areas and are attenuated in predictable regions. Extensive experiments show that the proposed scheme is compatible with various existing attacks and can significantly boost the undetectability of adversarial examples against both human inspection and statistical analysis of the same attack ability. The code is available at .

Hui Zeng, Biwei Chen, Rongsong Yang, Chenggang Li, Anjie Peng
On Efficient Federated Learning for Aerial Remote Sensing Image Classification: A Filter Pruning Approach

To promote the application of federated learning in resource-constraint unmanned aerial vehicle swarm, we propose a novel efficient federated learning framework CALIM-FL, short for Cross-All-Layers Importance Measure pruning-based Federated Learning. In CALIM-FL, an efficient one-shot filter pruning mechanism is intertwined with the standard FL procedure. The model size is adapted during FL to reduce both communication and computation overhead at the cost of a slight accuracy loss. The novelties of this work come from the following two aspects: 1) a more accurate importance measure on filters from the perspective of the whole neural networks; and 2) a communication-efficient one-shot pruning mechanism without data transmission from the devices. Comprehensive experiment results show that CALIM-FL is effective in a variety of scenarios, with a resource overhead saving of 88.4% at the cost of $$1\%$$ 1 % accuracy loss.

Qipeng Song, Jingbo Cao, Yue Li, Xueru Gao, Chengzhi Shangguan, Linlin Liang
ASGNet: Adaptive Semantic Gate Networks for Log-Based Anomaly Diagnosis

Logs are widely used in the development and maintenance of software systems. Logs can help engineers understand the runtime behavior of systems and diagnose system failures. For anomaly diagnosis, existing methods generally use log event data extracted from historical logs to build diagnostic models. However, we find that existing methods do not make full use of two types of features, (1) statistical features: some inherent statistical features in log data, such as word frequency and abnormal label distribution, are not well exploited. Compared with log raw data, statistical features are deterministic and naturally compatible with corresponding tasks. (2) semantic features: Logs contain the execution logic behind software systems, thus log statements share deep semantic relationships. How to effectively combine statistical features and semantic features in log data to improve the performance of log anomaly diagnosis is the key point of this paper. In this paper, we propose an adaptive semantic gate networks (ASGNet) that combines statistical features and semantic features to selectively use statistical features to consolidate log text semantic representation. Specifically, ASGNet encodes statistical features via a variational encoding module and fuses useful information through a well-designed adaptive semantic threshold mechanism. The threshold mechanism introduces the information flow into the classifier based on the confidence of the semantic features in the decision, which is conducive to training a robust classifier and can solve the overfitting problem caused by the use of statistical features. The experimental results on the real data set show that our method proposed is superior to all baseline methods in terms of various performance indicators.

Haitian Yang, Degang Sun, Wen Liu, Yanshu Li, Yan Wang, Weiqing Huang
Propheter: Prophetic Teacher Guided Long-Tailed Distribution Learning

The problem of deep long-tailed learning, a prevalent challenge in the realm of generic visual recognition, persists in a multitude of real-world applications. To tackle the heavily-skewed dataset issue in long-tailed classification, prior efforts have sought to augment existing deep models with the elaborate class-balancing strategies, such as class rebalancing, data augmentation, and module improvement. Despite the encouraging performance, the limited class knowledge of the tailed classes in the training dataset still bottlenecks the performance of the existing deep models. In this paper, we propose an innovative long-tailed learning paradigm that breaks the bottleneck by guiding the learning of deep networks with external prior knowledge. This is specifically achieved by devising an elaborated “prophetic” teacher, termed as “Propheter”, that aims to learn the potential class distributions. The target long-tailed prediction model is then optimized under the instruction of the well-trained “Propheter”, such that the distributions of different classes are as distinguishable as possible from each other. Experiments on eight long-tailed benchmarks across three architectures demonstrate that the proposed prophetic paradigm acts as a promising solution to the challenge of limited class knowledge in long-tailed datasets. The developed code is publicly available at .

Wenxiang Xu, Yongcheng Jing, Linyun Zhou, Wenqi Huang, Lechao Cheng, Zunlei Feng, Mingli Song
Sequential Transformer for End-to-End Person Search

Person Search aims to simultaneously localize and recognize a target person from realistic and uncropped gallery images. One major challenge of person search comes from the contradictory goals of the two sub-tasks, i.e., person detection focuses on finding the commonness of all persons so as to distinguish persons from the background, while person re-identification (re-ID) focuses on the differences among different persons. In this paper, we propose a novel Sequential Transformer (SeqTR) for end-to-end person search to deal with this challenge. Our SeqTR contains a detection transformer and a novel re-ID transformer that sequentially addresses detection and re-ID tasks. The re-ID transformer comprises the self-attention layer that utilizes contextual information and the cross-attention layer that learns local fine-grained discriminative features of the human body. Moreover, the re-ID transformer is shared and supervised by multi-scale features to improve the robustness of learned person representations. Extensive experiments on two widely-used person search benchmarks, CUHK-SYSU and PRW, show that our proposed SeqTR not only outperforms all existing person search methods with a 59.3 $$\%$$ % mAP on PRW but also achieves comparable performance to the state-of-the-art results with an mAP of 94.8 $$\%$$ % on CUHK-SYSU.

Long Chen, Jinhua Xu
Multi-scale Structural Asymmetric Convolution for Wireframe Parsing

Extracting salient line segments with their corresponding junctions is a promising method for structural environment recognition. However, conventional methods extract these structural features using square convolution, which greatly restricts the model performance and leads to unthoughtful wireframes due to the incompatible geometric properties with these primitives. In this paper, we propose a Multi-scale Structural Asymmetric Convolution for Wireframe Parsing (MSACWP) to simultaneously infer prominent junctions and line segments from images. Benefiting from the similar geometric properties of asymmetric convolution and line segment, the proposed Multi-Scale Asymmetric Convolution (MSAC) effectively captures long-range context feature and prevents the irrelevant information from adjacent pixels. Besides, feature maps obtained from different stages in decoder layers are combined using Multi-Scale Feature Combination module (MSFC) to promote the multi-scale feature representation capacity of the backbone network. Sufficient experiments on two public datasets (Wireframe and YorkUrban) are conducted to demonstrate the advantages of our proposed MSACWP compared with previous state-of-the-art methods.

Jiahui Zhang, Jinfu Yang, Fuji Fu, Jiaqi Ma
S3ACH: Semi-Supervised Semantic Adaptive Cross-Modal Hashing

Hash learning has been a great success in large-scale data retrieval field because of its superior retrieval efficiency and storage consumption. However, labels for large-scale data are difficult to obtain, thus supervised learning-based hashing methods are no longer applicable. In this paper, we introduce a method called Semi-Supervised Semantic Adaptive Cross-modal Hashing (S3ACH), which improves performance of unsupervised hash retrieval by exploiting a small amount of available label information. Specifically, we first propose a higher-order dynamic weight public space collaborative computing method, which balances the contribution of different modalities in the common potential space by invoking adaptive higher-order dynamic variable. Then, less available label information is utilized to enhance the semantics of hash codes. Finally, we propose a discrete optimization strategy to solve the quantization error brought by the relaxation strategy and improve the accuracy of hash code production. The results show that S3ACH achieves better effects than current advanced unsupervised methods and provides more applicable while balancing performance compared with the existing cross-modal hashing.

Liu Yang, Kaiting Zhang, Yinan Li, Yunfei Chen, Jun Long, Zhan Yang
Intelligent UAV Swarm Planning Based on Undirected Graph Model

The coordination of multiple drones for formation flight and collaborative task execution in the air, known as drone cluster control, has become a key research focus in recent years. Collision avoidance during formation maintenance remains a challenging aspect of cluster control, as traditional cluster control and path planning algorithms struggle to enable individual drones to independently avoid obstacles and maintain formation within the cluster. To address this issue, this paper proposes a cluster modeling method based on an undirected graph, which optimizes the entire model by adding constraints. A distributed system is also utilized to plan the entire cluster, improving the efficiency and robustness of the system and allowing individual drones to execute their flight tasks independently. Experimental verification was conducted in a ROS-based simulation environment, and the results demonstrate that our proposed algorithm effectively maintains the formation of drone clusters with high performance and stability.

Tianyi Lv, Qingyuan Xia, Qiwen Zheng
Learning Item Attributes and User Interests for Knowledge Graph Enhanced Recommendation

Knowledge Graphs (KGs) manifest great potential in recommendation. This is ascribable to the rich attribute information contained in KG, such as the price attribute of goods, which is further integrated into item and user representations and improves recommendation performance as side information. However, existing knowledge-aware methods leverage attribute information at a coarse-grained level in two aspects: (1) item representations don’t accurately learn the distributional characteristics of different attributes, and (2) user representations don’t sufficiently recognize the pattern of user preferences towards attributes. In this paper, we propose a novel attentive knowledge graph attribute network(AKGAN) to learn item attributes and user interests via attribute information in KG. Technically, AKGAN adopts a novel graph neural network framework, which has a different design between the first layer and the latter layer. The first layer merges one-hop neighbors’ attribute information by concatenation operation to avoid breaking down the independence of different attributes, and the latter layer recursively propagates attribute information without weight decrease of high-order significant neighbors. With one attribute placed in the corresponding range of element-wise positions, AKGAN employs a novel interest-aware attention unit, which releases the limitation that the sum of attention weight is 1, to model the complexity and personality of user interests. Experimental results on three benchmark datasets show that AKGAN achieves significant improvements over the state-of-the-art methods. Further analyses show that AKGAN offers interpretable explanations for user preferences towards attributes.

Zepeng Huai, Guohua Yang, Jianhua Tao, Dawei Zhang
Multi-view Stereo by Fusing Monocular and a Combination of Depth Representation Methods

The design of plane-sweep deep MVS primarily relies on patch-similarity based matching. However, this approach becomes impractical when dealing with low-textured, similar-textured and reflective regions in the scene, resulting in inaccurate matching results. One of the methods to avoid this kind of error is incorporating semantic information in matching process. In this paper, we propose an end-to-end method that uses monocular depth estimation to add semantic information to deep MVS. Additionally, we analyze the advantages and disadvantages of two main depth representations and propose a collaborative method to alleviate their drawbacks. Finally, we introduce a novel filtering criterion named Distribution Consistency, which can effectively filter out outliers with poor probability distribution, such as uniform distribution, to further enhance the reconstruction quality.

Fanqi Yu, Xinyang Sun
A Fast and Scalable Frame-Recurrent Video Super-Resolution Framework

The video super-resolution(VSR) methods based on deep learning have become the mainstream VSR methods and have been widely used in various fields. Although many deep learning-based VSR methods have been proposed, they cannot be applied to real-time VSR tasks due to the vast computation and memory occupation. The lightweight VSR networks have faster inference speeds, but their super-resolution performance could be better. In this paper, we analyze the explicit and implicit motion compensation methods commonly used in VSR networks and design a fast and scalable frame-recurrent VSR network(FFRVSR). FFRVSR incorporates the Frame-Recurrent Network and Recurrent-Residual Network. This network structure can extract information from low-resolution video frames more efficiently and alleviate error accumulation during inference. We also design a super-resolution flow estimation network(SRFnet) that can more accurately estimate optical flow between video frames while reducing error information ingress. Extensive experiments demonstrate that the proposed FFRVSR surpasses state-of-the-art methods in terms of inference speed. FFRVSR also has strong scalability and can be adapted for both real-time video super-resolution tasks and high-quality video super-resolution tasks.

Kaixuan Hou, Jianping Luo
Structural Properties of Associative Knowledge Graphs

This paper introduces a novel structural approach to constructing associative knowledge graphs. These graphs are composed of many overlapping scenes, with each scene representing a specific set of objects. In the knowledge graph, each scene is represented as a complete subgraph associating scene objects. Knowledge graph nodes represent various objects present within the scenes. The same object can appear in multiple scenes. The recreation of the stored scenes from the knowledge graph occurs through association with a given context, which includes some of the objects stored in the graph. The memory capacity of the system is determined by the size of the graph and the density of its synaptic connections. Theoretical dependencies are derived to describe both the critical graph density and the memory capacity of scenes stored in such graphs. The critical graph density represents the maximum density at which it is possible to reproduce all elements of the scene without errors.

Janusz A. Starzyk, Przemysław Stokłosa, Adrian Horzyk, Paweł Raif
Nonlinear NN-Based Perturbation Estimator Designs for Disturbed Unmanned Systems

This paper addresses the challenge of estimating perturbations in a classical unmanned system caused by a combination of internal uncertainties within the system and external disturbances. To accurately approximate these hard-to-measure perturbations, a novel nonlinear radial basis function neural network (RBFNN)-based estimator is introduced. This estimator is designed to reconstruct the perturbation structure effectively. The study demonstrates that utilizing RBFNN-based estimator designs, coupled with Lyapunov stability analysis, leads to achieving asymptotic estimation results. The effectiveness of the proposed perturbation estimation approach is validated through simulations conducted on both an unmanned marine system and a quadrotor system.

Xingcheng Tong, Xiaozheng Jin
DOS Dataset: A Novel Indoor Deformable Object Segmentation Dataset for Sweeping Robots

Path planning for sweeping robots requires avoiding specific obstacles, particularly deformable objects such as socks, ropes, faeces, and plastic bags. These objects can cause secondary pollution or hinder the robot’s cleaning capabilities. However, there is a lack of specific datasets for deformable obstacles in indoor environments. Existing datasets either focus on outdoor scenes or lack semantic segmentation annotations for deformable objects. In this paper, we introduce the first dataset for detecting and segmenting deformable objects in indoor sweeping robot scenarios, DOS Dataset. We believe that DOS will catalyze research in semantic segmentation of deformable objects for indoor robot obstacle avoidance applications.

Zehan Tan, Weidong Yang, Zhiwei Zhang
Leveraging Sound Local and Global Features for Language-Queried Target Sound Extraction

Language-queried target sound extraction is a fundamental audio-language task that aims to estimate the audio signal of the target sound event class by a natural language expression in a sound mixture. One of the key challenges of this task is leveraging the language expression to highlight the target sound features in the noisy mixture interpretably. In this paper, we leverage language expression to guide the model to extract the most informative features of the target sound event by adaptively using local and global features, and we present a novel language-aware synergic attention network (LASA-Net) for language-queried target sound extraction, as the first attempt to leverage local and global operations using language representation to extract target sound in single or multiple sound source environments. In particular, language-aware synergic attention consists of a local operation submodule, a global operation submodule, and an interaction submodule, in which local and global operation submodules extract sound local and global features while the interaction submodule adaptively selects the most discriminative features with the guidance of linguistic features. In addition, we introduce a linguistic-acoustic fusion module that leverages the well-proven correlation modeling power of self-attention for excavating helpful multi-modal contexts. Extensive experiments demonstrate that our proposed LASA-Net is able to achieve state-of-the-art performance while maintaining an attractive computational complexity.

Xinmeng Xu, Yiqun Zhang, Yuhong Yang, Weiping Tu
PEVLR: A New Privacy-Preserving and Efficient Approach for Vertical Logistic Regression

In our paper, we consider logistic regression in vertical federated learning. A new algorithm called PEVLR (Privacy-preserving and Efficient Vertical Logistic Regression) is proposed to efficiently solve vertical logistic regression with privacy preservation. To enhance the communication and computational efficiency, we design a novel local-update and global-update scheme for party $$\mathcal{A}$$ A and party $$\mathcal{B}$$ B , respectively. For the local update, we utilize hybrid SGD rather than vanilla SGD to mitigate the variance resulted from stochastic gradients. For the global update, full gradient is adopted to update the parameter of party $$\mathcal{B}$$ B , which leads to faster convergence rate and fewer communication rounds. Furthermore, we design a simple but efficient plan to exchange intermediate information with privacy-preserving guarantee. Specifically, random matrix sketch and random selected permutations are utilized to ensure the security of original data, label information and parameters under honest-but-curious assumption. The experiment results show the advantages of PEVLR in terms of convergence rate, accuracy and efficiency, compared with other related models.

Sihan Mao, Xiaolin Zheng, Jianguang Zhang, Xiaodong Hu
Semantic-Pixel Associative Information Improving Loop Closure Detection and Experience Map Building for Efficient Visual Representation

RatSLAM is a brain-inspired simultaneous localization and mapping (SLAM) system based on the rodent hippocampus model, which is used to construct the experience map for environments. However, the map it constructs has the problems of low mapping accuracy and poor adaptability to changing lighting environments due to the simple visual processing method. In this paper, we present a novel RatSLAM system by using more complex semantic object information for loop closure detection (LCD) and experience map building, inspired by the effectiveness of semantic information for scene recognition in the biological brain. Specifically, we calculate the similarity between current and previous scenes in LCD based on the pixel information computed by the sum of absolute differences (SAD) and the semantic information extracted by the YOLOv2 network. Then we build an enhanced experience map with object-level information, where the 3D model segmentation technology is used to perform instance semantic segmentation on the recognized objects. By fusing complex semantic information in visual representation, the proposed model can successfully mitigate the impact of illumination and fully express the multi-dimensional information in the environment. Experimental results on the Oxford New College, City Center, and Lab datasets demonstrate its superior LCD accuracy and mapping performance, especially for environments with changing illumination.

Yufei Deng, Rong Xiao, Jiaxin Li, Jiancheng Lv
Knowledge Distillation via Information Matching

Knowledge distillation can enhance network generalization by guiding a smaller student network to learn from a more complex teacher network. The challenge lies in maximizing the performance of the student network under the supervision of the teacher network. Currently, the feature-based distillation approach utilizes the middle-layer features of the teacher network to improve the performance of the student network. However, this approach lacks a measure to evaluate the content of the information present in the intermediate layers of both the teacher and student networks, which leads to a distillation mismatch of features and damages the student’s performance. In this study, we propose a new feature distillation method to solve this problem. We measure the information content in the intermediate layers of the teacher and student networks based on the receptive fields of corresponding features. Subsequently, the suitable number and locations of transmission features are decided based on information content, effectively alleviating the risk of information mismatch during distillation. Our experimental results demonstrate that the proposed method significantly improves the performance of the student network.

Honglin Zhu, Ning Jiang, Jialiang Tang, Xinlei Huang
CenAD: Collaborative Embedding Network for Anomaly Detection with Leveraging Partially Observed Anomalies

Leveraging observed anomalies in anomaly detection can significantly improve detection accuracy. Assuming that observed anomalies cover all anomaly distributions, existing methods commonly learn the anomaly distributions from these observed anomalies and assign each object an anomaly score according to the similarities between it and observed anomalies. However, these observed anomalies may partially cover anomaly distributions, which severely restrains the performance in detecting uncovered anomalies. To address this issue, we propose a novel collaborative embedding network for this task, named CenAD. By leveraging partially observed anomalies, the collaborative learning derives a loss with maximum neighbor dispersion and minimum volume estimation as guidance to make anomalies more dispersed. Each object is assigned to an anomaly score by its contributions to data dispersion, which distinguishes these anomalies from the entire data effectively. To investigate the effectiveness of CenAD with partially observed anomalies, we conduct extensive results on several datasets to validate the superiority of our method, in which we obtain average improvement up to 13.92% in AUC-ROC and 29.44% in AUC-PR compared with previous methods.

Li Cheng, Bin Li, Renjie He, Feng Yao
PAG: Protecting Artworks from Personalizing Image Generative Models

Recent advances in conditional image generation have led to powerful personalized generation models that generate high-resolution artistic images based on simple text descriptions through tuning. However, the abuse of personalized generation models may also increase the risk of plagiarism and the misuse of artists’ painting styles. In this paper, we propose a novel method called Protecting Artworks from Personalizing Image Generative Models framework (PAG) to safeguard artistic images from the malicious use of generative models. By injecting learned target perturbations into the original artistic images, we aim to disrupt the tuning process and introduce the distortions that protect the authenticity and integrity of the artist’s style. Furthermore, human evaluations suggest that our PAG model offers a feasible and effective way to protect artworks, preventing the personalized generation models from generating similar images to the given artworks.

Zhaorui Tan, Siyuan Wang, Xi Yang, Kaizhu Huang
Attention Based Spatial-Temporal Dynamic Interact Network for Traffic Flow Forecasting

The prediction of spatio-temporal traffic flow data is challenging due to the complex dynamics among different roads. Existing approaches often focused on capturing traffic patterns at a single temporal granularity, disregarding spatio-temporal interactions and relying heavily on prior knowledge. However, this limits the generality of the models and their ability to adapt to dynamic changes in traffic patterns. We argue that traffic flow changes co-occur in the road network’s temporal and spatial dimensions, which leads to commonalities and regularities in the data across these dimensions, with their dynamic changes depending on the temporal granularity. In this research, we propose an attention based spatio-temporal dynamic interaction network consisting of a spatio-temporal interaction filtering module and a spatio-temporal dynamic perception module. The interaction filtering module captures commonalities and regularities from a global perspective, ensuring adherence to the temporal and spatial dimensions of the road network structure. The dynamic perception module incorporates a sliding window attention mechanism to capture local dynamic correlations between the temporal and spatial dimensions at different time granularities. To address the issue of time series span, we design a more adaptive time-aware attention mechanism that effectively captures the impact of time intervals. Extensive experiments on four real-world datasets demonstrate that our approach achieves state-of-the-art performance and consistently outperforms other baseline methods. The source code is available at

Junwei Xie, Liang Ge, Haifeng Li, Yiping Lin
Staged Long Text Generation with Progressive Task-Oriented Prompts

Generating coherent and consistent long text remains a challenge for artificial intelligence. The state-of-the-art paradigm partitions the whole generating process into successive stages, however, the content plan applied in each stage may be error-prone and fine tuning large-scale language models, one for each stage, is resource-consuming. In this paper, we follow the above paradigm and devise three stages: keyphrase decompression, transition paraphrase, and text generation. We leverage task-oriented prompts to direct the producing of text in each stage which improves the quality of the generated text. Further, we propose a new content plan representation with elastic mask tokens to reduce model bias and irregular words. Moreover, we introduce length control and commonsense knowledge prompts to increase the adaptability of the proposed model. Extensive experiments conducted on two challenging tasks demonstrated that our model outperforms strong baselines significantly, and it is able to generate longer high quality texts with fewer parameters.

Xingjin Wang, Linjing Li, Daniel Zeng
Learning Stable Nonlinear Dynamical System from One Demonstration

Dynamic systems (DS) methods constitute one of the most commonly employed frameworks for Learning from Demonstration (LfD). The field of LfD aims to enable robots or other agents to learn new skills or behaviors by observing human demonstrations, and DS provide a powerful tool for modeling and reproducing such behaviors. Due to their ability to capture complex and nonlinear patterns of movement, DS have been successfully applied in robotics application. This paper presents a new learning from demonstration method by using the DS. The proposed method ensures that the learned systems achieve global asymptotic stability, a valuable property that guarantees the convergence of the system to an equilibrium point from any initial condition. The original trajectory is initially transformed to a higher-dimensional space and then subjected to a diffeomorphism transformation. This transformation maps the transformed trajectory forward to a straight line that converges towards the zero point. By deforming the trajectories in this way, the resulting system ensures global asymptotic stability for all generated trajectories.

Yu Zhang, Lijun Han, Zirui Wang, Xiuze Xia, Houcheng Li, Long Cheng
Towards High-Performance Exploratory Data Analysis (EDA) via Stable Equilibrium Point

Exploratory data analysis (EDA) is a vital procedure in data science projects. In this work, we introduce a stable equilibrium point (SEP)-based framework for improving the performance of EDA. By exploiting the SEPs to be the representative points, our approach aims to generate high-quality clustering and data visualization for real-world data sets. A very unique property of the proposed method is that the SEPs will directly encode the clustering properties of data sets. Compared with prior state-of-the-art clustering and data visualization methods, the proposed methods allow substantially improving solution quality for large-scale data analysis tasks. For instance, for the USPS data set, our method achieves more than $$10\%$$ 10 % clustering accuracy gain over the standard spectral clustering algorithm and 3X speedup for the t-SNE visualization.

Yuxuan Song, Yongyu Wang
MVFAN: Multi-view Feature Assisted Network for 4D Radar Object Detection

4D radar is recognized for its resilience and cost-effectiveness under adverse weather conditions, thus playing a pivotal role in autonomous driving. While cameras and LiDAR are typically the primary sensors used in perception modules for autonomous vehicles, radar serves as a valuable supplementary sensor. Unlike LiDAR and cameras, radar remains unimpaired by harsh weather conditions, thereby offering a dependable alternative in challenging environments. Developing radar-based 3D object detection not only augments the competency of autonomous vehicles but also provides economic benefits. In response, we propose the Multi-View Feature Assisted Network (MVFAN), an end-to-end, anchor-free, and single-stage framework for 4D-radar-based 3D object detection for autonomous vehicles. We tackle the issue of insufficient feature utilization by introducing a novel Position Map Generation module to enhance feature leareweighing foreground and background points, and their features, considering the irregular distribution of radar point clouds. Additionally, we propose a pioneering backbone, the Radar Feature Assisted backbone, explicitly crafted to fully exploit the valuable Doppler velocity and reflectivity data provided by the 4D radar sensor. Comprehensive experiments and ablation studies carried out on Astyx and VoD datasets attest to the efficacy of our framework. The incorporation of Doppler velocity and RCS reflectivity dramatically improves the detection performance for small moving objects such as pedestrians and cyclists. Consequently, our approach culminates in a highly optimized 4D-radar-based 3D object detection capability for autonomous driving systems, setting a new standard in the field.

Qiao Yan, Yihan Wang
Time Series Anomaly Detection with a Transformer Residual Autoencoder-Decoder

Time series anomaly detection is of great importance in a variety of domains such as finance fraud, industrial production, and information systems. However, due to the complexity and multiple periodicity of time series, extracting global and local information from different perspectives remains a challenge. In this paper, we propose a novel Transformer Residual Autoencoder-Decoder Model called $${\textbf {TRAD}}$$ TRAD for time series anomaly detection, which is based on a multi-interval sampling strategy incorporating with residual learning and stacked autoencoder-decoder to promote the ability to learn global and local information. Prediction error is applied to calculate anomaly scores using the proposed model from different scales, and the aggregated anomaly scores are utilized to infer outliers of the time series. Extensive experiments are conducted on five datasets and the results demonstrate that the proposed model outperforms the previous state-of-the-art baselines.

Shaojie Wang, Yinke Wang, Wenzhong Li
Adversarial Example Detection with Latent Representation Dynamic Prototype

In the realm of Deep Neural Networks (DNNs), one of the primary concerns is their vulnerability in adversarial environments, whereby malicious attackers can easily manipulate them. As such, identifying adversarial samples is crucial to safeguarding the security of DNNs in real-world scenarios. In this work, we propose a method of adversarial example detection. Our approach using a Latent Representation Dynamic Prototype to sample more generalizable latent representations from a learnable Gaussian distribution, which relaxes the detection dependency on the nearest neighbour’s latent representation. Additionally, we introduce Random Homogeneous Sampling (RHS) to replace KNN sampling reference samples, resulting in lower reasoning time complexity at O(1). Lastly, we use cross-attention in the adversarial discriminator to capture the evolutionary differences of latent representation in benign and adversarial samples by comparing the latent representations from inference and reference samples globally. We conducted experiments to evaluate our approach and found that it performs competitively in the gray-box setting against various attacks with two $$\mathcal {L}_p$$ L p -norm constraints for CIFAR-10 and SVHN datasets. Moreover, our detector trained with PGD attack exhibited detection ability for unseen adversarial samples generated by other adversarial attacks with small perturbations, ensuring its generalization ability in different scenarios.

Taowen Wang, Zhuang Qian, Xi Yang
A Multi-scale and Multi-attention Network for Skin Lesion Segmentation

Accurately segmenting the diseased areas from dermoscopy images is highly meaningful for the diagnosis of skin cancer, and in recent years, methods based on deep convolutional neural networks have become the mainstream for automatic segmentation of skin lesions. Although these methods have made significant improvements in the field of skin lesion segmentation, capturing long-range dependencies remains a major challenge for convolutional neural networks. In order to address this limitation, this paper proposes a deep learning model for skin lesion segmentation called the Multi-Scale and Multi-Attention Network (MSMA-Net). The encoder part utilizes a pretrained ResNet for feature extraction. In the skip connection part, we adopt a novel non-local method called the Fully Attentional Block (FLA), which effectively obtains long-range contextual information and retains attentions in all dimensions. In the decoder part, we propose a multi-attention decoder that consists of four attention modules, allowing effective attention to be given to the feature maps in three dimensions: spatial, channel, and scale. We conducted experiments on two publicly available skin lesion segmentation datasets, ISIC 2017 and ISIC 2018, and the results demonstrate that MSMA-Net outperforms other methods, confirming the effectiveness of MSMA-Net.

Cong Wu, Hang Zhang, Dingsheng Chen, Haitao Gan
Temporal Attention for Robust Multiple Object Pose Tracking

Estimating the pose of multiple objects has improved substantially since deep learning became widely used. However, the performance deteriorates when the objects are highly similar in appearance or when occlusions are present. This issue is usually addressed by leveraging temporal information that takes previous frames as priors to improve the robustness of estimation. Existing methods are either computationally expensive by using multiple frames, or are inefficiently integrated with ad hoc procedures. In this paper, we perform computationally efficient object association between two consecutive frames via attention through a video sequence. Furthermore, instead of heatmap-based approaches, we adopt a coordinate classification strategy that excludes post-processing, where the network is built in an end-to-end fashion. Experiments on real data show that our approach achieves state-of-the-art results on PoseTrack datasets.

Zhongluo Li, Junichiro Yoshimoto, Kazushi Ikeda
Correlation Guided Multi-teacher Knowledge Distillation

Knowledge distillation is a model compression technique that transfers knowledge from a redundant and strong network (teacher) to a lightweight network (student). Due to the limitations of a single teacher’s perspective, researchers advocate for the inclusion of multiple teachers to facilitate a more diverse and accurate acquisition of knowledge. However, the current multi-teacher knowledge distillation methods only consider the integrity of integrated knowledge from the teachers’ level in teacher weight assignments, which largely ignores the student’s preference for knowledge. This will result in inefficient and redundant knowledge transfer, thereby limiting the learning effect of the student network. To more efficiently integrate teacher knowledge suitable for student learning, we propose Correlation Guided Multi-Teacher Knowledge Distillation (CG-MTKD), which utilizes the feedback of the student’s learning effects to achieve the purpose of integrating the student’s preferred knowledge. Through extensive experiments on two public datasets, CIFAR-10 and CIFAR-100, we demonstrate that our method, CG-MTKD, can effectively integrate the knowledge of student preferences during teacher weight assignments.

Luyao Shi, Ning Jiang, Jialiang Tang, Xinlei Huang
Neural Information Processing
Biao Luo
Long Cheng
Zheng-Guang Wu
Hongyi Li
Chaojie Li
Copyright Year
Springer Nature Singapore
Electronic ISBN
Print ISBN

Premium Partner