Skip to main content

2025 | Buch

Pattern Recognition and Computer Vision

7th Chinese Conference, PRCV 2024, Urumqi, China, October 18–20, 2024, Proceedings, Part I

herausgegeben von: Zhouchen Lin, Ming-Ming Cheng, Ran He, Kurban Ubul, Wushouer Silamu, Hongbin Zha, Jie Zhou, Cheng-Lin Liu

Verlag: Springer Nature Singapore

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This 15-volume set LNCS 15031-15045 constitutes the refereed proceedings of the 7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024, held in Urumqi, China, during October 18–20, 2024.
The 579 full papers presented were carefully reviewed and selected from 1526 submissions. The papers cover various topics in the broad areas of pattern recognition and computer vision, including machine learning, pattern classification and cluster analysis, neural network and deep learning, low-level vision and image processing, object detection and recognition, 3D vision and reconstruction, action recognition, video analysis and understanding, document analysis and recognition, biometrics, medical image analysis, and various applications.

Inhaltsverzeichnis

Frontmatter

Machine Learning

Frontmatter
Cluster Center Initialization for Fuzzy K-Modes Clustering Using Outlier Detection Technique

The fuzzy K-modes clustering algorithm is an extension of the fuzzy K-means clustering algorithm, which can handle massive categorical data. However, the quality of the initial cluster centers (or called initial centers) may significantly affect the results of fuzzy K-modes clustering. In many cases, poor clustering results may occur due to unsuitable initial centers. Therefore, the selection of initial centers, that is, cluster center initialization (CCI), is a key issue in fuzzy K-modes clustering. This paper deals with the CCI problem of fuzzy K-modes clustering from the perspective of outlier detection, and proposes a cluster center initialization algorithm (CCI_DOFD), for fuzzy K-modes clustering. CCI_DOFD selects initial centers by virtue of the distance outlier factor of each object, the density of each object and the distances between objects. By considering the distance outlier factor, CCI_DOFD can avoid the problem that an outlier is selected as the initial center. Moreover, when calculating the density of each object and the distances between objects, CCI_DOFD assigns different weights to different attributes according to the significance of each attribute, which can effectively reflect the difference between different attributes. Experimental results on several UCI data sets demonstrate the effectiveness of our algorithm for the CCI of fuzzy K-modes clustering.

Yuqi Sha, Junwei Du, Zhiyong Yang, Feng Jiang
Few-Shot Class-Incremental Learning via Cross-Modal Alignment with Feature Replay

Few-shot class-incremental learning (FSCIL) studies the problem of continually learning novel concepts from a limited training data without catastrophically forgetting the old ones at the meantime. While most existing works are established on the premise of learning from scratch, growing efforts have been devoted to incorporating the benefits of pre-trained Vision-Language Models (VLMs) within the FSCIL solution, considering that these models have shown powerful generalization abilities in zero-shot/few-shot learning. In this paper, we propose a simple yet effective FSCIL framework that well leverages the prior knowledge of the CLIP model to attack the stability-plasticity dilemma. Considering the semantic gap between the pre-trained and downstream data, we first combine soft prompts with visual adaptation to effectively accommodate the prior knowledge from both branches. Then, we condition the textual prototype on each visual input to adaptively capture the instance-specific information, taking account of their intrinsic heterogeneous structures. On top of this framework, we employ a simple feature replay strategy that models each class as a Gaussian distribution to alleviate the task interference in each new session. Extensive experimental results on three benchmarks, i.e., CIFAR100, CUB200 and miniImageNet, show that our proposed method can achieve compelling FSCIL results.

Yanan Li, Linpu He, Feng Lin, Donghui Wang
Generalizing Soft Actor-Critic Algorithms to Discrete Action Spaces

ATARI is a suite of video games used by reinforcement learning (RL) researchers to test the effectiveness of the learning algorithm. Receiving only the raw pixels and the game score, the agent learns to develop sophisticated strategies, even to the comparable level of a professional human games tester. Ideally, we also want an agent requiring very few interactions with the environment. Previous competitive model-free algorithms for the task use the valued-based Rainbow algorithm without any policy head. In this paper, we change it by proposing a practical discrete variant of the soft actor-critic (SAC) algorithm. The new variant enables off-policy learning using policy heads for discrete domains. By incorporating it into the advanced Rainbow variant, i.e., the “bigger, better, faster” (BBF), the resulting SAC-BBF improves the previous state-of-the-art interquartile mean (IQM) from 1.045 to 1.088, and it achieves these results using only replay ratio (RR) 2. By using lower RR 2, the training time of SAC-BBF is strictly one-third of the time required for BBF to achieve an IQM of 1.045 using RR 8. As a value of IQM greater than one indicates super-human performance, SAC-BBF is also the only model-free algorithm with a super-human level using only RR 2. The code is publicly available on GitHub at https://github.com/lezhang-thu/bigger-better-faster-SAC .

Le Zhang, Yong Gu, Xin Zhao, Yanshuo Zhang, Shu Zhao, Yifei Jin, Xinxin Wu
LarvSeg: Exploring Image Classification Data for Large Vocabulary Semantic Segmentation via Category-Wise Attentive Classifier

Scaling up the vocabulary of semantic segmentation models is extremely challenging because annotating large-scale mask labels is labour-intensive and time-consuming. Recently, language-guided segmentation models have been proposed to address this challenge. However, their performance drops significantly when applied to out-of-distribution categories. In this paper, we propose a new large vocabulary semantic segmentation framework, called LarvSeg. Different from previous works, LarvSeg leverages image classification data to scale the vocabulary of semantic segmentation models as large-vocabulary classification datasets usually contain balanced categories and are much easier to obtain. However, for classification tasks, the category is image-level, while for segmentation we need to predict the label at pixel level. To address this issue, we first propose a general baseline framework to incorporate image-level supervision into the training process of a pixel-level segmentation model, making the trained network perform semantic segmentation on newly introduced categories in the classification data. We then observe that a model trained on segmentation data can group pixel features of categories beyond the training vocabulary. Inspired by this finding, we design a category-wise attentive classifier to apply supervision to the precise regions of corresponding categories to improve the model performance. Extensive experiments demonstrate that LarvSeg significantly improves the large vocabulary semantic segmentation performance, especially in the categories without mask labels. For the first time, we provide a 21K-category semantic segmentation model with the help of ImageNet21K. The code will be released soon.

Haojun Yu, Di Dai, Ziwei Zhao, Di He, Han Hu, Liwei Wang
Exploring Out-of-Distribution Scene Text Recognition for Driving Scenes with Hybrid Test-Time Adaptation

Scene Text Recognition (STR) in dynamic driving scenes is important for recognizing real-world kilometer marker to facilitate the scheduling and operation of industrial scenes. For example, the location information of the train affects the safe and reliable operation of the transportation, which can be effectively determined by identifying the kilometer markers with STR technology. However, most of the existing STR models make the independent and identically distributed (i.i.d) assumption that all the training data and test data are drawn from the same data distribution. Although satisfactory performance is achieved under i.i.d assumption, existing STR models remain notoriously weak at generalization on out-of-distribution (o.o.d) data, making a system unreliable and unsafe. To validate this phenomenon, we attempt to propose a new hybrid test-time adaptation (HTTA) to improve the performance of an STR model on o.o.d test data. Previously, test-time adaptation methods are targeted at classification models and do not consider the multi-step reasoning characteristic of sequence learning tasks. In HTTA, we deploy multiple semantically-reserved image augmentation and design a semantically-consistent auxiliary task to present a continual adaptation. Additionally, we construct a new Real-world Subway Kilometer Marker (RSKM) dataset for an out-of-distribution STR practice under dynamic driving scenes. We conduct extensive experiments on RSKM by embedding our HTTA into multiple classical STR methods to show the effectiveness. The experiment results show that our semantically-consistent augmentation and HTTA significantly improve the generalization performance on o.o.d STR practice.

Xiaoyu Xian, Jinghui Qin, Yukai Shi, Daxin Tian, Liang Lin
PhaseNN: An Unsupervised and Spatial-Frequency Integrated Network for Phase Retrieval

Phase retrieval (PR) aims to recover original phase signals from intensity-only measurements, which is a typical inverse problem in computational imaging. In recent years, deep learning algorithms have demonstrated considerable potential in dealing with such issues. However, the convolutional neural networks (CNNs), which are the most widely used modal, have suffered from certain inherent limitations. For example, the fixed receptive field restricts the ability of conventional CNNs to capture global dependencies, while treating the network model as a black box hinders the interpretability of the modal. Moreover, these methods only work with precisely labeled images built from expensive and ultra-precise sensors. Unfortunately, slight system aberrations can easily break the dependency between the images and the labels, making the training process extremely challenging. To address these issues, we propose PhaseNN, an unsupervised physics-driven wavefront phase retrieval network. PhaseNN adopts an encoder-decoder architecture, where the encoder efficiently extracts image features by fusing spatial and frequency domain features through the designed spatial-frequency block based on the property of the Fourier transform, which captures the global dependency and perceives the receptive field size of the image. The decoder generates pseudo-labels based on the physical optical imaging model, incorporating physical constraints into the training process. This study is the first attempt to explore the combination of spatial and frequency information in PR tasks. Experimental results demonstrate that PhaseNN outperforms existing CNN-based methods and achieves comparable wavefront phase reconstruction performance to supervised learning methods under the aberration-free condition. PhaseNN also exhibits superior robustness against the static and dynamic aberrations inherent to optical systems, thus showcasing exceptional performance.

Haining Hu, Jie Tan, Xiaoguang Ren, Yuchen Hua, Xin Liu
Sequential Transfer of Pose and Texture for Pose Guided Person Image Generation

Pose Guided Person Image Generation (PGPIG) aims to transform persons in source images into given target poses. Most existing methods only distort texture information towards the target pose, ignoring the impact of pose information transformation, resulting in images with unrealistic poses or texture loss. In this paper, we propose a novel generation network that sequentially conducts pose and texture transformations to enhance PGPIG performance. Initially, we prioritize pose transformation, generating features aligning with the target pose while retaining source texture details. Subsequently, we concentrate on texture transformation, ensuring consistency with the target pose. To achieve the goal of each step, we propose the Pose Factor Transformer Block (PT) for injecting target pose information and the Texture Factor Transformer Block (TT) for refining texture details in the generated person image. Extensive experiments demonstrate the efficacy of our approach across evaluation metrics such as LPIPS, PSNR, and SSIM. Furthermore, our network does not require additional parsing labels and reduces training costs significantly.

Zifan Li, Qingxuan Shi, Shuishui Cheng
Balanced Clustering with Discretely Weighted Pseudo-label

Clustering has aroused much attention in the community of data mining and image processing. However, it has the following two problems, which greatly limit its applications: 1) Existing methods usually adopt unbalanced clustering structures, and the feature information of categories with too few samples cannot be fully expressed and utilized, thus affecting the accuracy of clustering. 2) The continuous pseudo-label matrix learned from the relaxed problem based on spectral analysis deviates from reality to some extent. To solve the above problems and improve the clustering performance, this paper proposes a novel method named balanced clustering with discretely weighted pseudo-label (BC_DWP). Initially, the balanced constraint is employed for canonical clustering, which can generate balanced clusters through minimization. Then, the weighted pseudo-label matrix with discrete features is introduced to avoid the trivial solution of unsupervised least squares regression. After that, the $$l_{2,p}$$ l 2 , p -norm is introduced to satisfy the row sparsity of the selection matrix with flexible p. Finally, an efficient iterative algorithm is provided to optimize the model. Experimental results on six datasets show that the proposed method can not only handle the large-scale data, but also produce good clustering performance.

Zien Liang, Shuping Zhao, Zhuojie Huang, Jigang Wu
Tensor Robust Principal Component Analysis with Hankel Structure

Tensor Robust Principal Component Analysis (TRPCA) aims to recover clean tensor data corrupted with noise. This method finds significant application in recovering multidimensional data. However, the majority of existing methods rely solely on local similarity and global information, neglecting the potential benefits offered by the non-local similarity within those data. This oversight often leads to inferior recovery performance. To improve the performance of data recovery, we propose Weight Hankel-TRPCA, which integrates additional prior information to obtain good recovery performance, resulting in improved recovery performance. Specifically, the multidimensional data is divided into three dimensional patches, and similar patches are partitioned to obtain their non-local similarity. The partitioned patches are then projected onto the Hankel tensor to improve their low-rank properties. The algorithm is solved using the famous alternating direction method of multiplier (ADMM). Extensive experimental results show that the proposed method HK-TRPCA outperforms several state-of-the-art methods in terms of performance.

Chao Xu, Hao Tan, Qingrong Feng, Yue Zhang, Jianjun Wang
Self-Distillation via Intra-Class Compactness

Knowledge distillation, a popular model compression method, transfers knowledge from a large teacher model to a smaller student model. Self-distillation takes this a step further by having the model itself act as both teacher and student. However, existing self-distillation methods often focus on individual instance knowledge, such as logits and intermediate features, but overlook the structural information within each category’s representation. To address this gap, we propose Self-Distillation via Intra-Class Compactness (SDICC). Specifically, in SDICC, we use previous epoch models as teachers to guide training in the current epoch, while also emphasizing intra-class compactness as an additional training objective. This facilitates our model’s learning process in bringing intra-class features closer together, thereby promoting more discriminative representations across different categories. Moreover, to better combine both the knowledge from logits and the compactness of features, we adaptively perform self-distillation for progressive knowledge transfer. We extensively evaluate SDICC on popular image classification datasets like CIFAR-100 and Tiny ImageNet. Our results demonstrate that SDICC outperforms recent state-of-the-art self-distillation methods, showcasing its effectiveness in knowledge transfer and model compression.

Jiaye Lin, Lin Li, Baosheng Yu, Weihua Ou, Jianping Gou
An Enhanced Dual-Channel-Omni-Scale 1DCNN for Fault Diagnosis

It is crucial for ensuring production safety, improving production efficiency and equipment reliability to accurately diagnose faults in rotating machinery. With the rapid development of deep learning, the massive excellent bearing fault diagnosis methods have emerged. However, most of these methods only focus on local or global features, and as the number of network layers increases, overfitting and a large number of model parameters arise. In response to these issues, this paper proposes a lightweight framework for end-to-end fault diagnosis. The framework uses Omni-Scale block with an efficient channel attention mechanism (ECA-OS-block) to capture features at different scales, and performs global adaptive weighting to focus on critical signals via a signal attention mechanism. Then combined with a Fully Convolutional Network as dual channel to effectively extract the details of fault signals, as well as reduce the problem of over fitting and under fitting when a single model processes multi-bearing fault data. Experimental results show that the proposed approach can achieve excellent results on multiple fault datasets, and the standard deviation of the results from repeated training is small. This indicates that the model has good generalization and stability. Even with a limited number of training samples, key features of the data can still be captured. Also, the anti-interference ability is stronger than some existing models in multi-bearing systems.

Xiaona Zheng, Qintai Hu, Chunlin Li, Shuping Zhao
Visual-Guided Reasoning Path Generation for Visual Question Answering

Neural module network (NMN) based methods have shown promising performance in visual question answering (VQA). However, existing methods have overlooked the potential existence of multiple reasoning paths for a given question. They generate one reasoning path for a question, which restricts the diversity in module combinations. Additionally, these methods generate reasoning paths solely based on questions, neglecting visual cues, which may lead to sub-optimal paths in multi-step reasoning scenarios. In this paper, we introduce the Visual-Guided Neural Module Network (V-NMN), a neuro-symbolic method that integrates visual information to enhance the model’s reasoning capabilities. Specifically, we utilize the reasoning capability of large language models (LLM) to generate all feasible reasoning paths for the questions in a few-shot manner. Then, we assess the suitability of these paths for the image and select the optimal one based on the assessment. The final answer is derived by executing the reasoning process along the selected path. We evaluate our method on the GQA dataset and CX-GQA, a test set that requires multi-step reasoning. Experimental results demonstrate its effectiveness in real-world scenarios.

Xinyu Liu, Chenchen Jing, Mingliang Zhai, Yuwei Wu, Yunde Jia
FedGC: Federated Learning on Non-IID Data via Learning from Good Clients

Federated learning (FL) is a privacy-preserving solution for deep learning with decentralized data owners. An important issue that may degrade the performance of FL is statistical heterogeneity among data distributions of data owners (clients). That is, data of different clients are non-independently and identically distributed (non-IID) so that client local objective functions are inconsistent. To cope with this issue, we reveal that the unbiased client selection strategy is not optimal for FL on non-IID data. Motivated by this observation, we propose a new method named FedGC for solving data heterogeneity problem, which tends to select clients with better-performed models. With the proposed FedGC, the negative impact of inconsistent local updates on performance of global model is alleviated by learning the optimization directions of selected clients. On the other hand, all clients may learn from the selected clients in local training phase to reduce inconsistency in client local updates and increase consistency between local models and the global one. The experimental results on several benchmarks under various non-IIDness settings show that our proposed FedGC scheme generally outperforms the state-of-the-art methods and can serve as a useful plugin for enhancing the performance of FL methods.

Xu Ji, Hao-Tian Wu, Ting Cui, Yiqun Zhang, Lingling Xu
Inter-Class Correlation-Based Online Knowledge Distillation

Online knowledge distillation has emerged as a powerful approach for training student networks in real-time, exhibiting promising results in image classification tasks. However, current online knowledge distillation methods primarily focus on enhancing performance by transferring prediction and feature knowledge, neglecting the potential benefits of leveraging feature correlation. To address this gap and effectively distill valuable relational knowledge, we introduce a novel online knowledge distillation method named Inter-class Correlation-based Online Knowledge Distillation (ICOKD). Our approach establishes feature correlations among inter-class samples within each network and facilitates the mutual transfer of relational knowledge across different online networks. Additionally, to extract more informative feature relation information, we incorporate a feature enhancement module to enrich features. Moreover, we design an adaptive distillation module to guide each student network to learn at the logits level, thereby further distilling effective logits-based knowledge. Experimental results on CIFAR-100 and Tiny-ImageNet datasets validate the effectiveness of our proposed ICOKD method, demonstrating its superiority over state-of-the-art online methods.

Hongfang Zhu, Jianping Gou, Lan Du, Weihua Ou
Accelerating Domain Adaptation with Cascaded Adaptive Vision Transformer

Domain adaptation (DA) aims to transfer knowledge from labeled source domains to unlabeled target domains, addressing the challenge of model generalization when there is a distribution mismatch between training and testing data. While many Vision Transformer (ViT)-based methods have been developed for DA, they focus primarily on improving accuracy, with less emphasis on accelerating inference on unlabeled target domains. In this paper, we propose a novel method named Cascaded Adaptive Vision Transformer (CAViT), which dynamically adjusts token counts for each input image by cascading multiple transformers with increasing tokens. During testing, “easier” images exit early, while “harder” images are processed further until confident predictions are achieved. We further enhance domain adversarial learning by incorporating a token-level domain discriminator in the attention layer, which assigns distinct weights to different patch tokens. This enables the network to learn features with cross-domain transferability and discriminative capabilities, achieving effective feature alignment. Experimental results demonstrate that our method not only improves accuracy but also significantly reduces computational costs, as evidenced by results on three benchmark datasets.

Qilin Jiang, Chaoran Cui, Chunyun Zhang, Yongrui Zhen, Shuai Gong, Ziyi Liu, Fan’an Meng, Hongyan Zhao
Multistage Compression Optimization Strategies for Accelerating Diffusion Models

Diffusion models have recently gained widespread popularity in the field due to their exceptional image generation capabilities. Despite their powerful functionalities, the complex structure of these models and the step-by-step denoising process often lead to high computational costs and slow generation speeds, significantly limiting their wider application. Although various methods have been developed to reduce operational overhead and speed up image generation, these methods usually involve a trade-off between acceleration and maintaining quality. In this paper, we propose a new acceleration strategy that optimizes diffusion models by compressing the model structure and the generation process. Specifically, we first compress redundant tokens in the diffusion model’s generation process to reduce computational complexity. Next, we compress and reuse feature redundancies during the progressive sampling process to minimize unnecessary computation. To enhance the sampling efficiency of diffusion models, we employ an optimal path finding scheduler to approximate the entire generation process. We validated our method on a variety of datasets, including CIFAR, ImageNet and COCO2017, and tested under DDPM, LDM and Stable Diffusion.The experimental results confirmed the effectiveness of our approach in generating high-quality images across various settings of Stable Diffusion and LDM-4. Notably, we achieved a 2-6x acceleration effect while observing only minor changes in CLIP fraction.

Weiquan Huang, Qiang Chen
Defending Adversarial Patches via Joint Region Localizing and Inpainting

Deep neural networks are successfully used in various applications but show their vulnerability to adversarial examples. With the development of adversarial patches, the feasibility of attacks in physical scenes increases, and the defenses against patch attacks are urgently needed. However, the technology for defending against such adversarial patch attacks still requires to be improved. In this paper, we analyze the characteristics of adversarial patches and find that adversarial patches will lead to the appearance or contextual inconsistency in the target objects. The patch region will show abnormal changes on the high-level feature maps of the objects extracted by a backbone network. Consequently, we propose a novel defense method based on a “localizing and inpainting” mechanism to pre-process the input examples. Specifically, we design a unified framework, where the “localizing” sub-network utilizes a two-branch structure corresponding to two characteristics of patches to accurately detect the adversarial patch region in the image. The “inpainting” subnetwork utilizes the surrounding contextual cues to recover the original content covered by the adversarial patch. The quality of inpainted images is also evaluated by measuring the appearance consistency and the effects of adversarial attacks. These two sub-networks are jointly trained via an iterative optimization approach, allowing the ‘localizing’ and ‘inpainting’ modules to closely interact and learn a better solution. Extensive experiments on traffic sign classification and detection tasks demonstrate that our method outperforms the state-of-the-art method, increasing accuracy by 37%, which verifies the effectiveness and superiority of the proposed method.

Yafu Zhang, Shiji Zhao, Xingxing Wei, Sha Wei
Multi-view Spectral Clustering Based on Topological Manifold Learning

Multi-view clustering is an unsupervised learning strategy that divides data into multiple categories based on complementary and consistent information. Graph-based multi-view clustering methods have attracted much attention due to their simplicity and efficiency. Although graph-based multi-view clustering algorithms have achieved good clustering performance, there are still some issues that need to be addressed. Firstly, existing methods fail to consider the manifold topological structure in the data, which might cause that the constructed similarity graphs are low-quality. Secondly, many graph-based multi-view clustering algorithms treat the construction of similarity graphs and the learning of consistent spectral embedding as two separate procedures, in which the quality of similarity graphs heavily affects the clustering performance. To overcome these problems, we propose a novel method termed as Multi-view Spectral Clustering based on Topological Manifold Learning (MSCTML), where both similarity graph construction and consistent spectral embedding learning are jointly performed in an unified framework. Concretely, affine graph is initially constructed for each view. Subsequently, considering the manifold topological structure in the data, similarity graphs for different views are generated by using the above affine graphs. Furthermore, consistent spectral embedding is learned based on the constructed similarity graphs. Finally, the clustering result is obtained by K-means algorithm. The proposed method is tested on six benchmark datasets. Comparing with single-view and state-of-the-art multi-view clustering algorithms, extensive experimental results demonstrate the superior clustering performance of the proposed MSCTML method.

Shaojun Shi, Yibing Liu, Canyu Zhang, Xueling Chen
Client Selection Mechanism for Federated Learning Based on Class Imbalance

Due to limitations in the performance of client-side systems and constraints on communication costs, most existing Federated Learning (FL) algorithms cannot involve all clients in training. Therefore, randomly selecting some clients to participate in FL training is used in practice. However, the datasets held by clients often exhibit non-independent and identically distributed (Non-IID) characteristics. This method of randomly selecting clients can lead to training the global model on more unbalanced datasets, ultimately decreasing the global model’s performance. To effectively mitigate the impact of dataset imbalance on Federated Learning (FL), in this paper, we propose a class-balanced sampling method based on the grouping of the number of client classes - FedCCBS (Federated Client Class Balanced Sampling). It aims to select clients with complementary datasets for training, thereby alleviating the adverse effects of imbalanced datasets on the global model. We conducted experiments on the MNIST, FASHION MNIST and CIFAR-10 datasets, and the experimental results demonstrated that FedCCBS achieves faster convergence and maintains a more stable convergence process. Moreover, the classification accuracy of FedCCBS surpasses that of other baseline algorithms.

Linlin Zhang, Congjie Lin, Zhangshuai Bie, Shuo Li, Xuehua Bi, Kai Zhao
A New Paradigm for Enhancing Ensemble Learning Through Parameter Diversification

Ensemble learning has emerged as a pivotal area of interest within the machine learning community, consistently delivering superior performance across various predictive tasks. From the outset, diversity in the ensemble is a critical factor in the exceptional performance of these models. In the ensemble learning methodology, efforts to enhance ensemble performance have predominantly focused on increasing diversity at the data level. It requires large amounts of training data to avoid overfitting for neural networks. This paper proposes a novel diversity-enhanced strategy for neural network ensembles to create more diversity through Parameter Diversification (PaD). In particular, we introduce a regular term alongside a controlling parameter into the training loss function of each constituent network. This innovation enables us to cultivate a higher degree of diversity within the ensemble while concurrently maintaining the accuracy of the individual model. Our critical insight is straightforward: to encourage diversity in an ensemble by inducing other models to deviate from the optimal model, i.e., we desire the output of each network to be diverse. We validated our approach on multiple machine learning datasets and simulation datasets. The experimental results indicate that the proposed approach effectively creates favorable diversity for the ensemble, thereby endowing it with promising generalization capabilities.

Jiaqi Jiang, Fangqing Gu, Chikai Shang
Adaptive Multi-information Feature Fusion MLP with Filter Enhancement for Sequential Recommendation

In recent years, many self-attention models have achieved good sequence recommendation performance by, capture the sequential dependencies between users and items. However, user behavior data inevitably contains noise, and the embedding of location information may interfere with item embedding semantics, causing noise in the data to further increase. At the same time, these self-attention models ignore the impact of high-relevance user-item interactions on the next item. To address these problems, we propose a new sequential recommendation system (AMFRec). Specifically, we adopted a three-way information (sequence, cross-channel, cross-feature) adaptive fusion scheme enhanced by a filtering algorithm. The proposed system is completely based on the MLP architecture attenuates noise in the frequency domain to reduce its impact on the model, and is naturally sensitive to location information. Finally, we designed a squeeze incentive module suitable for recommendation systems to activate multiple highly relevant projects. Experiments were conducted on three widely used datasets to demonstrate the effectiveness and efficiency of the proposed method.

Shuangquan Li, Xingyao Yang, Hongtao Shen, Jiong Yu, Yanfu Wu
FedDCP: Personalized Federated Learning Based on Dual Classifiers and Prototypes

Federated Learning aims to enable joint training of high-performance deep learning by multiple clients while preserving data privacy by avoiding local data upload. The efficiency of collaborative learning is compromised by the heterogeneity of data distribution on different clients. In order to address this, we propose FedDCP, a generalized framework for personalized federated learning, that can mitigate the negative impact of long-tail distribution of local data. The core idea involves using the prototypes to limits the drift of local’s feature extraction and allowing personalized models to assimilate global insights alongside local data adaptation by dual classifier. Furthermore, a series of comprehensive experiments on three distinct datasets show that FedDCP can significantly enhance personalization capabilities of existing federated learning methods. In comparison to eight other SOTA pFL algorithms, the FedDCP demonstrates notable enhancements in accuracy. The code for FedDCP is publicly available on GitHub at https://github.com/awsl0/FedDCP .

Xiangxiang Li, Yang Hua, Xiaoning Song, Wenjie Zhang, Xiao-jun Wu
AtomTool: Empowering Large Language Models with Tool Utilization Skills

In recent years, significant strides have been made in harnessing large language models (LLMs) to leverage various tools across different fields, which largely expands the application scope of LLMs. However, current research predominantly focuses on LLMs’ inherent tool exploitation skills from their training data, leading to higher costs when integrating new tools. Additionally, most studies concentrate on English models, leaving a scarcity of open-source resources for other languages. This study investigates the zero-shot generalization of LLMs in tool usage, with a focus on Chinese models. We introduce AtomTool, an open-source framework for tool acquisition in LLMs, along with a dataset of 16,000 Chinese entries. This work marks the first effort to evaluate zero-shot generalization in Chinese models and provides the initial open-source framework and dataset dedicated to tool acquisition in Chinese LLMs. Our experiments show AtomTool outperforms the closed-source models like ChatGPT in zero-shot generalization in most cases. We also propose a novel dataset construction method and evaluation framework, examining prompt design and tool quantity effects on model performance. Overall, our work establishes a solid foundation for advancing tool acquisition in Chinese LLMs.

Yongle Li, Zheng Zhang, Junqi Zhang, Wenbo Hu, Yongyu Wu, Richang Hong
Making the Primary Task Primary: Boosting Few-Shot Classification by Gradient-Biased Multi-task Learning

Recent works in few-shot learning (FSL) have explored the incorporation of supplementary self-supervised auxiliary tasks to facilitate inductive knowledge transfer, yielding promising outcomes. Nevertheless, these approaches only optimize the shared parameters of the FSL model by minimizing a linear combination of two or more task losses, along with manually selecting the combination coefficients. Moreover, due to the unknown and intricate relationships between different tasks, such a simplistic linear combination operation is prone to inducing task conflicts, leading to adverse knowledge transfer. To tackle these challenges, we argue that in few-shot learning (FSL) augmented with auxiliary tasks, the emphasis should be laid on enhancing the performance of the primary FSL task. Specifically, to mitigate the aforementioned task conflicts, we introduce a new Gradient-biAsed Multi-task lEarning (GAME) method, which “makes the primary task primary” by considering both gradient direction and loss magnitude. Extensive experiments demonstrate that the proposed GAME method obtains substantial performance improvements over state-of-the-art methods.

Yunchen Wu, Boyao Shi, Jing Huo, Wenbin Li, Yang Gao, Hao Liu, Yunhao Wang, Tinghao Yu
Cascade Large Language Model via In-Context Learning for Depression Detection on Chinese Social Media

Depression is a common mental illness in modern society. However, incorrect and missed diagnoses of this illness still widely exist, making timely and accurate depression detection based on artificial intelligence (AI) technology urgent. Despite recent progress in depression detection, current models often struggle to analyze large amounts of data containing ambiguous content. The emergence of large language models (LLMs) sheds light on solving this issue. This paper proposes a Cascade Large Language Model via In-Context Learning (CLLM-ICL) for detecting depression. CLLM-ICL is a cascade model that includes an in-context learning prompt for GPT-3.5-Turbo-1106 and a small-sized linear neural network, subtly combining the strengths of the small-sized model and LLM in our application. We conduct experiments on the Weibo user depression detection dataset (WU3D) to evaluate the effectiveness of our model. The results show that our cascade model achieves better accuracy and F1-score than other recently proposed models.

Tong Zheng, Yanrong Guo, Richang Hong
TRAE: Reversible Adversarial Example with Traceability

Users often upload images containing sensitive personal information to social media platforms. Unfortunately, these images are susceptible to misuse by malicious entities, particularly for training deep neural networks. To protect privacy, current efforts are concentrated on utilizing reversible adversarial examples (RAE) to confuse and disrupt the training of deep neural networks, allowing authorized users to recover the original images. However, existing RAE methods lack research on traceability, leaving protected images vulnerable to attacks, especially when redistributed. To bolster image protection, we propose a secure solution called TRAE. TRAE provides a unified framework for both image traceability and reversible adversarial protection. It is built upon two sets of encoder-decoder pairs: two encoders for generating perturbations and embedding watermarks, and two decoders for extracting watermarks and restoring images. Experimental results demonstrate that images generated by TRAE exhibit favorable visual quality, strong attack capabilities, and efficient restoration capabilities across various datasets. Furthermore, the extracted watermark information maintains a high level of integrity.

Zhuo Tian, Xiaoyi Zhou, Fan Xing, Wentao Hao, Ruiyang Zhao
A Two-Stage Active Domain Adaptation Framework for Vehicle Re-Identification

Vehicle Re-Identification (Re-ID) plays a pivotal role in intelligent transportation, where the Domain Adaptation (DA) technique can deal well with the performance gap between the observable source domain and the unseen target domain. Traditional DA assumes the identities of target domain are unavailable, ignoring the effective identity-related semantic information. However, it is feasible and acceptable to annotate a moderate amount of target data with a certain annotation budget. To address this issue, we propose a novel Two-Stage Active Learning (TSAL) framework to query the identity annotations for the most informative target samples, which could maximize model performance with a limited annotation budget. TSAL contains two important sequential stages: (1) the randomness-enhanced sample-level stage aims to improve the sampling randomness for maximizing data diversity, which is achieved by the sequential combination of the fixed-interval and random sampling strategies. (2) At the identity-focused feature-level stage, a novel Identity-Focus Score (IFS) is utilized to emphasize identity-related features for modeling uncertainty and diversity. Extensive experiments across various vehicle Re-ID datasets indicate that our method can achieve state-of-the-art (SOAT) performance by only annotating 10% target data, significantly outperforming existing baselines.

Linzhi Shang, Dawei Zhao, Yiming Nie, Kunlong Zhao, Liang Xiao, Bin Dai
FBR-FL: Fair and Byzantine-Robust Federated Learning via SPD Manifold

Federated learning (FL) enables collaborative learning among multiple clients to obtain an optimal global model. Since the FL server’s limitation on the number of clients per training round, client selection has emerged as a critical research issue. The existing strategies for the selection among clients primarily concentrate on either ensuring fairness or optimizing performance in attacker-free FL systems, ignoring the disruption caused by Byzantine attacks. In this work, we propose FBR-FL, a fair client selection scheme that tolerates Byzantine attacks. To catch abnormal geometry among the models of FL clients under attacks, FBR-FL projects local model updates into manifold space and employs geodesic distance to assess similarity on Riemannian geometry. Moreover, to achieve fairness under attacks, the selection among clients is framed into an improved Lyapunov optimization problem with penalty rules, such that we can dynamically adjust FL clients’ selection probabilities based on their reputations and contributions. Our extensive experiments demonstrate that FBR-FL ensures fair selection of clients under various attacks while maintaining accuracy comparable to FedAvg. In the unreliable scenario containing attackers, FBR-FL achieves $$18.59\%$$ 18.59 % higher Jain’s Fairness Index (JFI) than the state-of-the-art client selection scheme. Our code and supplementary material is available at https://github.com/DataMining-Lab/FBR-FL.git.

Tao Zhang, Haoshuo Li, Teng Liu, Anxiao Song, Yulong Shen
SecBFL-IoV: A Secure Blockchain-Enabled Federated Learning Framework for Resilience Against Poisoning Attacks in Internet of Vehicles

Federated learning presents a decentralized machine learning paradigm facilitating collaboration among multiple clients by harnessing local computational power and model transmission. Federated Learning (FL) encounters challenges, particularly concerning data leakage due to the lack of robust privacy-preserving mechanisms during storage, transfer, and sharing processes. This poses significant risks to both data owners and suppliers. Existing FL systems often lack robust defense mechanisms to detect and mitigate such poisoning attacks. This paper aims to address this gap by proposing a novel defense strategy that can effectively identify and filter out malicious model updates before they are aggregated into the global model. By providing secure data-sharing platforms, Blockchain enhances the integrity and privacy of federated learning models, thus safeguarding sensitive information within the Internet of Vehicles (IoV) ecosystem. Furthermore, the integration of Homomorphic Encryption (HE) ensures end-to-end encryption of data and model updates, thereby strengthening the security and confidentiality of FL systems. Our approach achieves an Overall Accuracy of 97.20% and a Source-class Accuracy of 95.10% on the MNIST dataset, with a low Attack Success Rate of 0.46%. On the CIFAR-10 dataset, our method achieves an Overall Accuracy of 75.29% and a Source Class Accuracy of 59.30%, with an Attack Success Rate of 10.22%. These results demonstrate the effectiveness of our approach in countering poisoning attacks.

Irshad Ulllah, Xiaoheng Deng, Xinjun Pei, Husnain Mushtaq

Pattern Classification and Cluster Analysis

Frontmatter
Adapt and Refine: A Few-Shot Class-Incremental Learner via Pre-Trained Models

The intricate and ever-changing nature of the real world imposes greater demands on neural networks, necessitating the rapid assimilation of fleeting new concepts as they arise. Consequently, a novel learning paradigm has emerged, namely, few-shot class-incremental learning (FSCIL), which aims to continuously update knowledge of novel categories with insufficient instances while avoiding catastrophic forgetting of previous knowledge. However, recent FSCIL methods encountered significant performance limitations due to the low-quality latent representation spaces obtained from base session. To this end, this paper introduces a novel FSCIL method, Adapt and REfine (ARE). Specifically, ARE initially strengthens the latent space through the powerful representational capabilities of pre-trained models (PTMs). Subsequently, we further adapt and refine the feature space and prototypes to promote the enhancement of FSCIL performance. Extensive experiments on benchmarks such as CIFAR100, mini-ImageNet, and CUB200 validate the effectiveness of the proposed method.

Sunyuan Qiang, Zhu Xiong, Yanyan Liang, Jun Wan, Du Zhang
Learning Fully Parametric Subspace Clustering

Subspace clustering is a challenging problem in large-scale datasets due to the high computational cost associated with the coding and spectral decomposition. To address the above challenge, we introduce a novel approach, fully parametric subspace clustering (FPSC), that transforms the subspace clustering into learning neural network based classifier. Specifically, FPSC consists of three sequential networks: 1) a neural base-expressive model for learning low-dimensional coefficients, 2) a neural network for approximating the eigenvectors of the affinity matrix of the coefficients, and 3) a neural classifier that is trained with pseudo-labels generated by clustering the eigenvectors. Furthermore, we provide the theoretical analysis to demonstrates an upper bound on the reconstruction error. Experimental results demonstrate that our method significantly outperforms state-of-the-art subspace clustering methods.

Xuanrong Chen, Jianjun Qian, Shuo Chen, Guangyu Li, Jian Yang, Jun Li
A Comprehensive Exploration on Detecting Fake Images Generated by Stable Diffusion

Diffusion models, particularly Stable Diffusion Models (SDMs), have recently emerged as a focal point within the generative artificial intelligence sector, acclaimed for their superior visual fidelity and versatility. Despite their rising prominence, the challenge of detecting SDM-generated images has been somewhat overlooked, sparking concerns over their potential misuse for nefarious purposes. This paper aims to delve into the complexities of differentiating authentic images from those generated by SDMs, offering three significant contributions to the field. Firstly, we introduce a varied synthetic image dataset named SDM-Fakes, which consists of six subsets utilizing txt2img, img2img, and inpainting techniques. Secondly, we develop both CNN-based and Transformer-based detection models to identify artificial images, assessing a range of cutting-edge models. Thirdly, we pioneer the evaluation of these detection models’ generalization capabilities across different schemes. We also explored the impact of unknown perturbations on those detectors. Through comprehensive testing, we demonstrate that while current models are adept at recognizing SDM-generated images, there is a significant need to enhance their ability to generalize cross-scheme tasks, as well as robustness on unknown perturbations.

Jingyi Chen, Xiaolong Wang, Zhijian He, Xiaojiang Peng
Adaptive Margin Global Classifier for Exemplar-Free Class-Incremental Learning

Exemplar-free class-incremental learning (EFCIL) presents a significant challenge as the old class samples are absent for new task learning. Due to the severe imbalance between old and new class samples, the learned classifiers can be easily biased toward the new ones. Moreover, continually updating the feature extractor under EFCIL can compromise the discriminative power of old class features, e.g., leading to less compact and more overlapping distributions across classes. Existing methods mainly focus on handling biased classifier learning. In this work, both cases are considered using the proposed method. Specifically, we first introduce a Distribution-Based Global Classifier (DBGC) to avoid bias factors in existing methods, such as data imbalance and sampling. More importantly, the compromised distributions of old classes are simulated via a simple operation, variance enlarging (VE). Incorporating VE based on DBGC results in a novel classification loss for EFCIL. This loss is proven equivalent to an Adaptive Margin Softmax Cross Entropy (AMarX). The proposed method is thus called Adaptive Margin Global Classifier (AMGC). AMGC is simple yet effective. Extensive experiments show that AMGC achieves superior image classification results on its own under a challenging EFCIL setting.

Zhongren Yao, Xiaobin Chang
SACTGAN-EE Imbalanced Data Processing Method for Credit Default Prediction

Credit business is one of the primary operations in banking, and the control of credit default risk is critically important. A major challenge in the analysis of existing credit data is the highly imbalanced distribution between default and non-default cases. This imbalance can lead to biases in risk assessment. To address this issue, this study proposes an innovative method that self-attention CTGAN with an EasyEnsemble (SACTGAN-EE) mixed sampling method to handle data imbalance. This method boosts the data capturing capability of CTGAN through self-attention, enabling CTGAN to generate synthetic samples that are closer to the actual data distribution and significantly improving sample diversity and authenticity. Additionally, the use of EasyEnsemble technology integrates multiple data subsets to effectively balance majority and minority classes, thus reducing the bias caused by class imbalance. The final experimental results show that this integrated approach significantly outperforms traditional methods in dealing with the imbalance of credit data. It not only generates high-quality balanced datasets but also improves model capability in recognizing minority classes, effectively reducing overfitting risk.

Shuxian Liu, Guoqiang Wang, Zhida Liu
FedHC: Learning Imbalanced Clusters via Federated Hierarchical Clustering

Federated learning has been widely studied in recent years, which acts to avoid the privacy leakage problem while ensuring co-learning among clients. Most existing federated clustering methods are mainly focusing on extending the learning strategy of k-means to a federated scenario. This unavoidably makes the clustering inherit the inherent limitations brought by k-means, i.e., each learning client requires the number of true clusters k to be given in advance, which is not always the case in reality. Moreover, the “uniform effect” that prevents the existing clustering methods from effectively partitioning imbalanced clusters also remains unsolved for federated clustering. It is worth noting that a too-small k also contributes to the “uniform effect” as the granularity that will be searched by the clustering algorithm is fixed to be relatively large. To solve the above problems, we propose the Federated Hierarchical Clustering (FedHC) method to explore subclusters at each client and then merge them at the server to form a sought number of clusters. Such a process simultaneously protects privacy and aggregates the cluster distribution information that is finely learned by each client without requiring a pre-set “true” local k. Since we do not force each client to search for the same and small number of clusters, they can provide rich micro-cluster distribution information to the server, and the server hierarchically merges closely distributed subclusters to avoid the “uniform effect”. It turns out that FedHC can effectively explore imbalanced clusters without passing the privacy information. Several experiments have been conducted to illustrate the efficacy of FedHC.

Yue Zhang, Xinfa Liao, Qingsheng Chen, Haotian Wu, Yiqun Zhang
Enhancing Time Series Classification with Explainable Time-Frequency Features Representation

Time series classification is vital across many fields, despite complex data making precise classification challenging. Deep learning models have advanced, but interpretable models still dominate practical applications. Using time series Shapelet features reduces data and improves model interpretability, but extraction requires significant computing power and can lose some time series information. To address this, we apply an improved genetic algorithm for efficient Shapelet extraction. Additionally, we introduce discrete Fourier transform for frequency domain feature description, aiming to capture periodic patterns lossless. We propose a time series classification enhancement method combining Shapelet features and frequency domain characterization (E-STAR), providing a fast way to obtain Shapelets and enhancing recognition of global time series information. E-STAR outperforms four Shapelet-based algorithms on 32 UCR datasets, improving accuracy and interpretability.

Tao Ding, Wenjun Zhou, Bo Peng
Adaptive Unified Framework with Global Anchor Graph for Large-Scale Multi-view Clustering

Multi-view clustering faces serious challenges in reducing computational and memory demands for large-scale datasets while effectively extracting structural information from multi-view data. Most existing methods address algorithmic complexity by introducing anchors, typically through a two-stage process involving anchor sampling and subsequent bipartite graph construction. However, the quality of anchor selection directly affects the performance of the bipartite graph, this two-stage mechanism lacks mutual optimization, thereby negatively impacting clustering performance. To address these issues, we propose the Adaptive Unified Framework with Global Anchor Graph for Large-scale Multi-view Clustering (AUF-LMC). Different from the traditional sample-based anchor selection mechanism, AUF-LMC adaptively learns the underlying anchors across multiple views and builds global bipartite graph on this basis, so that these two processes can be linked to each other to promote optimization and improve clustering performance. Furthermore, we unify all processes within a single framework and apply appropriate constraints to the bipartite graph. Experimental evaluations demonstrate that our method delivers superior clustering performance and efficiency, characterized by fast convergence and robustness on standard datasets.

Lin Shi, Wangjie Chen, Yi Liu, Lihua Zhuang, Guangqi Jiang
SLRL: Structured Latent Representation Learning for Multi-view Clustering

In recent years, Multi-View Clustering (MVC) has attracted increasing attention for its potential to reduce the annotation burden associated with large datasets. The aim of MVC is to exploit the inherent consistency and complementarity among different views, thereby integrating information from multiple perspectives to improve clustering outcomes.Despite extensive research in MVC, most existing methods focus predominantly on harnessing complementary information across views to enhance clustering effectiveness, often neglecting the structural information among samples, which is crucial for exploring sample correlations. To address this gap, we introduce a novel framework, termed Structured Latent Representation Learning based Multi-View Clustering method (SLRL). SLRL leverages both the complementary and structural information. Initially, it learns a common latent representation for all views. Subsequently, to exploit the structural information among samples, a k-nearest neighbor graph is constructed from this common latent representation. This graph facilitates enhanced sample interaction through graph learning techniques, leading to a structured latent representation optimized for clustering. Extensive experiments demonstrate that SLRL not only competes well with existing methods but also sets new benchmarks in various multi-view datasets.

Zhangci Xiong, Meng Cao
Backmatter
Metadaten
Titel
Pattern Recognition and Computer Vision
herausgegeben von
Zhouchen Lin
Ming-Ming Cheng
Ran He
Kurban Ubul
Wushouer Silamu
Hongbin Zha
Jie Zhou
Cheng-Lin Liu
Copyright-Jahr
2025
Verlag
Springer Nature Singapore
Electronic ISBN
978-981-9784-87-5
Print ISBN
978-981-9784-86-8
DOI
https://doi.org/10.1007/978-981-97-8487-5