A Data Augmentation Based ViT for Fine-Grained Visual Classification

Fine-grained visual classification (FGVC) is a fundamental and longstanding problem aiming to recognize objects belonging to different subclasses accurately. Unfortunately, since categories are often confused, this task is genuinely challenging. Most previous methods solve this problem in two main ways: adding more annotations or constructing more complex structures. These approaches, however, require expensive labels or sophisticated designs. To alleviate these constraints, in this work, we propose an easy but efficient method called DA-ViT, just using data augmentations to supervise the model. Specifically, we adopt a vision transformer as the backbone. Then, we introduce highly interpretable visual heatmaps to guide the targeted data augmentations, and three methods (local area enlargement, flipping, and cutout) are created based on the high-response areas. Furthermore, the margins among confusing classes can be increased by simply using label smoothing. Extensive experiments conducted on three popular fine-grained benchmarks demonstrate that we achieve SOTA performance. Meanwhile, during the inference, our method requires less computational burden.

Shuozhi Yuan, Wenming Guo, Fang Han

A Detail Geometry Learning Network for High-Fidelity Face Reconstruction

In this paper, we propose a Detail Geometry Learning Network (DGLN) approach to investigate the problem of self-supervised high-fidelity face reconstruction from monocular images. Unlike existing methods that rely on detail generators to generate “pseudo-details” where most of the reconstructed detail geometries are inconsistent with real faces. Our DGLN can ensure face personalization and also correctly learn more local face details. Specifically, our method includes two stages: the personalization stage and the detailization stage. In the personalization stage, we design a multi-perception interaction module (MPIM) to adaptively calibrate the weighted responses by interacting with information from different receptive fields to extract distinguishable and reliable features. To further enhance the geometric detail information, in the detailization stage, we develop a multi-resolution refinement network module (MrNet) to estimate the refined displacement map with features from different layers and different domains (i.e. coarse displacement images and RGB images). Finally, we design a novel normal smoothing loss that improves the reconstructed details and realisticity. Extensive experiments demonstrate the superiority of our method over previous work.

Kehua Ma, Xitie Zhang, Suping Wu, Boyang Zhang, Leyang Yang, Zhixiang Yuan

A Lightweight Multi-Scale Large Kernel Attention Hierarchical Network for Single Image Deraining

Many current single image deraining methods focus on increasing network depth and width, which results in higher computational overhead and parameters. To address this issue, we propose a lightweight multi-scale large kernel attention hierarchical network(LMANet). Our approach combines multi-scale and Large Kernel Attention(LKA) to create Multi-Scale Large Kernel Attention (MSLKA), where large kernel decomposition can effectively decouple large kernel convolution operations to capture long-term dependencies at different granularity levels in a light weight. We then use the Channel Attention Feed-Forward Network (CAFN), which employs channel attention and gating mechanisms to correct channel information and reduce redundant features, thus further reducing the network size. Finally, we enhanced the inter-layer information learning of the feature maps by adding top-down Self Attention Distillation (SAD) to the network training process, thus accelerating the network training and alleviating 38 $$\%$$ % of the network parameters. Experimental results on benchmark datasets show that LMANet achieves satisfactory performance while significantly reducing the number of parameters.

Xin Wang, Chen Lyu

A Multi-scale Method for Cell Segmentation in Fluorescence Microscopy Images

Accurate segmentation of cells in fluorescent microscopy images plays a key role in high-throughput applications such as the quantification of protein expression and the study of cell function. Existing cell segmentation methods have drawbacks in terms of inaccurate location of segmentation boundary, misidentification, and inaccurate segmentation of overlapping cells. To address these issues, a novel multi-scale method for cell segmentation in fluorescence microscopy images (MMCS) is proposed in this paper. Our motivation to adopt multi-scale image analysis in the cell segmentation task originates from the basic observation that cells on fluorescence microscope images are often composed of different structures at different scales. In our proposed MMCS, three scales are exploited. At the high scale, noise effects are sufficiently suppressed, and the cell contour is fully smoothed. Then, scale fusion is further performed, that is, the cell contours obtained by segmentation at high, medium, and low scales are averaged, to improve the location accuracy of contour segmentation. To solve the problems of misidentification and cell overlapping, an improved Bradley technique with constraints based on shape and intensity features and region-based fitting of overlapping ellipses technique are also developed and embedded in our multi-scale approach for extracting cell contours at each single scale. The experimental results obtained on a large number of fluorescence microscope images from two data sets show that the proposed MMCS can outperform state-of-the-art methods by a large margin.

Yating Fang, Baojiang Zhong

Adaptive Interaction-Based Multi-view 3D Object Reconstruction

This paper introduces an end-to-end deep learning framework for multi-view 3D object reconstruction. The algorithm constructs an adaptive multi-view combination module in the 2D encoding through calculating the feature correlation of pixel points of each view, allowing each view contains the feature information of other views. It addresses the issue of inconsistent object reconstruction resulting from input images being presented in different orders. Additionally, a voxel refinement loss is employed to produce a comprehensive 3D voxel and establishes an adaptive voxel discrimination module for 3D voxel calibration. This reduces the production of superfluous voxels in the 3D voxel and enhances the completeness of the reconstruction. Extensive validation using the ShapeNet synthetic dataset and the Pix3D real-world dataset demonstrates that the proposed algorithm outperforms existing methods.

Jun Miao, Yilin Zheng, Jie Yan, Lei Li, Jun Chu

An Auxiliary Modality Based Text-Image Matching Methodology for Fake News Detection

Owing to the national network “clearing up” action, it has become increasingly important to detect false information by the use of deep learning technology. As social networks gradually presents a multimodal property, many scholars have devoted to multimodal fake news detection. However, the current multimodal achievements mainly focus on the fusion modeling between texts and images, while their consistencies are still in their infancy. This paper concentrates on the issue of how to extract effective features from texts and images, how to match modes in a more precise way, and subsequently proposes a novel fake news detection method. Especially, the models of Bert, Vgg, and Optical Character Recognition (OCR) are respectively adopted to reflect the textual features, the visual counterparts, as well as the corresponding embedded contents in the attachment. The overall model framework consists of four components: one fusion module and three matching modules, where the former one joints text and image features, and the latter three computes the corresponding similarities among textual, visual, and auxiliary modalities. Aligning them with different weights, and connecting them with a classifier, whether the news is fake or real can thus emerge. Comparative experiments embody the effectiveness of our model, which can reach 88.1%’s accuracy on the Chinese Weibo dataset and 91.7%’s accuracy on the English Twitter dataset.

Ying Guo, Bingxin Li, Hong Ge, Chong Di

An Improved Lightweight YOLOv5 for Remote Sensing Images

Achieving real-time accurate detection in remote sensing images, which exhibit features such as high resolution, small targets, and complex backgrounds, remains challenging due to the substantial computational demands of existing object detection models. In this paper, we propose an improved remote sensing image small object detection method based on YOLOv5. In order to preserve high-resolution features, we remove the Focus module from the YOLOv5 network structure and introduce RepGhostNet as a feature extraction network to enhance both accuracy and speed. We adopt the BiFormer prediction head for more flexible computational allocation and content perception, and employ the Normalized Wasserstein Distance (NWD) metric to alleviate IoU’s sensitivity to small objects. Experimental results show that our proposed method achieves mAP scores of 75.54% and 75.65% on the publicly available VEDAI and DIOR remote sensing image datasets, respectively, with significantly fewer parameters and FLOPs. Our approach effectively balances accuracy and speed compared to other models.

Shihao Hou, Linwei Fan, Fan Zhang, Bingchen Liu

An Improved YOLOv5 with Structural Reparameterization for Surface Defect Detection

Surface defects produced by the manufacturing process directly degrades the quality of industrial materials such as hot-rolled steel. However, existing methods for detecting surface defects cannot meet the requirements in terms of speed and accuracy. Based on structural re-parameterization, coordinate attention (CA) mechanism, and an additional detection head, we propose an improved YOLOv5 model for detecting surface defects of steel plates. Firstly, using the technique of structural re-parameterization in RepVGGBlock, the multi-channel structure of the training backbone network is converted to a single-channel structure of the inference network. This allows the network to speed up its inference while maintaining detection accuracy. Secondly, CA is integrated into the detection head to further improve detection accuracy. Finally, a layer of detection head is added at the end of the network to focus on detecting small targets. The experimental results on the Northeastern University (NEU) surface defect database show that, our model is superior to the state-of-the-art detectors, such as the original YOLOv5, Fast-RCNN in accuracy and speed.

Yixuan Han, Liying Zheng

ASP Loss: Adaptive Sample-Level Prioritizing Loss for Mass Segmentation on Whole Mammography Images

Alarming statistics on the mortality rate for breast cancer are a clear indicator of the significance of computer vision tasks related to cancer identification. In this study, we focus on mass segmentation, which is a crucial task for cancer identification as it preserves critical properties of the mass, such as shape and size, vital for identification tasks. While achieving promising results, existing approaches are mostly hindered by pixel class imbalance and various mass sizes that are inherent properties of masses in mammography images. We propose to alleviate this limitation on segmentation methods via a novel modification of the common hybrid loss, which is a weighted sum of the cross entropy and dice loss. The proposed loss, termed Adaptive Sample-Level Prioritizing (ASP) loss, leverages the higher-level information presented in the segmentation mask for customizing the loss for every sample, to prioritize the contribution of each loss term accordingly. As one of the variations of U-Net, AU-Net is selected as the baseline approach for the evaluation of the proposed loss. The ASP loss could be integrated with other existing mass segmentation approaches to enhance their performance by providing them with the ability to address the problems associated with the pixel class imbalance and diverse mass sizes specific to the domain of breast mass segmentation. We tested our method on two publicly available datasets, INbreast and CBIS-DDSM. The results of our experiments show a significant boost in the performance of the baseline method while outperforming state-of-the-art mass segmentation methods.

Parvaneh Aliniya, Mircea Nicolescu, Monica Nicolescu, George Bebis

Cascaded Network-Based Single-View Bird 3D Reconstruction

Existing single-view bird 3D reconstruction methods mostly cannot well recover the local geometry such as feet and wing tips, and the resulting 3D models often appear to have poor appearance when viewed from a new perspective. We thus propose a new method that requires only images and their silhouettes to accurately predict the shape of birds, as well as to obtain reasonable appearance in new perspectives. The key to the method lies in the introduction of a cascaded structure in the shape reconstruction network. This allows for the gradual generation of the 3D shape of birds from coarse to fine, enabling better capturing of local geometric features. Meanwhile, we recover the texture, lighting and camera pose with attention-enhanced encoders. To further improve the plausibility of the reconstructed 3D bird in novel views, we introduce the Multi-view Cycle Consistency loss to train the proposed method. We compare our method with state-of-the-art methods and demonstrate its superiority both qualitatively and quantitatively.

Pei Su, Qijun Zhao, Fan Pan, Fei Gao

CLASPPNet: A Cross-Layer Multi-class Lane Semantic Segmentation Model Fused with Lane Detection Module

Multi-class lane semantic segmentation is a crucial technology in the traffic violation detection system. However, the existing models for multi-classification lane semantic segmentation suffer from low segmentation accuracy for special lanes (e.g., ramp, emergency lane) and lane lines. To address this problem, we propose a cross-layer multi-class lane semantic segmentation model called CLASPPNet (Cross-Layer Atrous Spatial Pyramid Pooling Network) fused with lane detection module. We first design a Cross-Layer Atrous Spatial Pyramid Pooling (CLASPP) structure to integrate the deep and shallow features in the image and enhance the integrity of the lane segmentation. Additionally, we integrate the lane detection module during training in the cross-layer structure, which can improve the model’s ability of extracting lane line features. We evaluate CLASPPNet on the expressway dataset based on aerial view, and the experimental results show that our model significantly improves the segmentation performance of special lanes and lane lines. Additionally, it achieves the highest mIoU (mean Intersection over Union) of 86.4% while having 28.9M parameters.

Chao Huang, Zhiguang Wang, Yongnian Fan, Kai Liu, Qiang Lu

Classification-Based and Lightweight Networks for Fast Image Super Resolution

Lightweight image super-resolution (SR) networks are of great significance for practical applications. Presently, there are several SR methods based on deep learning with excellent performance, but their memory and computation costs hinder practical applications. In this paper, we propose a down-up sampling continuous mutual affine super-resolution network (DUSCMAnet) to solve above problems. Moreover, we propose a classification-based SR algorithm based on image statistical features (TSClassSR-DUSCMAnet) for accelerating SR networks on large images (2K–8K). The proposed algorithm first decomposes the large images into small sub-images, then uses a Class-Module to classify sub-images into different classes according to the difficulty of reconstruction, then use a SR-Module to perform SR for different classes. The Class-Module is composed of a support vector machine (SVM) based on image statistical features, and the SR-Module is composed of our proposed DUSCMAnet, a lightweight SR network. After classifying, a majority of sub-images will pass through lighter networks, thus the computational cost can be significantly reduced. Experiments show that our DUSCMAnet is superior to the existing lightweight SR models in terms of time performance and also has competitive SR performance. Our TSClassSR-DUSCMAnet can help DUSCMAnet save up to 63% FLOPs on DIV8K datasets.

Xueliang Zhong, Jianping Luo

CLN: Complementary Learning Network for 3D Face Reconstruction and Alignment

In the complex and changeable unconstrained state, the 3D Morphable Model (3DMM) parameters regression methods lack the ability to express local details and they are difficult to enrich geometric details. In this paper, we propose a complementary learning network (CLN), which aims to improve the ability to extract global discriminative features and capture local details through complementary learning between global and local information. Specifically, we first elaborately design a cross-domain self-shuffling data augmentation method to simulate the case of the inconspicuous or obscured face information under unconstrained conditions, thus increasing the sensitivity of the network to local information. Different from other methods, our augmented data contains global data that can guide the network in reasoning about the inconspicuous areas of the face in special lighting and partially obscured areas of the face in large poses. To better achieve the complementary between global information and local details, we design a complementary learning network in which one stream extracts global discriminative features from the original input image. Meanwhile, another stream extracts local detailed features from the shuffled image. Then, we adopt the coordinate attention transformer module to enforce our network’s ability to capture the correlation between local information. Finally, we linearly fuse the global information that constrains the geometric shape of the face and the local information of the rich geometric details to improve the regression accuracy of the network. Extensive experiments on AFLW, AFLW2000-3D, and other datasets demonstrate that our method is superior to previous methods.

Kangbo Wu, Xitie Zhang, Xing Zheng, Suping Wu, Yongrong Cao, Zhiyuan Zhou, Kehua Ma

Combining Edge-Guided Attention and Sparse-Connected U-Net for Detection of Image Splicing

The use of image-splicing technologies had detrimental effects on the security of multimedia information. Hence, it is necessary to develop effective methods for detecting and locating such tampering. Previous studies have mainly focused on the supervisory role of the mask on the model. The mask edges contain rich complementary signals, which help to fully understand the image and are usually ignored. In this paper, we propose a new network named EAU-Net to detect and locat the splicing regions in the image. The proposed network consists of two parts: Edge-guided SegFormer and Sparse-connected U-Net (SCU). Firstly, the feature extraction module captures local detailed cues and global environment information, which are used to deduce the initial location of the affected regions by SegFormer. Secondly, a Sobel-based edge-guided module (EGM) is proposed to guide the network to explore the complementary relationship between splicing regions and their boundaries. Thirdly, in order to achieve more precise positioning results, SCU is used as postprocessing for removing false alarm pixels outside the focusing regions. In addition, we propose an adaptive loss weight adjustment algorithm to supervise the network training, through which the weights of the mask and the mask edge can be automatically adjusted. Extensive experimental results show that the proposed method outperforms the state-of-the-art splicing detection and localization methods in terms of detection accuracy and robustness.

Lin Wan, Lichao Su, Huan Luo, Xiaoyan Li

Contour-Augmented Concept Prediction Network for Image Captioning

Semantic information in images is essential for image captioning. However, previous works leverage the pre-trained object detector to mine semantics in an image, making the model unable to accurately capture visual semantics, and further making the generated descriptions irrelevant to the content of the given image. Thus, in this paper, we propose a Contour-augmented Concept Prediction Network (CCP-Net), which leverages two additional aspects of visual information, including high-level features (concepts) and low-level features (contours) in an end-to-end manner, to encourage the contribution of visual content in description generation. Furthermore, we propose a contour-augmented visual feature extraction module and equip it with elegantly designed feature fusion. Utilizing homogeneous contour features can better enhance visual feature extraction and further promote visual concept prediction. Extensive experimental results on MS COCO dataset demonstrate the effectiveness of our method and each proposed module, which can obtain 40.6 BLEU-4 and 135.6 CIDEr scores. Code will be released in the final version of the paper.

Ting Wang, Weidong Chen, Jingyu Li, Yixing Peng, Zhendong Mao

Contrastive Knowledge Amalgamation for Unsupervised Image Classification

Knowledge amalgamation (KA) aims to learn a compact student model to handle the joint objective from multiple teacher models that are specialized for their own tasks respectively. Current methods focus on coarsely aligning teachers and students in the common representation space, making it difficult for the student to learn the proper decision boundaries from a set of heterogeneous teachers. Besides, the KL divergence in previous works only minimizes the probability distribution difference between teachers and the student, ignoring the intrinsic characteristics of teachers. Therefore, we propose a novel Contrastive Knowledge Amalgamation (CKA) framework, which introduces contrastive losses and an alignment loss to achieve intra-class cohesion and inter-class separation. Contrastive losses intra- and inter- models are designed to widen the distance between representations of different classes. The alignment loss is introduced to minimize the sample-level distribution differences of teacher-student models in the common representation space. Furthermore, the student learns heterogeneous unsupervised classification tasks through soft targets efficiently and flexibly in the task-level amalgamation. Extensive experiments on benchmarks demonstrate the generalization capability of CKA in the amalgamation of specific task as well as multiple tasks. Comprehensive ablation studies provide a further insight into our CKA.

Shangde Gao, Yichao Fu, Ke Liu, Yuqiang Han

Cross Classroom Domain Adaptive Object Detector for Student’s Heads

Training on a label-rich dataset and test on another label-scarce dataset usually leads to a poor performance because of the domain shift. Unsupervised domain adaptation is proved to be effective on this problem in recent researches. Unsupervised domain adaptive object detection of students’ heads between different classrooms has becoming an important task with the development of Smart Classroom. However, few cross-classroom models for students’ heads have been proposed despite the rapid development of domain adaptive object detection. In this paper, we propose two adaptations which focus on the challenges of domain adaptive object detection of students’ heads between different classrooms, including the adaptation based on the numbers of students and the adaptation based on the locations of students. Based on Unbiased Mean Teacher framework, our Cross Classroom Domain Adaptive Object Detector achieves an average precision of 50.2% on the cross-classroom students’ heads dataset called SCUT_HEAD, which outperforms the existing state-of-the-art methods.

Chunhui Li, Haoze Yang, Kunyao Lan, Liping Shen

Diffusion-Adapter: Text Guided Image Manipulation with Frozen Diffusion Models

Research on vision-language models has seen rapid development, enabling natural language-based processing for image generation and manipulation. Existing text-driven image manipulation is typically implemented by GAN inversion or fine-tuning diffusion models. The former is limited by the inversion capability of GANs, which fail to reconstruct pictures with novel poses and perspectives. The latter methods require expensive optimization for each input, and fine-tuning is still a complex process. To mitigate these problems, we propose a novel approach, dubbed Diffusion-Adapter, which performs text-driven image manipulation using frozen pre-trained diffusion models. In this work, we design an Adapter architecture to modify the target attributes without fine-tuning the pre-trained models. Our approach can be applied to diffusion models in any domain and only take a few examples to train the Adapter that could successfully edit images from unknown data. Compared with previous work, Diffusion-Adapter preserves a maximal amount of details from the original image without unintended changes to the input content. Extensive experiments demonstrate the advantages of our approach over competing baselines, and we make a novel attempt at text-driven image manipulation.

Rongting Wei, Chunxiao Fan, Yuexin Wu

DWA: Differential Wavelet Amplifier for Image Super-Resolution

This work introduces Differential Wavelet Amplifier (DWA), a drop-in module for wavelet-based image Super-Resolution (SR). DWA invigorates an approach recently receiving less attention, namely Discrete Wavelet Transformation (DWT). DWT enables an efficient image representation for SR and reduces the spatial area of its input by a factor of 4, the overall model size, and computation cost, framing it as an attractive approach for sustainable ML. Our proposed DWA model improves wavelet-based SR models by leveraging the difference between two convolutional filters to refine relevant feature extraction in the wavelet domain, emphasizing local contrasts and suppressing common noise in the input signals. We show its effectiveness by integrating it into existing SR models, e.g., DWSR and MWCNN, and demonstrate a clear improvement in classical SR tasks. Moreover, DWA enables a direct application of DWSR and MWCNN to input image space, reducing the DWT representation channel-wise since it omits traditional DWT.

Brian B. Moser, Stanislav Frolov, Federico Raue, Sebastian Palacio, Andreas Dengel

Dynamic Facial Expression Recognition in Unconstrained Real-World Scenarios Leveraging Dempster-Shafer Evidence Theory

Dynamic facial expression recognition (DFER) has garnered significant attention due to its critical role in various applications, including human-computer interaction, emotion-aware systems, and mental health monitoring. Nevertheless, addressing the challenges of DFER in real-world scenarios remains a formidable task, primarily due to the severe class imbalance problem, leading to suboptimal model performance and poor recognition of minority class expressions. Recent studies in facial expression recognition (FER) for class imbalance predominantly focus on spatial features analysis, while the capacity to encode temporal features of spontaneous facial expressions remains limited. To tackle this issue, we introduce a novel dynamic facial expression recognition in real-world scenarios (RS-DFER) framework, which primarily comprises a spatiotemporal features combination (STC) module and a multi-classifier dynamic participation (MCDP) module. Our extensive experiments on two prevalent large-scale DFER datasets from real-world scenarios demonstrate that our proposed method outperforms existing state-of-the-art approaches, showcasing its efficacy and potential for practical applications.

Zhenyu Liu, Tianyi Wang, Shuwang Zhou, Minglei Shu

End-to-End Remote Sensing Change Detection of Unregistered Bi-temporal Images for Natural Disasters

Change detection based on remote sensing images has been a prominent area of interest in the field of remote sensing. Deep networks have demonstrated significant success in detecting changes in bi-temporal remote sensing images and have found applications in various fields. Given the degradation of natural environments and the frequent occurrence of natural disasters, accurately and swiftly identifying damaged buildings in disaster-stricken areas through remote sensing images holds immense significance. This paper aims to investigate change detection specifically for natural disasters. Considering that existing public datasets used in change detection research are registered, which does not align with the practical scenario where bi-temporal images are not matched, this paper introduces an unregistered end-to-end change detection synthetic dataset called xBD-E2ECD. Furthermore, we propose an end-to-end change detection network named E2ECDNet, which takes an unregistered bi-temporal image pair as input and simultaneously generates the flow field prediction result and the change detection prediction result. It is worth noting that our E2ECDNet also supports change detection for registered image pairs, as registration can be seen as a special case of non-registration. Additionally, this paper redefines the criteria for correctly predicting a positive case and introduces neighborhood-based change detection evaluation metrics. The experimental results have demonstrated significant improvements.

Guiqin Zhao, Lianlei Shan, Weiqiang Wang

E-Patcher: A Patch-Based Efficient Network for Fast Whole Slide Images Segmentation

UNeXt is a leading medical image segmentation method that employs convolutional and multi-layer perceptron (MLP) structure for its segmentation network. It outperforms other image recognition algorithms, such as MedT and TransUNet, in terms of faster computation speed and has shown great potential for clinical applications. However, its reliance on limited pixel neighborhood information for pixel-level segmentation of large pathological images may lead to inaccurate segmentation of the entire image and overlook important features, resulting in suboptimal segmentation results. To this end, we designed a slight and universal plug-and-play block based patch-level for fully considering local features and global features simultaneously, named “Digging” and “Filling” ViT (DF-ViT) block. Specifically, a “Digging” operation is introduced to randomly select sub-blocks from each sub-patch. Multi-Head Attention (MHA) is then applied to integrate global information into these sub-blocks. The resulting sub-blocks with global semantic features are reassembled into the original feature map, and feature fusion is performed to combine the local and global features. This approach achieves global representation while maintaining a low computational complexity of the model at 0.1424G. Compared to UNeXt, it improves mIoU by 1.07%. Moreover, it reduces the parameter count by 58.50% and the computational workload by 68.09%. Extensive experiments on PAIP 2019 WSI dataset demonstrate that the DF-ViT block significantly enhances computation efficiency while maintaining a high level of accuracy.

Xiaoshuang Huang, Shuo Wang, Jinze Huang, Yaoguang Wei, Xinhua Dai, Yang Zhao, Dong An, Xiang Fang

Exploiting Multi-modal Fusion for Robust Face Representation Learning with Missing Modality

Current RGB-D-T face recognition methods are able to alleviate the sensitivity to facial variations, posture, occlusions, and illumination by incorporating complementary information, while they rely heavily on the availability of complete modalities. Given the likelihood of missing modalities in real-world scenarios and the fact that current multi-modal recognition models perform poorly when faced with incomplete data, robust multi-modal models for face recognition that can handle missing modalities are highly desirable. To this end, we propose a multi-modal fusion framework for robustly learning face representations in the presence of missing modalities, using a combination of RGB, depth, and thermal modalities. Our approach effectively blends these modalities together while also alleviating the semantic gap among them. Specifically, we put forward a novel modality-missing loss function to learn the modality-specific features that are robust to missing-modality data conditions. To project various modality features to the same semantic space, we exploit a joint modality-invariant representation with a central moment discrepancy (CMD) based distance constraint training strategy. We conduct extensive experiments on several benchmark datasets, such as VAP RGBD-T and Lock3DFace, and the results demonstrate the effectiveness and robustness of the proposed approach under uncertain missing-modality conditions compared with all baseline algorithms.

Yizhe Zhu, Xin Sun, Xi Zhou

Extraction Method of Rotated Objects from High-Resolution Remote Sensing Images

In recent years, the rapid development of remote sensing technology has made intelligent interpretation possible. However, remote sensing images have arbitrary object orientation, small object size and complex background compared with natural images, and these problems cause difficulty in accomplishing accurate object detection. In this study, several classical traditional object detection methods are first used to compare the detection effects based on the self-made high-resolution remote sensing image rotation detection dataset. Then, a dual-mode rotation regression network, namely, DRRN, is designed to solve the problem of arbitrary object orientation in remote sensing images. DRRN mainly consists of two parts: dual-mode region proposal network and rotation regression network. Meanwhile, a combined regression loss with intersection over union is proposed to improve the traditional smooth L1 loss. Next, a bi-directional cross-layer connected feature pyramid network, namely, bc-FPN, is proposed to solve the problem of small objects. And a supervised hybrid attention mechanism, namely, SHAM, is proposed to solve the problem of complex background, and the two modules are portable and plug-and-play. Experiments show that the proposed methods can effectively improve the detection effects of object in remote sensing images.

Tao Sun, Kun Liu, Jiechuan Shi

Few-Shot NeRF-Based View Synthesis for Viewpoint-Biased Camera Pose Estimation

Recently, several works have paid attention to view synthesis by neural radiance fields (NeRF) to improve camera pose estimation. Among them, LENS and Direct-PoseNet synthesize novel views from pre-trained NeRF and then train the pose regression convolutional network using real observations and the augmented synthetic views for better localization. Therefore, the performance depends on the three-dimensional (3D) consistency and the image quality of novel views. Especially, localization tends to fail if a diverse and high-quality training set is unavailable. To solve this issue, we tackle the problem of learning camera pose regressor from the viewpoint-biased and limited training set. We propose augmenting the regressor’s training set using a few-shot NeRF instead of an original NeRF, which is employed in the previous frameworks. We can render high-quality novel views with a consistent 3D structure for stable training of the regressor. The experiments show that few-shot NeRF is an effective data augmenter for camera pose estimation under the viewpoint-biased limited training set.

Sota Ito, Hiroaki Aizawa, Kunihito Kato

Ga-RFR: Recurrent Feature Reasoning with Gated Convolution for Chinese Inscriptions Image Inpainting

Inscriptions were a primary means of recording historical events and literary works in ancient times, and remain an important part of Chinese ancient civilization. However, due to their large quantity and prolonged exposure to the natural environment, inscriptions have suffered significant damage. Traditional manual restoration methods are time-consuming and labor-intensive, making it necessary to explore new restoration techniques. In this study, we introduce a Ga-RFR network that uses gated convolution to replace ordinary convolution. This technique reduces feature redundancy in generated feature maps and minimizes the production of unnecessary information, thereby enhancing restoration effectiveness. We also compare our method with other advanced image restoration algorithms, and our results demonstrate that our approach outperforms other current methods.

Long Zhao, Yuhao Lou, Zonglong Yuan, Xiangjun Dong, Xiaoqiang Ren, Hongjiao Guan

Generalisation Approach for Banknote Authentication by Mobile Devices Trained on Incomplete Samples

Reliable Banknote Authentication is critical for economic stability. Regarding everyday use, recent studies implemented successful techniques using banknote images taken by mobile phone cameras. One challenge in mobile banknote authentication is that it is impossible to collect images by all series/brands of mobile phones. In this study, classification models are implemented that are able to generalize to the samples from a wide number of mobile phone series even though they are trained with samples from a small group of series. Existing state-of-the-art banknote authentication approaches train a separate model per sub-image of a banknote, using the extracted features of that sub-image. A new approach that trains a single global model on the concatenated features of all the sub-images is presented. Furthermore, ensemble models that combine Linear Discriminant Analysis and Deep Neural Networks are employed in order to maximize the accuracy. Implemented techniques were able to reach up to F1-score of 0.99914 on a Euro banknote data set which contain images from 16 different mobile-phone series. The results also indicate that new global model approach can improve the accuracy of the existing banknote authentication techniques in case of model training with images from restricted/incomplete phone series and brands.

Barış Gün Sürmeli, Eugen Gillich, Helene Dörksen

Image Caption with Prior Knowledge Graph and Heterogeneous Attention

Currently, most image description models are limited in their ability to generate descriptions that reflect personal experiences and subjective perspectives. This makes it difficult to produce relevant and engaging descriptions that truly capture the essence of the image. To address this issue, we propose a novel approach called Subject-awareness-driven Heterogeneous Attention (SCHA). SCHA leverages users’ knowledge and expertise to generate content-adaptive image descriptions that are more human-like and reflective of personal experiences. Our approach involves a carefully designed heterogeneous cascade annotation model that captures scene information from multiple perspectives. We also incorporate a prior knowledge graph with textual information to enhance the richness and relevance of the generated descriptions. Our method has great potential for industrial production detection and can open up new possibilities for increasing the flexibility and variety of detection steps. When compared to the results of MSCOCO and Visual Genome datasets, our approach produces richer and more adaptive descriptions than widely used baseline models.

Junjie Wang, Wenfeng Huang

Image Captioning for Nantong Blue Calico Through Stacked Local-Global Channel Attention Network

Nantong blue calico, a Chinese folk hand-made printing and dyeing craft, has become one of intangible cultural heritages (ICHs) in China. To inherite and promote the ICH of Nantong blue calico, this study applies the image captioning technology to explaining blue-calico images. For this purpose, a novel image captioning method, called the stacked local-global channel attention network (SLGCAN), is proposed. This new network focuses on extracting important features from blue-calico images so that it can generate more accurate captions for blue-calico images. SLGCAN contains three parts, residual network (ResNet), stacked local-global channel attention module (SLGCAM), and Transformer. First, the pre-trained ResNet-101 model is used to extract rough features from blue-calico images and then, SLGCAM is to obtain the fine-grained information from rough image features. Eventually, SLGCAN adopts Transformer to encode and decode the fine-grained information of blue-calico images to predict the word information for generating accurate image captions. Experiments are conducted on a collected blue-calico image dataset. In experiments, we compare our SLGCAN with baseline models and show that that the proposed model is feasible and effective.

Chenyi Guo, Li Zhang, Xiang Yu

Improving Image Captioning with Feature Filtering and Injection

Image captioning represents a challenging multimodal task, requiring the generation of corresponding textual descriptions for complex input images. Existing methods usually leverage object detectors to extract visual features of images, and thus utilize text generators for learning. However, the features extracted by these methods lack focus and tend to ignore the relationship between objects and background information. To solve the aforementioned problems, we exploit both region features and grid features of the image to fully leverage the information encapsulated within the images. Specifically, we first propose an Object Filter Module (OFM) to extract the primary visual objects. Furthermore, we introduce a Global Injection Cross Attention (GICA) to inject the global context of the image into the filtered primary objects. The experimental results substantiate the efficacy of our model. Our model’s effectiveness and immense potential have been demonstrated through extensive experimentation on the widely-used benchmark COCO dataset. It outperforms previous methods on the image captioning task, achieving a CIDEr score of 136.1.

Menghao Guo, Qiaohong Chen, Xian Fang, Jia Bao, Shenxiang Xiang

In Silico Study of Single Synapse Dynamics Using a Three-State Kinetic Model

In this study, we validate a single synapse neural mass model based on a 3-state kinetic framework. Our model implements the ligand-gated neurotransmitter receptors mediated by excitatory alpha-amino-3-hydroxy-5-methyl-4-isoxazolepropionic-acid (AMPA) and N-methyl-D-aspartate (NMDA), and inhibitory gamma-amino-butyric acid subtype A (GABA $$_A$$ A ) synapses. Our results show the 3-state model equipped with desensitization state can assay the synaptic transmission processes recorded in both in vivo and in silico studies. Overall, we present a computationally light, single synapse kinetic model that can adapt as new molecular studies evolve. The fundamental behavioural study presented here also lays the foundation for building larger in silico models, that are computationally efficient and capable of replicating physiological and experimental observations as well as make testable predictions.

Swapna Sasi, Basabdatta Sen Bhattacharya

Interpretable Image Recognition by Screening Class-Specific and Class-Shared Prototypes

Convolutional neural networks (CNNs) have shown impressive performance in various domains, but their lack of interpretability remains an important issue. Prototype-based methods have been proposed to address this problem. Prototype-based methods assign a fixed number of prototypes to categories. But prototype networks are limited by the non-learnable relationship between prototypes and categories, which restricts each prototype to only one category. Furthermore, the large number of prototypes used in these methods often leads to poor prototype distribution. To address these limitations, we propose a deep learning approach with an active learning concept inspired by the associative function of the human brain. We introduce the Prototype Screening Matrix (PSM). We optimize PSM by label smoothing to describe the relationship between categories and prototypes, so that it can dynamically filter prototypes and retain prototypes that are more suitable for concept learning. PSM enables similar prototypes to be shared among different classes, which significantly reduces the number of prototypes and leads to a more rational distribution of prototypes. We experimentally validate the effectiveness of our proposed method on the CUB-200-2011 and Stanford Cars datasets and show that it achieves higher accuracy than existing methods. Our method is more interpretable, uses fewer prototypes, and has a simpler structure, advancing the state-of-the-art in interpretable and efficient prototype-based image classification methods. The code is available at https://github.com/Lixiaomemg/PSMnet .

Xiaomeng Li, Jiaqi Wang, Liping Jing

Joint Edge-Guided and Spectral Transformation Network for Self-supervised X-Ray Image Restoration

X-rays are widely utilized in the security inspection field due to their ability to penetrate objects and visualize intricate details and structural features. However, X-ray images often suffer from degradation issues, such as heavy noise and artifacts, which can adversely affect the accuracy of subsequent high-level tasks. Therefore, X-ray image restoration plays a critical role in the applications of X-ray images. Existing supervised restoration methods depend on numerous noisy-clean image pairs for training, which restricts their application to X-ray images. Although there have been a few attempts to train models with single noisy images, they ignored the unique prior knowledge of X-ray images. This result in poor performance with artifacts and inadequate denoising. To tackle these challenges, we propose a novel self-supervised restoration method called the Joint Edge-guided and Spectral Transformation Network (ESTNet), which integrates edge guidance and spectral transformation techniques to restore color X-ray images. Specifically, ESTNet leverages an adaptive edge guidance module to emphasize edge details. In addition, to achieve a balance between noise suppression and detail preservation in image restoration, we propose spatial spectral blocks that enable the network to capture both global and local contextual information. Extensive experiments on real-world images confirm the superiority of ESTNet over state-of-the-art methods in terms of quantitative metrics and visual quality.

Shasha Huang, Wenbin Zou, Hongxia Gao, Weipeng Yang, Hongsheng Chen, Shicheng Niu, Tian Qi, Jianliang Ma

Lightweight Human Pose Estimation Based on Densely Guided Self-Knowledge Distillation

The current human pose estimation network has difficulty to be deployed on lightweight devices due to its large number of parameters. An effective solution is knowledge distillation, but there still exists the problem of insufficient learning ability of the student network: (1) There is an error avalanche problem in multi-teacher distillation. (2) There exists noise in heatmaps generated by teachers, which causes model degradation. (3) The effect of self-knowledge distillation is ignored. (4) Pose estimation is considered to be a regression problem but people usually ignore that it is also a classification problem. To address the above problems, we propose a densely guided self-knowledge distillation framework named DSKD to solve the error avalanche problem, propose a binarization operation to reduce the noise of the teacher’s heatmaps, and add a classification loss to the total loss to guide student’s learning. Experimental results show that our method effectively improves the performance of different lightweight models.

Mingyue Wu, Zhong-Qiu Zhao, Jiajun Li, Weidong Tian

MCAPR: Multi-modality Cross Attention for Camera Absolute Pose Regression

Absolute camera pose regression typically estimates the position and orientation of a camera solely based on the captured image, trained with a convolutional backbone with multilayer perceptron heads for a single reference scene only. Recently, leading pose regression results have been achieved on multiple datasets by extending this approach to learn multiple scenes by leveraging data from different modalities, especially by fusing RGB and point cloud data. In this work, we propose to use cross-attention Transformers to learn multi-scene absolute camera pose regression, where cross-attention modules are used to aggregate activation maps with self-attention from different data modalities and convert latent features and scene index into candidate pose predictions. This mechanism allows our model to focus more on the general localization features. We evaluate our approach on the popular indoor benchmark dataset 7-scenes and compare it against both state-of-the-art (SOTA) single-scene and multiple absolute pose regression models. Our main result is the rotation accuracy is improved using our method more than slightly improved position accuracy compared to existing multi-scene methods.

Qiqi Shu, Zhaoliang luan, Stefan Poslad, Marie-Luce Bourguet, Meng Xu

MC-MLP: A Multiple Coordinate Frames MLP-Like Architecture for Vision

In deep learning, Multi-Layer Perceptrons (MLPs) have once again garnered attention from researchers. This paper introduces MC-MLP, a general MLP-like backbone for computer vision that is composed of a series of fully-connected (FC) layers. In MC-MLP, we propose that the same semantic information has varying levels of difficulty in learning, depending on the coordinate frame of features. To address this, we perform an orthogonal transform on the feature information, equivalent to changing the coordinate frame of features. Through this design, MC-MLP is equipped with multi-coordinate frame receptive fields and the ability to learn information across different coordinate frames. Experiments demonstrate that MC-MLP outperforms most MLPs in image classification tasks, achieving better performance at the same parameter level. The code will be available at: https://github.com/ZZM11/MC-MLP .

Zhimin Zhu, Jianguo Zhao, Tong Mu, Yuliang Yang, Mengyu Zhu

Medical Image Segmentation and Saliency Detection Through a Novel Color Contextual Extractor

Image segmentation is a critical step in computer-aided system diagnosis. However, many existing segmentation methods are designed for single-task driven segmentation, ignoring the potential benefits of incorporating multi-task methods, such as salient object detection (SOD) and image segmentation. In this paper, we propose a novel dual-task framework for the detection and segmentation of white blood cells and skin lesions. Our method comprises three main components: hair removal preprocessing for skin lesion images, a novel color contextual extractor (CCE) module for the SOD task, and an improved adaptive threshold (AT) paradigm for the image segmentation task. We evaluate the effectiveness of our proposed method on three medical image datasets, demonstrating superior performance compared to representative approaches.

Xiaogen Zhou, Zhiqiang Li, Tong Tong

MedNet: A Dual-Copy Mechanism for Medical Report Generation from Images

Generating medical reports from images is a complex task in the healthcare domain. Existing approaches often rely on predefined templates to retrieve sentences and do not take into account the hierarchical structure of medical reports. Additionally, they overlook the selective copying of input sequences to output sequences.To address these challenges, we propose MedNet, a generation-based model that employs a dual-copy mechanism to automatically generate medical reports from images. Our methodology involves first extracting features from images using an image encoder. Next, we use a dual-copy mechanism as the sequence encoder for retrieved reports to combine word generation in the decoder with bi-copying subsequences from the input sequence and placing them appropriately. Finally, a language decoder generates coherent and informative reports. We evaluate our MedNet on two public datasets, Open-I and MIMIC-CXR, and demonstrate that it outperforms current state-of-the-art methods. Our approach not only improves the quality of the generated reports but also allows for flexible generation, making it well-suited for a variety of healthcare applications. The proposed dual-copy mechanism in our approach enables the utilization of both integral tokens and sub-tokens, enhancing the accuracy and relevance of generated reports. Our work represents a significant step forward in the field of automated medical report generation from images.

Peng Nie, Xinbo Liu

Ms-AMPool: Down-Sampling Method for Dense Prediction Tasks

In recent years, CNN-based neural networks have continuously proven to have good performance in the field of computer vision. Dense prediction tasks (object detection and image segmentation), as fundamental tasks in the field of computer vision, the CNN models currently designed for them rely mainly on the construction of multi-scale information. Pooling layer, as a low-cost down-sampling component for building such models, effectively saves the number of model parameters and computational effort. The existing mainstream pooling methods not only lose a lot of effective feature information during down-sampling, but also do not perform multi-scale fusion of feature information during down-sampling, which greatly affects the final performance of the network. To this end, we propose an adaptive mixed pooling method. It is designed to allow the network to achieve adaptive mixing of pooling methods by learning the relevant weight parameters, so that the network can retain as much valid feature information as possible when down-sampling. Further, we construct a multi-scale adaptive mixed pooling method that can ensure that the network performs adaptive fusion of multi-scale feature information when down-sampling. We have conducted related experiments in yolo series object detection network as well as UNet and UNeXt image segmentation networks, and the results show the superiority of our method compared to other mainstream pooling methods for performance and parametric balance in dense prediction tasks.

Shukai Yang, Xiaoqian Zhang, Yufeng Chen, Lei Pu

Multi-frame Tilt-angle Face Recognition Using Fusion Re-ranking

Tilt-angle face recognition is a common problem in public video surveillance. Based on the complementarity among multi-frame tilt-angle faces from far to near, it is a possible way to improve recognition performance by fusing them. However, the feature fusion in the existing multi-frame frontal face recognition methods is unsuitable for tilt-angle faces with large changes in resolution and angle. To solve this issue, a multi-frame tilt-angle face recognition approach based on fusion re-ranking is proposed, which can obtain a more accurate final similarity list by weighted fusion of initial similarity lists given by different tilt-angle faces. In order to get the person-specific and angle-specific fusion weights, an angle-guided adaptive weight prediction network is designed. Moreover, we propose a weighted ranking loss to train the network, which can make the gallery face with the same identity as the probe face rank higher in the fused list. Experimental results on the tilt-angle face datasets demonstrate the effectiveness of our method.

Wenqin Song, Zhen Han, Kangli Zeng, Zhongyuan Wang

Multi-scale Field Distillation for Multi-task Semantic Segmentation

Semantic segmentation tasks in the field of computer vision have developed rapidly in recent years, because the development of autonomous driving technology requires more accurate semantic segmentation models. However, with the increasing demand, the use of deep models to solve semantic segmentation has reached a bottleneck. This is because the utilization of image information and correlations between images are not fully mined during model operation, and segmentation accuracy can only be achieved by deepening the model depth. To this end, we propose a multi-task semantic segmentation model based on multi-scale feature fusion and distillation to fully mine complementary information between related tasks. When building the multi-task learn framework, we firstly extracted multi-scale image information from the network layer at different depths, using feature fusion and jumping connection at different scales to make up for the loss of too much spatial information in the downsampling process, using distillation module to construct intermediate auxiliary tasks to roughly process features, and high-quality features can be obtained in the middle of training, thus alleviating the calculation pressure of the decoder and improving the accuracy of its own task. The proposed method uses three task-specific decoders to train for segmentation and two estimation tasks. Experiments on NYUv2 and Cityscapes data sets show that adding multi-scale information improves the efficiency of semantic segmentation and two estimation tasks, which shows the effectiveness of the proposed method.

Aimei Dong, Sidi Liu

Open Access

Neural Field Conditioning Strategies for 2D Semantic Segmentation

Neural fields are neural networks which map coordinates to a desired signal. When a neural field should jointly model multiple signals, and not memorize only one, it needs to be conditioned on a latent code which describes the signal at hand. Despite being an important aspect, there has been little research on conditioning strategies for neural fields. In this work, we explore the use of neural fields as decoders for 2D semantic segmentation. For this task, we compare three conditioning methods, simple concatenation of the latent code, Feature-wise Linear Modulation (FiLM), and Cross-Attention, in conjunction with latent codes which either describe the full image or only a local region of the image. Our results show a considerable difference in performance between the examined conditioning strategies. Furthermore, we show that conditioning via Cross-Attention achieves the best results and is competitive with a CNN-based decoder for semantic segmentation.

Martin Gromniak, Sven Magg, Stefan Wermter

PDF Zum Volltext

Neurodynamical Model of the Visual Recognition of Dynamic Bodily Actions from Silhouettes

For social species, including primates, the recognition of dynamic body actions is crucial for survival. However, the detailed neural circuitry underlying this process is currently not well understood. In monkeys, body-selective patches in the visual temporal cortex may contribute to this processing. We propose a physiologically-inspired neural model of the visual recognition of body movements, which combines an existing image-computable model (‘ShapeComp’) that produces high-dimensional shape vectors of object silhouettes, with a neurodynamical model that encodes dynamic image sequences exploiting sequence-selective neural fields. The model successfully classifies videos of body silhouettes performing different actions. At the population level, the model reproduces characteristics of macaque single-unit responses from the rostral dorsal bank of the Superior Temporal Sulcus (Anterior Medial Upper Body (AMUB) patch). In the presence of time gaps in the stimulus videos, the predictions made by the model match the data from real neurons. The underlying neurodynamics can be analyzed by exploiting the framework of neural field dynamics.

Prerana Kumar, Nick Taubert, Rajani Raman, Anna Bognár, Ghazaleh Ghamkhari Nejad, Rufin Vogels, Martin A. Giese

PACE: Point Annotation-Based Cell Segmentation for Efficient Microscopic Image Analysis

Cells are essential to life because they provide the functional, genetic, and communication mechanisms essential for the proper functioning of living organisms. Cell segmentation is pivotal for any biological hypothesis validation/analysis i.e., to get valuable insights into cell behavior, function, diagnosis, and treatment. Deep learning-based segmentation methods have high segmentation precision, however, need fully annotated segmentation masks for each cell annotated manually by the experts, which is very laborious and costly. Many approaches have been developed in the past to reduce the effort required to annotate the data manually and even though these approaches produce good results, there is still a noticeable difference in performance when compared to fully supervised methods. To fill that gap, a weakly supervised approach, PACE, is presented, which uses only the point annotations and the bounding box for each cell to perform cell instance segmentation. The proposed approach not only achieves 99.8% of the fully supervised performance, but it also surpasses the previous state-of-the-art by a margin of more than 4%.

Nabeel Khalid, Tiago Comassetto Froes, Maria Caroprese, Gillian Lovell, Johan Trygg, Andreas Dengel, Sheraz Ahmed

Pie-UNet: A Novel Parallel Interaction Encoder for Medical Image Segmentation

Most of the initial medical image segmentation methods based on deep learning adopt a full convolutional structure, while the fixed size of the convolutional window limits the modeling of long-range dependencies. ViT has powerful global modelling capabilities, but low-level feature detail is poorly represented. To address the above problems, we propose a novel encoder structure and design a new U-shaped network for medical image segmentation, called Pie-UNet. Firstly, facing the problem of lack of localization in ViT and lack of global perception in CNN, we complement each other by encoding global and local information separately and implementing both in a parallel interaction manner; meanwhile, we propose a network with local structure-aware ViT, called Rwin Transformer, to enhance the local detail representation of ViT itself; in addition, to further refine the local representation, we construct a focal modulator based on large kernels; finally, we propose a pre-fusion approach to optimize the information interaction between heterogeneous structures. The experimental results demonstrate that our proposed Pie-UNet can achieve optimal and accurate segmentation results compared with several existing medical image segmentation methods.

Youtao Jiang, Xiaoqian Zhang, Yufeng Chen, Shukai Yang, Feng Sun

Prior-SSL: A Thickness Distribution Prior and Uncertainty Guided Semi-supervised Learning Method for Choroidal Segmentation in OCT Images

Choroid structure is crucial for the diagnosis of ocular diseases, and deep supervised learning (SL) techniques have been widely applied to segment the choroidal structure based on OCT images. However, SL requires massive annotated data, which is difficult to obtain. Researchers have explored semi-supervised learning (SSL) methods based on consistency regularization and achieved strong performance. However, these methods suffer from heavy computational burdens and introduce noise that hinders the training process. To address these issues, we propose a thickness distribution prior and uncertainty aware pseudo-label selection SSL framework (Prior-SSL) for OCT choroidal segmentation. Specifically, we compute the instance-level uncertainty of the pseudo-label candidate, which significantly reduces the computational burden of uncertainty estimation. In addition, we consider the physiological characteristics of the choroid, explore the choroidal thickness distribution as prior knowledge in the pseudo-label selection procedure, and thereby obtain more reliable and accurate pseudo-labels. Finally, these two branches are combined via a Modified AND-Gate (MAG) to assign confidence levels to pseudo-label candidates. We achieve state-of-the-art performance for the choroidal segmentation task on the GOALS and NIDEK OCT datasets. Ablation studies verify the effectiveness of the Prior-SSL in selecting high-quality pseudo-labels.

Huihong Zhang, Xiaoqing Zhang, Yinlin Zhang, Risa Higashita, Jiang Liu

PSR-Net: A Dual-Branch Pyramid Semantic Reasoning Network for Segmentation of Remote Sensing Images

The long-range context information in the semantic segmentation network for remote sensing images (RSIs) plays an important role in the improvement of segmentation performance. However, in large RSIs, the interaction between local information and global information is limited. In order to solve the problem, we propose a dual-branch pyramid semantic reasoning segmentation network. Our dual-branch network consists of a global and local branch. The traditional CNN network is employed on the global branch, and a lightweight multi-scale hierarchical feature aggregation (MHFA) module is introduced into the local branch. In addition, the Feature Semantic Reasoning (FSR) module is proposed to enhance the valuable features and weaken the useless features to improve the semantic representation of RSIs, and then the double branch transformer is embedded. The ablation experiment on the Beijing Land-Use (BLU) dataset illustrates the effectiveness of the added modules, and the results presented by comparison with other traditional networks also confirm the superiority of our proposed network. The proposed network can achieve better segmentation accuracy on large-scale RSI datasets.

Lijun Wang, Bicao Li, Bei Wang, Chunlei Li, Jie Huang, Mengxing Song

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter