Zum Inhalt

Computer Vision and Image Processing

9th International Conference, CVIP 2024, Chennai, India, December 19–21, 2024, Revised Selected Papers, Part VI

  • 2026
  • Buch
insite
SUCHEN

Über dieses Buch

The Six-volume proceedings set CCIS 2473 and 2478 constitutes the refereed proceedings of the 9th International Conference on Computer Vision and Image Processing, CVIP 2024, held in Chennai, India, during December 19–21, 2024.

The 178 full papers presented were carefully reviewed and selected from 647 submissions.The papers focus on various important and emerging topics in image processing, computer vision applications, deep learning, and machine learning techniques in the domain.

Inhaltsverzeichnis

Frontmatter
Do Not Look so Locally to Fish Skins: Improved YOLOv7 for Fish Disease Detection with Transformers
Abstract
Aquaculture production significantly influences overall fish production, yet it is often adversely affected by various fish diseases. These diseases can be effectively identified by analyzing the condition of the fish’s skin. Consequently, there is a growing demand for automated fish skin disease detection methods. By implementing such automated approaches, the efficiency and accuracy of disease detection can be enhanced, leading to better management of fish health and, ultimately, more sustainable aquaculture practices. In this work, we propose a novel Transformer based modified YOLO approach for detection of five different fish skin diseases. We propose a Transformer feature extraction module (TFEM) to effectively capture the long-range dependencies from input image. The proposed TFEM is incorporated in the YOLOv7 backbone for efficient feature learning. We assessed the performance of our proposed TFEM by comparing it with various YOLOvX approaches to confirm its effectiveness. Both qualitative and quantitative results demonstrate that our method is highly capable of accurately detecting five distinct fish diseases. The source code is available at: https://​github.​com/​shrutiphutke/​Fish_​disease_​detection_​YOLO_​transformer
Shruti Phutke, Amit Shakya, Chetan Gupta, Rupesh Kumar, Tsuyoshi Kuroda, Lalit Sharma
MDDAMFN: Mixed Dual-Direction Attention Mechanism to Enhance Facial Expression
Abstract
Facial expression recognition (FER) plays an important role in human-computer interaction (HCI). Building on the current state of the art, the Dual-Direction Attention Mixed Feature Network (DDAMFN), we propose the Mixed Dual-Direction Attention Mixed Feature Network (MDDAMFN), incorporating a novel Mixed Dual-Direction Attention (MDDA) mechanism to address limitations in the original architecture. This new approach captures a wider range of information, from very local to global, mimicking the human perception of facial expressions. The MDDA mechanism enhances the model’s ability to identify better attention regions, significantly improving inter-class and intra-class predictions. Experimental results on different datasets, which are AffectNet, CAER-S, and FERPlus, show that MDDAMFN not only maintains the lightweight and robust characteristics of its predecessor (DDAMFN) but also achieves superior performance compared to existing models, making MDDAMFN a state-of-the-art model in the field of FER.
Srajan Chourasia, Sanskar Dethe, Shitala Prasad
A Brief Review of State-of-the-Art Classification Methods on Benchmark Peripheral Blood Smears Datasets
Abstract
White blood cells (WBCs) play a crucial role in the immune system, with their morphology and subtype counts serving as key indicators for diagnosing conditions like anemia and leukemia. However, manual WBC classification in peripheral blood smears is time-consuming, highlighting the need for automated WBC classification systems. Recent advancements in deep learning, including convolutional neural networks and vision transformers, have demonstrated significant potential in medical imaging by effectively extracting meaningful features. This paper surveys state-of-the-art techniques, examining relevant datasets and WBC types. We conduct a comprehensive performance analysis of nine models on two benchmark datasets, BCCD and PBC. Our findings indicate that ConvNeXt achieves a weighted average accuracy (WAA) of 89.58% and an F1-Score of 90.00% on the BCCD dataset, while DenseNet demonstrates superior performance on the PBC dataset, with a WAA of 98.88% and an F1-Score of 98.88%.
Muhammad Suhaib Kanroo, Hadia Showkat Kawoosa, Tanushri, Medha Aggarwal, Puneet Goyal
Detection and Monocular Depth Estimation of Ghost Nets
Abstract
Marine debris has a detrimental impact on marine habitat, human health, and the economy. Among various marine debris, abandoned, lost and discarded fishing gear (ALDFG), primarily ghost nets cause the most damage to the environment, fisheries, and shipping. Autonomous underwater vehicles and swarm robotics can be used to clean up and manipulate ghost nets. This requires effective robotic perception, for robots to work in such a challenging environment. In this work, we use binary object detection with transfer learning and the effectiveness of popular YOLOv5 models for real-time detection. Furthermore, we evaluate different monocular depth estimation techniques on ghost nets and couple YOLOv5 with MiDaS for real-time detection and depth estimation.
Mohammed Ayaan, R. Naveen Raj
DiffMamba: Leveraging Mamba for Effective Fusion of Noise and Conditional Features in Diffusion Models for Skin Lesion Segmentation
Abstract
Effective Skin Lesion Segmentation is crucial for dermatological care, it enables the early identification and accurate diagnosis of skin cancer. Denoising Diffusion Probabilistic Models (DDPMs) have recently become a major focus in computer vision. Its applications in image generation, such as Stable Diffusion, Latent Diffusion Models and Imagen, have showcased remarkable abilities in creating high-quality generative outputs. Recent research highlights that DDPMs also perform exceptionally well in medical image analysis, specifically in medical image segmentation tasks. Even though a U-Net backbone served as the foundation for these models initially, there is a promising opportunity to boost their performance by incorporating other mechanisms. Recent research include transformer-based framework for diffusion models, but the advancement come with the challenge of inherent quadratic complexity. Research has shown that state space models (SSMs), like Mamba efficiently capture long-range dependencies while maintaining linear computational complexity. Due to these benefits, it outperforms many of the mainstream foundational architectures. However, we found that simply merging Mamba with diffusion results in suboptimal performance. To truly harness the power of these two advanced technologies for medical image segmentation, a more effective integration is required, we formulate a novel Mamba-Based Diffusion framework, called DiffMamba for skin lesion segmentation. We access its performance on the ISIC 2018 dataset for skin lesion segmentation, and our method outperforms existing state-of-the-art techniques. The code is available at: https://​github.​com/​amit-shakya-28/​DiffMamba
Amit Shakya, Shruti Phutke, Chetan Gupta, Rupesh Kumar, Lalit Sharma, Chetan Arora
UDC-Mamba: Deep State Space Model for Under Display Camera Image Restoration
Abstract
Images captured with Under Display Camera (UDC) technology often experience various quality issues due to the inherent limitations of the capturing mechanism. For UDC image restoration, deep architectures utilizing CNNs and transformers frequently struggle to produce high-quality reconstructions because these networks often cannot effectively manage large receptive fields due to their inherent constraints. The recently proposed Mamba architecture, which employs State Space Models, has demonstrated promising results across various vision tasks, including image restoration applications such as denoising and super resolution. The model efficiently manages large receptive fields with linear time computational complexity. In this study, we evaluate the Mamba Model’s performance on UDC image restoration tasks after introducing a UDC specific additional module in the base architecture namely MambaIR. The proposed model, named UDC-Mamba, consists of a shallow restoration module, a novel hybrid deep enhancement module, and a selective scan module for high quality reconstruction. In our proposed hybrid deep enhancement module, convolutional blocks with multiple kernel sizes are used in conjunction with state space blocks. Experiments reveal that this model effectively restores UDC images, achieving notably superior perceptual quality compared to existing state-of-the-art methods. Our code is available at https://​github.​com/​J-Karthik-palaniappan/​UDC_​Mamba
Aniruth Sundararajan, MS Levin, Karthik Palaniappan, Jiji CV
Walking Direction Estimation Using Silhouette and Skeletal Representations
Abstract
Walking direction is vital for applications such as surveillance, security, traffic safety systems and health monitoring. Silhouettes and Skeleton joint coordinates are commonly used gait representation modalities, both with their own advantages. Silhouettes are rich in representative features while skeleton coordinates are more robust to noise. In this paper, we explore the two popular modalities for walking direction estimation. We leverage temporal sequence modeling to extract gait relevant information and capture the correlation between consecutive frames and extract motion features. Despite the challenges posed by variations in pose, clothing, and carrying conditions, which lead to high intra-class variability, our approach utilizes sequence based deep architecture to address these issues. These architectures, with their ability to generalize across different conditions and learn hierarchical and rich feature representations, demonstrate effective representation learning and provide a baseline for comparing the two input modalities. We propose a novel method that utilizes a deep architecture with residual blocks across two modalities to extract direction relevant features. We introduce two different training and testing setting to evaluate our model on CASIA-B dataset and provide ablation study on the effect of triplet loss in training. Experimental results show that our proposed methods achieve impressive Rank-1 accuracy, with an average Rank-1 accuracy of 97.41% on the CASIA-B dataset, an average Rank-1 accuracy of 96.15% and 96.30% for OU-MVLP silhouette and pose dataset respectively.
A. V. Vishnuram, Raj Hilton, Rahul Raman
Realizing GAN Potential for Image Generation and Image-To-Image Translation Using Pix2Pix
Abstract
Generative-Adversarial-Networks (GANs) have demonstrated notable/significant potential in the area/field of image-synthesis, especially in the areas of image-generation and image to image translation. In this work, we explore the capabilities of GANs through two distinct implementations. First, we utilize a regular GAN, to generate synthetic images belonging to MNIST and CIFAR-10 datasets, illustrating the network’s proficiency in creating diverse and realistic images from different data distributions. Second, we apply the Pix2Pix GAN model for image-to-image (I2I) translation, focusing on converting satellite images into maps, a task that highlights the model's capability/ability to learn intricate mappings between paired datasets. Our results underscore the versatility of GANs in generating high-quality images and performing complex image translation tasks. In this study, for generating images of MNIST handwritten digits, the discriminator has a loss of 0.57 and an accuracy of 82%, while the generator has a loss of 0.52. For the CIFAR-10 dataset, the discriminator’s loss is 0.629 when dealing with real images and 0.51 when dealing with fake images, with the generator’s (basically a CNN), loss at 0.897. In the image translation task using Pix2Pix on the maps dataset, the discriminator’s loss is 0.6 for real images and 0.4 for unreal images, while the generator’s loss is 0.61. In each of the GAN architectures, the discriminator, the generator, and the combined GAN model are carefully designed and trained accordingly. These findings demonstrate the effectiveness of GANs in both generating diverse, high-quality images and performing precise image-to-image translation. The results validate the robustness and adaptability of GAN architectures across different datasets and tasks, reinforcing their potential for advanced image synthesis apps/applications.
Sumera, T. S. Subashini, K. Vaidehi
DSFF-Net: Depthwise Separable U-Net with Feature Fusion for Polyp Segmentation Towards Hardware Deployment
Abstract
Colorectal cancer (CRC) can be prevented with early detection and removal of the colorectal polyps. The accurate removal of polyps depends on the efficient segmentation of the polyps. In a power-constrained environment such as embedded devices, less complex deep learning models are needed to perform segmentation efficiently. This work proposed a lightweight segmentation model to be deployed on FPGA. The proposed architecture uses depthwise separable convolution efficiently, making the architecture perform well with lower computational complexity. The issue of polyp size variation can be successfully overcome with feature fusion within the encoder section of the architecture. The proposed architecture was successfully evaluated on CVC-CliniDB and Kvasir-SEG datasets with an mIoU of 0.9365 and 0.8501, respectively. It delivers a throughput of 44.82 FPS with 2.5011 energy efficiency when deployed on FPGA ZCU104. The proposed model outperformed other considered state-of-the-art (SOTA) models in terms of performance.
Debaraj Rana, Bunil Kumar Balabantaray, Rangababu Peesapati
Cattle Identification Through Multi-biometric Features and Edge Device
Abstract
Cattle individuality recognition has emerged as a critical aspect of contemporary precision livestock farming. Biometric identifiers, specifically muzzle and facial features, are gaining traction as key components in this domain. This paper proposes a novel multi-biometric approach for enhanced cattle individuality recognition in precision livestock farming. The system leverages advanced object detection models, specifically YOLOv8, to identify cattle based on muzzle and facial features. Pre-processing techniques and data augmentation strategies are employed to improve model robustness. The proposed method is implemented as a real-time edge device application, demonstrating its potential for practical agricultural use. A meticulously curated dataset (https://​github.​com/​RahulRaman2/​Indian-Cattle-Biometric-Database) exceeding 5,000 cattle face and muzzle images is utilized for model training, achieving an accuracy of 90.39%. Further improvements in accuracy can be achieved through continued refinement of the training dataset, optimization of the model parameters, and exploration of ensemble learning techniques.
Apurba Roy, Rahul Raman
Fast Sparse SAR Image Reconstruction Using Sparsity Independent Regularized Pursuit
Abstract
Synthetic Aperture Radar(SAR) imaging technique processes high-resolution images of Earth’s Surface irrespective of weather conditions. SAR images require large bandwidths to transmit to ground stations and received data acquires huge volumes of data. To overcome this challenge in this work, the sparsity nature of the SAR image is explored in the Fourier domain, and the complete signal is reconstructed using the fast Sparsity-independent regularized Pursuit (SIRP) reconstruction algorithm. The SIRP algorithm is suitably derived for SAR image reconstruction. It improves SAR image recovery compared to traditional CS recovery algorithms as it uses an optimized regularization strategy and does not depend on Sparsity. It also lessens the computational load using parallel estimation, which is crucial for quickly handling extensive amounts of SAR data. The proposed work is validated for the ERS-2 dataset, which is of size 4912\(\,\times \,\)29750. SIRP proves its improvement in reconstruction by showing a significant improvement in PSNR with less computational time compared to OMP and CoSAMP thereby enhancing remote sensing proficiencies.
Boddu Bharadwaj, J. Sheeba Rani
Space Varying Motion Blur Degradation Dataset and Model for Semantic Segmentation
Abstract
This paper proposes an efficient degradation model addressing the Space Varying Motion Blur (SVMB) challenge in semantic image segmentation. SVMB distorts object boundaries in motion-captured images, and current State-of-the-Art (SotA) Deep Learning (DL) models require annotated datasets containing SVMB degradation to handle such issues. However, annotating acquired blurred images is often impractical, since the original information is heavily distorted. This work presents a simple and effective technique for generating synthetic SVMB image segmentation data using the Cityscapes benchmark dataset. We leverage the ground truth annotations from the dataset and the Connected Components Algorithm (CCA) to separate the foreground object information. Our experiments demonstrate that U-Net, trained on our SVMB image segmentation dataset augmented with the original Cityscapes dataset, demonstrates superior performance in segmenting synthetic and real-captured blurred image data.
Bhargav Reddy, S. Sree Rama Vamsidhar, Sahana Prabhu, Rama Krishna Gorthi
Multi-class Classification of Gastrointestinal Disease Detection Using Vision Transformers
Abstract
Gastrointestinal (GI) diseases are among the most commonly occurring diseases in the human digestive system, with a significantly higher mortality rate. Early diagnosis plays a crucial role in the effective treatment of these diseases. Accurate evaluation of endoscopic images is essential in the decision making process for patient treatment. While many deep learning models, such as Convolutional Neural Networks (CNNs), have been employed for the detection of these diseases, they struggle to handle large datasets effectively. To overcome these limitations, we propose the use of advanced Vision Transformer (ViT) model, which shown superior performance on large-scale data. The proposed model is evaluated on GastroVision dataset which consists of 8600 images, categorized into 27 different classes. Our proposed approach achieved an recall of 85.75%, precision of 86.41%, and an F1-score of 85.57%, and also outperformed the Densenet-121 model.
Jagadeesh Kakarla, R. Usha Rani, Vemakoti Krishnamurty, Ruvva Pujitha
MGC: Music Genre Classification Using a Hybrid CNN-LSTM Model with MFCC Input
Abstract
In this paper, we propose an architecture for music genre classification leveraging a Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN). The motivation behind this study is to recognize the importance of discerning essential spectral features for accurate genre classification in audio data. Focusing on the GTZAN dataset comprising ten music genres, our methodology involves intricate feature extraction, emphasizing Mel Frequency Cepstral Coefficients (MFCC). This transformation captures essential spectral features crucial for genre discernment. Beyond traditional methods, we incorporate Short-Term Fourier Transform (STFT) with advanced activations and signal processing techniques to enhance feature extraction. The CNN-LSTM model effectively captures spatial and temporal complexities in audio data, significantly to the domain of music genre classification. The outcomes underscore the performance of our proposed model, showcasing its potential for practical applications in music genre classifications, and compare the results with state-of-the-art methods.
Dattatreya N. Halyal, Mahamadshiraj A. Bichhunavar, Manjunath C. Pati, Ramesh A. Tabib, Basawaraj, Uma Mudenagudi
DBTC-Net: Dual-Branch Transformer-CNN Network for Brain Tumor Segmentation
Abstract
Accurately segmenting brain tumors is vital for diagnosing and treating the disease. Recently, convolutional neural networks (CNNs) have obtained notable performances in segmenting brain tumors. Nevertheless, they have limited capability to utilize long-range (or global) dependencies. In contrast, Transformers can successfully model global dependencies but cannot adequately capture local dependencies. It is crucial to utilize both local and global dependencies to perform accurate brain tumor segmentation. Consequently, several studies have attempted to combine the benefits of CNN and Transformer. However, effectively capturing local and global information remains challenging. Thus, we introduce a new 3D U-Net variant termed Dual-Branch Transformer-CNN Network (DBTC-Net), in which the encoder contains two branches built using Swin Transformer and CNN to effectively utilize global and local information. Furthermore, we design a Swin Transformer Channel Attention (SwinTCA) block by modifying the Swin Transformer blocks to enable it to capture location-wise channel dependencies in addition to global spatial dependencies. Moreover, we propose a Transformer Convolution Feature Combination (TCFC) block that effectively combines the complementary global and local features from the Transformer and CNN encoder blocks to improve the feature representation capability. In addition, a Multi-Scale Context Combination (MCC) block is introduced in the bottleneck to handle the variations in tumor size by utilizing multi-scale contextual features and forcing the network to concentrate on the tumor area. Extensive experimentation on the BraTS 2021 and 2020 benchmark datasets proves the success of the introduced components. The results reveal that DBTC-Net outperformed the CNN- and Transformer-based state-of-the-art networks.
Indrajit Mazumdar, Jayanta Mukhopadhyay
Backmatter
Titel
Computer Vision and Image Processing
Herausgegeben von
Jagadeesh Kakarla
R. Balasubramanian
Subrahmanyam Murala
Santosh Kumar Vipparthi
Deep Gupta
Copyright-Jahr
2026
Electronic ISBN
978-3-031-93703-3
Print ISBN
978-3-031-93702-6
DOI
https://doi.org/10.1007/978-3-031-93703-3

Die PDF-Dateien dieses Buches wurden gemäß dem PDF/UA-1-Standard erstellt, um die Barrierefreiheit zu verbessern. Dazu gehören Bildschirmlesegeräte, beschriebene nicht-textuelle Inhalte (Bilder, Grafiken), Lesezeichen für eine einfache Navigation, tastaturfreundliche Links und Formulare sowie durchsuchbarer und auswählbarer Text. Wir sind uns der Bedeutung von Barrierefreiheit bewusst und freuen uns über Anfragen zur Barrierefreiheit unserer Produkte. Bei Fragen oder Bedarf an Barrierefreiheit kontaktieren Sie uns bitte unter accessibilitysupport@springernature.com.

    Bildnachweise
    AvePoint Deutschland GmbH/© AvePoint Deutschland GmbH, NTT Data/© NTT Data, Wildix/© Wildix, arvato Systems GmbH/© arvato Systems GmbH, Ninox Software GmbH/© Ninox Software GmbH, Nagarro GmbH/© Nagarro GmbH, GWS mbH/© GWS mbH, CELONIS Labs GmbH, USU GmbH/© USU GmbH, G Data CyberDefense/© G Data CyberDefense, FAST LTA/© FAST LTA, Vendosoft/© Vendosoft, Kumavision/© Kumavision, Noriis Network AG/© Noriis Network AG, WSW Software GmbH/© WSW Software GmbH, tts GmbH/© tts GmbH, Asseco Solutions AG/© Asseco Solutions AG, AFB Gemeinnützige GmbH/© AFB Gemeinnützige GmbH