Skip to main content
Top

Computer Vision and Robotics

Proceedings of CVR 2025, Volume 2

  • 2026
  • Book
insite
SEARCH

About this book

This book consists of a collection of the high-quality research articles in the field of computer vision and robotics which are presented at the International Conference on Computer Vision and Robotics (CVR 2025), organized by National Institute of Technology, Goa, India, during 25–26 April 2025. The book discusses applications of computer vision and robotics in the fields like medical science, defence, and smart city planning. The book presents recent works from researchers, academicians, industry, and policy makers.

Table of Contents

Next
  • current Page 1
  • 2
  • 3
  1. Frontmatter

  2. Automating Medical Report Summarization: A Generative AI Approach for Enhanced Decision Support and Workflow Efficiency in Healthcare

    Palak Hajare, Mallika Hariharan, Snehal V. Laddha
    Abstract
    In today’s healthcare systems, managing the vast and growing volume of clinical text, particularly pathological reports, remains a pressing challenge. To address this, we introduce an automated summarization framework designed to distill essential information from lengthy medical documents. The proposed system combines a Transformer-based encoder-decoder architecture with a Generative Adversarial Network (GAN) to enhance the accuracy and fluency of generated summaries. Prior to modeling, the input text undergoes rule-based preprocessing and Named Entity Recognition (NER) to identify and retain critical medical terms while eliminating irrelevant data. The Transformer module effectively captures complex contextual relationships within the document, while the GAN discriminator improves the summary’s coherence through adversarial refinement. We evaluated our model on a clinical dataset using standard summarization metrics, including ROUGE-1, ROUGE-2, and ROUGE-L. Comparative analysis with existing models such as BERTSUM and TextRank indicates that our approach yields more relevant and concise summaries. This solution aims to support healthcare professionals by streamlining the review of clinical texts and facilitating faster decision-making.
  3. Generating Machine-Style Handwriting: A Diffusion Based Latent Generation with VAE Decoding

    Phani Kumar Nyshadham, Prasanna Biswas, Archie Mittal
    Abstract
    In this paper, we introduce the Style-Calligraphy model, an innovative architecture designed to generate high-fidelity images of text in specified machine styles, conditioned on a given text input. Our approach leverages the strengths of Variational Autoencoders (VAEs) and Latent Diffusion Models (LDMs) to address the challenges of latent space representation and efficient image generation. The VAE encoder-decoder framework is employed to learn structured latent spaces, mitigating the limitations of traditional autoencoders by incorporating Kullback-Leibler divergence alongside image reconstruction loss. This ensures a continuous and feasible latent space for sampling. The LDM is trained as a denoiser with text-based conditioning, utilizing a Markov chain to model the noise addition process and employing cross-attention mechanisms to enhance spatial character relationships. We introduce a novel sliding cross-attention technique using duplets and triplets to capture intricate dependencies between characters, significantly improving the model’s performance. Furthermore, we propose a stand-alone image decoder to address noise sensitivity, trained on both clean and noisy latent representations, resulting in a substantial increase in image quality. A key innovation of our work is the repurposing of a single LDM across multiple machine styles, drastically reducing training costs by isolating style-specific training to the image decoder. Our comprehensive training pipeline, optimized for efficiency, demonstrates the model’s capability to generate accurate and stylistically coherent text images, achieving a 99.5% success rate in high-quality sample generation on seen data.
  4. A Comparative Study of Image Synthesis Models: Stack GANs and Diffusion Based Text to Image Generation

    Tarushi Khattar, Sara Bare, Tanya, Sakshi Kuyate, Vaishali Wangikar
    Abstract
    In today’s world, where visuals communicate more effectively than words, text to image synthesis plays a crucial role across various sectors, helping them grow with a creative, image centric approach. Machine learning has contributed significantly to this field through simple techniques, but deep learning has introduced robust models that can automatically generate realistic images from input text. Recent advancements in generative models have led to the development of several techniques for text to image synthesis. Although multiple models exist, our project primarily focuses on exploring two: the Stack GANs, which has maintained a strong position, and the recently emerged Stable Diffusion model. This study includes an exploration of the literature on text generative models, providing a deeper understanding of these models. Furthermore, the research extends to training different related models on various common datasets, such as LAION5B, CUB2002011 and Oxford 102 Flower evaluating both the accuracy and quality of the generated images. Finally, we present the findings and results of our study show that the diffusion model, with an accuracy of 76%, out performed the GANs Model, which had an accuracy of 56%, leaving room for future enhancements.
  5. Optimized Humidity Prediction: A Random Forest and Aquila Optimizer Approach

    Sandeep Samantaray
    Abstract
    Weather dynamics of Relative Humidity (RH) are notoriously nonlinear, there are outliers, and even the error distributions are asymmetric, all of which hinder the accurate prediction of RH in the conventional models. The limitations mentioned above are formulated in this study by introducing a random forest optimized with Aquila optimizer (RF-AO) as a novel hybrid machine learning model to mitigate these problems. The Aquila Optimizer improves generalization and Metropolitan noise robustness by adapting the RF hyperparameters such as tree depth, node splits, and ensemble size to meteorological data noise. The RF-AO model was run using daily RH data (2015–2018 from Pahalgam, India, provided by IMD) the RF-AO model reduced the Mean Absolute Error (MAE) to 0.1764 (vs. 8.8863 for standalone RF) and achieved a Willmott’s Index (WI) of 0.9901 d R2an of 0.9843 during testing. These improvements stem from the AO’s ability to balance exploration and exploitation during optimization, which mitigates overfitting and outlier sensitivity. The results demonstrate the model’s ability to be deployed for real-time applications in irrigation planning, HVAC control, and climate resilience strategy. Proposing a scalable framework for global climatic regions, this work integrates metaheuristic optimization into the ensemble forecasting for RH by making it more robust.
  6. Diabetic Retinopathy Classification using Transformer Models: An Comprehensive Survey

    S. Suvalakshmi, B. Vinoth Kumar
    Abstract
    Diabetic Retinopathy (DR), a leading cause of blindness and visual impairment, arises from prolonged diabetes mellitus with poor glycemic control, leading to structural damage in the retina. DR is becoming a critical medical challenge, affecting individuals’ vision and overall health. While ophthalmologists can manually diagnose DR, this approach is labor-intensive and time-consuming, particularly in today’s high demand clinical environments. Early detection and prevention of DR require an automated, precise, and personalized approach using deep learning. Various deep learning techniques have been explored for DR severity classification, with Convolutional Neural Networks (CNNs) being the predominant choice. However, CNNs have limitations in capturing long-range dependencies within retinal images. Recently, transformers have gained prominence in computer vision, demonstrating superior performance in natural language processing. Transformers utilize multi-head self-attention mechanisms to model complex contextual interactions between image pixels, addressing the shortcomings of CNNs. This study proposes a transformer-based approach for DR classification, leveraging its self-attention mechanisms to enhance feature extraction and improve diagnostic accuracy. Fundus images are segmented into non overlapping patches, which are then flattened into sequences and processed through a linear projection and positional embedding technique to retain spatial information. These sequences are subsequently fed into multiple layers of transformer attention mechanisms to generate the final feature representation. In practical clinical applications, transformer-based models can provide ophthalmologists with rapid, precise, and individualized diagnostic insights, facilitating timely medical interventions and improving patient outcomes.
  7. VisionAid: A Real-Time System for Object Detection, Text Reading, and Voice Alerts for Visually Impaired Individuals

    Saanvi Sanjay, N. Shivani, Soham M Karia, Vaishnavi Mendon, B. V. Poornima
    Abstract
    This manuscript introduces VisionAid, an innovative assistive system designed to enhance the independence of visually impaired individuals by facilitating navigation and environmental interaction. VisionAid integrates cutting-edge technologies, including real-time object detection, Optical Character Recognition (OCR), and Text-to-Speech (TTS), to provide dynamic audio feedback. We conducted extensive experiments using the COCO (Common Objects in Context) dataset, which contains thousands of real-world images, as a pretrained model. For object detection, VisionAid leverages YOLOv8 (You Only Look Once), a state-of-the-art deep learning model known for its high accuracy and low-latency performance. This enables the system to accurately detect and identify objects in real time, ensuring reliable feedback for the user. The system also incorporates Tesseract OCR for text recognition, allowing users to access printed or digital text seamlessly. The recognized text is then converted into natural speech using TTS technology, ensuring both visual and textual information are communicated effectively. By combining these capabilities, VisionAid offers an intuitive, accessible means for visually impaired users to interact with and understand their surroundings through auditory feedback.
  8. Computation of Fetal Heart Rate Variability from Abdominal ECG Using Adaptive Filtering and Independent Component Analysis

    Sanghamitra Subhadarsini Dash, Ashish Biju Varghese, Malaya Kumar Nath
    Abstract
    Investigating fetal electrocardiogram (fECG) is of critical importance for pregnant women to study about fetal health and its well-being. Generally, its extraction is preferred from the abdominal ECG (aECG) recordings, which consists of fECG, maternal ECG (mECG), and noises (such as, power line disturbances, motion artifact, uterine contraction, baseline wander, and high frequency noises, etc.). Accompanying noise in ECG causes loss of critical information and leads to misdiagnosis. The work presented in this paper extracts clean fECG from aECG, using the capability of independent component analysis (ICA) and adaptive filtering (AF). ICA is a blind source separation (BSS) technique, which is used for estimating multivariate data as a linear combination of statistically independent non-Gaussian signals (i.e., source signals). It is also a non-parametric technique and is independent of pattern averaging, making it an efficient algorithm for identification of atypical heartbeats in ECG signal. FastICA (FICA) is a fixed-point iterative algorithm that estimates the independent components (ICs) with maximum non-Gaussianity by minimizing the similarity between them. These ICs are subjected to adaptive filtering along with direct fECG as the reference signal for extracting clean fECG. This filtering helps in estimating the lost fECG signal during acquisition by canceling the background noises. In this work, an optimally converging least mean square (LMS) algorithm is used with proper selection of step size. The extracted fECG obtained from the filtering process is subjected to post-processing, by Savitzky-Golay filter, followed by a \(3^{rd}\) order band-pass filter, a derivative filter, and a P-point moving average filter for clear identification of R-peaks. From R-peak locations, heart rate variability (HRV) has been computed using the Pan-Tompkins algorithm to predict fetal heart abnormalities. This method is validated on the publicly available PhysioNet (ADFECG) database and obtained an F1-score of 92.68%. The estimated heart rate for the extracted fECG is found to be 78 bpm.
  9. Evaluation of Novel In-Shoe Strain Gauge Device for Gait Analysis via Data Processing Techniques

    Ayaan Shankta, Rejin Jacob, Reetu Jain
    Abstract
    Gait analysis is the assessment of walking patterns through the coordination and balance of muscles in the body. It is essential in the diagnosis of neurological disorders and monitoring of patient progress for rehabilitation. Conventional gait analysis is heavily reliant on force plates to measure the Ground Reaction Forces (GRF) that a person exerts. However, these systems are constrained by high costs, constant maintenance, and repeated foot strikes to ensure accurate data. This study establishes an alternative novel in-shoe device that utilizes strain gauges to measure the GRF of a person. It addresses the key limitations of force plates while maintaining the accuracy and precision of the measurements. The device integrates four 3D-printed strain gauge mounts, positioned within the sole of the shoe. This replicates the functionality of force plates by capturing real-time GRF data while walking. Adjustments to the strain gauge positioning allowed for optimized force distribution. The device demonstrates an accuracy of approximately 95%, which has been supported by quantitative metrics like high correlation coefficient and low error rates. Beyond its empirical accuracy, the participant’s comfort while using the shoe was a critical consideration while designing the device. The device’s portability, affordability, and non-invasive design make it an ideal alternative to traditional force plates, particularly for use in clinical studies, rehabilitation, and remote diagnosis of disorders.
  10. Pothole Detection Using YOLOv8 with an Integrated Notification System

    Shanaya Karkhanis, Shreyash Nadgouda, Archana Lakhe
    Abstract
    Urban roads, especially in populated cities like Mumbai experience daily wear and tear and get damaged quickly, leading to the formation of potholes. Conventional methods like manual inspection and laser-based systems for the detection of potholes are labor intensive, time consuming and costly, making deep learning models an efficient anfad cost-effective solution for automating pothole detection and road maintenance. This research presents a pothole detection system using YOLOv8 (You Only Look Once Version 8), a deep learning model that performs well in real time object detection by balancing the speed and accuracy of detection. It improves upon its previous model versions such as YOLOv2, YOLOv3, YOLOv4, YOLOv5 and YOLOv7 where the earlier versions achieved a Mean Average Precision (mAP) of 85–90%, while later iterations such as YOLOv5 and YOLOv7 improved the mAP to 94%. In the pothole detection system, we have also integrated a notification system that sends an SMS (Short Message Service) alert with the Global Positioning System (GPS) Coordinates i.e. longitude and latitude of the pothole once detected; indicating its potential to contribute to road safety improvements and more stream-lined infrastructure maintenance processes. While the current system sends notifications to a personal contact number for demonstration purposes, the system is designed to create real-world impact by potentially sending the notification alert to a designated government portal contact number, enabling faster intervention and more efficient road maintenance. As a whole, the pothole detection system achieves a precision of 92.7% and a recall of 87.5%, ensuring that potholes are detected with minimal false positives.
  11. A Performance Analysis of RC Filter for the Application of Analog Devices

    Rajulapati Sudha, P. Ramesh, Rushitha Reddy Golamaru
    Abstract
    This study explores the characteristics and applications of RC (resistor-capacitor) low-pass and high-pass filters, focusing on their gain versus frequency response, advantages, limitations, and practical significance in electronic circuits. A detailed theoretical analysis is conducted to understand the role of these filters in signal processing, emphasizing their ability to attenuate specific frequency components and shape signal waveforms. The research also includes an experimental investigation, where an RC circuit is implemented using discrete components and analyzed with the ADALM1000 active learning module and PixelPulse2 software. The cutoff frequency and filtering behavior are examined through direct measurements and the results are compared with MATLAB-based simulations. The experimental results align closely with theoretical predictions, confirming the effectiveness of RC filters in noise reduction, frequency selection, and signal conditioning. By bridging theoretical concepts with practical validation, this study underscores the importance of RC filters in modern electronics. The insights gained from this research provide valuable guidelines for optimizing filter design in applications such as audio processing, communication systems, and embedded electronics.
  12. Ensemble Simulation Model-Based Animal Intrusion Detection System

    B. N. Lohith Kumar, N. Manish, N. V. Uma Reddy, S. Sreejith
    Abstract
    Human-animal collisions are becoming more common, and + there is a rising need for efficient ways to create systems that can promptly alert drivers to potential animal collisions. Street dogs, cats, cattle, pigs etc. are commonly seen in our streets and are reasons mainly for accidents. Peacocks were also reported to cause many major accidents for two-wheeler riders since it files at low altitude. Potholes and speed breakers that are difficult to notice also contribute to major accidents in roads. The intensity of these accidents were high especially during night times since it is very difficult to trace these animal crossing. Hence an effective model is required to quickly and accurately detect and alter the driver about animal crossing. In this paper we have created an effective ensemble simulation model that will detect animals using thermal images.
  13. Large Language Model Interface for Manipulator Control

    N. Preeti, Hema Srivarshini Chilakala, A. A. Nippun Kumaar
    Abstract
    A language model is now a prominent research topic in Artificial Intelligence (AI) that has been trained to comprehend humans’ mode of communication and converse back in same way. Large-language model (LLM) is an improved version with greater learning capacity to also absorb sophisticated language structure capabilities. Robots are being utilized in every domain for automating processes. A pivotal challenge is the necessity of technical guide to communicate with the robot. This study’s main objective is to integrate LLMs into manipulator control systems, i.e., to facilitate the input of human-language instructions, which are then seamlessly translated into precise robotic arm tasks. The study addresses challenges like interpreting vague inputs, inferring reference frames, and ensuring usability through simple queries without requiring technical expertise. The proposed method is implemented using LLM models and the Robot Operating System (ROS) and tested using multiple manipulators both in the Gazebo simulator and on real-time hardware. The model was tested with a series of prompts, and it achieved a success rate of 87.33%, highlighting the LLM’s effective understanding of human commands and the related performance of the robotic system.
  14. Efficient Detection of Vehicles on Indian Roads: A Comparative Performance Analysis of YOLOv8, V9, and V10

    Preet Kanwal, Anjan R. Prasad, Prasad B. Honnavalli
    Abstract
    An essential task in computer vision - object detection, involves identifying and locating objects within images and/or video frames and has seen significant advancements through models like YOLO (You Only Look Once). This study presents a comparative study of object detection models trained on a custom dataset consisting of auto-rickshaws and license plates. Utilizing the YOLO models - YOLOv8, YOLOv9, and YOLOv10, the study evaluates the performance of their various versions in recognizing and localizing objects. Each model was trained under identical conditions to ensure an unbiased comparison and the results were analyzed based on performance metrics such as precision, recall, mean average precision, and model complexity. The results highlight that the YOLOv10 models - the medium(YOLOv10m) and the balanced(YOLOv10b) achieve better results with the former achieving a mAP@50 value of 0.791 and a F1 score of 0.768 for auto-rickshaw detection. The latter achieved a mAP@50 value of 0.739 and a F1 score of 0.731 for license plate detection with fewer parameters(YOLOv10m - 16.49M and YOLOv10b - 20.45M) with respect to the other models considered. Based on this, the use of YOLOv10m and the YOLOv10b models for future research in object detection tasks is recommended.
  15. DL Based Approach for Assessing the Severity of DR from Retinal Fundus Images

    Sidharth Jeyaraj, Malaya Kumar Nath
    Abstract
    Diabetic retinopathy (DR) is a severe consequence of diabetes and a major cause of vision impairment globally, particularly affecting individuals in their lifetime. Early detection and timely treatment can significantly prevent vision loss in many individuals with DR. Once DR symptoms are identified, the disease’s severity can be assessed to determine the most suitable course of treatment. This manuscript focuses on classifying DR from fundus images based on its severity level by using ResNet, MobileNet, GoogLeNet, and VGG16. These models are considered for analysis due to their proven effectiveness in image classification tasks, robustness in feature extraction, and efficiency in handling medical imaging datasets. ResNet’s deep residual connections provides detailed information, whereas MobileNet’s lightweight architecture optimizes speed, GoogLeNet’s inception module, and VGG16’s simple convolutional layers make them well-suited for DR classification. These models are trained by experimentally determined learning rate, optimizer, and loss function to achieve higher accuracy. These models have been tested using the APTOS 2019 dataset consisting of 5593 retinal images across 5 classes and obtained an overall accuracy of 95.89%.
  16. Music Recommendation System Based on Facial Emotion Recognition

    Kreesha Iyer, Neha Grandhi, Bhagyashree Birje, Priyanka Verma
    Abstract
    Music recommender systems have become an important application of personalised technology aimed at tailoring content to users’ preferences. However, most past systems have relied almost exclusively on the users’ past interactions and similarity in content, rather than adjusting recommendations in real time based on inputs from the users’ end. This project introduces a facial recommendation system that uses a Convolutional Neural Network (CNN) to recognise the facial emotion of the user, thus creating a more immersive and contextually relevant experience. Following this, the system employs clustering and content-based recommendation methods to predict and recommend songs to the users based on their mood.
  17. Indian Sign Language Recognition Using CNN-LSTM Architecture for Enhanced Gesture Prediction

    Anshara Beigh, Smriti Kumari, Rebekah Russel, Ali Imam Abidi
    Abstract
    The precise development of automated recognition systems for Indian Sign Language (ISL) faces significant difficulties because ISL gestures demonstrate high variability together with complex patterns. Classic neural networks fail to grasp both the space-focused together with time-based properties of these gestures appropriately. Our proposed model uses Convolutional Neural Networks (CNN) to extract spatial features and a Long Short-Term Memory (LSTM) network with a percentage-based attention system to analyze temporal elements. The system analyzes frames through the CNN networks and employs Attention-LSTM temporal processing to achieve 99% accurate ISL gesture recognition on a complete dataset.
    This research presents a CNN-Percentage-Based-Attention-LSTM model structure which effectively retrieves gesture space and motor characteristics while achieving better accuracy levels as compared to traditional approaches. The attention mechanism embedded in the model helps it direct attention toward essential gesture features to enhance recognition accuracy on both complex and subtle gestural movements. Real-time ISL gesture recognition’s scalability and robust nature enable the model to operate as a promising communication aid tool for educational, social, and professional domains and hearing-impaired individuals. The obtained results show how this methodology can solve the present challenges of standard ISL recognition techniques while leading to new developments in this field.
  18. A Modified Aggregation Operator and Score Function for Solving Multicriteria Decision Making Problem Under Neutrosophic Environment

    Ritu, Tarun Kumar, M. K. Sharma
    Abstract
    Neutrosophic sets are characterized by membership, non-membership and indeterminacy functions that provide a robust method for modeling of the incomplete or inconsistent data, which is commonly encountered in real-world decision-making scenarios. This paper proposes a multicriteria decision-making (MCDM) approach for handling the uncertain and vague information in decision problems using the aggregation operators and score functions within the framework of Neutrosophic Sets (NS). The proposed approach combines the aggregation operators to fuse the multiple criteria and alternatives along with the score functions to rank and evaluate the best alternatives. The aggregation operators are designed to combine the Neutrosophic sets associated with each criterion into a single comprehensive evaluation while the score function helps in deriving a crisp ranking of alternatives. A realistic example is provided to illustrate the approach's efficacy, showcasing its applicability in handling the complex decision problems under uncertainty and imprecision. On the basis of scoring functions and criteria, the table compares several aggregation functions. Existing approaches from Sachin, Garg, and Nafei et al. are contrasted with the suggested score function. The values of 0.233 and 0.333 produced by the suggested scoring function are similar to those of Sachin and Garg but lower than those of Nafei et al.
    Three criteria (\({C}_{1},{C}_{2},{C}_{3}\)) are included in the comparison for the aggregation functions for two options (\({A}_{1},{A}_{2}\)). The aggregated values for \(({A}_{1})\) are 0.3268, 0.2000, and 0.3881 when using the current aggregation operator (Ye [9]); however, the suggested aggregation operator produces 0.1796, 0.1056, and 0.1634, suggesting a decrease in values. Likewise, for \({(A}_{2})\) the suggested approach yields 0.5627, 0.1414, and 0.2000 using the current aggregation operator. These results suggest that this method offers a flexible and effective tool for decision makers in a situation involving incomplete or conflicting information.
Next
  • current Page 1
  • 2
  • 3
Title
Computer Vision and Robotics
Editors
Harish Sharma
Abhishek Bhatt
Chirag Modi
Andries Engelbrecht
Copyright Year
2026
Electronic ISBN
978-3-032-06253-6
Print ISBN
978-3-032-06252-9
DOI
https://doi.org/10.1007/978-3-032-06253-6

PDF files of this book have been created in accordance with the PDF/UA-1 standard to enhance accessibility, including screen reader support, described non-text content (images, graphs), bookmarks for easy navigation, keyboard-friendly links and forms and searchable, selectable text. We recognize the importance of accessibility, and we welcome queries about accessibility for any of our products. If you have a question or an access need, please get in touch with us at accessibilitysupport@springernature.com.

Premium Partner

    Image Credits
    Neuer Inhalt/© ITandMEDIA, Nagarro GmbH/© Nagarro GmbH, AvePoint Deutschland GmbH/© AvePoint Deutschland GmbH, AFB Gemeinnützige GmbH/© AFB Gemeinnützige GmbH, USU GmbH/© USU GmbH, Ferrari electronic AG/© Ferrari electronic AG