Skip to main content

2025 | Buch

Pattern Recognition and Computer Vision

7th Chinese Conference, PRCV 2024, Urumqi, China, October 18–20, 2024, Proceedings, Part VII

herausgegeben von: Zhouchen Lin, Ming-Ming Cheng, Ran He, Kurban Ubul, Wushouer Silamu, Hongbin Zha, Jie Zhou, Cheng-Lin Liu

Verlag: Springer Nature Singapore

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This 15-volume set LNCS 15031-15045 constitutes the refereed proceedings of the 7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024, held in Urumqi, China, during October 18–20, 2024.

The 579 full papers presented were carefully reviewed and selected from 1526 submissions. The papers cover various topics in the broad areas of pattern recognition and computer vision, including machine learning, pattern classification and cluster analysis, neural network and deep learning, low-level vision and image processing, object detection and recognition, 3D vision and reconstruction, action recognition, video analysis and understanding, document analysis and recognition, biometrics, medical image analysis, and various applications.

Inhaltsverzeichnis

Frontmatter

Character Recognition

Frontmatter
Scene Text Recognition Via k-NN Attention-Based Decoder and Margin-Based Softmax Loss

Facing various complex background and diverse text image shape, this paper proposes an encoder-decoder-based scene text recognition model named E2D-Rec to enhance the recognition capability of irregular text and achieve stronger generalization. Firstly, a text rectification network is introduced to transform irregular texts such as curved and skewed texts into relatively regular ones, which are iteratively learned as several control points of text regions within the model. Through control points to drive TPS interpolation transformation, rectified text images are obtained. Then, a modeling network based on the encoder-decoder architecture sequentially predicts text sequences in an auto-regressive manner. The visual encoder is utilized to generate image patch embedding from text image, and the visual-textual decoder learns the correlation between word embedding and image patch embedding via k-NN attention selection. Finally, during the training phase, a loss function based on inter-class penalty is adopted as the model’s objective function. By widening the boundaries between different classes in the final label space mapping layer, the model learns deep features with high discriminative power. Experimental results validate the proposed model’s improved recognition performance on the Union14M-Benchmark and six commonly used datasets.

Hongxia Zhang, Minqiang Xu, Liang He
Real-Time Text Detection with Multi-level Feature Fusion and Pixel Clustering

Recent segmentation techniques for scene text detection have attracted significant attention for their ability to flexibly handle texts of various shapes and orientations. These techniques benefit from the high extensibility of pixel-level representations, but they face challenges due to complex network designs and slow post-processing steps, which hinder inference speeds and lead to suboptimal detection of unusual text shapes. We present a real-time detector for text of arbitrary shapes, named the Multi-Level Feature Fusion and Pixel Clustering (MFFPC) Network, to address these challenges. This method utilizes a lightweight feature extraction network, enhanced with specially designed Feature Enhancement Module (FEM) and Feature Filter Module (FFM) to improve feature representation. MFFPC enhances visual context understanding and refines lower-level feature maps using high-level features, effectively modeling text through a lightweight segmentation head and GPU-accelerated parallel post-processing. Additionally, an auxiliary training branch, inspired by clustering algorithms, further increases segmentation accuracy. The performance of MFFPC on three benchmark datasets validates its effectiveness. Specifically, on the challenging Total-Text dataset, it achieves an F-measure of 88.7% and processes at a speed of 66.8 frames per second (FPS).

Lu Xu, Zhufeng Jiang, Xingyu Han, Hui Wang, Zizhu Fan
Refined and Locality-Enhanced Feature for Handwritten Mathematical Expression Recognition

Many studies have been conducted on handwritten mathematical expression recognition (HMER) based on encoder-decoder architecture. However, the previous methods fail to predict accurate results due to low-quality images such as blur, complex background and distortion. In addition, ambiguous or subtle symbols caused by different handwriting styles are often recognized incorrectly. In this paper, we propose an efficient method for HMER to deal with the above issues. Specifically, we propose a Dual-branch Refinement Module (DRM) to deal with the challenging disturbances. In terms of ambiguous or subtle symbols, we believe that the combination of local and global information is beneficial to recognizing these symbols. Therefore, we design a Local Feature Enhancement Module (LFEM) to enhance local features, which can cooperate with global information extracted by the following transformer decoder. Extensive experimental results on CROHME and HME100K datasets verify the effectiveness of our method.

Liu Yu, Xiangcheng Du, Ziang Liu, Daoguo Dong, Liang He
Learning Fine-Grained and Semantically Aware Mamba Representations for Tampered Text Detection in Images

Tampered text detection in images, as a task focused on detecting manipulated or forged text with image documentation or signage, has increasingly attracted attention due to the widespread use of image editing software and CNN synthesis techniques. The potential difficulties of perceiving subtle differences in tampered text images lie in the gap between the model’s capability to obtain global fine-grained information and the realistic demand. In this work, we propose a robust detection method, Tampered Text Detection with Mamba (TTDMamba). It achieves linear complexity without sacrificing global spatial contextual information, offering significant advancements over the limitations of the Transformer architecture. In particular, we utilize the advanced VMamba architecture as the encoder and incorporate the proposed High-frequency Feature Aggregation to enhance the visual feature set by integrating additional signals. This aggregation guides Mamba’s attention toward capturing fine-grained forgery information. Additionally, we integrate Disentangled Semantic Axial Attention into the stacked Visual State Space block of the VMamba architecture. This integration allows us to incorporate the inherent high-level semantic attributes of the tampered image into a pretrained hierarchical converter. As a result, we obtain a tamper map that is more reliable and accurate. Extensive experiments on the T-SROIE, T-IC13, and DocTamper datasets demonstrate that TTDMamba not only surpasses existing state-of-the-art methods in detection accuracy but also shows superior robustness in pixel-level forgery localization, marking a significant contribution to the domain of text tampering detection.

Hao Sun, Jie Cao, Zhida Zhang, Tao Wu, Kai Zhou, Huaibo Huang
Dual Feature Enhanced Scene Text Recognition Method for Low-Resource Uyghur

Scene text recognition is the task of identifying text in natural scene images. Popular scene text recognition technologies mostly employ Transformer-based encoder-decoder methods tailored for resource-rich languages. However, when training data is insufficient, Transformer-based methods perform poorly compared to CNN-based methods. Nonetheless, CNN-based methods cannot process global information and often fall short in capturing contextual information between characters. This paper proposes a Dual Feature Enhanced Scene Text Recognition Method based on the CNN encoder-decoder designed for low-resource Uyghur. In the encoder, we employ a dynamic attention enhancement technique to strengthen the model’s learning capability of features in both spatial and channel dimensions, reducing the model’s dependency on large-scale training data. In the decoder, we introduce a novel global feature enhancement strategy that associates features from the encoder globally, mitigating the convolutional neural network’s lack of global information processing ability. Additionally, we construct two Uyghur language scene datasets, named U1 and U2. Comparative experimental results demonstrate the outstanding performance of our method on the U1 and U2 datasets. Compared to baseline methods, our approach achieves a respective increase in accuracy of 5.2% and 3.2% while reducing model parameters.

Miaomiao Xu, Jiang Zhang, Lianghui Xu, Yanbing Li, Wushour Silamu
Segmentation-Free Todo Mongolian OCR and its Public Dataset

Due to the severe scarcity of training data and the challenges in compiling dictionaries for Todo Mongolian, there have been no text recognition tools specifically designed for Todo Mongolian to date. This study aims to fill the research gap in Optical Character Recognition (OCR) for Todo Mongolian and introduces the first publicly available Todo Mongolian OCR dataset. We have developed a novel segmentation-free recognition method for entire lines of Todo Mongolian text, which draws on traditional Mongolian text recognition techniques. Data synthesis and enhancement techniques were used to expand the dataset and alleviate the issues of data scarcity. As part of the research, we have compiled and released a database containing 150,000 lines of generated Todo Mongolian text images, each meticulously annotated. The database, along with the scripts used for generating synthetic images and the data generation code, will be made freely available to the academic community to support further research. This method has been experimentally validated and achieved a word-level error rate of 15.27%. This work not only provides an initial solution to the OCR challenges of Todo Mongolian but also offers a valuable data resource for researchers in the field.

Weiqi Wang, Feilong Bao, Hui Zhang
Hybrid Encoding Method for Scene Text Recognition in Low-Resource Uyghur

Current advanced methods for scene text recognition are predominantly based on Transformer architecture, focusing primarily on resource-rich languages such as Chinese and English. However, Transformer-based architectures heavily rely on annotated data and their performance on low-resource data is not satisfactory. This paper proposes a Hybrid Encoding Method (HEM) for Scene Text Recognition in Low-Resource Uyghur, aiming to equip the network with both the long-range association capability of the Transformer for global image context and the function of CNN for capturing local detailed information. Simultaneously, by combining the strengths of CNN and Transformer encodings, the model’s learning capacity can be enhanced in low-resource settings, bolstering its ability to comprehend images while reducing its reliance on annotated data. On the other hand, we construct two Uyghur scene text datasets, namely U1 and U2. Experimental results demonstrate that the proposed hybrid encoding method achieves outstanding performance in low-resource Uyghur scene text recognition, improving accuracy by 15% compared to baseline methods.

Miaomiao Xu, Jiang Zhang, Lianghui Xu, Yanbing Li, Wushour Silamu
ROBC: A Radical-Level Oracle Bone Character Dataset

Oracle bone characters (OBCs) serve as vital resources for the in-depth study of Chinese history and the development of writing systems. The recognition of OBCs holds immense significance in the realm of oracle research. Despite the growing adoption of deep learning techniques for OBC recognition, their widespread implementation has been hindered by challenges such as category imbalance and inaccurate labeling in existing datasets. We construct a radical-level oracle bone character dataset (ROBC) in response to these challenges. To mitigate the issue of inaccurate labeling, we rigorously clean the existing dataset based solely on glyph criteria. Moreover, we pioneer the annotation of radical-level information in the oracle bone character dataset. Through statistical analysis, we demonstrate the efficacy of radical-level annotations in alleviating the class imbalance issue prevalent in existing OBC datasets. In addition, we conduct closed and open set recognition tasks on the ROBC using multiple baseline models and achieve considerable results, demonstrating the versatility and robustness of the ROBC dataset and laying the foundation for future research in the OBC recognition. The ROBC dataset is temporarily available at https://github.com/ycfang-lab/ROBC .

Zhengchen Li, Xintong Li, Kaiwen Qian, Yuchun Fang
Integrated Recognition of Arbitrary-Oriented Multi-line Billet Number

In billet production scenarios, the billet number often appears as a sequence composed of multi-line characters that may have arbitrary orientations, and its recognition is a particular industrial OCR (Optical Character Recognition) task. Currently, mainstream OCR methods primarily operate at the character or line level. When using these OCR methods for multi-line text recognition, the recognition results are first obtained for individual characters or single lines of text, and then the results are integrated into the sequence in top-to-bottom and left-to-right order. However, these recognition methods are designed for horizontally arranged multi-line text and are unsuitable for accurately identifying skewed or reversed billet numbers. In such cases, the sequential integration order becomes ineffective, and skewed characters are susceptible to misidentification, resulting in erroneous recognition. To solve these problems, we propose to treat the arbitrary-oriented multi-line sequence of characters as an integrated text unit and recognize the entire text unit to avoid additional sequential integration. To achieve this, we propose the integration attention to simultaneously focus on individual character features, positional information between characters, and global features of the sequence. This way, characters as well as sequence features can be adaptively extracted and recognized. We also propose the Cascaded Sample Module (CSM) to extract multiscale global features, alleviating the misidentification of skewed characters. Extensive experiments on multiple datasets demonstrate that our method can achieve state-of-the-art performance with more than 30 $$\%$$ % accuracy improvement.

Zhongjie Hu, Qi Liu, Song-Lu Chen, Yan Liu, Feng Chen, Xu-Cheng Yin
Improving Scene Text Recognition with Counting-Aware Contrastive Learning and Attention Alignment

Contrastive learning for scene text recognition (STR) task greatly relieve the problem of relying on large scale of synthetic data or labeled data for training. Most of previous STR method using contrastive learning attempted to divide the visual features from encoder into fixed number of instances. However, fixing the number of instances may cause the mismatching between positive and negative samples in contrastive learning, especially for data with extreme geometric transformation. To tackle this problem, we introduce a very weakly labeled text length information to restrict the projection heads of the visual feature. Moreover, we found that the attention for the time-step character prediction of the same text in different shapes are different. That inspires us to think about extracting the shape-variation features of scene text, which is equal as making the attention be consistent for the same text with different shapes during decoding stage. In this paper, we propose a Counting-Aware Contrastive Learning Model (CALM) and a plug-in Attention Alignment Module (AAM) to overcome the above challenges for STR. CALM uses the text length information as guidance to determine the projection head numbers for different scene text images in the pre-training of encoder stage. AAM constrains the character attention at each time step regardless of the shape and length of text in the decoder stage. To validate the method, we synthesize the scene text data by sever transformation. Extensive experiments demonstrate the effectiveness of CALM and AAM for STR task. The comparison results also show that our proposed method achieves state-of-the-art performance on public benchmarks.

JunJie Yang, Bo Zhou, Anna Zhu
GridMask: An Efficient Scheme for Real Time Curved Scene Text Detection

Real-time curved scene text detection remains challenging due to various background and diverse text shapes. Existing methods, which often predict at full scale, are time-consuming, whilst low scale methods are not able to handle texts in complex scene. To resolve this problem, we propose a new quarter-scale detection scheme, named GridMask. GridMask models a 4 $$\times $$ × 4 pixels block efficiently and avoids post-processing. It formulates text detection as a grid classification and regression task, enabling fast execution. A comprehensive set of experiments on the curved and multi-orientation texts from four datasets, including ICDAR 2015, CTW1500, Total Text and MSRA-TD500, demonstrate that GridMask achieves state-of-the-art execution speed in scene text detection. GridMask also achieves state-of-the-art accuracy on the CTW1500 and Total Text datasets, which implies that GridMask is superior to prior studies from both perspectives. The source code and the trained model is available.

Zhonghong Ou, Yiqun Zhang, Siyuan Yao, Meina Song
Tibetan Handwriting Recognition Method Based on Structural Re-Parameterization ViT and Vertical Attention

Tibetan handwritten text recognition holds significant importance in the fields of Tibetan office automation and ancient document preservation. Addressing the diverse characteristics of Tibetan handwritten characters, this paper proposes a Tibetan handwritten recognition method based on Structural re-parameterization Vision Transformer (ViT) and vertical attention mechanism. The contributions of this paper lie in three aspects: firstly, leveraging Structural re-parameterization technique to construct a feature extraction network, globally modeling image features, and employing an improved vertical attention mechanism to unfold text feature lines, trained using the standard Connectionist Temporal Classification (CTC) loss function. Secondly, in terms of data augmentation, an Adaptive Random Masking (AdaRM) method is proposed based on Tibetan characteristics, effectively enhancing model performance and training efficiency. Lastly, in terms of evaluation metrics, a Tibetan Syllable Error Rate (TSER) metric is proposed based on Tibetan characteristics. Experimental results demonstrate that the proposed method achieves a Character Error Rate (CER) of 4.19% and a TSER of 5.92% on the Tibetan_HW text line recognition test set, and a CER of 3.56% and a TSER of 4.86% on the Tibetan_HW paragraph recognition test set. Compared to baseline models, the recognition error rates are significantly reduced.

Binglin Li, Jie Zhu, Dongcai Zhao

Document Analysis and Recognition

Frontmatter
MFH: Marrying Frequency Domain with Handwritten Mathematical Expression Recognition

Handwritten mathematical expression recognition (HMER) suffers from complex formula structures and character layouts in sequence prediction. In this paper, we incorporate frequency domain analysis into HMER and propose a method that marries frequency domain with HMER (MFH), leveraging the discrete cosine transform (DCT). We emphasize the structural analysis assistance of frequency information for recognizing mathematical formulas. When implemented on various baseline models, our network exhibits a consistent performance enhancement, demonstrating the efficacy of frequency domain information. Experiments show that our MFH-CoMER achieves noteworthy accuracy rates of 61.66%/62.07%/63.72% on the CROHME 2014/2016/2019 test sets. The source code is available at https://github.com/Hryxyhe/MFH .

Huanxin Yang, Qiwen Wang
Leveraging Structure Knowledge and Deep Models for the Detection of Abnormal Handwritten Text

Currently, the destruction of the sequence structure in handwritten text has become one of the main bottlenecks restricting the recognition task. The typical situations include additional specific markers (the text swapping modification) and the text overlap caused by character modifications like deletion, replacement, and insertion. In this paper, we propose a two-stage detection algorithm that combines structure knowledge and deep models for the above mentioned text. Firstly, different structure prototypes are roughly located from handwritten text images. Based on the detection results of the first stage, in the second stage, we adopt different strategies. Specifically, a shape regression network trained by a novel semi-supervised contrast training strategy is introduced and the positional relationship between the characters is fully employed. Experiments on two handwritten text datasets show that the proposed method can greatly improve the detection performance. The new dataset is available at https://github.com/Wukong90.

Zi-Rui Wang
OCR-Aware Scene Graph Generation Via Multi-modal Object Representation Enhancement and Logical Bias Learning

Scene Graph Generation (SGG) is the task of mapping an image or a video into a semantic structural scene graph automatically for better scene understanding. It requires detecting objects and building their relations. Current SGG methods ignore an essential element of the scene images, i.e., scene text. To better utilize this information for more comprehensive image understanding, we introduce it into the SGG task and propose an OCR-aware Scene Graph Generation (OSGG) baseline approach. To solve the training bias in both SGG and OSGG tasks, we present a novel learning strategy based on causal inference to remove the bad bias and make the prediction process more rational. The feature representations of objects are one of the keys to these tasks but generally extracted from bounding boxes, which are coarse. To obtain more fine-grained object features, we propose a visual feature enhancement module that fuses linguistic modality and integrates cross-modal attention. For evaluation, we provide a new OCR-aware dataset, TextCaps-SG, to benchmark the performance. Experimental results on this dataset and the Visual Gnome (VG) dataset demonstrate the effectiveness of each designed module and verify the superiority of our proposed method over other state-of-the-art methods. Moreover, we apply our generated OSG to cross-modal retrieval tasks. Experiments conducted on COCO TextCaps (CTC) and TextCaps-SG further illustrate that our method significantly outperforms the previous SG-based retrieval methods and could achieve competitive or better results than some large-scale models.

Xinyu Zhou, Zihan Ji, Anna Zhu
Enhancing Transformer-Based Table Structure Recognition for Long Tables

Extensive researches have demonstrated the effectiveness of Image-to-Sequence (img2seq) approaches in table structure recognition (TSR) task. However, when dealing with long tables, these methods always suffer from the inherent limitations of their attention mechanism, which hinders the further progress of this technical route. Based on the repetitive nature of table body structures, we present a novel approach for compressing table HTML code, resulting in a significant reduction in the overall length of the HTML code by more than 65%. Additionally, to address the relatively complex structure of table heads and the dense arrangement of rows and cells in long tables, we integrate coverage information into the transformer decoder. Experiments show that our method achieves performance comparable to the state-of-the-art methods with a simple model structure. What’s more, our method effectively mitigates the performance degradation commonly observed in img2seq methods when the length of the target sequence increases.

Ziyi Zhu, Wenqi Zhao, Liangcai Gao
Show Exemplars and Tell Me What You See: In-Context Learning with Frozen Large Language Models for TextVQA

Modern Large Visual Language Models (LVLMs) can transfer Large Language Models (LLMs)’ powerful abilities to visual domains by combining LLMs with the pre-trained visual encoder, and can also leverage in-context learning originated from LLMs to achieve remarkable performance in the Text-based Visual Question Answering (TextVQA) task. However, the alignment process between vision and language requires a significant amount of training resources. This study introduces SETS (stands for Show Exemplars and Tell me what you See), a straightforward yet effective in-context learning framework for TextVQA. SETS consists of two components, an LLM for reasoning and decision-making, as well as a set of external tools that extract visual entities in scene images, including scene text and objects, to assist the LLM. More specifically, SETS selects visual entities relevant to questions, constructs their spatial relationships, and customizes task-specific instructions. Furthermore, given these instructions, a two-round inference strategy is applied to automatically choose the final predicted answer. Extensive experiments on three widely used TextVQA datasets demonstrate that SETS enables frozen LLMs like Vicuna and LLaMA2 to achieve superior performance when compared with LVLMs counterparts.

Yan Zhang, Gangyan Zeng, Huawen Shen, Can Ma, Yu Zhou
MLR-NET: An Arbitrary Skew Angle Detection Algorithm for Complex Layout Document Images

To avoid applying intelligent document processing techniques that are sensitive to the image with skew angle. Scanned and processed digitized document images with complex layouts (CLDImge) must be transformed to obtain a rectified image by calculating the skew angle. Traditional machine learning methods based on image characterization can only detect skew angles between plus and minus 45 $$^{\circ }$$ ∘ . Meanwhile, deep learning-based classification models can only detect the discretized skew angle of a set scale, which constrains the accuracy of angle detection. Therefore, to address the limitations of detection range and granularity, we propose a multivariate linear regression network (MLR-NET) for the detection of arbitrary skew angles in CLDImge, which realizes high-precision computation of arbitrary skew angles from −180 $$^{\circ }$$ ∘ to 180 $$^{\circ }$$ ∘ . MLR-NET improves the linear representation between skew angles and features by fitting multiple mapping values generated by a set continuous period mapping function instead of directly fitting the actual skew angle in the multiple linear regression layer. Further, considering that the mapping value regression prediction approach has the problem of uneven gradient conduction or even local extreme points during model training, We propose an optimization method based on a rotating coordinate system. The average error Of MLR-NET is only 0.0515 on the constructed dataset CLDIMGE-DATA, and the model effectiveness is verified on the public dataset Tobacco600.

Peisen Wang, Bo Wang, Xixi Nie, Chunyi Guo, Kaijiang Li
TextViTCNN: Enhancing Natural Scene Text Recognition with Hybrid Transformer and Convolutional Networks

In the field of STR, This research presents TextViTCNN, an innovative architecture that merges the benefits of CNNs and ViT. By innovatively integrating features from CNNs and ViT, TextViTCNN provides a powerful solution to cope with the inherent complexity of STR. Our model is particularly adept at handling diverse and irregular English and self-constructed Uyghur texts, and significantly improves recognition accuracy by effectively merging local and global features through a learning-based feature fusion layer. The decoder employs a strategy that incorporates mask and substitution context learning, and integrates word length information through the training process of a pre-trained language model (PLM), allowing TextViTCNN to achieve the most advanced performance in experiments.

Elham Eli, Wenting Xu, Alimjan Aysa, Hornisa Mamat, Kurban Ubul
Enhancing Visual Information Extraction with Large Language Models Through Layout-Aware Instruction Tuning

Recently, leveraging large language models (LLMs) for visually-rich document information extraction has made significant progress. Previous studies have simplified the task of visual information extraction into a document visual question answering task. This task involves a question-answer session that yields a single entity result at a time, serving as a means of validating the document understanding capabilities of large language models (LLMs). However, these methods encounter significant challenges in computational efficiency and cost when addressing the document digitization requirements for extracting multiple entities from a single document. This scenario is common in practical applications of visual information extraction. This paper builds upon large language model and incorporates document layout information through a document layout modeling branch. We also design a layout-aware and task-specific instruction set. To further enhance the model’s proficiency in learning document layout information, we initially augment the tokenizer’s vocabulary. Subsequently, the entire model undergoes fine-tuning to ensure improved adaptability to the expanded vocabulary and effective extraction of document layout features. By harnessing the exceptional language comprehension capabilities of LLMs, our model is capable of executing comprehensive entity extraction for an entire document in a single pass. Benefiting from the characteristics of generative large language models, we can accomplish multiple downstream tasks of visual information extraction using an individual model. Our experimental results demonstrate consistent improvement over the baseline model across a range of document visual information extraction tasks.

Teng Li, Jiapeng Wang, Lianwen Jin
SFENet: Arbitrary Shapes Scene Text Detection with Semantic Feature Extractor

Arbitrary shaped scene text detection is a very challenging task and has become a research hotspot in the field of computer vision. Existing methods mostly use CNN to extract multi-scale features, followed by segmentation-based approaches to obtain probability maps and convert them into text boxes. However, due to the limitation of the CNN receptive field and the lack of global semantic information, the text detection network cannot model the correlation between different text instances, which limits the further improvement of text detection performance. In this paper, we propose a text detection network with a semantic feature extractor (SFENet), which can robustly detect irregular text in scene images. Firstly, we use Vision Transformer as the semantic feature extraction module to extract global semantic information and capture relationships between different text instances. Secondly, through the fusion module, semantic information is injected into multi-scale features and adaptively fused, resulting in accurate probability maps and text boxes of text regions. Finally, we conduct experiments on three public datasets: ICDAR2015, Total-Text and MSRA-TD500, and compare with other algorithms. The results demonstrate that SFENet achieves good detection results for arbitrary shape scene text, and performs well on all three public datasets.

Hongwei Chen, Mengxi Cheng, Tianshun Cheng, Yun Xiao
Improving Zero-Shot Image Captioning Efficiency with Metropolis-Hastings

Image captioning, a crucial aspect of natural language processing, aims to convert visual content into textual representations through technological advancements. Zero-shot learning, a technique that has gained widespread attention in recent research, performs tasks without relying on domain-specific training datasets. However, current zero-shot image captioning methods mainly depend on non-autoregressive language models, which often suffer from operational inefficiencies, resulting in prolonged captioning times and limiting their practical applications. To address this limitation, this study introduces an efficient zero-shot image captioning method, MHIC, leveraging the Metropolis-Hastings sampling algorithm. MHIC significantly improves computational speed while maintaining caption quality. Specifically, we optimize the caption generation process using the Metropolis-Hastings algorithm, effectively reducing the number of iterations required for word generation while preserving caption quality by minimizing the number of generated words. This approach markedly enhances the speed of image captioning. To validate the effectiveness of MHIC, we conducted rigorous experiments on two publicly available datasets. The results demonstrate that our proposed algorithm achieves an average speedup of 2 times compared to state-of-the-art methods, highlighting its significant potential in improving computational efficiency.

Dehu Du, Yujia Wu
Improving Text Classification Performance Through Multimodal Representation

Traditionally, text classification research has predominantly focused on extracting single text features, with limited exploration of integrating other modal information (such as speech and images) to enhance classification performance. To address this research gap, we propose the Multimodal Representation for Text Classification (MRTC) framework. This framework aims to boost text classification performance by incorporating speech, image, and text features. Specifically, we employ advanced text-to-speech models to convert text content into audio features. Simultaneously, we retrieve images closely associated with the text content and extract their visual features to further enrich the information dimension of text representation. Subsequently, we utilize an efficient triplet structure network to fuse the speech, image, and text features, thereby constructing a multimodal feature representation for application in text classification tasks. The proposed MRTC framework achieves high-precision text classification across multiple datasets without requiring additional multimodal annotated data. This characteristic not only reduces the cost of data annotation but also enhances the model’s practical flexibility and scalability. To validate the effectiveness of the MRTC framework, we conduct experiments on six distinct text classification tasks. The experimental results demonstrate the significant effectiveness of our MRTC framework across various text classification tasks.

Yujia Wu, Xuan Zhang, Hong Ren
A Multi-feature Fusion Approach for Words Recognition of Ancient Mongolian Documents

Ancient Mongolian documents are valuable repositories of historical information and cultural significance. Analyzing these documents effectively demands specialized recognition research, as the absence of some words in current lexicons makes recognizing out-of-vocabulary (OOV) words crucial. To better recognize ancient Mongolian documents, an end-to-end approach based on multi-feature fusion called Ancient Mongolian Documents Recognition Unit (AMDRU) is proposed in this paper. This approach improves the ability of the model to understand images in ancient documents by leveraging information from word images at different scales. AMDRU receives word images and processes them through a custom-designed feature extractor to capture multi-scale structural details. These features are then input into an encoder utilizing the efficient additive attention mechanism, enabling superior understanding and representation of essential information. The encoded features are passed to a Transformer decoder to convert image data into text. The final output is a prediction of the corresponding strings. To address the uneven data distribution in ancient documents and enhance the learning of rare word images, the asymmetric loss is utilized, which significantly improves the model’s ability to learn from word images and boosts recognition performance. Experimental results demonstrate that our proposed approach can capture the structural features of characters in ancient Mongolian documents more accurately, and its recognition performance outperforms existing methods. It shows particularly better performance in the challenging task of recognizing OOV words.

Shiwen Sun, Hongxi Wei, Yiming Wang, Chao He
TableRocket: An Efficient and Effective Framework for Table Reconstruction

Table reconstruction (TR) aims to extract cell contents and logical structure from table images. Existing table reconstruction methods are superior in recognizing logical structure, but they are inhibited by slow decoding speed, error accumulation in long sequences, and varying cell sizes in table images, which are critical for subsequent table reconstruction. To address these problems, we propose TableRocket, a table reconstruction framework that includes an end-to-end layout cell instance segmentation module based on multi-stage query mechanisms and a non-autoregressive (NAR) logical structure generation module. TableRocket employs an innovative method for model layout cell segmentation by introducing dynamic candidate boxes and features to more accurately address the problem of varying cell sizes in table images, and has developed an algorithm specifically for logical structure generation based on layout cell bounding boxes. Extensive experimental results demonstrate that TableRocket achieves significantly faster inference speeds, while maintaining comparable performance to the auto-regressive model.

Liucheng Pang, Yaping Zhang, Cong Ma, Yang Zhao, Yu Zhou, Chengqing Zong
Not All Texts Are the Same: Dynamically Querying Texts for Scene Text Detection

In recent years, scene text detection has witnessed considerable advancements. However, such methods do not dynamically mine diverse text characteristics within each image to adaptively adjust model parameters, resulting in suboptimal detection performance. To address this issue, we propose a simple yet effective segmentation-based model named Text Query Detector (TQD), inspired by the recently popular transformer. TQD implicitly queries textual information and flexibly generates convolution parameters with the global receptive field. In addition, we decouple the features for parameter generation and dynamic convolution to maximize the benefits of both transformer and convolution. Extensive experiments demonstrate that our approach strikes an ideal tradeoff in terms of both accuracy and speed on prevalent benchmarks. Especially on MSRA-TD500 and ICDAR2015, our TQD achieves state-of-the-art results while maintaining high speed. Code is available at: https://github.com/TangLinJie/TQD .

Linjie Tang, Pengfei Yi, Mingrui Chen, MingKun Yang, Dingkang Liang
Multi-Modal Attention Based on 2D Structured Sequence for Table Recognition

Table Structure Recognition (TSR) aims to extract two-part information from a table image: a 2D structured language sequence and a bounding box sequence. Image-to-text (i2t) methods have received more attention among recent TSR methods. However, recent i2t methods use (1) a dual-decoder framework for the two-part output, which is complex and hard to design, and (2) the vanilla attention layer for predicting 2D structured sequences, which is inappropriate because the vanilla attention is designed for 1D sequences. To address these problems, (1) we are the first to propose a novel encoder-decoder framework based on i2t methods, which discards the old dual-decoder framework, making it more uniform and easier to design. Our encoder-decoder architecture is composed of three modules: Multi-Modal Encoder, Multi-Modal Mid-Block, and Multi-Modal Decoder. (2) We also propose a novel 2D attention layer, which explicitly models the 2D features of the structured language. Finally, we test our method on public datasets and achieve significant improvements in visual prediction and comparable results in text prediction.

Yiming Zhang, Yaping Zhang, Lu Xiang, Yu Zhou

Action Recognition

Frontmatter
A Two-Stream Hybrid CNN-Transformer Network for Skeleton-Based Human Interaction Recognition

Human Interaction Recognition (HIR) is the process of identifying and understanding interactive actions and activities between multiple participants in a specific environment or situation. Many single Convolutional Neural Networks (CNN) has issues, such as the inability to capture global instance interaction features or difficulty in training, leading to ambiguity in action semantics. In this work, we propose a Two-stream Hybrid CNN-Transformer Network (THCT-Net), which exploits the local specificity of CNN and models global dependencies through the Transformer. CNN and Transformer simultaneously model the entity, time and space relationships between interactive entities respectively. Multi-grained information modelling is employed to enhance the accuracy and robustness of the action recognition system. Experimental results on diverse and challenging datasets, such as NTU-RGBD, H2O, and Assembly101, demonstrate that the proposed method can better comprehend and infer the meaning and context of various actions, outperforming state-of-the-art methods.

Ruoqi Yin, Jianqin Yin
Language-Skeleton Pre-training to Collaborate with Self-Supervised Human Action Recognition

Multi-modal representation learning for skeleton-based action recognition has been developed with the advance of contrastive learning methods in recent years. In this work, in order to better excavate high-level semantic information, we leverage LLMs’ profound linguistic knowledge to provide semantic compensation for skeleton embedding. Specifically, we present a Contrastive Language-Skeleton Pre-training framework (SkeletonCLSP), which employs cross-modal multivariate features integration and anomaly distribution auxiliary correction for self-supervised action representation. In the preprocessing phase, the Distributed Difference Perception (DDP) module is first proposed to emendate the anomaly fusion features by simulating the distribution of skeleton features in the fusion sample. Subsequently, utilizing the distribution to estimate anomalies in the inference stage. Furthermore, to implement parallel sequence modeling of text and skeleton embedding, we incorporate a pre-fusion encoder called Text-Skeleton Vision Encode (TSVE), encouraging a richer exchange of information between the multi-modal data. Finally, Text Context Broadcasting (TCB) is proposed to infuse dense interactions instead of sparse interactions into the skeleton spatial-temporal representation by giving each sample uniform attention. Extensive experiments on NTU60, NTU120, PKU-MMD, and UAV-Human show that the proposed method achieves remarkable action recognition performance. Related code will be available on .

Yi Liu, Ruyi Liu, Wentian Xin, Qiguang Miao, Yuzhi Hu, Jiahao Qi
Spatio-Temporal Contrastive Learning for Compositional Action Recognition

The task of compositional action recognition holds significant importance in the field of video understanding; however, the issue of static bias severely limits the generalization capability of models. Existing models often overly rely on sensitive features in videos, such as object appearance and background morphology, for action recognition, without fully leveraging true temporal action features, leading to recognition errors when faced with novel object-action combinations. To address this issue, this paper proposes an innovative framework for compositional action recognition, utilizing Spatio-Temporal contrastive learning to construct a three-branch architecture that distinguishes appearance and spatiotemporal features at the feature extraction stage. The model is encouraged to contrast features that predict factual probabilities with those that predict biased probabilities through contrastive learning, thereby reducing the direct and indirect reliance on sensitive features and enhancing the accuracy and generalization of recognition. Experimental results show that this method achieves state-of-the-art performance on the Something-Else dataset, validating its effectiveness in composite action recognition tasks. Furthermore, it achieves comparable or superior results to state-of-the-art methods on standard action recognition datasets such as Something-Something-V2, UCF101, and HMDB51.

Yezi Gong, Mingtao Pei
Path-Guided Motion Prediction with Multi-view Scene Perception

Compared to the prediction for individuals, motion prediction in 3D scenes remains a challenging task, often requiring guidance on both historical motion and the surrounding environment. However, the issue of how to effectively introduce scene context into human motion prediction remains unexplored. In this paper, we propose a novel scene-aware motion prediction method that formulates the RGBD scene data through multi-view perception, while predicting the human motion that matches the scene. First, from the top view, we perform a global path planning for motion trajectory based on scene context information. Then, from the 2D view, the semantic features of the scene are extracted from image sequences and fused with the human motion features to learn the potential interaction between the scene and the motion intention. Finally, a path-guided motion prediction framework is proposed to infer the final motion of human in the 3D view. We evaluate the effectiveness of the proposed method on two challenging datasets, including both synthetic and real environments. Experimental results demonstrate that the proposed method achieves the state-of-the-art motion prediction performance in complex scenes.

Zongyun Li, Yang Yang, Xuehao Gao
Privacy-Preserving Action Recognition: A Survey

With the rapid development of deep learning in the field of action recognition, there is a growing concern about the invasion of privacy due to the need for large amounts of data for training and the collection of personal data. To alleviate these concerns while ensuring the advancement of research on action recognition, research on Privacy-preserving action recognition (PPAR) has emerged. PPAR requires the model to remove the private information contained in videos and recognize the action performed in the private information-free videos. Many efforts have been made to solve this problem from different aspects. In this work, we first outline the common framework for training PPAR models and formulate the goals of PPAR. Then, we review the commonly used datasets in PPAR research and detailedly explain the different definitions and evaluations of privacy protection in these datasets. The demand for PPAR datasets is also discussed. Furthermore, we categorize existing PPAR methods according to the way of removing private information. To inspire insights into potential future research directions, we comprehensively review each category of these existing PPAR methods for addressing PPAR. An objective analysis of existing methods’ astonishing improvements as well as their inevitable drawback is provided.

Xiao Li, Yu-Kun Qiu, Yi-Xing Peng, Ling-An Zeng, Wei-Shi Zheng
Attention-Based Spatio-Temporal Modeling with 3D Convolutional Neural Networks for Dynamic Gesture Recognition

Dynamic gesture recognition has indeed become a focal point in the field of human-computer interaction. Despite advancements, contemporary approaches fall short in considering the interdependence between frame images, and the complexity of real-world application backgrounds poses a significant obstacle to enhancing gesture recognition accuracy. In this paper, we propose a concatenated spatio-temporal attention with 3D convolutional network (CPTA3DNet) for gesture recognition. A stackable spatio-temporal attention module (STAM) is designed to effectively capture the dynamics of gestures over time and space. This module operates sequentially, initially emphasizing temporal attention followed by spatial attention. To evaluate the effectiveness of our method, we rigorously tested it on two large-scale public gesture recognition datasets: the Jester dataset and EgoGesture dataset. Focusing on the RGB modality, our experiments led to groundbreaking results, with our method attaining recognition accuracy of 94.79% on the Jester dataset and 94.36% on the EgoGesture dataset, which performs better than existing methods. Additionally, we performed comprehensive ablation studies to substantiate the impact of STAM in capturing both temporal and spatial dynamics.

Yutong Hu
MIT: Multi-cue Injected Transformer for Two-Stage HOI Detection

Transformers have demonstrated potential in leveraging features for two-stage human-object interaction (HOI) detection, but a considerable performance gap persists compared to one-stage methods. We attribute this discrepancy to the limited granularity in the coverage of first-stage features. In this paper, we introduce a multi-cue injected Transformer (MIT), specifically devised for two-stage HOI detection. MIT efficiently utilizes multi-granularity information, encompassing cues related to instances, bounding boxes, 3D poses, and global context. Initially, MIT associates instances, such as humans and objects, that have potential interactive relationships using bounding box cues, subsequently fusing these instances with 3D pose to derive a fused embedding for each human-object pair. These embeddings are then refined by querying on global context feature maps. Through the hierarchical integration of these diverse cues, MIT substantially enhances HOI detection performance. Extensive experiments validate MIT’s effectiveness and its superiority to state-of-the-art methods.

Weilong Peng, Qingfeng Chen, Keke Tang, Zhihao Yang, Meng Xing, Meie Fang
DIDA: Dynamic Individual-to-integrateD Augmentation for Self-supervised Skeleton-Based Action Recognition

Self-supervised action recognition plays a crucial role by enabling machines to understand and interpret human actions without the need for numerous human-annotated labels. Contrastive learning, which compels the model to focus on discriminative features by constructing positive and negative sample pairs, is a highly effective method for achieving self-supervised action recognition. In contrastive learning, existing models focus on designing various augmentation methods and simply applying a fixed combination of these augmentations to generate the sample pairs. Nevertheless, there are two primary concerns associated with these methods: (1) The contentious strong augmentation could distort the structure of skeleton data and lead to semantic distortion. (2) Existing methods often apply augmentations uniformly, ignoring the unique characteristics of each augmentation technique. To address these problems, we propose the Dynamic Individual-to-integrateD Augmentation (DIDA) framework. This framework is designed with an innovative dual-phase structure. In the first phase, a close-loop feedback structure is applied to handle each augmentation separately and adjust their intensities dynamically based on immediate results. In the second phase, individual-to-integrated augmentation strategy with multi-level contrastive learning is designed to further enhance the feature representation ability of the model. Extensive experiments show that the proposed DIDA outperforms current state-of-the-art methods on the NTU60 and NTU120 datasets.

Haobo Huang, Jianan Li, Hongbin Fan, Zhifu Zhao, Yangtao Zhou
Multi-scale Spatial and Temporal Feature Aggregation Graph Convolutional Network for Skeleton-Based Action Recognition

In the field of deep learning, skeleton data is widely used for action recognition. Currently, the recognition of human skeleton action based on Graph Convolutional Networks (GCNs) has occupied the main position and has achieved remarkable results. However, existing methods are not sufficiently expressive concerning temporal and spatial features. Therefore, we propose a Multi-scale Spatial and Temporal Feature Aggregation Graph Convolutional Network (MSTA-GCN) for skeleton-based action recognition, which can effectively aggregate features from spatial and temporal dimensions using a hierarchical structure. Specifically, we integrate the topology learning strategy with the edge convolution module to aggregate global and fine-grained features at the spatial dimension. On this basis, a multi-scale temporal convolution based on a temporal attention module is proposed to aggregate the node features that change within frames under the condition of guaranteeing the global temporal features. Finally, the feature refinement module of skeleton data is improved to enhance the ability of the network to represent spatial features. Our proposed MSTA-GCN outperforms most mainstream methods and achieves satisfactory performance on three large-scale datasets: NTU RGB+D 60, NTU RGB+D 120, and Northwestern-UCLA.

Yifei Du, Mingliang Zhang, Bin Li
Improving Video Representation of Vision-Language Model with Decoupled Explicit Temporal Modeling

Vision-Language Pre-trained (VLP) models have shown significant ability in many video tasks. For action recognition, recent studies predominantly use meticulously designed prompt tokens or positional encodings to adapt VLP models to video domains, consequently leading to a reliance on designing and learning processes. Moreover, in mainstream fine-tuning settings, models are guided by downstream tasks, which is a coarse-grained objective toward temporal modeling. To address these issues we propose an Explicit Temporal Modeling (ETM) method that mainly consists of two key designs and is decoupled from the image model. To add temporal supervision, we focus on frame-sequential order and design a temporal-related task in a contrastive manner. To reduce dependence on the quality of design and learning when modeling temporality, we propose a module with temporality-aware computation approaches and make it compatible with the newly added task. Extensive experiments are conducted on real-world datasets, demonstrating that our proposed ETM can improve VLP models’ performance on action recognition tasks. Besides, our model also performs generalization ability in few/zero-shot tasks. Code and supplementary are available at https://github.com/lyxwest/ETM .

Yuxi Liu, Wenyu Zhang, Sihong Chen, Xinming Zhang
KS-FuseNet: An Efficient Action Recognition Method Based on Keyframe Selection and Feature Fusion

Addressing the challenge of effectively capturing features in contemporary video tasks, we propose an action recognition approach grounded in keyframe filtering and feature fusion. Our method comprises two core modules. The keyframe screening module employs an attention mechanism to segregate the input depth feature map sequence into two distinct tensors, effectively reducing spatial redundancy computation and enhancing key feature capture. The other spatio-temporal and action feature module features two branches with divergent structures, performing spatio-temporal and action feature extraction on the differentiated features from the previous module. Through these closely linked modules, our approach effectively discerns and extracts meaningful video features for subsequent classification tasks. We construct an end-to-end deep learning model using established frameworks, training and validating it on a generic video dataset, and confirm its efficacy through comparison and ablation experiments. Experiments conducted on this dataset demonstrate that our model surpasses the majority of prior works.

Keming Mao, Yilong Xiao, Xin Jing, Zepeng Hu, Yi Ping
Dynamic Skeleton Association Transformer for Dyadic Interaction Action Recognition

Since GCN has been proposed to represent skeleton data as graphs, it has always been the primary method for skeleton-based human action recognition. However, when dealing with interaction skeleton sequences, current GCN-based methods do not consider dynamically updating the connections between the skeleton points of two persons and cannot extract interaction features well. The self-attention module of Transformer can well focus on the correlation between skeleton sequences. We propose a novel method called Dynamic Skeleton Association Transformer (DSAT) for dyadic interaction action recognition, which can dynamically update the interaction relationship adjacency matrix by combining the spatial attention features and geometric spatial distances of two skeleton sequences to capture the spatial interaction relationship between the skeleton sequence of the two persons. Then, we use spatial self-attention to extract the interaction relationships between different individuals and within the same individual. We also improve the temporal self-attention module according to the density of interactive events to extract the correlation between the same skeleton point in different frames. Through our strategy, our model can more effectively recognize interactive behaviors that are density in time and space, and we have conducted extensive experiments on the benchmark datasets of SBU, NTU-RGB+D, and NTU-RGB+D 120 interaction subsets to verify the effectiveness of our method.

Zixian Liu, Longfei Zhang, Xiaokun Zhao, Yixuan Wang
Species-Aware Guidance for Animal Action Recognition with Vision-Language Knowledge

Species diversity is one of the major differences between animal action recognition and human action recognition, resulting in a series of challenges, e.g., action manifestation diversity, concurrent actions, and long-tailed distribution in datasets. As the same action can be manifested significantly differently among animal species due to their physiological differences, it is crucial for models to distinctively learn various visual content under the same label with species-aware perspectives. However, previous works mainly applied single-species recognition methods to animal datasets, without considering species diversity to address animal action recognition. To fill this gap, we propose a novel animal action recognition approach with specific species guidance by exploring pre-trained vision-language knowledge, namely Species-Aware Guidance (SAG). Firstly, we add word-level species semantics to visual embeddings as guidance, leading the model to focus on relevant regions of target animals in subsequent visual understanding. Then, we apply spatiotemporal modeling in both global and local granularity via a two-branch module to obtain a cross-modal video representation. Finally, sentence-level species-aware semantics is fused with action labels as an overall query, guiding the video representation to output the final action label via the decoder. On two widely used public benchmarks of animal action recognition, for both single-label and multi-label scenarios, SAG archives state-of-the-art performance, e.g., Animal Kingdom ( $$\uparrow $$ ↑ 5.0%), Mammalnet ( $$\uparrow $$ ↑ 27.0%) compared to existing methods, especially well-alleviating the problem of long-tailed distributions, demonstrating the effectiveness of species guidance under limited data for training.

Zhen Zhai, Hailun Zhang, Qijun Zhao, Keren Fu
Backmatter
Metadaten
Titel
Pattern Recognition and Computer Vision
herausgegeben von
Zhouchen Lin
Ming-Ming Cheng
Ran He
Kurban Ubul
Wushouer Silamu
Hongbin Zha
Jie Zhou
Cheng-Lin Liu
Copyright-Jahr
2025
Verlag
Springer Nature Singapore
Electronic ISBN
978-981-9785-11-7
Print ISBN
978-981-9785-10-0
DOI
https://doi.org/10.1007/978-981-97-8511-7