Skip to main content

2025 | Buch

Pattern Recognition and Computer Vision

7th Chinese Conference, PRCV 2024, Urumqi, China, October 18–20, 2024, Proceedings, Part VI

herausgegeben von: Zhouchen Lin, Ming-Ming Cheng, Ran He, Kurban Ubul, Wushouer Silamu, Hongbin Zha, Jie Zhou, Cheng-Lin Liu

Verlag: Springer Nature Singapore

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This 15-volume set LNCS 15031-15045 constitutes the refereed proceedings of the 7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024, held in Urumqi, China, during October 18–20, 2024.

The 579 full papers presented were carefully reviewed and selected from 1526 submissions. The papers cover various topics in the broad areas of pattern recognition and computer vision, including machine learning, pattern classification and cluster analysis, neural network and deep learning, low-level vision and image processing, object detection and recognition, 3D vision and reconstruction, action recognition, video analysis and understanding, document analysis and recognition, biometrics, medical image analysis, and various applications.

Inhaltsverzeichnis

Frontmatter

3D Vision and Reconstruction

Frontmatter
Visual Harmony: LLM’s Power in Crafting Coherent Indoor Scenes from Images

Indoor scene generation has recently attracted significant attention as it is crucial for metaverse, 3D animation, visual effects in movies, and virtual/augmented reality. Existing image-based indoor scene generation methods often produce scenes that are not realistic enough, with issues such as floating objects, incorrect object orientations, and incomplete scenes that only include the part of the scenes captured by the input image. To address these challenges, we propose Visual Harmony, a method that leverages the powerful spatial imagination capabilities of Large Language Model (LLM) to generate corresponding indoor scenes based on the input image. Specifically, we first extract information from the input image through depth estimation and panorama segmentation, reconstructing a semantic point cloud. Using this reconstructed semantic point cloud, we extract a scene graph that describes only the objects in the image. Then we leverage the strong spatial imagination capabilities of LLM to complete the scene graph, forming a representation of a complete room scene. Based on this fine scene graph, we can generate entire indoor scene that includes both the captured and not captured parts of the input image. Extensive experiments demonstrate that our method can generate realistic, plausible, and highly relevant complete indoor scenes related to the input image.

Genghao Zhang, Yuxi Wang, Chuanchen Luo, Shibiao Xu, Yue Ming, Junran Peng, Man Zhang
Superpixel Cost Volume Excitation for Stereo Matching

In this work, we concentrate on exciting the intrinsic local consistency of stereo matching through the incorporation of superpixel soft constraints, with the objective of mitigating inaccuracies at the boundaries of predicted disparity maps. Our approach capitalizes on the observation that neighboring pixels are predisposed to belong to the same object and exhibit closely similar intensities within the probability volume of superpixels. By incorporating this insight, our method encourages the network to generate consistent probability distributions of disparity within each superpixel, aiming to improve the overall accuracy and coherence of predicted disparity maps. Experimental evaluations on widely-used datasets validate the efficacy of our proposed approach, demonstrating its ability to assist cost volume-based matching networks in restoring competitive performance.

Shanglong Liu, Lin Qi, Junyu Dong, Wenxiang Gu, Liyi Xu
Multi-view Depth Estimation with Adaptive Feature Extraction and Region-Aware Depth Prediction

Multi-view depth estimation is an essential task for 3D reconstruction, which aims to obtain depth map from multi-view images via multi-view stereo (MVS) technique. However, when the input images contain challenging regions with occlusion and low texture, existing MVS methods may fail to perform well. To tackle these problems, this paper proposes a multi-view depth estimation framework, which consists of adaptive feature extraction and region-aware depth prediction. To obtain better pixel feature matching, adaptive feature extraction is constructed with a CNN-based Adaptive Local Feature Extractor (ALFE) and a Transformer-based Global Feature Extractor (GFE) to capture the representative and robust features in challenging regions. To obtain better depth map output, region-aware depth prediction is constructed with the Region-Aware Depth Refinement Module (RA-DRM), which iteratively refines the depth map guided by the extracted features in different regions. Qualitative and quantitative experiments were conducted on the DTU and BlendedMVS datasets. The results show the effectiveness of the proposed modules of ALFE, GFE and RA-DRM, and the comparisons indicate that our learning-based MVS method is superior to related methods.

Chi Zhang, Lingyu Liang, Jijun Zhou, Yong Xu
3D Data Augmentation for Driving Scenes on Camera

Driving scenes are extremely diverse and complicated that it is impossible to collect all cases with human effort alone. While data augmentation is an effective technique to enrich the training data, existing methods for camera data in autonomous driving applications are confined to the 2D image plane, which may not optimally increase data diversity in 3D real-world scenarios. To this end, we propose a 3D data augmentation approach termed Drive-3DAug, aiming at augmenting the driving scenes on camera in the 3D space. We first utilize Neural Radiance Field (NeRF) to reconstruct the 3D models of background and foreground objects. Then, augmented driving scenes can be obtained by placing the 3D objects with adapted location and orientation at the pre-defined valid region of backgrounds. As such, the training database could be effectively scaled up. However, the 3D object modeling is constrained to the image quality and the limited viewpoints. To overcome these problems, we modify the original NeRF by introducing a geometric rectified loss and a symmetric-aware training strategy. We evaluate our method for the camera-only monocular 3D detection task on the Waymo and nuScences datasets. The proposed data augmentation approach contributes to a gain of $$1.7\%$$ and $$1.4\%$$ in terms of detection accuracy, on Waymo and nuScences respectively. Furthermore, the constructed 3D models serve as digital driving assets and could be recycled for different detectors or other 3D perception tasks.

Wenwen Tong, Jiangwei Xie, Tianyu Li, Yang Li, Hanming Deng, Bo Dai, Lewei Lu, Hao Zhao, Junchi Yan, Hongyang Li
A Pose-Aware Auto-Augmentation Framework for 3D Human Pose and Shape Estimation from Partial Point Clouds

This work mainly addresses the challenges in 3D human pose and shape estimation from real partial point clouds. Existing 3D human estimation methods from point clouds usually have limited generalization ability on real data due to factors such as self-occlusion and random noise and domain gap between real data and synthetic data. In this paper, we propose a pose-aware auto-augmentation framework for 3D human pose and shape estimation from partial point clouds. Specifically, we design an occlusion-aware module for the estimator network that can obtain refined features to accurately regress human pose and shape parameters from partial point clouds, even if the point clouds are self-occlusive. Based on the pose parameters and global features of the point clouds from estimator network, we carefully design a learnable augmentor network that can intelligently drive and deform real data to enrich data diversity during the training of estimator network. To guide the augmentor network to generate challenging augmented samples, we adopt an adversarial learning strategy according to the error feedback of the estimator. The experimental results on real data and synthetic data demonstrate that the proposed approach can accurately estimate the 3D human pose and shape from partial point clouds and outperform prior works in terms of reconstruction accuracy.

Kangkan Wang, Shihao Yin, Chenghao Fang
Efficient Emotional Talking Head Generation via Dynamic 3D Gaussian Rendering

The synthesis of talking heads with outstanding fidelity, lip synchronization, emotion control, and high efficiency has received lots of research interest in recent years. While some current methods can produce high-fidelity videos in real-time based on NeRF, they are still constrained by computational resources and struggle to achieve accurate emotion control. To tackle these challenges, we propose Emo-Gaussian, a method for generating talking heads based on 3D Gaussian Splatting. In our method, a Gaussian field is utilized to model a specific character. We condition the opacity and color information on audio and emotion inputs, dynamically rendering and optimizing the 3D Gaussians, thus effectively achieving the modeling of the dynamic variations of the talking head. As for the emotion input, we introduce an emotion control module, which utilizes a pre-trained CLIP model to extract emotional priors from images of individuals. These priors are then integrated with an attention mechanism to provide emotion guidance for the process of generating talking heads. Quantitative and qualitative experiments demonstrate the superiority of our method over previous approaches in terms of image quality, lip synchronization, and emotion control, meanwhile exhibiting high efficiency compared to previous state-of-the-art methods.

Tiantian Liu, Jiahe Li, Xiao Bai, Jin Zheng
Generalizable Geometry-Aware Human Radiance Modeling from Multi-view Images

Modeling an animatable human avatar in sparse view inputs is highly challenging, especially when synthesizing novel pose images different from the input views. Previous methods suffered from significant image blurring and a lack of clothing wrinkle details due to the spatial transformation process, along with rendering artifacts in the human body caused by self-occlusion issues. To address these issues, we introduce an efficient generalizable geometry-aware human radiance field method for synthesizing high-fidelity novel views and poses from sparse view inputs. To solve the inaccurate feature correspondence caused by human spatial transformation, we propose a human body geometric embedding derived from centroid mapping to provide accurate geometric prior information for guiding the neural radiance field’s learning. Furthermore, we use a geometry-aware attention mechanism consisting of two feature attention modules to address the issue of self-occlusion in sparse view inputs, resulting in improved body shape details and reduced blurriness. Qualitative and quantitative results on the ZJU-MoCap and Thuman datasets demonstrate that our method outperforms state-of-the-art approaches significantly in novel view and pose synthesis tasks.

Weijun Wu, Zhixiong Mo, Weihao Yu, Yizhou Cheng, Tinghua Zhang, Jin Huang
AG-NeRF: Attention-Guided Neural Radiance Fields for Multi-height Large-Scale Outdoor Scene Rendering

Existing neural radiance fields (NeRF)-based novel view synthesis methods for large-scale outdoor scenes are mainly built on a single altitude. Moreover, they often require a priori camera shooting height and scene scope, leading to inefficient and impractical applications when camera altitude changes. In this work, we propose an end-to-end framework, termed AG-NeRF, and seek to reduce the training cost of building good reconstructions by synthesizing free-viewpoint images based on varying altitudes of scenes. Specifically, to tackle the detail variation problem from low altitude (drone-level) to high altitude (satellite-level), a source image selection method and an attention-based feature fusion approach are developed to extract and fuse the most relevant features of target view from multi-height images for high-fidelity rendering. Extensive experiments demonstrate that AG-NeRF achieves SOTA performance on 56 Leonard and Transamerica benchmarks and only requires a half hour of training time to reach the competitive PSNR as compared to the latest BungeeNeRF.

Jingfeng Guo, Xiaohan Zhang, Baozhu Zhao, Qi Liu
JPA: A Joint-Part Attention for Mitigating Overfocusing on 3D Human Pose Estimation

Recently, transformer-based solutions have exhibited remarkable success in 3D human pose estimation (3D-HPE) by computing pairwise relations between joints. However, we observed that the conventional self-attention mechanism in 3D-HPE tends to overly focus on a tiny fraction of joints. Moreover, these overfocused joints often lack relevance to the performed actions, resulting in models that struggle to generalize across poses. In this paper, we address this issue by incorporating prior information on the human body structure through a plug-and-play Joint-Part Attention (JPA) module. Firstly, we design a Part-aware Weighted Aggregation (PWA) module to merge different joints into distinct parts. Secondly, we introduce a Joint-Part Cross-scale Attention (JPCA) module to encourage the model to attend to more joints. This is achieved by configuring joint tokens to query part tokens across two scales. In our experiments, we apply JPA to various transformer-based methods, demonstrating its superiority on Human3.6M, MPI-INF-3DHP, and HumanEva datasets. We will release our code.

Dengqing Yang, Zhenhua Tang, Jinmeng Wu, Shuo Wang, Lechao Cheng, Yanbin Hao
Realistic and Visually-Pleasing 3D Generation of Indoor Scenes from a Single Image

Artificial Intelligence Generated Content (AIGC) has experienced significant advancements, particularly in the areas of natural language processing and 2D image generation. However, the generation of three-dimensional (3D) content from a single image still poses challenges, particularly when the input image contains complex backgrounds. This limitation hinders the potential applications of AIGC in areas such as human-machine interaction, virtual reality (VR), and architectural design. Despite the progress made so far, existing methods face difficulties when dealing with single images that have intricate backgrounds. Their reconstructed 3D shapes tend to be incomplete, noisy, or lack of partial geometric structures. In this paper, we introduce a 3D generation framework for indoor scenes from a single image to generate realistic and visually-pleasing 3D geometry shapes, without the requirement of point clouds, multi-view images, depth or masks as input. The main idea of our method is clustering-based 3D shape learning and prediction, followed by a shape deformation. Since more than one objects tend to be existing in indoor scenes, our framework will simultaneously generate multi-objects and predict the layout with a camera pose, as well as 3D object bounding boxes for holistic 3D scene understanding. We have evaluated the proposed framework on benchmark datasets including ShapeNet, SUN RGB-D and Pix3D, and state-of-the-art performance has been achieved. We have also given examples to illustrate immediate applications in virtual reality.

Jie Li, Lei Wang, Gongbin Chen, Ang Li, Yuhao Qiu, Jiaji Wu, Jun Cheng
AttenPoint: Exploring Point Cloud Segmentation Through Attention-Based Modules

Similar to how humans perceive 3D objects, neural networks discern the class labels of point clouds by combining local and global features of the structures and performance. Based on this, we reviewed the pipeline of few-shot point cloud semantic segmentation and identified three issues: the neglect of local information by the neural network, the lack of receptive field for point cloud features and the domain gap between the support and query set data. To address them, we propose a novel network called AttenPoint. It incorporates three attention-based modules, the attention pooling is used to extract local feature accurately, the attention feature enhancement aims to broaden the global feature receptive field and the attention segmentation head aims to achieve transfer across domains with limited samples. Experimental results on the S3DIS and ScanNet datasets demonstrate that AttenPoint has achieved state-of-the-art(SOTA) performance in few-shot semantic point cloud segmentation task.

Xiaohan Yan, Nan Wang, Xiaowei Song, Gang Wei, Zhicheng Wang
MTFusion: Reconstructing Any 3D Object from Single Image Using Multi-word Textual Inversion

Reconstructing 3D models from single-view images is a long-standing problem in computer vision. The latest advances for single-image 3D reconstruction extract a textual description from the input image and further utilize it to synthesize 3D models. However, existing methods focus on capturing a single key attribute of the image (e.g., object type, artistic style) and fail to consider the multi-perspective information required for accurate 3D reconstruction, such as object shape and material properties. Besides, the reliance on Neural Radiance Fields hinders their ability to reconstruct intricate surfaces and texture details. In this work, we propose MTFusion, which leverages both image data and textual descriptions for high-fidelity 3D reconstruction. Our approach consists of two stages. First, we adopt a novel multi-word textual inversion technique to extract a detailed text description capturing the image’s characteristics. Then, we use this description and the image to generate a 3D model with FlexiCubes. Additionally, MTFusion enhances FlexiCubes by employing a special decoder network for Signed Distance Functions, leading to faster training and finer surface representation. Extensive evaluations demonstrate that our MTFusion surpasses existing image-to-3D methods on a wide range of synthetic and real-world images. Furthermore, the ablation study proves the effectiveness of our network designs.

Yu Liu, Ruowei Wang, Jiaqi Li, Zixiang Xu, Qijun Zhao
Multi-view 3D Reconstruction by Fusing Polarization Information

For the shortcomings of current 3D reconstruction models such as poor reconstruction effect and blurred edges when dealing with weakly textured and textureless objects, this paper fuses the rich polarization spectral information with multi-view 3D reconstruction and presents the MP-mip-NeRf 360 model. This paper has constructed a multi-angle polarization dataset and systematic theoretical model validations are completed on this dataset. Compared with existing deep learning models, our model achieves better results in terms of accuracy, rendering more realistic scenes and obtaining more detailed depth maps.

Gaomei Hu, Haimeng Zhao, Qirun Huo, Jianfang Zhu, Peng Yang
Quat-DGNet: Enhancing 3D Dense Captioning with Quaternion-Based Spatial Offsets and Dynamic Neighborhood Graphs

3D dense captioning aims at generating more detailed and accurate descriptions for objects in a 3D scene. Since the one-stage (detect-and-describe) model does not have a detector to provide proposals as local information to the encoder, it leads to the problem of imbalance between global and local information in the encoding stage. To solve this problem, we propose Quat-DGNet, a novel model to complement the problem of insufficient local information. Specifically, we propose Quat-B and DNG to capture positional offsets and local relationship graph modeling. Quat-B does this by constraining the point cloud coordinates to a quaternion space, the quaternion representation being valid for parameterizing smooth rotations and spatial transformations in vector space. We design a loss function to more accurately describe the offset and make the point cloud move towards the object. DNG supplements local geometric features by constructing dynamic point cloud relationship maps, which can maintain alignment invariance and capture local geometric features, thus improving the diversity and quality of the model-generated descriptions. Comprehensive experiments demonstrate that our model outperforms existing efficient models in performance.

Shu Li, Xiangdong Su, Jiang Li, Fujun Zhang
Disparity Refinement Based on Cross-Modal Feature Fusion and Global Hourglass Aggregation for Robust Stereo Matching

Stereo matching is a critical research area in computer vision. The advancement of deep learning has led to the gradual replacement of cost-filtering methods by iterative optimization techniques, characterized by outstanding generalization performance. However, cost volumes constructed solely through recurrent all-pairs field transforms in iterative optimization methods lack adequate image information, making it challenging to resolve blurring issues in pathological regions such as illumination changes or similar textures. In this paper, we propose SCA-Stereo, a disparity refinement network aimed at further optimizing the initial disparity map generated by iteration. First, we introduce a high- and low-frequency feature extractor to delve deeper into the structural and fine feature information inherent in the image. Furthermore, we propose a cross-modal feature fusion module to facilitate the exchange and integration of diverse features, expanding the receptive field to enhance information flow. Finally, we design a global hourglass aggregation network to efficiently capture non-local interactions between fusion features. Extensive experiments conducted across Scene Flow, KITTI, Middlebury, and ETH3D demonstrate the effectiveness of SCA-Stereo in achieving state-of-the-art stereo matching performance.

Gang Wang, Jinlong Yang, Yinghui Wang
Trajectory-based Calibration for Optical See-Through Head-Mounted Displays Without Alignment

Optical see-through head-mounted display (OST-HMD) calibration is crucial for virtual and real content alignment of augmented reality applications. Most conventional OST-HMD calibration methods heavily rely on manual alignment of virtual and real content to compute the transformation parameters of relevant coordinates. However, manual alignment is cumbersome and time-consuming, which limits the wide application of these methods. To address this issue, we propose a new alignment-free calibration method based on trajectory generation for robust OST-HMD calibration. There are three main steps of the proposed method: 1) Obtaining calibration image data from different coordinate spaces following predefined trajectories; 2) Calculate the transformation parameter from the real world to tracking camera and observation camera as well as the transformation parameter from virtual content to observation camera; 3) Derive the OST-HMD calibration parameters from above transformation parameters. The advantages of our method are: 1) Combined with the trajectories calibration mechanism, the proposed method can get more robust results than traditional approaches; 2) due to the precise geometry derivation from 3D-3D correspondences, it can naturally avoid manual alignment and obtain accurate calibration results without the requirement of any internal parameters of the virtual camera or the display parameters of OST-HMD devices. Extensive experiments show that the proposed method can achieve better calibration parameters than other methods on both accuracy and robustness.

Yongqi Wang, Shaohua Zhao, Wei Chen, ZhongChen Shi, Liang Xie, Ye Yan, ErWei Yin
Animatable Human Rendering from Monocular Video via Pose-Independent Deformation

Rendering animatable avatars from monocular videos has significant applications in the broader field of interactive entertainment. Previous methods based on Neural Radiance Field (NeRF) struggle with long training time and tend to overfit on seen poses. To address this, we introduce PID-NeRF, a novel framework with a Pose-Independent Deformation (PID) module. Specifically, PID module learns a multi-entity shared skinning prior and optimizes instance-level non-rigid offsets in UV-H space, which is independent of human motion. The pose-independence enable our model unify the backward and forward human skeleton deformations in same network parameters, increasing the generalizability of our skinning prior. Additionally, a bounded segment modeling (BSM) strategy is utilized with a window function to smooth overlapping regions of bounding boxes, to balance the training speed and rendering quality. Extensive experiments demonstrate that our method achieves better results than the state-of- the-art methods in novel-view and novel-pose synthesis on multiple datasets.

Tong Duan, Zekai Jiang, Zipei Ma, Dongyu Zhang
Maximum Spanning Tree for 3D Point Cloud Registration

3D point cloud registration, as a fundamental and challenging problem in computational vision, aims to seek the best pose to align a point cloud between two 3D objects or scenes and transform them into the same coordinate system to achieve the best alignment of point clouds. The key to point cloud registration is accurately finding the corresponding relationship between point clouds. In this paper, we propose a pure geometric 3D point cloud registration method based on the maximum spanning tree (MST): 1) We extract the local features of the point clouds for initial alignment and construct the initial compatibility graph of two 3D objects or scenes. 2) We sparse the initial compatibility graph and find the maximum spanning tree in the sparse graph. 3) We select the maximum spanning tree with the most nodes and use the DFS algorithm to find all the node sets(each node set contains three nodes) in the selected maximum spanning tree. 4) We generate and evaluate the hypothesis of the node sets, and the hypothesis with the highest evaluation score is the transform matrix solved. We extensively tested our method on the 3DMatch, 3DLoMatch datasets. The experimental results show that the proposed MST method performs the same or better than the existing 3D point cloud registration methods.

Xin Zhao, Chengzhuan Yang, Zhonglong Zheng
Learning the Dynamic Spatio-Temporal Relationship Between Joints for 3D Human Pose Estimation

Graph Convolutional Networks (GCNs) have shown remarkable success in 3D human pose estimation. Nonetheless, effectively modeling the motion correlation among human joints remains a challenging task. This paper presents a novel Dynamic Learning Specific Motion Graph Convolutional Network (DLSM-GCN) block, aimed at adaptively capturing the kinematic relationships between human joints from diverse inputs. Specifically, it introduces two components: Dynamic Spatio-Temporal Graph Convolution (DSTG) and Dynamic Spatial Second-Order Connectivity Graph Convolution (DSOG). DSTG is devised to model latent motion correlations between joints, while DSOG focuses on capturing second-order connectivity relationships among joints exhibiting vigorous motion. Furthermore, we propose the DSLMFormer, which integrates the DLSM-GCN block with the Temporal Transformer Block (TTB). DLSM-GCN and TTB respectively handle spatial and temporal modeling of human pose in videos, effectively mitigating depth ambiguity and motion uncertainty. Extensive experiments conducted on various datasets demonstrate that DLSMFormer dynamically captures motion correlations among human joints from diverse inputs, thereby enhancing the accuracy of 3D human pose estimation.

Feiyi Xu, Ying Sun, Jin Qi, Yanfei Sun
MaskEditor: Instruct 3D Object Editing with Learned Masks

We introduce MaskEditor, an object-level 3D neural field editing method based on text instructions. Different from manipulating the whole scene, local editing needs accurate locating and proper field fusion to provide a realistic object-level replacement. We utilize a 3D mask grid to accurately localize the target object leveraging the 2D segmentation information provided by the Segment Anything Model (SAM). The whole scene is divided into the object field and background field based on the learned 3D mask. Subsequently, we apply the Variational Score Distillation (VSD) to optimize the object field and leave the background field unaltered, which achieves editing results aligned with text instructions. Furthermore, we implement composited rendering and coarse-to-fine editing strategy to enhance the editing quality and the consistency of the edited object with the original scene. Qualitative and quantitative evaluations confirm that MaskEditor achieves more precise and superior local editing compared to baselines.

Xinyao Liu, Kai Xu, Yuhang Huang, Renjiao Yi, Chenyang Zhu
DyGASR: Dynamic Generalized Gaussian Splatting with Surface Alignment for Accelerated 3D Mesh Reconstruction

Recent advancements in 3D Gaussian Splatting (3DGS), which lead to high-quality novel view synthesis and accelerated rendering, have remarkably improved the quality of radiance field reconstruction. However, the extraction of mesh from a massive number of minute 3D Gaussian points remains great challenge due to the large volume of Gaussians and difficulty of representation of sharp signals caused by their inherent low-pass characteristics. To address this issue, we propose DyGASR, which utilizes generalized Gaussian instead of traditional 3D Gaussian to decrease the number of particles and dynamically optimize the representation of the captured signal. In addition, it is observed that reconstructing mesh with generalized Gaussian splatting without modifications frequently leads to failures since the Gaussian centroids may not precisely align with the scene surface. To overcome this, we further introduce a Generalized Surface Regularization (GSR), which reduces the smallest scaling vector of each Gaussian to zero and ensures normal alignment perpendicular to the surface, facilitating subsequent Poisson surface mesh reconstruction. Additionally, we propose a dynamic resolution adjustment strategy that utilizes a cosine schedule to gradually increase image resolution from low to high during the training stage, thus avoiding constant full resolution, which significantly boosts the reconstruction speed. Our approach surpasses existing 3DGS-based mesh reconstruction methods, as evidenced by extensive evaluations on various scene datasets, demonstrating a 25% increase in speed, and a 30% reduction in memory usage.

Shengchao Zhao, Yundong Li
MMIDM: Generating 3D Gesture from Multimodal Inputs with Diffusion Models

Multimodal-driven gesture generation has received increasing attention recently. However, a new challenge is how to mine the relationship between multimodal conditional inputs and gestures to generate diverse and realistic gestures better. To address this challenge, we propose a novel framework-3D gesture generation from MultiModal Inputs with the Diffusion Models (MMIDM)-that can effectively fuse multiple modal information (such as text, music, facial expressions, character information, emotions, etc.) as the condition to guide gesture generation. Specifically, we design a multimodal self-evaluation fusion network to capture the features highly related to gestures and automatically evaluate the importance of different conditional inputs using the mixture-of-experts mechanism. Moreover, we found that the diffusion model guided by multimodal conditions will cause serious jitter problems in the generated gesture motions. To alleviate the jitter problem, we employ a novel timestep embedding strategy where the timestep embedding is injected into each transformer block of the diffusion model. We evaluated the proposed method on the BEAT multimodal dataset. Experimental results demonstrate the effectiveness of our approach.

Ji Ye, Changhong Liu, Haocong Wan, Aiwen Jiang, Zhenchun Lei
Discriminative-Guided Diffusion-Based Self-supervised Monocular Depth Estimation

Self-supervised monocular depth estimation is a critical task in computer vision. Existing methods can be typically categorized into discriminative-based and generative-based methods according to different data modeling approaches. Discriminative-based methods are distinguished by high accuracy, while generative-based methods are notable for superior robustness. Given that images captured in real-world scenarios are inevitably influenced by various external factors, it is essential to develop a robust and accurate depth estimation algorithm. However, there is limited research on the balance of robustness and accuracy by exploring the interactions between discriminative and generative networks. We propose a generative diffusion-based self-supervised monocular depth estimation algorithm guided by discriminative networks and incorporate a depth interaction constraint. We utilize discriminative networks to optimize image-guided information for the denoising process within the diffusion model. This approach seamlessly combines great robustness with high accuracy. Additionally, to reduce the impact of low texture regions on the reprojection photometric loss, we design a texture-aware discriminatory mask module. This module strengthens the constraint capability of the photometric consistency. We conduct experiments on the KITTI and Make3D datasets. The results demonstrate that our method successfully balances accuracy and robustness.

Runze Liu, Guanghui Zhang, Dongchen Zhu, Lei Wang, Xiaolin Zhang, Jiamao Li
Multiview Light Field Angular Super-Resolution Based on View Alignment and Frequency Attention

The Light Field (LF) camera, renowned for its ability to capture both the intensity and directional aspects of light rays simultaneously, has garnered widespread attention. However, the plenoptic LF camera encounters a constraint in its field of view due to sensor limitations. To address this challenge, this paper introduces a multiview LF angular super-resolution method based on view alignment and frequency attention. Specifically, we first propose an alignment paradigm to acquire the targeted sparse LF image by warping its adjacent two sparse LF views to their target positions. Moreover, an angular frequency attention block is introduced to meticulously discern global high-frequency details within LF images. Subsequently, the structural information is extracted through the application of a gradient-guided method to guarantee the geometric consistency. Comprehensive experiments verify the effectiveness of our proposed method in both single LF angular reconstruction and multiview LF reconstruction tasks.

Deyang Liu, Yifan Mao, Youzhi Zhang, Xin Zheng, Yifan Zuo, Yuming Fang
MagicGS: Combining 2D and 3D Priors for Effective 3D Content Generation

Diffusion-based 3D generative models have seen significant progress recently. However, their further advancement is limited by issues like mode collapse and slow generation speed. In this paper, we present a coarse-to-fine 3D Gaussian generation method named MagicGS, which is capable of efficient and high-quality 3D generation from a single image. Our key contribution is to introduce the Combine-SDS strategy by leveraging both 2D diffusion and 3D diffusion priors, which can improve the optimization process and alleviate oversaturation effects. In the first stage, we optimize the 3D Gaussian with Combine-SDS to obtain rough shapes. In the second stage, we extract the mesh as a 3D representation and optimize it to generate high-quality, textured meshes. Through extensive experimentation, we demonstrate our superior performance in terms of both mesh quality and runtime compared to existing methods. Additionally, our method exhibits versatility by supporting multi-modal 3D generation tasks through integration with conditional diffusion models.

Jiayi Wang, Zhenqiang Li, Yangjie Cao, Jie Li
ESD-Pose: Enhanced Semantic Discrimination for Generalizable 6D Pose Estimation

Existing generalizable object pose estimation frameworks utilize a set of reference images to predict the complete pose of the target object in a query scene, which does not require textured CAD models to generate training data and can handle unseen novel objects during inference. However, current methods suffer from insufficient discriminative capability due to the template matching strategy. Both potential distractors and negative samples with similar appearance can be confused with the foreground, which limits performance on precise pose estimation. To address these problems, we propose a novel method called ESD-Pose to enhance the discrimination capacity of the framework. Specifically, a semantic interaction aware (SIA) module is introduced to seek semantic consistency among reference images and discrepancies between reference-query pairs. This module mitigates problems related to model deception caused by distractors. For dealing with slender objects robustly, we propose a dynamic scale weight learner to generate adaptive weights for multi-scale feature fusion, making for reasonable utilization of semantic information at different levels. Finally, an IoU-guided loss is designed to align localization and scale prediction, thus facilitating accurate pose estimation. Comprehensive experiments in the LINEMOD and GenMOP datasets demonstrate that ESD-Pose outperforms existing advanced methods, further validating the effectiveness of our method.

Xingyuan Deng, Kangru Wang, Lei Wang, Dongchen Zhu, Jiamao Li
Trans-DONeRF for Transparent Object Rendering with Mixed Depth Prior

The advent of Neural Radiance Fields (NeRF) has significantly impacted the field of novel view synthesis, heralding a new era of methodological advancements. Depth Oracle NeRF (DONeRF), as a depth-guided sampling methodology, reaches a real-time rendering efficiency as well as compatibility with other NeRF speeding methods. However, the surface depth of transparent objects from synthetic datasets can result in blurry and chaotic renderings, as it mistakenly focuses on the surface position. The problem is that the observed radiance is not emitted from the transparent surface, but rather from the virtual image of the object behind it. In response to this challenge, we introduce Trans-DONeRF to augment the rendering fidelity of transparent objects whilst preserving DONeRF’s rendering efficiency. Trans-DONeRF incorporates a modular plug-and-play component, Multi-View Grounded-SAM (MV-GSAM), which autonomously segments transparent objects with multi-view consistency by exploiting the textual-and-semantic-aligned features. On this base, we design a refined depth prior Classified Mixed Depth and an SDF-based transparent surface tailored Density Loss for DONeRF training. Comprehensive experiments validate the superiority of our approach in enhancing the quality of multi-view segmentation and transparent object rendering. Furthermore, we release our synthetic dataset of glasses and mirrors to cover the vacancies of related research, which is released at Google Drive ( https://drive.google.com/drive/folders/1Fs1OeTY4-6x2ZuS7ZGvgA0ejWGGFP22v?usp=drive_link .)

Jiangnan Ye, Taoqi Bao, Lianrui Mu, Yuchen Yang, Jiedong Zhuang, Xiaoyu Liang, Jiayi Xu, Haoji Hu
SFDNeRF: A Semantic Feature-Driven Few-Shot Neural Radiance Field Framework with Hybrid Regularization

Few-shot 3D scene reconstruction remains a challenge due to the limited number of viewpoints available for rendering high-quality images. The proposed framework, SFDNeRF, addresses this by integrating semantic feature-driven constraints and hybrid regularization techniques to enhance Neural Radiance Fields (NeRF). By leveraging an improved CLIP model, Alpha-CLIP, for semantic feature extraction and applying a novel combination of frequency and geometric regularization, SFDNeRF significantly reduces the number of required input images to as few as 3 to 8 while still achieving state-of-the-art performance in scene reconstruction quality. Our method not only excels in synthesizing photorealistic and semantically coherent views but also demonstrates robustness against data scarcity. Extensive evaluations on the LLFF and NeRF Blender datasets show that SFDNeRF outperforms existing few-shot NeRF methods, establishing a new benchmark for few-shot 3D scene synthesis. Our ablation studies and qualitative analyses further attest to the individual and collective strengths of the framework’s components, advocating for its efficacy. SFDNeRF’s advancements pave the way for future research in dynamic scene reconstruction and video synthesis with limited data.

Xing Wang, Bin Zhang
TriEn-Net: Non-parametric Representation Learning for Large-Scale Point Cloud Semantic Segmentation

Large-scale point cloud semantic segmentation is crucial for various real-world applications such as autonomous driving and remote sensing. Existing methods either fail to explore large-range contextual information, require preprocessing steps, or cannot adequately exploit point cloud features. To address these challenges, we propose TriEn-Net, an efficient framework for large-scale point cloud semantic segmentation. TriEn-Net directly inputs the whole point clouds to capture large-range contextual information and distinguishes the point cloud representations as points and high-dimensional features. By developing the complementary Non-parametric Encoding Module and Non-parametric Attention Module to encode spatial coordinates and semantic features respectively, TriEn-Net captures rich information and enhances feature representation capability without introducing additional parameters. Comprehensive experiments on S3DIS and SemanticKITTI datasets demonstrate TriEn-Net’s superior performance even with fewer parameters, making it a promising solution for efficient and effective semantic segmentation on large-scale point clouds.

Yifei Wang, Jixiang Miao, Anan Du, Xiang Gu, Shuchao Pang
Decomposed Latent Diffusion Model for 3D Point Cloud Generation

Latent diffusion models have achieved significant success in point cloud generation recently, where the diffusion process is constructed under a low-dimensional but efficient latent space. However, existing methods usually overlook the differences between consistency information and offset information in the point clouds, leading to difficulty in accurately learning both the overall shape and the offset of points on shape simultaneously. To address this issue, we propose a decomposed latent diffusion model that separately captures consistency information and offset information in the latent space with feature decoupling. To learn effective consistency information, the consistency constraint among different point clouds with a shape is imposed in the latent space. Then, based on the decomposed features, we further design a geometry diffusion model. We predict key points with consistency information to guide the diffusion model. Therefore, the diffusion model can achieve comprehensive and strong geometry feature extraction. Experiments show that our method achieved state-of-the-art generation performance on the ShapeNet dataset.

Runfeng Zhao, Junzhong Ji, Minglong Lei
Learning Multi-Branch Attention Networks for 3D Face Reconstruction

The traditional 3D morphable model (3DMM) techniques generally regress model coefficients directly, neglecting the critical 2D spatial and semantic edge information. To address this limitation, we propose a multi-branch attention network (MBAN) designed to reconstruct 3D faces from monocular outdoor images, effectively mitigating edge feature loss. Our approach includes preprocessing through image cropping and adaptive edge feature learning using an edge facial feature extraction module. Additionally, we introduce a global feature extraction module based on an enhanced deformator, which enables the interaction between semantic and spatial information of global features for fine-grained representation. This study presents three attention mechanisms: the edge attention mechanism (EAM), the global attention mechanism (GAM), and the local attention mechanism (LAM). These mechanisms establish correlations among edge, global, and local features, enhancing feature extraction at various levels and facilitating multi-level interactive feature learning. This approach addresses the challenges of 3D face reconstruction under varying poses and complex environments, thereby improving the robustness of the reconstruction process. Extensive experiments on the AFLW2000-3D and AFLW datasets validate the effectiveness of our proposed method.

Lei Ma, Zhengwei Yang, Yange Wang, Xiangzheng Li
CP-VoteNet: Contrastive Prototypical VoteNet for Few-Shot Point Cloud Object Detection

Few-shot point cloud 3D object detection (FS3D) aims to identify and localise objects of novel classes from point clouds, using knowledge learnt from annotated base classes and novel classes with very few annotations. Thus far, this challenging task has been approached using prototype learning, but the performance remains far from satisfactory. We find that in existing methods, the prototypes are only loosely constrained and lack of fine-grained awareness of the semantic and geometrical correlation embedded within the point cloud space. To mitigate these issues, we propose to leverage the inherent contrastive relationship within the semantic and geometrical subspaces to learn more refined and generalisable prototypical representations. To this end, we first introduce contrastive semantics mining, which enables the network to extract discriminative categorical features by constructing positive and negative pairs within training batches. Meanwhile, since point features representing local patterns can be clustered into geometric components, we further propose to impose contrastive relationship at the primitive level. Through refined primitive geometric structures, the transferability of feature encoding from base to novel classes is significantly enhanced. The above designs and insights lead to our novel Contrastive Prototypical VoteNet (CP-VoteNet). Extensive experiments on two FS3D benchmarks FS-ScanNet and FS-SUNRGBD demonstrate that CP-VoteNet surpasses current state-of-the-art methods by considerable margins across different FS3D settings. Further ablation studies conducted corroborate the rationale and effectiveness of our designs.

Xuejing Li, Weijia Zhang, Chao Ma
Cross Modality Fusion Network with Feature Alignment and Salient Object Exchange for Single Image 3D Shape Retrieval

The image-based 3D shape retrieval (IBSR) aims to retrieve 3D shapes that are similar to the query image. Most methods consider metric learning, which involves mapping images and 3D shapes to a low-dimensional space. This enables greater similarity between images and 3D shapes of the same instance, while images and 3D shapes of different instances are dissimilar. However, most existing methods do not consider the fusion of information across modalities. By leveraging complementary knowledge contained in different modalities, integrating data from different modalities into a single representation comprehensively represents the data, which enhances the data representation capability and thus facilitates retrieval. Therefore we propose a new method that takes into account information across different modalities. Firstly, we introduce a cross modality fusion network. The cross modality fusion network is primarily an attention mechanism network. By employing this attention mechanism network to fuse modal information, the network can determine the probability of similarity between the input query image and 3D shape. Secondly, to alleviate the difficulty of modal fusion, we propose a feature alignment module based on contrastive learning. This module includes instance discrimination and cross domain feature alignment modules, which align features before modal fusion. Finally, we propose salient object exchange, which further assists in modal fusion. Experiments on three commonly used datasets, i.e., Pix3D, Stanford Cars, and Comp Cars, demonstrates the effectiveness of the proposed method.

Zhenyu Diao, Dongmei Niu, Xiaofan Han, Xiuyang Zhao
Enhanced Spatial Adaptive Fusion Network For Video Super-Resolution

In the field of Video Super-Resolution (VSR), recurrent structures are frequently used in the designed network architectures, yet they face significant challenges in effective information transmission and accurate alignment. To address this issue, this study introduces a concise and efficient network, named Enhanced Spatial Adaptive Fusion Network (ESAFN), to enhance information flow and accuracy in VSR. The core strategy of ESAFN involves a Coupled Propagation Spatially Adaptive Module (CPSAM) and an Implicit Alignment Mechanism (IAM), innovatively restructuring the classical BasicVSR architecture. The proposed method initially promotes the effective integration of forward and backward information, then it dynamically selects the most repr esentative features across multiple scales, and further explores contextual information through Spatially Mixed Convolution (SMC) technology. With the implicit alignment strategy, our method significantly enhances the capability to restore high-frequency details in super-resolved videos, markedly surpassing current advanced technologies.

Boyue Li, Xin Zhao, Shiqian Yuan, Rushi Lan, Xiaonan Luo
Multi-3D Occlusion Mask Learning for Flexible Occlusion Removal in Neural Radiance Fields

As NeRFs modeling becomes more widely available, there is an increasing demand for the ability to flexibly and conveniently exclude unnecessary obstructions during the modeling process. Existing methods generally adopt a “ignore” strategy for occlusions, which cannot conveniently and flexibly remove any occlusions in complex scenes. We propose a new method that only requires the introduction of a small number of different external occlusion annotation to model independent 3D masks for different occlusions in space. This “model first, remove later” occlusion removal strategy allows us to model the scene only one process and obtain unobstructed images from desired viewpoint, with any specific or multiple obstruction removed. Experimental results on existing and our synthesized datasets validate the effectiveness of our method and strategy.

Zhuoyu Shi, Shuo Zhang, Song Chang, Youfang Lin
Sketch-Based 3D Shape Retrieval Via Cross-Modal Contrastive Learning and Difficulty-Aware Uncertainty Regularization

Sketch-based 3D shape retrieval (SBSR) is to find from a repository the 3D shapes most similar to a user-drawn sketch. There are two key challenges for this task. Firstly, sketch and 3D shape are different modalities of object representation and there exists a domain gap between them. Secondly, sketches tend to be noisy since all sketchers are not expected to have a consistently good level of drawing skills, which may cause overfitting. In this work, we propose approaches to effectively overcome these challenges. For the cross-domain feature alignment, we employ cross-modal contrastive learning that avoids the inefficiency and instability issues of triplet-based training as usually adopted by the existing SBSR methods. Further, in order to mitigate the negative impact of noisy sketch data on network learning, we employ a relative Mahalanobis distance based metric to measure sketch sample difficulty and introduce difficulty-aware uncertainty regularization into the loss function. Experiments conducted on the SHREC’13 and SHREC’14 datasets demonstrate the state-of-the-art performance of our proposed model and the effectiveness of our proposed algorithmic components.

Wentao Hou, Zhenyu Diao, Jingliang Peng
Residual Hybrid Attention Enhanced Video Super-Resolution with Cross Convolution

In video super-resolution reconstruction, traditional methods often fall short in capturing details, particularly in edges and occluded areas, which affects the realism and clarity of the images. To address this issue, we propose a novel model-the Residual Hybrid Attention-Enhanced Video Super-Resolution Model, augmented by Cross Convolution techniques, denoted as RCVSR. The model ingeniously integrates a residual hybrid attention mechanism, refining the learning of global and local features through parallel channel attention and self-attention mechanisms. Simultaneously, our model introduces overlapping cross-attention blocks to enhance dynamic interactions between frames, thereby boosting the model’s performance. Furthermore, the design of the cross-convolution blocks allows for parallel processing of vertical and horizontal gradient information in images, effectively extracting edge details. In multiple benchmark tests, the RCVSR model demonstrated its excellent reconstruction effects and outstanding performance.

Shiqian Yuan, Boyue Li, Xin Zhao, Rushi Lan, Xiaonan Luo
SDFReg: Learning Signed Distance Functions for Point Cloud Registration

Learning-based point cloud registration methods can handle clean point clouds well, while it is still challenging to generalize to noisy, partial, and density-varying point clouds. To this end, we propose a novel point cloud registration framework for these imperfect point clouds. By introducing a neural implicit representation, we replace the problem of rigid registration between point clouds with a registration problem between the point cloud and the neural implicit function. We then propose to alternately optimize the implicit function and the registration between the implicit function and point cloud. In this way, point cloud registration can be performed in a coarse-to-fine manner. By fully capitalizing on the capabilities of the neural implicit function without computing point correspondences, our method showcases remarkable robustness in the face of challenges such as noise, incompleteness, and density changes of point clouds.

Leida Zhang, Zhengda Lu, Kai Liu, Yiqun Wang
Unfolding Gradient Graph Regularization for Point Cloud Color Denoising

Due to the cost and accuracy of current point cloud sampling equipment, the obtained point color information is often corrupted by various noises. Existing point cloud denoising algorithms mainly focus on smoothness priors and convex optimization. Their performances highly depend on model parameters whose values are determined manually and fixed throughout the iterations. In this paper, we propose to unfold gradient graph regularization with deep neural networks for point cloud color denoising. It improves the robustness of the model for denoising in different kinds of datasets and across domains. Specifically, our approach first uses a point cloud extraction network to obtain effective features for gradient computation. Then, we construct a gradient graph Laplacian regularization (GGLR) as signal smoothness prior to point cloud restoration. Finally, we introduce shallow neural networks for model parameter estimation to unfold GGLR. The proposed point cloud denoising framework is fully differentiable and can be trained end-to-end. Experiments show that the proposed algorithm unfolding outperforms several existing point cloud color denoising techniques.

Hongtao Wang, Fei Chen, Wanling Liu, Xunxun Zeng
ER-SFM: Efficient and Robust Cluster-Based Structure from Motion

Structure from Motion (SfM) is a fundamental computer vision technique that recovers scene structure and camera motion from multi-view images. When facing large-scale scenarios, cluster-based methods are commonly employed to improve reconstruction efficiency. However, these methods currently face challenges regarding their limited robustness, redundant computation, and drift. To address these issues, we propose a unified pipeline called ER-SfM, which enhances the three key aspects of cluster-based SfM: image clustering, local reconstruction, and merging. In terms of image clustering, we propose a three-stage image clustering method to ensure adequate and reliable connections between clusters. In the local reconstruction stage, we expedite the reconstruction process by eliminating duplicate point cloud computation. In the final merging stage, we introduce a global merging algorithm without scale ambiguity to address the drift problem. Extensive experimental results demonstrate the superior performance of our method in terms of both robustness and efficiency compared to state-of-the-art methods.

Zongxin Ye, Wenyu Li, Sidun Liu, Peng Qiao, Yong Dou
Multimodal Token Fusion and Optimization for 3D Human Mesh Reconstruction with Transformers

3D human recovery has attracted great attention and shown potential in games and movies. Confronting challenges related to occlusion and depth blurring in 3D human reconstruction, transformer encoder architectures have been used to make good progress in learning the connections between various parts of the human body. Nevertheless, for the input to the model, the differences between image tokens and vertex-joint tokens of different modalities are still limiting the reconstruction capability of the 3D human mesh. To overcome this limitation, we propose a module based on a multimodal cross-feature fusion mechanism directly fuses 2D images and 3D spatial coordinates to reconstruct a better human mesh. Our approach employs a large kernel attention strategy to improve the understanding of image features for spatial long-range relationships. We also design a token shift module for joints and vertices to learn interactions between vertices. Quantitative and qualitative experiments on large-scale human datasets such as 3DPW and Human3.6 show that our method achieves excellent reconstruction accuracy.

Yang Jiang, Sunli Wang, Mingyang Sun, Dongliang Kou, Qiangbin Xie, Lihuang Zhang
Backmatter
Metadaten
Titel
Pattern Recognition and Computer Vision
herausgegeben von
Zhouchen Lin
Ming-Ming Cheng
Ran He
Kurban Ubul
Wushouer Silamu
Hongbin Zha
Jie Zhou
Cheng-Lin Liu
Copyright-Jahr
2025
Verlag
Springer Nature Singapore
Electronic ISBN
978-981-9785-08-7
Print ISBN
978-981-9785-07-0
DOI
https://doi.org/10.1007/978-981-97-8508-7