Skip to main content
Top

2025 | Book

Computer Vision – ECCV 2024

18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XLIV

insite
SEARCH

About this book

The multi-volume set of LNCS books with volume numbers 15059 upto 15147 constitutes the refereed proceedings of the 18th European Conference on Computer Vision, ECCV 2024, held in Milan, Italy, during September 29–October 4, 2024.

The 2387 papers presented in these proceedings were carefully reviewed and selected from a total of 8585 submissions. The papers deal with topics such as computer vision; machine learning; deep neural networks; reinforcement learning; object recognition; image classification; image processing; object detection; semantic segmentation; human pose estimation; 3d reconstruction; stereo vision; computational photography; neural networks; image coding; image reconstruction; motion estimation.

Table of Contents

Frontmatter
ComFusion: Enhancing Personalized Generation by Instance-Scene Compositing and Fusion
Abstract
Recent progress in personalizing text-to-image (T2I) diffusion models has demonstrated their capability to generate images based on personalized visual concepts using only a few user-provided examples. However, these models often struggle with maintaining high visual fidelity, particularly when modifying scenes according to textual descriptions. To address this challenge, we introduce ComFusion, an innovative approach that leverages pretrained models to create compositions of user-supplied subject images and predefined text scenes. ComFusion incorporates a class-scene prior preservation regularization, which combines subject class and scene-specific knowledge from pretrained models to enhance generation fidelity. Additionally, ComFusion uses coarse-generated images to ensure alignment with both the instance images and scene texts, thereby achieving a delicate balance between capturing the subject’s essence and maintaining scene fidelity. Extensive evaluations of ComFusion against various baselines in T2I personalization have demonstrated its qualitative and quantitative superiority.
Yan Hong, Yuxuan Duan, Bo Zhang, Haoxing Chen, Jun Lan, Huijia Zhu, Weiqiang Wang, Jianfu Zhang
ML-SemReg: Boosting Point Cloud Registration with Multi-level Semantic Consistency
Abstract
Recent advances in point cloud registration mostly leverage geometric information. Although these methods have yielded promising results, they still struggle with problems of low overlap, thus limiting their practical usage. In this paper, we propose ML-SemReg, a plug-and-play point cloud registration framework that fully exploits semantic information. Our key insight is that mismatches can be categorized into two types, i.e., inter- and intra-class, after rendering semantic clues, and can be well addressed by utilizing multi-level semantic consistency. We first propose a Group Matching module to address inter-class mismatching, outputting multiple matching groups that inherently satisfy Local Semantic Consistency. For each group, a Mask Matching module based on Scene Semantic Consistency is then introduced to suppress intra-class mismatching. Benefit from those two modules, ML-SemReg generates correspondences with a high inlier ratio. Extensive experiments demonstrate excellent performance and robustness of ML-SemReg, e.g., in hard-cases of the KITTI dataset, the Registration Recall of MAC increases by almost 34% points when our ML-SemReg is equipped. Code is available at https://​github.​com/​Laka-3DV/​ML-SemReg.
Shaocheng Yan, Pengcheng Shi, Jiayuan Li
Mask as Supervision: Leveraging Unified Mask Information for Unsupervised 3D Pose Estimation
Abstract
Automatic estimation of 3D human pose from monocular RGB images is a challenging and unsolved problem in computer vision. In a supervised manner, approaches heavily rely on laborious annotations and present hampered generalization ability due to the limited diversity of 3D pose datasets. To address these challenges, we propose a unified framework that leverages mask as supervision for unsupervised 3D pose estimation. With general unsupervised segmentation algorithms, the proposed model employs skeleton and physique representations that exploit accurate pose information from coarse to fine. Compared with previous unsupervised approaches, we organize the human skeleton in a fully unsupervised way which enables the processing of annotation-free data and provides ready-to-use estimation results. Comprehensive experiments demonstrate our state-of-the-art pose estimation performance on Human3.6M and MPI-INF-3DHP datasets. Further experiments on in-the-wild datasets also illustrate the capability to access more data to boost our model. Code will be available at https://​github.​com/​Charrrrrlie/​Mask-as-Supervision.
Yuchen Yang, Yu Qiao, Xiao Sun
MoVideo: Motion-Aware Video Generation with Diffusion Model
Abstract
While recent years have witnessed great progress on using diffusion models for video generation, most of them are simple extensions of image generation frameworks, which fail to explicitly consider one of the key differences between videos and images, i.e., motion. In this paper, we propose a novel motion-aware video generation (MoVideo) framework that takes motion into consideration from two aspects: video depth and optical flow. The former regulates motion by per-frame object distances and spatial layouts, while the later describes motion by cross-frame correspondences that help in preserving fine details and improving temporal consistency. More specifically, given a key frame that exists or generated from text prompts, we first design a diffusion model with spatio-temporal modules to generate the video depth and the corresponding optical flows. Then, the video is generated in the latent space by another spatio-temporal diffusion model under the guidance of depth, optical flow-based warped latent video and the calculated occlusion mask. Lastly, we use optical flows again to align and refine different frames for better video decoding from the latent space to the pixel space. In experiments, MoVideo achieves state-of-the-art results in both text-to-video and image-to-video generation, showing promising prompt consistency, frame consistency and visual quality.
Jingyun Liang, Yuchen Fan, Kai Zhang, Radu Timofte, Luc Van Gool, Rakesh Ranjan
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning
Abstract
Parameter-efficient transfer learning (PETL) has emerged as a flourishing research field for adapting large pre-trained models to downstream tasks, greatly reducing trainable parameters while grappling with memory challenges during fine-tuning. To address it, memory-efficient series (METL) avoid backpropagating gradients through the large backbone. However, they compromise by exclusively relying on frozen intermediate outputs and limiting the exhaustive exploration of prior knowledge from pre-trained models. Moreover, the dependency and redundancy between cross-layer features are frequently overlooked, thereby submerging more discriminative representations and causing an inherent performance gap (vs. conventional PETL methods). Hence, we propose an innovative METL strategy called SHERL for resource-limited scenarios to decouple the entire adaptation into two successive and complementary processes. In the early route, intermediate outputs are consolidated via an anti-redundancy operation, enhancing their compatibility for subsequent interactions; thereby in the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead and regulate these fairly flexible features into more adaptive and powerful representations for new domains. Extensive ablations on vision-and-language and language-only tasks show that SHERL combines the strengths of both parameter and memory-efficient techniques, performing on-par or better across diverse architectures with lower memory during fine-tuning. Our code is publicly available at: https://​github.​com/​Paranioar/​SHERL.
Haiwen Diao, Bo Wan, Xu Jia, Yunzhi Zhuge, Ying Zhang, Huchuan Lu, Long Chen
MonoTTA: Fully Test-Time Adaptation for Monocular 3D Object Detection
Abstract
Monocular 3D object detection (Mono 3Det) aims to identify 3D objects from a single RGB image. However, existing methods often assume training and test data follow the same distribution, which may not hold in real-world test scenarios. To address the out-of-distribution (OOD) problems, we explore a new adaptation paradigm for Mono 3Det, termed Fully Test-time Adaptation which aims to adapt a well-trained model to unlabeled test data by handling potential data distribution shifts at test time. However, applying this paradigm in Mono 3Det poses significant challenges due to OOD test data causing a remarkable decline in object detection scores. This decline conflicts with the pre-defined score thresholds of existing detection methods, leading to severe object omissions (i.e., rare positive detections and many false negatives). Consequently, the limited positive detection and plenty of noisy predictions cause test-time adaptation to fail in Mono 3Det. To handle this problem, we propose a novel Monocular Test-Time Adaptation (MonoTTA) method, based on two new strategies. 1) Reliability-driven adaptation: we empirically find that high-score objects are still reliable and the optimization of high-score objects can enhance confidence across all detections. Thus, we devise a self-adaptive strategy to identify reliable objects for model adaptation, which discovers potential objects and alleviates omissions. 2) Noise-guard adaptation: since high-score objects may be scarce, we develop a negative regularization term to exploit the numerous low-score objects via negative learning, preventing overfitting to noise and trivial solutions. Experimental results show that MonoTTA brings significant performance gains for Mono 3Det models in OOD test scenarios, approximately 190% gains by average on KITTI and 198% gains on nuScenes. The source code is now available at Hongbin98/​MonoTTA.
Hongbin Lin, Yifan Zhang, Shuaicheng Niu, Shuguang Cui, Zhen Li
RangeLDM: Fast Realistic LiDAR Point Cloud Generation
Abstract
Autonomous driving demands high-quality LiDAR data, yet the cost of physical LiDAR sensors presents a significant scaling-up challenge. While recent efforts have explored deep generative models to address this issue, they often consume substantial computational resources with slow generation speeds while suffering from a lack of realism. To address these limitations, we introduce RangeLDM, a novel approach for rapidly generating high-quality range-view LiDAR point clouds via latent diffusion models. We achieve this by correcting range-view data distribution for accurate projection from point clouds to range images via Hough voting, which has a critical impact on generative learning. We then compress the range images into a latent space with a variational autoencoder, and leverage a diffusion model to enhance expressivity. Additionally, we instruct the model to preserve 3D structural fidelity by devising a range-guided discriminator. Experimental results on KITTI-360 and nuScenes datasets demonstrate both the robust expressiveness and fast speed of our LiDAR point cloud generation.
Qianjiang Hu, Zhimin Zhang, Wei Hu
Learn to Optimize Denoising Scores: A Unified and Improved Diffusion Prior for 3D Generation
Abstract
In this paper, we propose a unified framework aimed at enhancing the diffusion priors for 3D generation tasks. Despite the critical importance of these tasks, existing methodologies often struggle to generate high-caliber results. We begin by examining the inherent limitations in previous diffusion priors. We identify a divergence between the diffusion priors and the training procedures of diffusion models that substantially impairs the quality of 3D generation. To address this issue, we propose a novel, unified framework that iteratively optimizes both the 3D model and the diffusion prior. Leveraging the different learnable parameters of the diffusion prior, our approach offers multiple configurations, affording various trade-offs between performance and implementation complexity. Notably, our experimental results demonstrate that our method markedly surpasses existing techniques, establishing new state-of-the-art in the realm of text-to-3D generation. Additionally, our framework yields insightful contributions to the understanding of recent score distillation methods, such as the VSD loss and CSD loss. Code: https://​yangxiaofeng.​github.​io/​demo_​diffusion_​prior.
Xiaofeng Yang, Yiwen Chen, Cheng Chen, Chi Zhang, Yi Xu, Xulei Yang, Fayao Liu, Guosheng Lin
Be-Your-Outpainter: Mastering Video Outpainting Through Input-Specific Adaptation
Abstract
Video outpainting is a challenging task, aiming at generating video content outside the viewport of the input video while maintaining inter-frame and intra-frame consistency. Existing methods fall short in either generation quality or flexibility. We introduce MOTIA (Mastering Video Outpainting Through Input-Specific Adaptation), a diffusion-based pipeline that leverages both the intrinsic data-specific patterns of the source video and the image/video generative prior for effective outpainting. MOTIA comprises two main phases: input-specific adaptation and pattern-aware outpainting. The input-specific adaptation phase involves conducting efficient and effective pseudo outpainting learning on the single-shot source video. This process encourages the model to identify and learn patterns within the source video, as well as bridging the gap between standard generative processes and outpainting. The subsequent phase, pattern-aware outpainting, is dedicated to the generalization of these learned patterns to generate outpainting outcomes. Additional strategies are proposed to better leverage the diffusion model’s generative prior and the acquired video patterns from source videos for inference. Extensive evaluations underscore MOTIA ’s superiority, outperforming existing state-of-the-art methods in widely recognized benchmarks. Notably, these advancements are achieved without necessitating extensive, task-specific tuning. More details are available at https://​be-your-outpainter.​github.​io/​.
Fu-Yun Wang, Xiaoshi Wu, Zhaoyang Huang, Xiaoyu Shi, Dazhong Shen, Guanglu Song, Yu Liu, Hongsheng Li
Physically Plausible Color Correction for Neural Radiance Fields
Abstract
Neural Radiance Fields have become the representation of choice for many 3D computer vision and computer graphics applications, e.g., novel view synthesis and 3D reconstruction. Multi-camera systems are commonly used as the image capture setup in NeRF-based multi-view tasks such as dynamic scene acquisition or realistic avatar animation. However, a critical issue that has often been overlooked in this setup is the evident differences in color responses among multiple cameras, which adversely affect the NeRF reconstruction performance. These color discrepancies among multiple input images stem from two aspects: 1) implicit properties of the scenes such as reflections and shadings, and 2) external differences in camera settings and lighting conditions. In this paper, we address this problem by proposing a novel color correction module that simulates the physical color processing in cameras to be embedded in NeRF, enabling the unified color NeRF reconstruction. Besides the view-independent color correction module for external differences, we predict a view-dependent function to minimize the color residual (including, e.g., specular and shading) to eliminate the impact of inherent attributes. We further describe how the method can be extended with a reference image as guidance to achieve aesthetically plausible color consistency and color translation on novel views. Experiments validate that our method is superior to baseline methods in both quantitative and qualitative evaluations of color correction and color consistency.
Qi Zhang, Ying Feng, Hongdong Li
Unifying 3D Vision-Language Understanding via Promptable Queries
Abstract
A unified model for 3D vision-language (3D-VL) understanding is expected to take various scene representations and perform a wide range of tasks in a 3D scene. However, a considerable gap exists between existing methods and such a unified model, due to the independent application of representation and insufficient exploration of 3D multi-task training. In this paper, we introduce PQ3D, a unified model capable of using Promptable Queries to tackle a wide range of 3D-VL tasks, from low-level instance segmentation to high-level reasoning and planning. This is achieved through three key innovations: (1) unifying various 3D scene representations (i.e., voxels, point clouds, multi-view images) into a shared 3D coordinate space by segment-level grouping, (2) an attention-based query decoder for task-specific information retrieval guided by prompts, and (3) universal output heads for different tasks to support multi-task training. Tested across ten diverse 3D-VL datasets, PQ3D demonstrates impressive performance on these tasks, setting new records on most benchmarks. Particularly, PQ3D improves the state-of-the-art on ScanNet200 by 4.9% (AP25), ScanRefer by 5.4% (acc@0.5), Multi3DRefer by 11.7% (F1@0.5), and Scan2Cap by 13.4% (CIDEr@0.5). Moreover, PQ3D supports flexible inference with individual or combined forms of available 3D representations, e.g., solely voxel input.
Ziyu Zhu, Zhuofan Zhang, Xiaojian Ma, Xuesong Niu, Yixin Chen, Baoxiong Jia, Zhidong Deng, Siyuan Huang, Qing Li
Model Stock: All We Need Is Just a Few Fine-Tuned Models
Abstract
This paper introduces an efficient fine-tuning method for large pre-trained models, offering strong in-distribution (ID) and out-of-distribution (OOD) performance. Breaking away from traditional practices that need a multitude of fine-tuned models for averaging, our approach employs significantly fewer models to achieve final weights yet yield superior accuracy. Drawing from key insights in the weight space of fine-tuned weights, we uncover a strong link between the performance and proximity to the center of weight space. Based on this, we introduce a method that approximates a center-close weight using only two fine-tuned models, applicable during or after training. Our innovative layer-wise weight averaging technique surpasses state-of-the-art model methods such as Model Soup, utilizing only two fine-tuned models. This strategy can be aptly coined Model Stock, highlighting its reliance on selecting a minimal number of models to draw a more optimized-averaged model. We demonstrate the efficacy of Model Stock with fine-tuned models based upon pre-trained CLIP architectures, achieving remarkable performance on both ID and OOD tasks on the standard benchmarks, all while barely bringing extra computational demands. Our code and pre-trained models are available at https://​github.​com/​naver-ai/​model-stock.
Dong-Hwan Jang, Sangdoo Yun, Dongyoon Han
Motion-Guided Latent Diffusion for Temporally Consistent Real-World Video Super-Resolution
Abstract
Real-world low-resolution (LR) videos have diverse and complex degradations, imposing great challenges on video super-resolution (VSR) algorithms to reproduce their high-resolution (HR) counterparts with high quality. Recently, the diffusion models have shown compelling performance in generating realistic details for image restoration tasks. However, the diffusion process has randomness, making it hard to control the contents of restored images. This issue becomes more serious when applying diffusion models to VSR tasks because temporal consistency is crucial to the perceptual quality of videos. In this paper, we propose an effective real-world VSR algorithm by leveraging the strength of pre-trained latent diffusion models. To ensure the content consistency among adjacent frames, we exploit the temporal dynamics in LR videos to guide the diffusion process by optimizing the latent sampling path with a motion-guided loss, ensuring that the generated HR video maintains a coherent and continuous visual flow. To further mitigate the discontinuity of generated details, we insert temporal module to the decoder and fine-tune it with an innovative sequence-oriented loss. The proposed motion-guided latent diffusion (MGLD) based VSR algorithm achieves significantly better perceptual quality than state-of-the-arts on real-world VSR benchmark datasets, validating the effectiveness of the proposed model design and training strategies. Codes and models are available at https://​github.​com/​IanYeung/​MGLD-VSR.
Xi Yang, Chenhang He, Jianqi Ma, Lei Zhang
PoseCrafter: One-Shot Personalized Video Synthesis Following Flexible Pose Control
Abstract
In this paper, we introduce PoseCrafter, a one-shot method for personalized video generation following the control of flexible poses. Built upon Stable Diffusion and ControlNet, we carefully design an inference process to produce high-quality videos without the corresponding ground-truth frames. First, we select an appropriate reference frame from the training video and invert it to initialize all latent variables for generation. Then, we insert the corresponding training pose into the target pose sequences to enhance faithfulness through a trained temporal attention module. Furthermore, to alleviate the face and hand degradation resulting from discrepancies between poses of training videos and inference poses, we implement simple latent editing through an affine transformation matrix involving facial and hand landmarks. Extensive experiments on several datasets demonstrate that PoseCrafter achieves superior results to baselines pre-trained on a vast collection of videos under 8 commonly used metrics. Besides, PoseCrafter can follow poses from different individuals or artificial edits and simultaneously retain the human identity in an open-domain training video. Our project page is available at https://​ml-gsai.​github.​io/​PoseCrafter-demo/​.
Yong Zhong, Min Zhao, Zebin You, Xiaofeng Yu, Changwang Zhang, Chongxuan Li
MAD-DR: Map Compression for Visual Localization with Matchness Aware Descriptor Dimension Reduction
Abstract
3D-structure based methods remain the top-performing solution for long-term visual localization tasks. However, the dimension of existing local descriptors is usually high and the map takes huge storage space, especially for large-scale scenes. We propose an asymmetric framework which learns to reduce the dimension of local descriptors and match them jointly. We can compress existing local descriptor to 1/256 of original size while maintaining high matching performance. Experiments on public visual localization datasets show that our pipeline obtains better results than existing map compression methods and non-structure based alternatives.
Qiang Wang
Benchmarking Object Detectors with COCO: A New Path Forward
Abstract
The Common Objects in Context (COCO) dataset has been instrumental in benchmarking object detectors over the past decade. Like every dataset, COCO contains subtle errors and imperfections stemming from its annotation procedure. With the advent of high-performing models, we ask whether these errors of COCO are hindering its utility in reliably benchmarking further progress. In search for an answer, we inspect thousands of masks from COCO (2017 version) and uncover different types of errors such as imprecise mask boundaries, non-exhaustively annotated instances, and mislabeled masks. Due to the prevalence of COCO, we choose to correct these errors to maintain continuity with prior research. We develop COCO-ReM (Refined Masks), a cleaner set of annotations with visibly better mask quality than COCO-2017. We evaluate fifty object detectors and find that models that predict visually sharper masks score higher on COCO-ReM, affirming that they were being incorrectly penalized due to errors in COCO-2017. Moreover, our models trained using COCO-ReM converge faster and score higher than their larger variants trained using COCO-2017, highlighting the importance of data quality in improving object detectors. With these findings, we advocate using COCO-ReM for future object detection research. Our dataset is available at https://​cocorem.​xyz.
Shweta Singh, Aayan Yadav, Jitesh Jain, Humphrey Shi, Justin Johnson, Karan Desai
Adaptive High-Frequency Transformer for Diverse Wildlife Re-identification
Abstract
Wildlife ReID involves utilizing visual technology to identify specific individuals of wild animals in different scenarios, holding significant importance for wildlife conservation, ecological research, and environmental monitoring. Existing wildlife ReID methods are predominantly tailored to specific species, exhibiting limited applicability. Although some approaches leverage extensively studied person ReID techniques, they struggle to address the unique challenges posed by wildlife. Therefore, in this paper, we present a unified, multi-species general framework for wildlife ReID. Given that high-frequency information is a consistent representation of unique features in various species, significantly aiding in identifying contours and details such as fur textures, we propose the Adaptive High-Frequency Transformer model with the goal of enhancing high-frequency information learning. To mitigate the inevitable high-frequency interference in the wilderness environment, we introduce an object-aware high-frequency selection strategy to adaptively capture more valuable high-frequency components. Notably, we unify the experimental settings of multiple wildlife datasets for ReID, achieving superior performance over state-of-the-art ReID methods. In domain generalization scenarios, our approach demonstrates robust generalization to unknown species. Code is available at https://​github.​com/​JigglypuffStitch​/​AdaFreq.​git.
Chenyue Li, Shuoyi Chen, Mang Ye
WPS-SAM: Towards Weakly-Supervised Part Segmentation with Foundation Models
Abstract
Segmenting and recognizing diverse object parts is crucial in computer vision and robotics. Despite significant progress in object segmentation, part-level segmentation remains underexplored due to complex boundaries and scarce annotated data. To address this, we propose a novel Weakly-supervised Part Segmentation (WPS) setting and an approach called WPS-SAM, built on the large-scale pre-trained vision foundation model, Segment Anything Model (SAM). WPS-SAM is an end-to-end framework designed to extract prompt tokens directly from images and perform pixel-level segmentation of part regions. During its training phase, it only uses weakly supervised labels in the form of bounding boxes or points. Extensive experiments demonstrate that, through exploiting the rich knowledge embedded in pre-trained foundation models, WPS-SAM outperforms other segmentation models trained with pixel-level strong annotations. Specifically, WPS-SAM achieves 68.93% mIOU and 79.53% mACC on the PartImageNet dataset, surpassing state-of-the-art fully supervised methods by approximately 4% in terms of mIOU.
Xin-Jian Wu, Ruisong Zhang, Jie Qin, Shijie Ma, Cheng-Lin Liu
Lane Graph as Path: Continuity-Preserving Path-Wise Modeling for Online Lane Graph Construction
Abstract
Online lane graph construction is a promising but challenging task in autonomous driving. Previous methods usually model the lane graph at the pixel or piece level, and recover the lane graph by pixel-wise or piece-wise connection, which breaks down the continuity of the lane and results in suboptimal performance. Human drivers focus on and drive along the continuous and complete paths instead of considering lane pieces. Autonomous vehicles also require path-specific guidance from lane graph for trajectory planning. We argue that the path, which indicates the traffic flow, is the primitive of the lane graph. Motivated by this, we propose to model the lane graph in a novel path-wise manner, which well preserves the continuity of the lane and encodes traffic information for planning. We present a path-based online lane graph construction method, termed LaneGAP, which end-to-end learns the path and recovers the lane graph via a Path2Graph algorithm. We qualitatively and quantitatively demonstrate the superior accuracy and efficiency of LaneGAP over conventional pixel-based and piece-based methods on the challenging nuScenes and Argoverse2 datasets under controllable and fair conditions. Compared to the recent state-of-the-art piece-wise method TopoNet on the OpenLane-V2 dataset, LaneGAP still outperforms by 1.6 mIoU, further validating the effectiveness of path-wise modeling. Abundant visualizations in the supplementary material show LaneGAP can cope with diverse traffic conditions. Code is released at https://​github.​com/​hustvl/​LaneGAP.
Bencheng Liao, Shaoyu Chen, Bo Jiang, Tianheng Cheng, Qian Zhang, Wenyu Liu, Chang Huang, Xinggang Wang
DeCo: Decoupled Human-Centered Diffusion Video Editing with Motion Consistency
Abstract
Diffusion models usher a new era of video editing, flexibly manipulating the video contents with text prompts. Despite the widespread application demand in editing human-centered videos, these models face significant challenges in handling complex objects like humans. In this paper, we introduce DeCo, a novel video editing framework specifically designed to treat humans and the background as separate editable targets, ensuring global spatial-temporal consistency by maintaining the coherence of each individual component. Specifically, we propose a decoupled dynamic human representation that utilizes a parametric human body prior to generate tailored humans while preserving the consistent motions as the original video. In addition, we consider the background as a layered atlas to apply text-guided image editing approaches on it. To further enhance the geometry and texture of humans during the optimization, we extend the calculation of score distillation sampling into normal space and image space. Moreover, we tackle inconsistent lighting between the edited targets by leveraging a lighting-aware video harmonizer, a problem previously overlooked in decompose-edit-combine approaches. Extensive qualitative and numerical experiments demonstrate that DeCo outperforms prior video editing methods in human-centered videos, especially in longer videos.
Xiaojing Zhong, Xinyi Huang, Xiaofeng Yang, Guosheng Lin, Qingyao Wu
Unleashing the Potential of the Semantic Latent Space in Diffusion Models for Image Dehazing
Abstract
Diffusion models have recently been investigated as powerful generative solvers for image dehazing, owing to their remarkable capability to model the data distribution. However, the massive computational burden imposed by the retraining of diffusion models, coupled with the extensive sampling steps during the inference, limit the broader application of diffusion models in image dehazing. To address these issues, we explore the properties of hazy images in the semantic latent space of frozen pre-trained diffusion models, and propose a Diffusion Latent Inspired network for Image Dehazing, dubbed DiffLI\(^2\)D. Specifically, we first reveal that the semantic latent space of pre-trained diffusion models can represent the content and haze characteristics of hazy images, as the diffusion time-step changes. Building upon this insight, we integrate the diffusion latent representations at different time-steps into a delicately designed dehazing network to provide instructions for image dehazing. Our DiffLI\(^2\)D avoids re-training diffusion models and iterative sampling process by effectively utilizing the informative representations derived from the pre-trained diffusion models, which also offers a novel perspective for introducing diffusion models to image dehazing. Extensive experiments on multiple datasets demonstrate that the proposed method achieves superior performance to existing image dehazing methods.
Zizheng Yang, Hu Yu, Bing Li, Jinghao Zhang, Jie Huang, Feng Zhao
Uncertainty-Aware Sign Language Video Retrieval with Probability Distribution Modeling
Abstract
Sign language video retrieval plays a key role in facilitating information access for the deaf community. Despite significant advances in video-text retrieval, the complexity and inherent uncertainty of sign language preclude direct applications of these techniques. Previous methods achieve mapping between sign language videos and text through fine-grained modal alignment. However, due to the scarcity of fine-grained annotations, the uncertainty inherent in sign language videos is underestimated, limiting further development of sign language retrieval tasks. To address this challenge, we propose a new Uncertainty-aware Probability Distribution Retrieval (UPRet), which conceptualizes the mapping process of sign language videos and texts in terms of probability distributions, explores their potential interrelationships, and enables flexible mappings. Experiments on three benchmarks demonstrate the effectiveness of our method, which achieves state-of-the-art results on How2Sign (59.1%), PHOENIX-2014T (72.0%), and CSL-Daily (78.4%). Our source code is available: https://​github.​com/​xua222/​UPRet.
Xuan Wu, Hongxiang Li, Yuanjiang Luo, Xuxin Cheng, Xianwei Zhuang, Meng Cao, Keren Fu
NeRMo: Learning Implicit Neural Representations for 3D Human Motion Prediction
Abstract
Predicting accurate future human poses from historically observed motions remains a challenging task due to the spatial-temporal complexity and continuity of motions. Previous historical-value methods typically interpret the motion as discrete consecutive frames, which neglects the continuous temporal dynamics and impedes the capability of handling incomplete observations (with missing values). In this paper, we propose a novel implicit Neural Representation method for the task of human Motion prediction, dubbed NeRMo, which represents the motion as a continuous function parameterized by a neural network. The core idea is to explicitly disentangle the spatial-temporal context and output the corresponding 3D skeleton positions. This separate and flexible treatment of space and time allows NeRMo to combine the following advantages. It extrapolates at arbitrary temporal locations; it can learn from both complete and incomplete observed past motions; it provides a unified framework for repairing missing values and forecasting future poses using a single trained model. In addition, we show that NeRMo exhibits compatibility with meta-learning methods, enabling it to effectively generalize to unseen time steps. Extensive experiments conducted on classical benchmarks have confirmed the superior repairing and prediction performance of our proposed method compared to existing historical-value baselines.
Dong Wei, Huaijiang Sun, Xiaoning Sun, Shengxiang Hu
Bridging Synthetic and Real Worlds for Pre-Training Scene Text Detectors
Abstract
Existing scene text detection methods typically rely on extensive real data for training. Due to the lack of annotated real images, recent works have attempted to exploit large-scale labeled synthetic data (LSD) for pre-training text detectors. However, a synth-to-real domain gap emerges, further limiting the performance of text detectors. Differently, in this work, we propose FreeReal, a real-domain-aligned pre-training paradigm that enables the complementary strengths of both LSD and unlabeled real data (URD). Specifically, to bridge real and synthetic worlds for pre-training, a glyph-based mixing mechanism (GlyphMix) is tailored for text images. GlyphMix delineates the character structures of synthetic images and embeds them as graffiti-like units onto real images. Without introducing real domain drift, GlyphMix freely yields real-world images with partial annotations derived from synthetic labels. Furthermore, when given free fine-grained synthetic labels, GlyphMix can effectively bridge the linguistic domain gap stemming from English-dominated LSD to URD in various languages. Without bells and whistles, FreeReal achieves average gains of 1.97%, 3.90%, 3.85%, and 4.56% in improving the performance of FCENet, PSENet, PANet, and DBNet methods, respectively, consistently outperforming previous pre-training methods by a substantial margin across four public datasets. Code will be available at https://​github.​com/​SJTU-DeepVisionLab/​FreeReal.
Tongkun Guan, Wei Shen, Xue Yang, Xuehui Wang, Xiaokang Yang
VLAD-BuFF: Burst-Aware Fast Feature Aggregation for Visual Place Recognition
Abstract
Visual Place Recognition (VPR) is a crucial component of many visual localization pipelines for embodied agents. VPR is often formulated as an image retrieval task aimed at jointly learning local features and an aggregation method. The current state-of-the-art VPR methods rely on VLAD aggregation, which can be trained to learn a weighted contribution of features through their soft assignment to cluster centers. However, this process has two key limitations. Firstly, the feature-to-cluster weighting does not account for over-represented repetitive structures within a cluster, e.g., shadows or window panes; this phenomenon is also referred to as the ‘burstiness’ problem, classically solved by discounting repetitive features before aggregation. Secondly, feature to cluster comparisons are compute-intensive for state-of-the-art image encoders with high-dimensional local features. This paper addresses these limitations by introducing VLAD-BuFF with two novel contributions: i) a self-similarity based feature discounting mechanism to learn Burst-aware features within end-to-end VPR training, and ii) Fast Feature aggregation by reducing local feature dimensions specifically through PCA-initialized learnable pre-projection. We benchmark our method on 9 public datasets, where VLAD-BuFF sets a new state of the art. Our method is able to maintain its high recall even for \({12}{\times }\) reduced local feature dimensions, thus enabling fast feature aggregation without compromising on recall. Through additional qualitative studies, we show how our proposed weighting method effectively downweights the non-distinctive features. Source code: https://​github.​com/​Ahmedest61/​VLAD-BuFF/​.
Ahmad Khaliq, Ming Xu, Stephen Hausler, Michael Milford, Sourav Garg
DSA: Discriminative Scatter Analysis for Early Smoke Segmentation
Abstract
Early smoke segmentation (ESS) plays a crucial role in accurately locating the source of smoke, facilitating prompt fire rescue operations and gas leak detection. Unlike regular objects, which are typically rigid, opaque, and have clear boundaries, ESS presents challenges due to the large areas of high transparency in early smoke. This leads to a significant similarity between smoke features and the surrounding background features. The key solution is to obtain a discriminative embedding space. Some distance-based methods have pursued this goal by using specific loss functions (e.g., pair-based Triplet loss and proxy-based NCA loss) to constrain the feature extractor. In this paper, we propose a novel approach called discriminative scatter analysis (DSA). Instead of solely measuring Euclidean distance, DSA assesses the compactness and separation of the embedding space from a sample scatter perspective. DSA is performed on both pixel-proxy scatter (IOS) and proxy-proxy scatter (OOS), and a unified loss function is designed to optimize the feature extractor. DSA can be easily integrated with regular segmentation methods. It is applied only during training and without incurring any additional computational cost during inference. Extensive experiments have demonstrated that DSA can consistently improve the performance of various models in ESS.
Lujian Yao, Haitao Zhao, Jingchao Peng, Zhongze Wang, Kaijie Zhao
SafaRi: Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation
Abstract
Referring Expression Segmentation (RES) aims to provide a segmentation mask of the target object in an image referred to by the text (i.e., referring expression). Existing methods require large-scale mask annotations. Moreover, such approaches do not generalize well to unseen/zero-shot scenarios. To address the aforementioned issues, we propose a weakly-supervised bootstrapping architecture for RES with several new algorithmic innovations. To the best of our knowledge, ours is the first approach that considers only a fraction of both mask and box annotations (shown in Fig. 1 and Table 1) for training. To enable principled training of models in such low-annotation settings, improve image-text region-level alignment, and further enhance spatial localization of the target object in the image, we propose Cross-modal Fusion with Attention Consistency module. For automatic pseudo-labeling of unlabeled samples, we introduce a novel Mask Validity Filtering routine based on a spatially aware zero-shot proposal scoring approach. Extensive experiments show that with just 30% annotations, our model SafaRi achieves 59.31 and 48.26 mIoUs as compared to 58.93 and 48.19 mIoUs obtained by the fully-supervised SOTA method SeqTR respectively on RefCOCO+@testA and RefCOCO+testB datasets. SafaRi also outperforms SeqTR by 11.7% (on RefCOCO+testA) and 19.6% (on RefCOCO+testB) in a fully-supervised setting and demonstrates strong generalization capabilities in unseen/zero-shot tasks. Our project page can be found at https://​sayannag.​github.​io/​safari_​eccv2024/​.
Sayan Nag, Koustava Goswami, Srikrishna Karanam
Backmatter
Metadata
Title
Computer Vision – ECCV 2024
Editors
Aleš Leonardis
Elisa Ricci
Stefan Roth
Olga Russakovsky
Torsten Sattler
Gül Varol
Copyright Year
2025
Electronic ISBN
978-3-031-72784-9
Print ISBN
978-3-031-72783-2
DOI
https://doi.org/10.1007/978-3-031-72784-9

Premium Partner