Skip to main content
Top

2025 | Book

Computer Vision – ECCV 2024

18th European Conference, Milan, Italy, September 29 – October 4, 2024, Proceedings, Part XXXV

insite
SEARCH

About this book

The multi-volume set of LNCS books with volume numbers 15059 up to 15147 constitutes the refereed proceedings of the 18th European Conference on Computer Vision, ECCV 2024, held in Milan, Italy, during September 29–October 4, 2024.

The 2387 papers presented in these proceedings were carefully reviewed and selected from a total of 8585 submissions. They deal with topics such as computer vision; machine learning; deep neural networks; reinforcement learning; object recognition; image classification; image processing; object detection; semantic segmentation; human pose estimation; 3d reconstruction; stereo vision; computational photography; neural networks; image coding; image reconstruction; object recognition; motion estimation.

Table of Contents

Frontmatter
Visual Prompting via Partial Optimal Transport
Abstract
Visual prompts represent a lightweight approach that adapts pre-trained models to downstream tasks without modifying the model parameters. They strategically transform the input and output through prompt engineering and label mapping, respectively. Yet, existing methodologies often overlook the synergy between these components, leaving the intricate relationship between them underexplored. To address this, we propose an Optimal Transport-based Label Mapping strategy (OTLM) that effectively reduces distribution migration and lessens the modifications required by the visual prompts. Specifically, we reconceptualize label mapping as a partial optimal transport problem, and introduce a novel transport cost matrix. Through the optimal transport framework, we establish a connection between output-side label mapping and input-side visual prompting. Additionally, we analyze frequency-based label mapping methods within this framework. We also offer an analysis of frequency-based label mapping techniques and demonstrate the superiority of our OTLM method. Our experiments across multiple datasets and various model architectures demonstrate significant performance improvements, which prove the effectiveness of the proposed method.
Mengyu Zheng, Zhiwei Hao, Yehui Tang, Chang Xu
Modelling Competitive Behaviors in Autonomous Driving Under Generative World Model
Abstract
Modeling the trajectories of intelligent vehicles is an essential component of a traffic-simulating system. However, such trajectory predictors are typically trained to imitate the movements of human drivers. The imitation models often fall short of capturing safety-critical events residing in the long-tail end of the data distribution, especially under complex environments involving multiple drivers. In this paper, we propose a game-theoretic perspective to resolve this challenge by modeling the competitive interactions of vehicles in a general-sum Markov game and characterizing these safety-critical events with the correlated equilibrium. To achieve this goal, we pretrain a generative world model to predict the environmental dynamics of self-driving scenarios. Based on this world model, we probe the action predictor for identifying the Coarse Correlated Equilibrium (CCE) by incorporating both optimistic Bellman update and magnetic mirror descent into the objective function of the Multi-Agent Reinforcement Learning (MARL) algorithm. We conduct extensive experiments to demonstrate our algorithm outperforms other baselines in terms of efficiently closing the CCE-gap and generating meaningful trajectories under competitive autonomous driving environments. The code is available at: https://​github.​com/​qiaoguanren/​MARL-CCE.
Guanren Qiao, Guorui Quan, Rongxiao Qu, Guiliang Liu
Tendency-Driven Mutual Exclusivity for Weakly Supervised Incremental Semantic Segmentation
Abstract
Weakly Incremental Learning for Semantic Segmentation (WILSS) leverages a pre-trained segmentation model to segment new classes using cost-effective and readily available image-level labels. A prevailing way to solve WILSS is the generation of seed areas for each new class, serving as a form of pixel-level supervision. However, a scenario usually arises where a pixel is concurrently predicted as an old class by the pre-trained segmentation model and a new class by the seed areas. Such a scenario becomes particularly problematic in WILSS, as the lack of pixel-level annotations on new classes makes it intractable to ascertain whether the pixel pertains to the new class or not. To surmount this issue, we propose an innovative, tendency-driven relationship of mutual exclusivity, meticulously tailored to govern the behavior of the seed areas and the predictions generated by the pre-trained segmentation model. This relationship stipulates that predictions for the new and old classes must not conflict whilst prioritizing the preservation of predictions for the old classes, which not only addresses the conflicting prediction issue but also effectively mitigates the inherent challenge of incremental learning - catastrophic forgetting. Furthermore, under the auspices of this tendency-driven mutual exclusivity relationship, we generate pseudo masks for the new classes, allowing for concurrent execution with model parameter updating via the resolution of a bi-level optimization problem. Extensive experiments substantiate the effectiveness of our framework, resulting in the establishment of new benchmarks and paving the way for further research in this field.
Chongjie Si, Xuehui Wang, Xiaokang Yang, Wei Shen
AdaCLIP: Adapting CLIP with Hybrid Learnable Prompts for Zero-Shot Anomaly Detection
Abstract
Zero-shot anomaly detection (ZSAD) targets the identification of anomalies within images from arbitrary novel categories. This study introduces AdaCLIP for the ZSAD task, leveraging a pre-trained vision-language model (VLM), CLIP. AdaCLIP incorporates learnable prompts into CLIP and optimizes them through training on auxiliary annotated anomaly detection data. Two types of learnable prompts are proposed: static and dynamic. Static prompts are shared across all images, serving to preliminarily adapt CLIP for ZSAD. In contrast, dynamic prompts are generated for each test image, providing CLIP with dynamic adaptation capabilities. The combination of static and dynamic prompts is referred to as hybrid prompts, and yields enhanced ZSAD performance. Extensive experiments conducted across 14 real-world anomaly detection datasets from industrial and medical domains indicate that AdaCLIP outperforms other ZSAD methods and can generalize better to different categories and even domains. Finally, our analysis highlights the importance of diverse auxiliary data and optimized prompts for enhanced generalization capacity. Code is available at https://​github.​com/​caoyunkang/​AdaCLIP.
Yunkang Cao, Jiangning Zhang, Luca Frittoli, Yuqi Cheng, Weiming Shen, Giacomo Boracchi
Pathformer3D: A 3D Scanpath Transformer for  Images
Abstract
Scanpath prediction in \(360^{\circ }\) images can help realize rapid rendering and better user interaction in Virtual/Augmented Reality applications. However, existing scanpath prediction models for \(360^{\circ }\) images execute scanpath prediction on 2D equirectangular projection plane, which always result in big computation error owing to the 2D plane’s distortion and coordinate discontinuity. In this work, we perform scanpath prediction for \(360^{\circ }\) images in 3D spherical coordinate system and proposed a novel 3D scanpath Transformer named Pathformer3D. Specifically, a 3D Transformer encoder is first used to extract 3D contextual feature representation for the \(360^{\circ }\) image. Then, the contextual feature representation and historical fixation information are input into a Transformer decoder to output current time step’s fixation embedding, where the self-attention module is used to imitate the visual working memory mechanism of human visual system and directly model the time dependencies among the fixations. Finally, a 3D Gaussian distribution is learned from each fixation embedding, from which the fixation position can be sampled. Evaluation on four panoramic eye-tracking datasets demonstrates that Pathformer3D outperforms the current state-of-the-art methods. Code is available at https://​github.​com/​lsztzp/​Pathformer3D.
Rong Quan, Yantao Lai, Mengyu Qiu, Dong Liang
TransFusion – A Transparency-Based Diffusion Model for Anomaly Detection
Abstract
Surface anomaly detection is a vital component in manufacturing inspection. Current discriminative methods follow a two-stage architecture composed of a reconstructive network followed by a discriminative network that relies on the reconstruction output. Currently used reconstructive networks often produce poor reconstructions that either still contain anomalies or lack details in anomaly-free regions. Discriminative methods are robust to some reconstructive network failures, suggesting that the discriminative network learns a strong normal appearance signal that the reconstructive networks miss. We reformulate the two-stage architecture into a single-stage iterative process that allows the exchange of information between the reconstruction and localization. We propose a novel transparency-based diffusion process where the transparency of anomalous regions is progressively increased, restoring their normal appearance accurately while maintaining the appearance of anomaly-free regions using localization cues of previous steps. We implement the proposed process as TRANSparency DifFUSION (TransFusion), a novel discriminative anomaly detection method that achieves state-of-the-art performance on both the VisA and the MVTec AD datasets, with an image-level AUROC of 98.5% and 99.2%, respectively. Code: https://​github.​com/​MaticFuc/​ECCV_​TransFusion
Matic Fučka, Vitjan Zavrtanik, Danijel Skočaj
SparseLIF: High-Performance Sparse LiDAR-Camera Fusion for 3D Object Detection
Abstract
Sparse 3D detectors have received significant attention since the query-based paradigm embraces low latency without explicit dense BEV feature construction. However, these detectors achieve worse performance than their dense counterparts. In this paper, we find the key to bridging the performance gap is to enhance the awareness of rich representations in two modalities. Here, we present a high-performance fully sparse detector for end-to-end multi-modality 3D object detection. The detector, termed SparseLIF, contains three key designs, which are (1) Perspective-Aware Query Generation (PAQG) to generate high-quality 3D queries with perspective priors, (2) RoI-Aware Sampling (RIAS) to further refine prior queries by sampling RoI features from each modality, (3) Uncertainty-Aware Fusion (UAF) to precisely quantify the uncertainty of each sensor modality and adaptively conduct final multi-modality fusion, thus achieving great robustness against sensor noises. By the time of paper submission, SparseLIF achieves state-of-the-art performance on the nuScenes dataset, ranking 1st on both validation set and test benchmark, outperforming all state-of-the-art 3D object detectors by a notable margin.
Hongcheng Zhang, Liu Liang, Pengxin Zeng, Xiao Song, Zhe Wang
3D Gaussian Parametric Head Model
Abstract
Creating high-fidelity 3D human head avatars is crucial for applications in VR/AR, telepresence, digital human interfaces, and film production. Recent advances have leveraged morphable face models to generate animated head avatars from easily accessible data, representing varying identities and expressions within a low-dimensional parametric space. However, existing methods often struggle with modeling complex appearance details, e.g., hairstyles and accessories, and suffer from low rendering quality and efficiency. This paper introduces a novel approach, 3D Gaussian Parametric Head Model, which employs 3D Gaussians to accurately represent the complexities of the human head, allowing precise control over both identity and expression. Additionally, it enables seamless face portrait interpolation and the reconstruction of detailed head avatars from a single image. Unlike previous methods, the Gaussian model can handle intricate details, enabling realistic representations of varying appearances and complex expressions. Furthermore, this paper presents a well-designed training framework to ensure smooth convergence, providing a guarantee for learning the rich content. Our method achieves high-quality, photo-realistic rendering with real-time efficiency, making it a valuable contribution to the field of parametric head models.
Yuelang Xu, Lizhen Wang, Zerong Zheng, Zhaoqi Su, Yebin Liu
RING-NeRF : Rethinking Inductive Biases for Versatile and Efficient Neural Fields
Abstract
Recent advances in Neural Fields mostly rely on developing task-specific supervision which often complicates the models. Rather than developing hard-to-combine and specific modules, another approach generally overlooked is to directly inject generic priors on the scene representation (also called inductive biases) into the NeRF architecture. Based on this idea, we propose the RING-NeRF architecture which includes two inductive biases: a continuous multi-scale representation of the scene and an invariance of the decoder’s latent space over spatial and scale domains. We also design a single reconstruction process that takes advantage of those inductive biases and experimentally demonstrates on-par performances in terms of quality with dedicated architecture on multiple tasks (anti-aliasing, few view reconstruction, SDF reconstruction without scene-specific initialization) while being more efficient. Moreover, RING-NeRF has the distinctive ability to dynamically increase the resolution of the model, opening the way to adaptive reconstruction. Project page can be found at: https://​cea-list.​github.​io/​RING-NeRF
Doriand Petit, Steve Bourgeois, Dumitru Pavel, Vincent Gay-Bellile, Florian Chabot, Loïc Barthe
Platypus: A Generalized Specialist Model for Reading Text in Various Forms
Abstract
Reading text from images (either natural scenes or documents) has been a long-standing research topic for decades, due to the high technical challenge and wide application range. Previously, individual specialist models are developed to tackle the sub-tasks of text reading (e.g., scene text recognition, handwritten text recognition and mathematical expression recognition). However, such specialist models usually cannot effectively generalize across different sub-tasks. Recently, generalist models (such as GPT-4V), trained on tremendous data in a unified way, have shown enormous potential in reading text in various scenarios, but with the drawbacks of limited accuracy and low efficiency. In this work, we propose Platypus, a generalized specialist model for text reading. Specifically, Platypus combines the best of both worlds: being able recognize text of various forms with a single unified architecture, while achieving excellent accuracy and high efficiency. To better exploit the advantage of Platypus, we also construct a text reading dataset (called Worms), the images of which are curated from previous datasets and partially re-labeled. Experiments on standard benchmarks demonstrate the effectiveness and superiority of the proposed Platypus model. Model and data will be made publicly available at AdvancedLiterate​Machinery.
Peng Wang, Zhaohai Li, Jun Tang, Humen Zhong, Fei Huang, Zhibo Yang, Cong Yao
Structured-NeRF: Hierarchical Scene Graph with Neural Representation
Abstract
We present Structured Neural Radiance Field (Structured-NeRF) for indoor scene representaion based on a novel hierarchical scene graph structure to organize the neural radiance field. Existing object-centric methods focus only on the inherent characteristics of objects, while overlooking the semantic and physical relationships between them. Our scene graph is adept at managing the complex real-world correlation between objects within a scene, enabling functionality beyond novel view synthesis, such as scene re-arrangement. Based on the hierarchical structure, we introduce the optimization strategy based on semantic and physical relationships, thus simplifying the operations involved in scene editing and ensuring both efficiency and accuracy. Moreover, we conduct shadow rendering on objects to further intensify the realism of the rendered images. Experimental results demonstrate our structured representation not only achieves state-of-the-art (SOTA) performance in object-level and scene-level rendering, but also advances downstream applications in union with LLM/VLM, such as automatic and instruction/image conditioned scene re-arrangement, thereby extending the NeRF to interactive editing conveniently and controllably.
Zhide Zhong, Jiakai Cao, Songen Gu, Sirui Xie, Liyi Luo, Hao Zhao, Guyue Zhou, Haoang Li, Zike Yan
EGIC: Enhanced Low-Bit-Rate Generative Image Compression Guided by Semantic Segmentation
Abstract
We introduce EGIC, an enhanced generative image compression method that allows traversing the distortion-perception curve efficiently from a single model. EGIC is based on two novel building blocks: i) OASIS-C, a conditional pre-trained semantic segmentation-guided discriminator, which provides both spatially and semantically-aware gradient feedback to the generator, conditioned on the latent image distribution, and ii) Output Residual Prediction (ORP), a retrofit solution for multi-realism image compression that allows control over the synthesis process by adjusting the impact of the residual between an MSE-optimized and GAN-optimized decoder output on the GAN-based reconstruction. Together, EGIC forms a powerful codec, outperforming state-of-the-art diffusion and GAN-based methods (e.g., HiFiC, MS-ILLM, and DIRAC-100), while performing almost on par with VTM-20.0 on the distortion end. EGIC is simple to implement, very lightweight, and provides excellent interpolation characteristics, which makes it a promising candidate for practical applications targeting the low bit range.
Nikolai Körber, Eduard Kromer, Andreas Siebert, Sascha Hauke, Daniel Mueller-Gritschneder, Björn Schuller
Plug-and-Play Learned Proximal Trajectory for 3D Sparse-View X-Ray Computed Tomography
Abstract
Plug-and-Play algorithms (PnP) have recently emerged as a powerful framework for solving inverse problems in imaging. They leverage the power of Gaussian denoising algorithms to solve complex optimization problems. This work focuses on the challenging task of 3D sparse-view X-ray computed tomography (CT). We propose to replace the Gaussian denoising network in Plug-and-Play with a restoration network, i.e.a network trained to remove arbitrary artifacts. We show that using a restoration prior tailored to the specific inverse problem improves the performances of Plug-and-Play algorithms. Besides, we show that plugging a basic restoration network into a PnP scheme is not sufficient to obtain good results. Thus, we propose a procedure to train the restoration network to be a robust approximation of a proximal operator along a pre-defined optimization trajectory. We demonstrate the effectiveness and scalability of our approach on two 3D Cone-Beam CT datasets and outperform state-of-the-art methods in terms of PSNR. Code is available at https://​github.​com/​romainvo/​pnp-learned-proximal-trajectory.
Romain Vo, Julie Escoda, Caroline Vienne, Étienne Decencière
PPAD: Iterative Interactions of Prediction and Planning for End-to-End Autonomous Driving
Abstract
We present a new interaction mechanism of prediction and planning for end-to-end autonomous driving, called PPAD (Iterative Interaction of Prediction and Planning Autonomous Driving), which considers the timestep-wise interaction to better integrate prediction and planning. An ego vehicle performs motion planning at each timestep based on the trajectory prediction of surrounding agents (e.g., vehicles and pedestrians) and its local road conditions. Unlike existing end-to-end autonomous driving frameworks, PPAD models the interactions among ego, agents, and the dynamic environment in an autoregressive manner by interleaving the Prediction and Planning processes at every timestep, instead of a single sequential process of prediction followed by planning. Specifically, we design ego-to-agent, ego-to-map, and ego-to-BEV interaction mechanisms with hierarchical dynamic key objects attention to better model the interactions. The experiments on the nuScenes benchmark show that our approach outperforms state-of-the-art methods. Project page at https://​github.​com/​zlichen/​PPAD.
Zhili Chen, Maosheng Ye, Shuangjie Xu, Tongyi Cao, Qifeng Chen
Test-Time Stain Adaptation with Diffusion Models for Histopathology Image Classification
Abstract
Stain shifts are prevalent in histopathology images, and typically dealt with by normalization or augmentation. Considering training-time methods are limited in dealing with unseen stains, we propose a test-time stain adaptation method (TT-SaD) with diffusion models that achieves stain adaptation by solving a nonlinear inverse problem during testing. TT-SaD is promising in that it only needs a single domain for training but can adapt well from other domains during testing, preventing models from retraining whenever there are new data available. For tumor classification, stain adaptation by TT-SaD outperforms state-of-the-art diffusion model-based test-time methods. Moreover, TT-SaD beats training-time methods when testing on data that are inaccessible during training. To our knowledge, the study of stain adaptation in diffusion model during testing time is relatively unexplored.
Cheng-Chang Tsai, Yuan-Chih Chen, Chun-Shien Lu
Beyond MOT: Semantic Multi-object Tracking
Abstract
Current multi-object tracking (MOT) aims to predict trajectories of targets (i.e., “where”) in videos. Yet, knowing merely “where” is insufficient in many crucial applications. In comparison, semantic understanding such as fine-grained behaviors, interactions, and overall summarized captions (i.e., “what”) from videos, associated with “where”, is highly-desired for comprehensive video analysis. Thus motivated, we introduce Semantic Multi-Object Tracking (SMOT), that aims to estimate object trajectories and meanwhile understand semantic details of associated trajectories including instance captions, instance interactions, and overall video captions, integrating “where” and “what” for tracking. In order to foster the exploration of SMOT, we propose BenSMOT, a large-scale Benchmark for Semantic MOT. Specifically, BenSMOT comprises 3,292 videos with 151K frames, covering various scenarios for semantic tracking of humans. BenSMOT provides annotations for the trajectories of targets, along with associated instance captions in natural language, instance interactions, and overall caption for each video sequence. To our best knowledge, BenSMOT is the first publicly available benchmark for SMOT. Besides, to encourage future research, we present a novel tracker named SMOTer, which is specially designed and end-to-end trained for SMOT, showing promising performance. By releasing BenSMOT, we expect to go beyond conventional MOT by predicting “where” and “what” for SMOT, opening up a new direction in tracking for video understanding. We will release BenSMOT and SMOTer at https://​github.​com/​Nathan-Li123/​SMOTer.
Yunhao Li, Qin Li, Hao Wang, Xue Ma, Jiali Yao, Shaohua Dong, Heng Fan, Libo Zhang
Temporal Event Stereo via Joint Learning with Stereoscopic Flow
Abstract
Event cameras are dynamic vision sensors inspired by the biological retina, characterized by their high dynamic range, high temporal resolution, and low power consumption. These features make them capable of perceiving 3D environments even in extreme conditions. Event data is continuous across the time dimension, which allows a detailed description of each pixel’s movements. To fully utilize the temporally dense and continuous nature of event cameras, we propose a novel temporal event stereo, a framework that continuously uses information from previous time steps. This is accomplished through the simultaneous training of an event stereo matching network alongside stereoscopic flow, a new concept that captures all pixel movements from stereo cameras. Since obtaining ground truth for optical flow during training is challenging, we propose a method that uses only disparity maps to train the stereoscopic flow. The performance of event-based stereo matching is enhanced by temporally aggregating information using the flows. We have achieved state-of-the-art performance on the MVSEC and the DSEC datasets. The method is computationally efficient, as it stacks previous information in a cascading manner. The code is available at https://​github.​com/​mickeykang16/​TemporalEventSte​reo.
Hoonhee Cho, Jae-Young Kang, Kuk-Jin Yoon
SAM-COD: SAM-Guided Unified Framework for Weakly-Supervised Camouflaged Object Detection
Abstract
Most Camouflaged Object Detection (COD) methods heavily rely on mask annotations, which are time-consuming and labor-intensive to acquire. Existing weakly-supervised COD approaches exhibit significantly inferior performance compared to fully-supervised methods and struggle to simultaneously support all the existing types of camouflaged object labels, including scribbles, bounding boxes, and points. Even for Segment Anything Model (SAM), it is still problematic to handle the weakly-supervised COD and it typically encounters challenges of prompt compatibility of the scribble labels, extreme response, semantically erroneous response, and unstable feature representations, producing unsatisfactory results in camouflaged scenes. To mitigate these issues, we propose a unified COD framework in this paper, termed SAM-COD, which is capable of supporting arbitrary weakly-supervised labels. Our SAM-COD employs a prompt adapter to handle scribbles as prompts based on SAM. Meanwhile, we introduce response filter and semantic matcher modules to improve the quality of the masks obtained by SAM under COD prompts. To alleviate the negative impacts of inaccurate mask predictions, a new strategy of prompt-adaptive knowledge distillation is utilized to ensure a reliable feature representation. To validate the effectiveness of our approach, we have conducted extensive empirical experiments on three mainstream COD benchmarks. The results demonstrate the superiority of our method against state-of-the-art weakly-supervised and even fully-supervised methods.
Huafeng Chen, Pengxu Wei, Guangqian Guo, Shan Gao
Just a Hint: Point-Supervised Camouflaged Object Detection
Abstract
Camouflaged Object Detection (COD) demands models to expeditiously and accurately distinguish objects which conceal themselves seamlessly in the environment. Owing to the subtle differences and ambiguous boundaries, COD is not only a remarkably challenging task for models but also for human annotators, requiring huge efforts to provide pixel-wise annotations. To alleviate the heavy annotation burden, we propose to fulfill this task with the help of only one point supervision. Specifically, by swiftly clicking on each object, we first adaptively expand the original point-based annotation to a reasonable hint area. Then, to avoid partial localization around discriminative parts, we propose an attention regulator to scatter model attention to the whole object through partially masking labeled regions. Moreover, to solve the unstable feature representation of camouflaged objects under only point-based annotation, we perform unsupervised contrastive learning based on differently augmented image pairs (e.g. changing color or doing translation). On three mainstream COD benchmarks, experimental results show that our model outperforms several weakly-supervised methods by a large margin across various metrics.
Huafeng Chen, Dian Shao, Guangqian Guo, Shan Gao
ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation
Abstract
Performing language-conditioned robotic manipulation tasks in unstructured environments is highly demanded for general intelligent robots. Conventional robotic manipulation methods usually learn a semantic representation of the observation for action prediction, which ignores the scene-level spatiotemporal dynamics for human goal completion. In this paper, we propose a dynamic Gaussian Splatting method named ManiGaussian for multi-task robotic manipulation, which mines scene dynamics via future scene reconstruction. Specifically, we first formulate the dynamic Gaussian Splatting framework that infers the semantics propagation in the Gaussian embedding space, where the semantic representation is leveraged to predict the optimal robot action. Then, we build a Gaussian world model to parameterize the distribution in our dynamic Gaussian Splatting framework, which provides informative supervision in the interactive environment via future scene reconstruction. We evaluate our ManiGaussian on 10 RLBench tasks with 166 variations, and the results demonstrate our framework can outperform the state-of-the-art methods by 13.1% in average success rate (Project page: https://​guanxinglu.​github.​io/​ManiGaussian/​).
Guanxing Lu, Shiyi Zhang, Ziwei Wang, Changliu Liu, Jiwen Lu, Yansong Tang
Global-Local Collaborative Inference with LLM for Lidar-Based Open-Vocabulary Detection
Abstract
Open-Vocabulary Detection (OVD) is the task of detecting all interesting objects in a given scene without predefined object classes. Extensive work has been done to deal with the OVD for 2D RGB images, but the exploration of 3D OVD is still limited. Intuitively, lidar point clouds provide 3D information, both object level and scene level, to generate trustful detection results. However, previous lidar-based OVD methods only focus on the usage of object-level features, ignoring the essence of scene-level information. In this paper, we propose a Global-Local Collaborative Scheme (GLIS) for the lidar-based OVD task, which contains a local branch to generate object-level detection result and a global branch to obtain scene-level global feature. With the global-local information, a Large Language Model (LLM) is applied for chain-of-thought inference, and the detection result can be refined accordingly. We further propose Reflected Pseudo Labels Generation (RPLG) to generate high-quality pseudo labels for supervision and Background-Aware Object Localization (BAOL) to select precise object proposals. Extensive experiments on ScanNetV2 and SUN RGB-D demonstrate the superiority of our methods. Code is released at https://​github.​com/​GradiusTwinbee/​GLIS.
Xingyu Peng, Yan Bai, Chen Gao, Lirong Yang, Fei Xia, Beipeng Mu, Xiaofei Wang, Si Liu
Learning High-Resolution Vector Representation from Multi-camera Images for 3D Object Detection
Abstract
The Bird’s-Eye-View (BEV) representation is a critical factor that directly impacts the 3D object detection performance, but the traditional BEV grid representation induces quadratic computational cost as the spatial resolution grows. To address this limitation, we present a new camera-based 3D object detector with high-resolution vector representation: VectorFormer. The presented high-resolution vector representation is combined with the lower-resolution BEV representation to efficiently exploit 3D geometry from multi-camera images at a high resolution through our two novel modules: vector scattering and gathering. To this end, the learned vector representation with richer scene contexts can serve as the decoding query for final predictions. We conduct extensive experiments on the nuScenes dataset and demonstrate state-of-the-art performance in NDS and inference time. Furthermore, we investigate query-BEV-based methods incorporated with our proposed vector representation and observe a consistent performance improvement. Project page at https://​github.​com/​zlichen/​VectorFormer.
Zhili Chen, Shuangjie Xu, Maosheng Ye, Zian Qian, Xiaoyi Zou, Dit-Yan Yeung, Qifeng Chen
View-Consistent 3D Editing with Gaussian Splatting
Abstract
The advent of 3D Gaussian Splatting (3DGS) has revolutionized 3D editing, offering efficient, high-fidelity rendering and enabling precise local manipulations. Currently, diffusion-based 2D editing models are harnessed to modify multi-view rendered images, which then guide the editing of 3DGS models. However, this approach faces a critical issue of multi-view inconsistency, where the guidance images exhibit significant discrepancies across views, leading to mode collapse and visual artifacts of 3DGS. To this end, we introduce View-consistent Editing (VcEdit), a novel framework that seamlessly incorporates 3DGS into image editing processes, ensuring multi-view consistency in edited guidance images and effectively mitigating mode collapse issues. VcEdit employs two innovative consistency modules: the Cross-attention Consistency Module and the Editing Consistency Module, both designed to reduce inconsistencies in edited images. By incorporating these consistency modules into an iterative pattern, VcEdit proficiently resolves the issue of multi-view inconsistency, facilitating high-quality 3DGS editing across a diverse range of scenes. Further code and video results are released at http://​yuxuanw.​me/​vcedit/​.
Yuxuan Wang, Xuanyu Yi, Zike Wu, Na Zhao, Long Chen, Hanwang Zhang
E3V-K5: An Authentic Benchmark for Redefining Video-Based Energy Expenditure Estimation
Abstract
Accurately estimating energy expenditure (EE) is crucial for optimizing athletic training, monitoring daily activity levels, and preventing sports-related injuries. Estimating energy expenditure based on video (E\(^ 3 \)V) is an appealing research direction. This paper introduces E3V-K5, an authentic dataset of sports videos that significantly enhances the accuracy of EE estimation. The dataset comprises 16,526 video clips from various categories and intensity of sports with continuous calorie readings obtained from the COSMED K5 indirect calorimeter, recognized as the most reliable standard in sports research. Augmented with the heart rate and physical attributes of each subject, the volume, diversity, and authenticity of E3V-K5 surpass all previous video datasets in E\(^ 3 \)V, making E3V-K5 a cornerstone in this field and facilitating future research. Furthermore, we propose E3SFormer, a novel approach specifically designed for the E3V-K5 dataset, focusing on EE estimation using human skeleton data. E3SFormer consists of two Transformer branches for simultaneous action recognition and EE regression. The attention of joints from the action recognition branch is utilized in assisting the EE regression branch. Extensive experimentation validates E3SFormer’s effectiveness, demonstrating its superior performance to existing skeleton-based action recognition models. Our dataset and code are publicly available at https://​github.​com/​zsxm1998/​E3V.
Shengxuming Zhang, Lei Jin, Yifan Wang, Xinyu Wang, Xu Wen, Zunlei Feng, Mingli Song
GeoGaussian: Geometry-Aware Gaussian Splatting for Scene Rendering
Abstract
During the Gaussian Splatting optimization process, the scene geometry can gradually deteriorate if its structure is not deliberately preserved, especially in non-textured regions such as walls, ceilings, and furniture surfaces. This degradation significantly affects the rendering quality of novel views that deviate significantly from the viewpoints in the training data. To mitigate this issue, we propose a novel approach called GeoGaussian. Based on the smoothly connected areas observed from point clouds, this method introduces a novel pipeline to initialize thin Gaussians aligned with the surfaces, where the characteristic can be transferred to new generations through a carefully designed densification strategy. Finally, the pipeline ensures that the scene geometry and texture are maintained through constrained optimization processes with explicit geometry constraints. Benefiting from the proposed architecture, the generative ability of 3D Gaussians is enhanced, especially in structured regions. Our proposed pipeline achieves state-of-the-art performance in novel view synthesis and geometric reconstruction, as evaluated qualitatively and quantitatively on public datasets.
Yanyan Li, Chenyu Lyu, Yan Di, Guangyao Zhai, Gim Hee Lee, Federico Tombari
URS-NeRF: Unordered Rolling Shutter Bundle Adjustment for Neural Radiance Fields
Abstract
In this paper, we propose a novel rolling shutter bundle adjustment method for neural radiance fields (NeRF), which utilizes the unordered rolling shutter (RS) images to obtain the implicit 3D representation. Existing NeRF methods suffer from low-quality images and inaccurate initial camera poses due to the RS effect in the image. Furthermore, the previous method that incorporates RS images into NeRF requires strict sequential data input, thus limiting its widespread applicability. In contrast, our method recovers the physical formation of RS images by estimating camera poses and velocities, thereby removing the input constraints on sequential data. Moreover, we adopt a coarse-to-fine training strategy, in which the RS epipolar constraints of the pairwise frames in the scene graph are used to detect the camera poses that fall into local minima. The poses detected as outliers are corrected by the interpolation method with neighboring poses. The experimental results validate the effectiveness of our method over state-of-the-art works and demonstrate that the reconstruction of 3D representations is not constrained by the requirement of video sequence input.
Bo Xu, Ziao Liu, Mengqi Guo, Jiancheng Li, Gim Hee Lee
Backmatter
Metadata
Title
Computer Vision – ECCV 2024
Editors
Aleš Leonardis
Elisa Ricci
Stefan Roth
Olga Russakovsky
Torsten Sattler
Gül Varol
Copyright Year
2025
Electronic ISBN
978-3-031-72761-0
Print ISBN
978-3-031-72760-3
DOI
https://doi.org/10.1007/978-3-031-72761-0

Premium Partner