Skip to main content
Top

2025 | Book

Computer Vision – ECCV 2024

18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LVII

insite
SEARCH

About this book

The multi-volume set of LNCS books with volume numbers 15059 up to 15147 constitutes the refereed proceedings of the 18th European Conference on Computer Vision, ECCV 2024, held in Milan, Italy, during September 29–October 4, 2024.

The 2387 papers presented in these proceedings were carefully reviewed and selected from a total of 8585 submissions. They deal with topics such as computer vision; machine learning; deep neural networks; reinforcement learning; object recognition; image classification; image processing; object detection; semantic segmentation; human pose estimation; 3d reconstruction; stereo vision; computational photography; neural networks; image coding; image reconstruction; motion estimation.

Table of Contents

Frontmatter
ST-LLM: Large Language Models Are Effective Temporal Learners
Abstract
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation, prompting research efforts towards video LLMs to facilitate human-AI interaction at the video level. However, how to effectively encode and understand videos in video-based dialogue systems remains to be solved. In this paper, we investigate a straightforward yet unexplored question: Can we feed all spatial-temporal tokens into the LLM, thus delegating the task of video sequence modeling to the LLMs? Surprisingly, this simple approach yields significant improvements in video understanding. Based upon this, we propose ST-LLM, an effective video-LLM baseline with Spatial-Temporal sequence modeling inside LLM. Furthermore, to address the overhead and stability issues introduced by uncompressed video tokens within LLMs, we develop a dynamic masking strategy with tailor-made training objectives. For particularly long videos, we have also designed a global-local input module to balance efficiency and effectiveness. Consequently, we harness LLM for proficient spatial-temporal modeling, while upholding efficiency and stability. Extensive experimental results attest to the effectiveness of our method. Through a more concise model and training pipeline, ST-LLM establishes a new state-of-the-art result on VideoChatGPT-Bench and MVBench. Codes have been available at https://​github.​com/​TencentARC/​ST-LLM.
Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, Ge Li
Exact Diffusion Inversion via Bidirectional Integration Approximation
Abstract
Recently, various methods have been proposed to address the inconsistency issue of DDIM inversion to enable image editing, such as EDICT [40] and Null-text inversion [24]. However, the above methods introduce considerable computational overhead. In this paper, we propose a new technique, bidirectional integration approximation (BDIA), to perform exact diffusion inversion with negligible computational overhead. We consider a family of second order integration algorithms obtained by averaging forward and backward DDIM steps. The resulting approach estimates the next diffusion state as a linear combination of the estimated Gaussian noise at the current step and the previous and current diffusion states. This allows for exact backward computation of previous state given the current and next ones, leading to exact diffusion inversion. We perform a convergence analysis for BDIA-DDIM that includes the analysis for DDIM as a special case. It is demonstrated with experiments that BDIA-DDIM is effective for (round-trip) prompt-driven image editing. Our experiments further show that BDIA-DDIM produces markedly better image sampling quality than DDIM and EDICT for text-to-image generation and conventional image sampling. BDIA can also be applied to improve the performance of other ODE solvers in addition to DDIM. In particular, it is found that applying BDIA to the EDM sampling procedure produces consistently better performance over four pre-trained models (see Alg. 4 and Table 3 in Appendix C).
Guoqiang Zhang, J. P. Lewis, W. Bastiaan Kleijn
Textual Query-Driven Mask Transformer for Domain Generalized Segmentation
Abstract
In this paper, we introduce a method to tackle Domain Generalized Semantic Segmentation (DGSS) by utilizing domain-invariant semantic knowledge from text embeddings of vision-language models. We employ the text embeddings as object queries within a transformer-based segmentation framework (textual object queries). These queries are regarded as a domain-invariant basis for pixel grouping in DGSS. To leverage the power of textual object queries, we introduce a novel framework named the textual query-driven mask transformer (tqdm). Our tqdm aims to (1) generate textual object queries that maximally encode domain-invariant semantics and (2) enhance the semantic clarity of dense visual features. Additionally, we suggest three regularization losses to improve the efficacy of tqdm by aligning between visual and textual features. By utilizing our method, the model can comprehend inherent semantic information for classes of interest, enabling it to generalize to extreme domains (e.g., sketch style). Our tqdm achieves 68.9 mIoU on GTA5\(\rightarrow \)Cityscapes, outperforming the prior state-of-the-art method by 2.5 mIoU. The project page is available at https://​byeonghyunpak.​github.​io/​tqdm.
Byeonghyun Pak, Byeongju Woo, Sunghwan Kim, Dae-hwan Kim, Hoseong Kim
EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head
Abstract
We present a novel approach for synthesizing 3D talking heads with controllable emotion, featuring enhanced lip synchronization and rendering quality. Despite significant progress in the field, prior methods still suffer from multi-view consistency and a lack of emotional expressiveness. To address these issues, we collect EmoTalk3D dataset with calibrated multi-view videos, emotional annotations, and per-frame 3D geometry. By training on the EmoTalk3D dataset, we propose a ‘Speech-to-Geometry-to-Appearance’ mapping framework that first predicts faithful 3D geometry sequence from the audio features, then the appearance of a 3D talking head represented by 4D Gaussians is synthesized from the predicted geometry. The appearance is further disentangled into canonical and dynamic Gaussians, learned from multi-view videos, and fused to render free-view talking head animation. Moreover, our model enables controllable emotion in the generated talking heads and can be rendered in wide-range views. Our method exhibits improved rendering quality and stability in lip motion generation while capturing dynamic facial details such as wrinkles and subtle expressions. Experiments demonstrate the effectiveness of our approach in generating high-fidelity and emotion-controllable 3D talking heads. The code and EmoTalk3D dataset are released in https://​nju-3dv.​github.​io/​projects/​EmoTalk3D.
Qianyun He, Xinya Ji, Yicheng Gong, Yuanxun Lu, Zhengyu Diao, Linjia Huang, Yao Yao, Siyu Zhu, Zhan Ma, Songcen Xu, Xiaofei Wu, Zixiao Zhang, Xun Cao, Hao Zhu
Arbitrary-Scale Video Super-Resolution with Structural and Textural Priors
Abstract
Arbitrary-scale video super-resolution (AVSR) aims to enhance the resolution of video frames, potentially at various scaling factors, which presents several challenges regarding spatial detail reproduction, temporal consistency, and computational complexity. In this paper, we first describe a strong baseline for AVSR by putting together three variants of elementary building blocks: 1) a flow-guided recurrent unit that aggregates spatiotemporal information from previous frames, 2) a flow-refined cross-attention unit that selects spatiotemporal information from future frames, and 3) a hyper-upsampling unit that generates scale-aware and content-independent upsampling kernels. We then introduce ST-AVSR by equipping our baseline with a multi-scale structural and textural prior computed from the pre-trained VGG network. This prior has proven effective in discriminating structure and texture across different locations and scales, which is beneficial for AVSR. Comprehensive experiments show that ST-AVSR significantly improves super-resolution quality, generalization ability, and inference speed over the state-of-the-art. The code is available at https://​github.​com/​shangwei5/​ST-AVSR.
Wei Shang, Dongwei Ren, Wanying Zhang, Yuming Fang, Wangmeng Zuo, Kede Ma
Object-Centric Diffusion for Efficient Video Editing
Abstract
Diffusion-based video editing have reached impressive quality and can transform either the global style, local structure, and attributes of given video inputs, following textual edit prompts. However, such solutions typically incur heavy memory and computational costs to generate temporally-coherent frames, either in the form of diffusion inversion and/or cross-frame attention. In this paper, we conduct an analysis of such inefficiencies, and suggest simple yet effective modifications that allow significant speed-ups whilst maintaining quality. Moreover, we introduce Object-Centric Diffusion, to fix generation artifacts and further reduce latency by allocating more computations towards foreground edited regions, arguably more important for perceptual quality. We achieve this by two novel proposals: i) Object-Centric Sampling, decoupling the diffusion steps spent on salient or background regions and spending most on the former, and ii) Object-Centric Token Merging, which reduces cost of cross-frame attention by fusing redundant tokens in unimportant background regions. Both techniques are readily applicable to a given video editing model without retraining, and can drastically reduce its memory and computational cost. We evaluate our proposals on inversion-based and control-signal-based editing pipelines, and show a latency reduction up to 10\(\times \) for a comparable synthesis quality. Project page: qualcomm-ai-research.​github.​io/​object-centric-diffusion.
Kumara Kahatapitiya, Adil Karjauv, Davide Abati, Fatih Porikli, Yuki M. Asano, Amirhossein Habibian
Single-Mask Inpainting for Voxel-Based Neural Radiance Fields
Abstract
3D inpainting is a challenging task in computer vision and graphics that aims to remove objects and fill in missing regions with a visually coherent and complete representation of the background. A few methods have been proposed to address this problem, yielding notable results in inpainting. However, these methods haven’t perfectly solved the limitation of relying on masks for each view. Obtaining masks for each view can be time-consuming and reduces quality, especially in scenarios with a large number of views or complex scenes. To address this limitation, we propose an innovative approach that eliminates the need for per-view masks and uses a single mask from a selected view. We focus on improving the quality of forward-facing scene inpainting. By unprojecting the single 2D mask into the NeRFs space, we define the regions that require inpainting in three dimensions. We introduce a two-step optimization process. Firstly, we utilize 2D inpainters to generate color and depth priors for the selected view. This provides a rough supervision for the area to be inpainted. Secondly, we incorporate a 2D diffusion model to enhance the quality of the inpainted regions, reducing distortions and elevating the overall visual fidelity. Through extensive experiments, we demonstrate the effectiveness of our single-mask inpainting framework. The results show that our approach successfully inpaints complex geometry and produces visually plausible and realistic outcomes.
Jiafu Chen, Tianyi Chu, Jiakai Sun, Wei Xing, Lei Zhao
McGrids: Monte Carlo-Driven Adaptive Grids for Iso-Surface Extraction
Abstract
Iso-surface extraction from an implicit field is a fundamental process in various applications of computer vision and graphics. When dealing with geometric shapes with complicated geometric details, many existing algorithms suffer from high computational costs and memory usage. This paper proposes McGrids, a novel approach to improve the efficiency of iso-surface extraction. The key idea is to construct adaptive grids for iso-surface extraction rather than using a simple uniform grid as prior art does. Specifically, we formulate the problem of constructing adaptive grids as a probability sampling problem, which is then solved by Monte Carlo process. We demonstrate McGrids’ capability with extensive experiments from both analytical SDFs computed from surface meshes and learned implicit fields from real multiview images. The experiment results show that our McGrids can significantly reduce the number of implicit field queries, resulting in significant memory reduction, while producing high-quality meshes with rich geometric details.
Daxuan Ren, Hezi Shi, Jianmin Zheng, Jianfei Cai
Freeview Sketching: View-Aware Fine-Grained Sketch-Based Image Retrieval
Abstract
In this paper, we delve into the intricate dynamics of Fine-Grained Sketch-Based Image Retrieval (FG-SBIR) by addressing a critical yet overlooked aspect – the choice of viewpoint during sketch creation. Unlike photo systems that seamlessly handle diverse views through extensive datasets, sketch systems, with limited data collected from fixed perspectives, face challenges. Our pilot study, employing a pre-trained FG-SBIR model, highlights the system’s struggle when query-sketches differ in viewpoint from target instances. Interestingly, a questionnaire however shows users desire autonomy, with a significant percentage favouring view-specific retrieval. To reconcile this, we advocate for a view-aware system, seamlessly accommodating both view-agnostic and view-specific tasks. Overcoming dataset limitations, our first contribution leverages multi-view 2D projections of 3D objects, instilling cross-modal view awareness. The second contribution introduces a customisable cross-modal feature through disentanglement, allowing effortless mode switching. Extensive experiments on standard datasets validate the effectiveness of our method.
Aneeshan Sain, Pinaki Nath Chowdhury, Subhadeep Koley, Ayan Kumar Bhunia, Yi-Zhe Song
Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts
Abstract
For a general-purpose robot to operate in reality, executing a broad range of instructions across various environments is imperative. Central to the reinforcement learning and planning for such robotic agents is a generalizable reward function. Recent advances in vision-language models, such as CLIP, have shown remarkable performance in the domain of deep learning, paving the way for open-domain visual recognition. However, collecting data on robots executing various language instructions across multiple environments remains a challenge. This paper aims to transfer video-language models with robust generalization into a generalizable language-conditioned reward function, only utilizing robot video data from a minimal amount of tasks in a singular environment. Unlike common robotic datasets used for training reward functions, human video-language datasets rarely contain trivial failure videos. To enhance the model’s ability to distinguish between successful and failed robot executions, we cluster failure video features to enable the model to identify patterns within. For each cluster, we integrate a newly trained failure prompt into the text encoder to represent the corresponding failure mode. Our language-conditioned reward function shows outstanding generalization to new environments and new instructions for robot planning and reinforcement learning.
Yanting Yang, Minghao Chen, Qibo Qiu, Jiahao Wu, Wenxiao Wang, Binbin Lin, Ziyu Guan, Xiaofei He
Diffusion for Natural Image Matting
Abstract
Existing natural image matting algorithms inevitably have flaws in their predictions on difficult cases, and their one-step prediction manner cannot further correct these errors. In this paper, we investigate a multi-step iterative approach for the first time to tackle the challenging natural image matting task, and achieve excellent performance by introducing a pixel-level denoising diffusion method (DiffMatte) for the alpha matte refinement. To improve iteration efficiency, we design a lightweight diffusion decoder as the only iterative component to directly denoise the alpha matte, saving the huge computational overhead of repeatedly encoding matting features. We also propose an ameliorated self-aligned strategy to consolidate the performance gains brought about by the iterative diffusion process. This allows the model to adapt to various types of errors by aligning the noisy samples used in training and inference, mitigating performance degradation caused by sampling drift. Extensive experimental results demonstrate that DiffMatte not only reaches the state-of-the-art level on the mainstream Composition-1k test set, surpassing the previous best methods by 8% and 15% in the SAD metric and MSE metric respectively, but also show stronger generalization ability in other benchmarks. The code will be open-sourced for the following research and applications. Code is available at https://​github.​com/​YihanHu-2022/​DiffMatte.
Yihan Hu, Yiheng Lin, Wei Wang, Yao Zhao, Yunchao Wei, Humphrey Shi
Agglomerative Token Clustering
Abstract
We present Agglomerative Token Clustering (ATC), a novel token merging method that consistently outperforms previous token merging and pruning methods across image classification, image synthesis, and object detection & segmentation tasks. ATC merges clusters through bottom-up hierarchical clustering, without the introduction of extra learnable parameters. We find that ATC achieves state-of-the-art performance across all tasks, and can even perform on par with prior state-of-the-art when applied off-the-shelf, i.e. without fine-tuning. ATC is particularly effective when applied with low keep rates, where only a small fraction of tokens are kept and retaining task performance is especially difficult.
Joakim Bruslund Haurum, Sergio Escalera, Graham W. Taylor, Thomas B. Moeslund
CMD: A Cross Mechanism Domain Adaptation Dataset for 3D Object Detection
Abstract
Point cloud data, representing the precise 3D layout of the scene, quickly drives the research of 3D object detection. However, the challenge arises due to the rapid iteration of 3D sensors, which leads to significantly different distributions in point clouds. This, in turn, results in subpar performance of 3D cross-sensor object detection. This paper introduces a Cross Mechanism Dataset, named CMD, to support research tackling this challenge. CMD is the first domain adaptation dataset, comprehensively encompassing diverse mechanical sensors and various scenes for 3D object detection. In terms of sensors, CMD includes 32-beam LiDAR, 128-beam LiDAR, solid-state LiDAR, 4D millimeter-wave radar, and cameras, all of which are well-synchronized and calibrated. Regarding the scenes, CMD consists of 50 sequences collocated from different scenarios, ranging from campuses to highways. Furthermore, we validated the effectiveness of various domain adaptation methods in mitigating sensor-based domain differences. We also proposed a DIG method to reduce domain disparities from the perspectives of Density, Intensity, and Geometry, which effectively bridges the domain gap between different sensors. The experimental results on the CMD dataset show that our proposed DIG method outperforms the state-of-the-art techniques, demonstrating the effectiveness of our baseline method. The dataset and the corresponding code are available at https://​github.​com/​im-djh/​CMD.
Jinhao Deng, Wei Ye, Hai Wu, Xun Huang, Qiming Xia, Xin Li, Jin Fang, Wei Li, Chenglu Wen, Cheng Wang
Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning
Abstract
Recently, zero-shot image captioning has gained increasing attention, where only text data is available for training. The remarkable progress in text-to-image diffusion model presents the potential to resolve this task by employing synthetic image-caption pairs generated by this pre-trained prior. Nonetheless, the defective details in the salient regions of the synthetic images introduce semantic misalignment between the synthetic image and text, leading to compromised results. To address this challenge, we propose a novel Patch-wise Cross-modal feature Mix-up (PCM) mechanism to adaptively mitigate the unfaithful contents in a fine-grained manner during training, which can be integrated into most of encoder-decoder frameworks, introducing our PCM-Net. Specifically, for each input image, salient visual concepts in the image are first detected considering the image-text similarity in CLIP space. Next, the patch-wise visual features of the input image are selectively fused with the textual features of the salient visual concepts, leading to a mixed-up feature map with less defective content. Finally, a visual-semantic encoder is exploited to refine the derived feature map, which is further incorporated into the sentence decoder for caption generation. Additionally, to facilitate the model training with synthetic data, a novel CLIP-weighted cross-entropy loss is devised to prioritize the high-quality image-text pairs over the low-quality counterparts. Extensive experiments on MSCOCO and Flickr30k datasets demonstrate the superiority of our PCM-Net compared with state-of-the-art VLMs-based approaches. It is noteworthy that our PCM-Net ranks first in both in-domain and cross-domain zero-shot image captioning. The synthetic dataset SynthImgCap and code are available at https://​jianjieluo.​github.​io/​SynthImgCap.
Jianjie Luo, Jingwen Chen, Yehao Li, Yingwei Pan, Jianlin Feng, Hongyang Chao, Ting Yao
ClusteringSDF: Self-Organized Neural Implicit Surfaces for 3D Decomposition
Abstract
3D decomposition/segmentation remains a challenge as large-scale 3D annotated data is not readily available. Existing approaches typically leverage 2D machine-generated segments, integrating them to achieve 3D consistency. In this paper, we propose ClusteringSDF, a novel approach achieving both segmentation and reconstruction in 3D via the neural implicit surface representation, specifically the Signed Distance Function (SDF), where the segmentation rendering is directly integrated with the volume rendering of neural implicit surfaces. Although based on ObjectSDF++, ClusteringSDF no longer requires ground-truth segments for supervision while maintaining the capability of reconstructing individual object surfaces, relying purely on the noisy and inconsistent labels from pre-trained models. As the core of ClusteringSDF, we introduce a highly efficient clustering mechanism for lifting 2D labels to 3D. Experimental results on the challenging scenes from ScanNet and Replica datasets show that ClusteringSDF  can achieve competitive performance compared to the state-of-the-art with significantly reduced training time.
Tianhao Wu, Chuanxia Zheng, Qianyi Wu, Tat-Jen Cham
NAMER: Non-autoregressive Modeling for Handwritten Mathematical Expression Recognition
Abstract
Recently, Handwritten Mathematical Expression Recognition (HMER) has gained considerable attention in pattern recognition for its diverse applications in document understanding. Current methods typically approach HMER as an image-to-sequence generation task within an autoregressive (AR) encoder-decoder framework. However, these approaches suffer from several drawbacks: 1) a lack of overall language context, limiting information utilization beyond the current decoding step; 2) error accumulation during AR decoding; and 3) slow decoding speed. To tackle these problems, this paper makes the first attempt to build a novel bottom-up Non-AutoRegressive Modeling approach for HMER, called NAMER. NAMER comprises a Visual Aware Tokenizer (VAT) and a Parallel Graph Decoder (PGD). Initially, the VAT tokenizes visible symbols and local relations at a coarse level. Subsequently, the PGD refines all tokens and establishes connectivities in parallel, leveraging comprehensive visual and linguistic contexts. Experiments on CROHME 2014/2016/2019 and HME100K datasets demonstrate that NAMER not only outperforms the current state-of-the-art (SOTA) methods on ExpRate by 1.93%/2.35%/1.49%/0.62%, but also achieves significant speedups of 13.7\(\times \) and 6.7\(\times \) faster in decoding time and overall FPS, proving the effectiveness and efficiency of NAMER.
Chenyu Liu, Jia Pan, Jinshui Hu, Baocai Yin, Bing Yin, Mingjun Chen, Cong Liu, Jun Du, Qingfeng Liu
GIVT: Generative Infinite-Vocabulary Transformers
Abstract
We introduce Generative Infinite-Vocabulary Transformers (GIVT) which generate vector sequences with real-valued entries, instead of discrete tokens from a finite vocabulary. To this end, we propose two surprisingly simple modifications to decoder-only transformers: 1) at the input, we replace the finite-vocabulary lookup table with a linear projection of the input vectors; and 2) at the output, we replace the logits prediction (usually mapped to a categorical distribution) with the parameters of a multivariate Gaussian mixture model. Inspired by the image-generation paradigm of VQ-GAN and MaskGIT, where transformers are used to model the discrete latent sequences of a VQ-VAE, we use GIVT to model the unquantized real-valued latent sequences of a \(\beta \)-VAE. In class-conditional image generation GIVT outperforms VQ-GAN (and improved variants thereof) as well as MaskGIT, and achieves performance competitive with recent latent diffusion models. Finally, we obtain strong results outside of image generation when applying GIVT to panoptic segmentation and depth estimation with a VAE variant of the UViM framework.
Michael Tschannen, Cian Eastwood, Fabian Mentzer
Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment
Abstract
While existing image-text alignment models reach high quality binary assessments, they fall short of pinpointing the exact source of misalignment. In this paper, we present a method to provide detailed textual and visual explanation of detected misalignments between text-image pairs. We leverage large language models and visual grounding models to automatically construct a training set that holds plausible misaligned captions for a given image and corresponding textual explanations and visual indicators. We also publish a new human curated test set comprising ground-truth textual and visual misalignment annotations. Empirical results show that fine-tuning vision language models on our training set enables them to articulate misalignments and visually indicate them within images, outperforming strong baselines both on the binary alignment classification and the explanation generation tasks. Our code and human curated test set are available at: https://​github.​com/​MismatchQuest/​MismatchQuest.
Brian Gordon, Yonatan Bitton, Yonatan Shafir, Roopal Garg, Xi Chen, Dani Lischinski, Daniel Cohen-Or, Idan Szpektor
Regulating Model Reliance on Non-robust Features by Smoothing Input Marginal Density
Abstract
Trustworthy machine learning necessitates meticulous regulation of model reliance on non-robust features. We propose a framework to delineate and regulate such features by attributing model predictions to the input. Within our approach, robust feature attributions exhibit a certain consistency, while non-robust feature attributions are susceptible to fluctuations. This behavior allows identification of correlation between model reliance on non-robust features and smoothness of marginal density of the input samples. Hence, we uniquely regularize the gradients of the marginal density w.r.t. the input features for robustness. We also devise an efficient implementation of our regularization (Our code is available at https://​github.​com/​ypeiyu/​input_​density_​reg.) to address the potential numerical instability of the underlying optimization process. Moreover, we analytically reveal that, as opposed to our marginal density smoothing, the prevalent input gradient regularization smoothens conditional or joint density of the input, which can cause limited robustness. Our experiments validate the effectiveness of the proposed method, providing clear evidence of its capability to address the feature leakage problem and mitigate spurious correlations. Extensive results further establish that our technique enables the model to exhibit robustness against perturbations in pixel values, input gradients, and density.
Peiyu Yang, Naveed Akhtar, Mubarak Shah, Ajmal Mian
Multi-modal Video Dialog State Tracking in the Wild
Abstract
We present \(\mathbb {MST}_\mathbb {MIXER}\) https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-72998-0_20/MediaObjects/635815_1_En_20_Figa_HTML.gif – a novel video dialog model operating over a generic multi-modal state tracking scheme. Current models that claim to perform multi-modal state tracking fall short in two major aspects: (1) They either track only one modality (mostly the visual input) or (2) they target synthetic datasets that do not reflect the complexity of real-world in-the-wild scenarios. Our model addresses these two limitations in an attempt to close this crucial research gap. Specifically, \(\mathbb {MST}_\mathbb {MIXER}\)  first tracks the most important constituents of each input modality. Then, it predicts the missing underlying structure of the selected constituents of each modality by learning local latent graphs using a novel multi-modal graph structure learning method. Subsequently, the learned local graphs and features are parsed together to form a global graph operating on the mix of all modalities, further refining its structure and node embeddings. Finally, the fine-grained graph node features are used to enhance the hidden states of the backbone Vision-Language Model (VLM). \(\mathbb {MST}_\mathbb {MIXER}\)  achieves new state-of-the-art results on five challenging benchmarks.
Adnen Abdessaied, Lei Shi, Andreas Bulling
Factorized Diffusion: Perceptual Illusions by Noise Decomposition
Abstract
Given a factorization of an image into a sum of linear components, we present a zero-shot method to control each individual component through diffusion model sampling. For example, we can decompose an image into low and high spatial frequencies and condition these components on different text prompts. This produces hybrid images, which change appearance depending on viewing distance. By decomposing an image into three frequency subbands, we can generate hybrid images with three prompts. We also use a decomposition into grayscale and color components to produce images whose appearance changes when they are viewed in grayscale, a phenomena that naturally occurs under dim lighting. And we explore a decomposition by a motion blur kernel, which produces images that change appearance under motion blurring. Our method works by denoising with a composite noise estimate, built from the components of noise estimates conditioned on different prompts. We also show that for certain decompositions, our method recovers prior approaches to compositional generation and spatial control. Finally, we show that we can extend our approach to generate hybrid images from real images. We do this by holding one component fixed and generating the remaining components, effectively solving an inverse problem.
Daniel Geng, Inbum Park, Andrew Owens
To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy to Generate Unsafe Images ... For Now
Abstract
The recent advances in diffusion models (DMs) have revolutionized the generation of realistic and complex images. However, these models also introduce potential safety hazards, such as producing harmful content and infringing data copyrights. Despite the development of safety-driven unlearning techniques to counteract these challenges, doubts about their efficacy persist. To tackle this issue, we introduce an evaluation framework that leverages adversarial prompts to discern the trustworthiness of these safety-driven DMs after they have undergone the process of unlearning harmful concepts. Specifically, we investigated the adversarial robustness of DMs, assessed by adversarial prompts, when eliminating unwanted concepts, styles, and objects. We develop an effective and efficient adversarial prompt generation approach for DMs, termed UnlearnDiffAtk. This method capitalizes on the intrinsic classification abilities of DMs to simplify the creation of adversarial prompts, thereby eliminating the need for auxiliary classification or diffusion models. Through extensive benchmarking, we evaluate the robustness of widely-used safety-driven unlearned DMs (i.e., DMs after unlearning undesirable concepts, styles, or objects) across a variety of tasks. Our results demonstrate the effectiveness and efficiency merits of UnlearnDiffAtk over the state-of-the-art adversarial prompt generation method and reveal the lack of robustness of current safety-driven unlearning techniques when applied to DMs. Codes are available at https://​github.​com/​OPTML-Group/​Diffusion-MU-Attack.
WARNING: There exist AI generations that may be offensive.
Yimeng Zhang, Jinghan Jia, Xin Chen, Aochuan Chen, Yihua Zhang, Jiancheng Liu, Ke Ding, Sijia Liu
Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions
Abstract
Large multimodal models (LMMs) excel in adhering to human instructions. However, self-contradictory instructions may arise due to the increasing trend of multimodal interaction and context length, which is challenging for language beginners and vulnerable populations. We introduce the Self-Contradictory Instructions benchmark to evaluate the capability of LMMs in recognizing conflicting commands. It comprises 20,000 conflicts, evenly distributed between language and vision paradigms. It is constructed by a novel automatic dataset creation framework, which expedites the process and enables us to encompass a wide range of instruction forms. Our comprehensive evaluation reveals current LMMs consistently struggle to identify multimodal instruction discordance due to a lack of self-awareness. Hence, we propose the Cognitive Awakening Prompting to inject cognition from external, largely enhancing dissonance detection. Here are our website, dataset, and code.
Jin Gao, Lei Gan, Yuankai Li, Yixin Ye, Dequan Wang
StereoGlue: Robust Estimation with Single-Point Solvers
Abstract
We propose StereoGlue, a method designed for joint feature matching and robust estimation that effectively reduces the combinatorial complexity of these tasks using single-point minimal solvers. StereoGlue is applicable to a range of problems, including but not limited to relative pose and homography estimation, determining absolute pose with 2D-3D correspondences, and estimating 3D rigid transformations between point clouds. StereoGlue starts with a set of one-to-many tentative correspondences, iteratively forms tentative matches, and estimates the minimal sample model. This model then facilitates guided matching, leading to consistent one-to-one matches, whose number serves as the model score. StereoGlue is superior to the state-of-the-art robust estimators on real-world datasets on multiple problems, improving upon a number of recent feature detectors and matchers. Additionally, it shows improvements in point cloud matching and absolute camera pose estimation. The code is at: https://​github.​com/​danini/​stereoglue.
Daniel Barath, Dmytro Mishkin, Luca Cavalli, Paul-Edouard Sarlin, Petr Hruby, Marc Pollefeys
Boosting Transferability in Vision-Language Attacks via Diversification Along the Intersection Region of Adversarial Trajectory
Abstract
Vision-language pre-training (VLP) models exhibit remarkable capabilities in comprehending both images and text, yet they remain susceptible to multimodal adversarial examples (AEs). Strengthening attacks and uncovering vulnerabilities, especially common issues in VLP models (e.g., high transferable AEs), can advance reliable and practical VLP models. A recent work (i.e., Set-level guidance attack) indicates that augmenting image-text pairs to increase AE diversity along the optimization path enhances the transferability of adversarial examples significantly. However, this approach predominantly emphasizes diversity around the online adversarial examples (i.e., AEs in the optimization period), leading to the risk of overfitting the victim model and affecting the transferability. In this study, we posit that the diversity of adversarial examples towards the clean input and online AEs are both pivotal for enhancing transferability across VLP models. Consequently, we propose using diversification along the intersection region of adversarial trajectory to expand the diversity of AEs. To fully leverage the interaction between modalities, we introduce text-guided adversarial example selection during optimization. Furthermore, to further mitigate the potential overfitting, we direct the adversarial text deviating from the last intersection region along the optimization path, rather than adversarial images as in existing methods. Extensive experiments affirm the effectiveness of our method in improving transferability across various VLP models and downstream vision-and-language tasks. Code is available at https://​github.​com/​SensenGao/​VLPTransferAttac​k.
Sensen Gao, Xiaojun Jia, Xuhong Ren, Ivor Tsang, Qing Guo
Leveraging Enhanced Queries of Point Sets for Vectorized Map Construction
Abstract
In autonomous driving, the high-definition (HD) map plays a crucial role in localization and planning. Recently, several methods have facilitated end-to-end online map construction in DETR-like frameworks. However, little attention has been paid to the potential capabilities of exploring the query mechanism for map elements. This paper introduces MapQR, an end-to-end method with an emphasis on enhancing query capabilities for constructing online vectorized maps. To probe desirable information efficiently, MapQR utilizes a novel query design, called scatter-and-gather query, which is modelled by separate content and position parts explicitly. The base map instance queries are scattered to different reference points and added with positional embeddings to probe information from BEV features. Then these scatted queries are gathered back to enhance information within each map instance. Together with a simple and effective improvement of a BEV encoder, the proposed MapQR achieves the best mean average precision (mAP) and maintains good efficiency on both nuScenes and Argoverse 2. In addition, integrating our query design into other models can boost their performance significantly. The source code is available at https://​github.​com/​HXMap/​MapQR.
Zihao Liu, Xiaoyu Zhang, Guangwei Liu, Ji Zhao, Ningyi Xu
Robust Zero-Shot Crowd Counting and Localization With Adaptive Resolution SAM
Abstract
The existing crowd counting models require extensive training data, which is time-consuming to annotate. To tackle this issue, we propose a simple yet effective crowd counting method by utilizing the Segment-Everything-Everywhere Model (SEEM), an adaptation of the Segmentation Anything Model (SAM), to generate pseudo-labels for training crowd counting models. However, our initial investigation reveals that SEEM’s performance in dense crowd scenes is limited, primarily due to the omission of many persons in high-density areas. To overcome this limitation, we propose an adaptive resolution SEEM to handle the scale variations, occlusions, and overlapping of people within crowd scenes. Alongside this, we introduce a robust localization method, based on Gaussian Mixture Models, for predicting the head positions in the predicted people masks. Given the mask and point pseudo-labels, we propose a robust loss function, which is designed to exclude uncertain regions based on SEEM’s predictions, thereby enhancing the training process of the counting network. Finally, we propose an iterative method for generating pseudo-labels. This method aims at improving the quality of the segmentation masks by identifying more tiny persons in high-density regions, which are often missed in the first pseudo-labeling iteration. Overall, our proposed method achieves the best unsupervised performance in crowd counting, while also being comparable to some classic supervised fully methods. This makes it a highly effective and versatile tool for crowd counting, especially in situations where labeled data is not available.
Jia Wan, Qiangqiang Wu, Wei Lin, Antoni Chan
Backmatter
Metadata
Title
Computer Vision – ECCV 2024
Editors
Aleš Leonardis
Elisa Ricci
Stefan Roth
Olga Russakovsky
Torsten Sattler
Gül Varol
Copyright Year
2025
Electronic ISBN
978-3-031-72998-0
Print ISBN
978-3-031-72997-3
DOI
https://doi.org/10.1007/978-3-031-72998-0

Premium Partner