Skip to main content

2025 | Buch

Computer Vision – ECCV 2024

18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXXXI

herausgegeben von: Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, Gül Varol

Verlag: Springer Nature Switzerland

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

The multi-volume set of LNCS books with volume numbers 15059 up to 15147 constitutes the refereed proceedings of the 18th European Conference on Computer Vision, ECCV 2024, held in Milan, Italy, during September 29–October 4, 2024.

The 2387 papers presented in these proceedings were carefully reviewed and selected from a total of 8585 submissions. They deal with topics such as computer vision; machine learning; deep neural networks; reinforcement learning; object recognition; image classification; image processing; object detection; semantic segmentation; human pose estimation; 3d reconstruction; stereo vision; computational photography; neural networks; image coding; image reconstruction; motion estimation.

Inhaltsverzeichnis

Frontmatter
Few-Shot Class Incremental Learning with Attention-Aware Self-adaptive Prompt
Abstract
Few-Shot Class-Incremental Learning (FSCIL) models aim to incrementally learn new classes with scarce samples while preserving knowledge of old ones. Existing FSCIL methods usually fine-tune the entire backbone, leading to overfitting and hindering the potential to learn new classes. On the other hand, recent prompt-based CIL approaches alleviate forgetting by training prompts with sufficient data in each task. In this work, we propose a novel framework named Attention-aware Self-adaptive Prompt (ASP). ASP encourages task-invariant prompts to capture shared knowledge by reducing specific information from the attention aspect. Additionally, self-adaptive task-specific prompts in ASP provide specific information and transfer knowledge from old classes to new classes with an Information Bottleneck learning objective. In summary, ASP prevents overfitting on base task and does not require enormous data in few-shot incremental tasks. Extensive experiments on three benchmark datasets validate that ASP consistently outperforms state-of-the-art FSCIL and prompt-based CIL methods in terms of both learning new classes and mitigating forgetting. Source code is available at https://​github.​com/​DawnLIU35/​FSCIL-ASP.
Chenxi Liu, Zhenyi Wang, Tianyi Xiong, Ruibo Chen, Yihan Wu, Junfeng Guo, Heng Huang
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
Abstract
In this study, we identify the inefficient attention phenomena in Large Vision-Language Models (LVLMs), notably within prominent models like LLaVA-1.5, QwenVL-Chat, and Video-LLaVA. We find that the attention computation over visual tokens is extremely inefficient in the deep layers of popular LVLMs, suggesting a need for a sparser approach compared to textual data handling. To this end, we introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency by learning adaptive attention patterns in early layers and pruning visual tokens in subsequent ones. Our evaluations demonstrate FastV’s ability to dramatically reduce computational costs (e.g., a 45% reduction in FLOPs for LLaVA-1.5-13B) without sacrificing performance in a wide range of image and video understanding tasks. The computational efficiency and performance trade-off of FastV are highly customizable and Pareto-efficient. It can compress the FLOPs of a 13B-parameter model to achieve a lower cost than that of a 7B-parameter model while still maintaining superior performance. We believe FastV has practical value for the deployment of LVLMs in edge devices and commercial models. Code is released at github.​com/​pkunlp-icler/​FastV.
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, Baobao Chang
Generalizable Symbolic Optimizer Learning
Abstract
Existing automated symbolic optimizer design methods necessitate the use of proxies, often resulting in significant performance degradation when transferring to a target domain. In this paper, we propose a learning based model called Symbolic Optimizer Learner (SOL) that can discover high-performance symbolic optimizers directly on the target. SOL is integrated with symbols and can be directly transformed into a symbolic optimizer. In addition, an unrolled optimization approach is introduced for SOL training. SOL can be embedded into the training process of neural networks, optimizing the target directly without any proxies. Our extensive experiments demonstrate the good performance and high generalizability of SOL through diverse tasks, ranging from classifications to adversarial attacks, from GNN to NLP tasks. On image classification, SOL achieved \(\sim \)5\(\times \) speedup and \(\sim \)3% accuracy gain. On adversarial attacks, SOL achieved the best attack success rate across seven SOTA defense models. On GNN training, SOL discovered optimizers can outperform Adam on three different datasets. On BERT fine-tuning, SOL also outperformed AdamW on five benchmarks. The source code is available at https://​github.​com/​songxt3/​SOL.
Xiaotian Song, Peng Zeng, Yanan Sun, Andy Song
Online Continuous Generalized Category Discovery
Abstract
With the advancement of deep neural networks in computer vision, artificial intelligence (AI) is widely employed in real-world applications. However, AI still faces limitations in mimicking high-level human capabilities, such as novel category discovery, for practical use. While some methods utilizing offline continual learning have been proposed for novel category discovery, they neglect the continuity of data streams in real-world settings. In this work, we introduce Online Continuous Generalized Category Discovery (OCGCD), which considers the dynamic nature of data streams where data can be created and deleted in real time. Additionally, we propose a novel method, DEAN, Discovery via Energy guidance and feature AugmentatioN, which can discover novel categories in an online manner through energy-guided discovery and facilitate discriminative learning via energy-based contrastive loss. Furthermore, DEAN effectively pseudo-labels unlabeled data through variance-based feature augmentation. Experimental results demonstrate that our proposed DEAN achieves outstanding performance in proposed OCGCD scenario. The implementation code is available at https://​github.​com/​KHU-AGI/​OCGCD.
Keon-Hee Park, Hakyung Lee, Kyungwoo Song, Gyeong-Moon Park
Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation
Abstract
Text-to-image generation has made significant advancements with the introduction of text-to-image diffusion models. These models typically consist of a language model that interprets user prompts and a vision model that generates corresponding images. As language and vision models continue to progress in their respective domains, there is a great potential in exploring the replacement of components in text-to-image diffusion models with more advanced counterparts. A broader research objective would therefore be to investigate the integration of any two unrelated language and generative vision models for text-to-image generation. In this paper, we explore this objective and propose LaVi-Bridge, a pipeline that enables the integration of diverse pre-trained language models and generative vision models for text-to-image generation. By leveraging LoRA and adapters, LaVi-Bridge offers a flexible and plug-and-play approach without requiring modifications to the original weights of the language and vision models. Our pipeline is compatible with various language models and generative vision models, accommodating different structures. Within this framework, we demonstrate that incorporating superior modules, such as more advanced language models or generative vision models, results in notable improvements in capabilities like text alignment or image quality. Extensive evaluations have been conducted to verify the effectiveness of LaVi-Bridge. Code is available at https://​github.​com/​ShihaoZhaoZSH/​LaVi-Bridge.
Shihao Zhao, Shaozhe Hao, Bojia Zi, Huaizhe Xu, Kwan-Yee K. Wong
Tackling Structural Hallucination in Image Translation with Local Diffusion
Abstract
Recent developments in diffusion models have advanced conditioned image generation, yet they struggle with reconstructing out-of-distribution (OOD) images, such as unseen tumors in medical images, causing “image hallucination” and risking misdiagnosis. We hypothesize such hallucinations result from local OOD regions in the conditional images. We verify that partitioning the OOD region and conducting separate image generations alleviates hallucinations in several applications. From this, we propose a training-free diffusion framework that reduces hallucination with multiple Local Diffusion processes. Our approach involves OOD estimation followed by two modules: a “branching” module generates locally both within and outside OOD regions, and a “fusion” module integrates these predictions into one. Our evaluation shows our method mitigates hallucination over baseline models quantitatively and qualitatively, reducing misdiagnosis by 40% and 25% in the real-world medical and natural image datasets, respectively. It also demonstrates compatibility with various pre-trained diffusion models. Code is available at https://​github.​com/​edshkim98/​LocalDiffusion-Hallucination.
Seunghoi Kim, Chen Jin, Tom Diethe, Matteo Figini, Henry F. J. Tregidgo, Asher Mullokandov, Philip Teare, Daniel C. Alexander
Hierarchical Separable Video Transformer for Snapshot Compressive Imaging
Abstract
Transformers have achieved the state-of-the-art performance on solving the inverse problem of Snapshot Compressive Imaging (SCI) for video, whose ill-posedness is rooted in the mixed degradation of spatial masking and temporal aliasing. However, previous Transformers lack an insight into the degradation and thus have limited performance and efficiency. In this work, we tailor an efficient reconstruction architecture without temporal aggregation in early layers and Hierarchical Separable Video Transformer (HiSViT) as building block. HiSViT is built by multiple groups of Cross-Scale Separable Multi-head Self-Attention (CSS-MSA) and Gated Self-Modulated Feed-Forward Network (GSM-FFN) with dense connections, each of which is conducted within a separate channel portions at a different scale, for multi-scale interactions and long-range modeling. By separating spatial operations from temporal ones, CSS-MSA introduces an inductive bias of paying more attention within frames instead of between frames while saving computational overheads. GSM-FFN further enhances the locality via gated mechanism and factorized spatial-temporal convolutions. Extensive experiments demonstrate that our method outperforms previous methods by >0.5 dB with comparable or fewer parameters and complexity. The source codes and pretrained models are released at https://​github.​com/​pwangcs/​HiSViT.
Ping Wang, Yulun Zhang, Lishun Wang, Xin Yuan
Unified Medical Image Pre-training in Language-Guided Common Semantic Space
Abstract
Vision-Language Pre-training (VLP) has shown the merits of analysing medical images. It efficiently learns visual representations by leveraging supervisions in their corresponding reports, and in turn facilitates analysis and interpretation of intricate imaging data. However, such observation is predominantly justified on single-modality data (mostly 2D images like X-rays), adapting VLP to learning unified representations for medical images in real scenario remains an open challenge. This arises from medical images often encompass a variety of modalities, especially modalities with different dimensions (e.g., 3D images like Computed Tomography), and there are almost no paired multi-dimension data here. To overcome the aforementioned challenges, we propose an Unified Medical Image Pre-training framework, namely UniMedI, which utilizes diagnostic reports as common semantic space to create unified representations for diverse modalities of medical images (especially for 2D and 3D images). Under the text’s guidance, UniMedI effectively select text-related 2D slices from sophisticated 3D volume, which acts as pseudo-pairs to bridge 2D and 3D data, ultimately enhancing the consistency across various medical imaging modalities. To demonstrate the effectiveness and versatility of UniMedI, we evaluate its performance on both 2D and 3D images across several different datasets, covering a wide range of medical image tasks such as classification, segmentation, and retrieval. UniMedI has demonstrated superior performance in downstream tasks, showcasing its effectiveness in establishing a universal medical visual representation.
Xiaoxuan He, Yifan Yang, Xinyang Jiang, Xufang Luo, Haoji Hu, Siyun Zhao, Dongsheng Li, Yuqing Yang, Lili Qiu
On the Vulnerability of Skip Connections to Model Inversion Attacks
Abstract
Skip connections are fundamental architecture designs for modern deep neural networks (DNNs) such as CNNs and ViTs. While they help improve model performance significantly, we identify a vulnerability associated with skip connections to Model Inversion (MI) attacks, a type of privacy attack that aims to reconstruct private training data through abusive exploitation of a model. In this paper, as a pioneer work to understand how DNN architectures affect MI, we study the impact of skip connections on MI. We make the following discoveries: 1) Skip connections reinforce MI attacks and compromise data privacy. 2) Skip connections in the last stage are the most critical to attack. 3) RepVGG, an approach to remove skip connections in the inference-time architectures, could not mitigate the vulnerability to MI attacks. 4) Based on our findings, we propose MI-resilient architecture designs for the first time. Without bells and whistles, we show in extensive experiments that our MI-resilient architectures can outperform state-of-the-art (SOTA) defense methods in MI robustness. Furthermore, our MI-resilient architectures are complementary to existing MI defense methods. Our project is available at https://​Pillowkoh.​github.​io/​projects/​RoLSS/​.
Koh Jun Hao, Sy-Tuyen Ho, Ngoc-Bao Nguyen, Ngai-Man Cheung
Adversarial Robustification via Text-to-Image Diffusion Models
Abstract
Adversarial robustness has been conventionally believed as a challenging property to encode for neural networks, requiring plenty of training data. In the recent paradigm of adopting off-the-shelf models, however, access to their training data is often infeasible or not practical, while most of such models are not originally trained concerning adversarial robustness. In this paper, we develop a scalable and model-agnostic solution to achieve adversarial robustness without using any data. Our intuition is to view recent text-to-image diffusion models as “adaptable” denoisers that can be optimized to specify target tasks. Based on this, we propose: (a) to initiate a denoise-and-classify pipeline that offers provable guarantees against adversarial attacks, and (b) to leverage a few synthetic reference images generated from the text-to-image model that enables novel adaptation schemes. Our experiments show that our data-free scheme applied to the pre-trained CLIP could improve the (provable) adversarial robustness of its diverse zero-shot classification derivatives (while maintaining their accuracy), significantly surpassing prior approaches that utilize the full training data. Not only for CLIP, we also demonstrate that our framework is easily applicable for robustifying other visual classifiers efficiently. Code is available at https://​github.​com/​ChoiDae1/​robustify-T2I.
Daewon Choi, Jongheon Jeong, Huiwon Jang, Jinwoo Shin
Overcome Modal Bias in Multi-modal Federated Learning via Balanced Modality Selection
Abstract
Selecting proper clients to participate in each federated learning (FL) round is critical to effectively harness a broad range of distributed data. Existing client selection methods simply consider the mining of distributed uni-modal data, yet, their effectiveness may diminish in multi-modal FL (MFL) as the modality imbalance problem not only impedes the collaborative local training but also leads to a severe global modality-level bias. We empirically reveal that local training with a certain single modality may contribute more to the global model than training with all local modalities. To effectively exploit the distributed multiple modalities, we propose a novel Balanced Modality Selection framework for MFL (BMSFed) to overcome the modal bias. On the one hand, we introduce a modal enhancement loss during local training to alleviate local imbalance based on the aggregated global prototypes. On the other hand, we propose the modality selection aiming to select subsets of local modalities with great diversity and achieving global modal balance simultaneously. Our extensive experiments on audio-visual, colored-gray, and front-back datasets showcase the superiority of BMSFed over baselines and its effectiveness in multi-modal data exploitation.
Yunfeng Fan, Wenchao Xu, Haozhao Wang, Fushuo Huo, Jinyu Chen, Song Guo
Comprehensive Attribution: Inherently Explainable Vision Model with Feature Detector
Abstract
As deep vision models’ popularity rapidly increases, there is a growing emphasis on explanations for model predictions. The inherently explainable attribution method aims to enhance the understanding of model behavior by identifying the important regions in images that significantly contribute to predictions. It is achieved by cooperatively training a selector (generating an attribution map to identify important features) and a predictor (making predictions using the identified features). Despite many advancements, existing methods suffer from the incompleteness problem, where discriminative features are masked out, and the interlocking problem, where the non-optimized selector initially selects noise, causing the predictor to fit on this noise and perpetuate the cycle. To address these problems, we introduce a new objective that discourages the presence of discriminative features in the masked-out regions thus enhancing the comprehensiveness of feature selection. A pre-trained detector is introduced to detect discriminative features in the masked-out region. If the selector selects noise instead of discriminative features, the detector can observe and break the interlocking situation by penalizing the selector. Extensive experiments show that our model makes accurate predictions with higher accuracy than the regular black-box model, and produces attribution maps with high feature coverage, localization ability, fidelity and robustness. Our code will be available at https://​github.​com/​Zood123/​COMET.
Xianren Zhang, Dongwon Lee, Suhang Wang
Reinforcement Learning via Auxiliary Task Distillation
Abstract
We present Reinforcement Learning via Auxiliary Task Distillation (AuxDistill), a new method that enables reinforcement learning (RL) to perform long-horizon robot control problems by distilling behaviors from auxiliary RL tasks. AuxDistill achieves this by concurrently carrying out multi-task RL with auxiliary tasks, which are easier to learn and relevant to the main task. A weighted distillation loss transfers behaviors from these auxiliary tasks to solve the main task. We demonstrate that AuxDistill can learn a pixels-to-actions policy for a challenging multi-stage embodied object rearrangement task from the environment reward without demonstrations, a learning curriculum, or pre-trained skills. AuxDistill achieves \(2.3\times \) higher success than the previous state-of-the-art baseline in the Habitat Object Rearrangement benchmark and outperforms methods that use pre-trained skills and expert demonstrations.
Abhinav Narayan Harish, Larry Heck, Josiah P. Hanna, Zsolt Kira, Andrew Szot
DHR: Dual Features-Driven Hierarchical Rebalancing in Inter- and Intra-Class Regions for Weakly-Supervised Semantic Segmentation
Abstract
Weakly-supervised semantic segmentation (WSS) ensures high-quality segmentation with limited data and excels when employed as input seed masks for large-scale vision models such as Segment Anything. However, WSS faces challenges related to minor classes since those are overlooked in images with adjacent multiple classes, a limitation originating from the overfitting of traditional expansion methods like Random Walk. We first address this by employing unsupervised and weakly-supervised feature maps instead of conventional methodologies, allowing for hierarchical mask enhancement. This method distinctly categorizes higher-level classes and subsequently separates their associated lower-level classes, ensuring all classes are correctly restored in the mask without losing minor ones. Our approach, validated through extensive experimentation, significantly improves WSS across five benchmarks (VOC: 79.8%, COCO: 53.9%, Context: 49.0%, ADE: 32.9%, Stuff: 37.4%), reducing the gap with fully supervised methods by over 84% on the VOC validation set. Code is available at https://​github.​com/​shjo-april/​DHR.
Sanghyun Jo, Fei Pan, In-Jae Yu, Kyungsu Kim
Pre-trained Visual Dynamics Representations for Efficient Policy Learning
Abstract
Pre-training for Reinforcement Learning (RL) with purely video data is a valuable yet challenging problem. Although in-the-wild videos are readily available and inhere a vast amount of prior world knowledge, the absence of action annotations and the common domain gap with downstream tasks hinder utilizing videos for RL pre-training. To address the challenge of pre-training with videos, we propose Pre-trained Visual Dynamics Representations (PVDR) to bridge the domain gap between videos and downstream tasks for efficient policy learning. By adopting video prediction as a pre-training task, we use a Transformer-based Conditional Variational Autoencoder (CVAE) to learn visual dynamics representations. The pre-trained visual dynamics representations capture the visual dynamics prior knowledge in the videos. This abstract prior knowledge can be readily adapted to downstream tasks and aligned with executable actions through online adaptation. We conduct experiments on a series of robotics visual control tasks and verify that PVDR is an effective form for pre-training with videos to promote policy learning.
Hao Luo, Bohan Zhou, Zongqing Lu
View-Consistent Hierarchical 3D Segmentation Using Ultrametric Feature Fields
Abstract
Large-scale vision foundation models such as Segment Anything (SAM) demonstrate impressive performance in zero-shot image segmentation at multiple levels of granularity. However, these zero-shot predictions are rarely 3D-consistent. As the camera viewpoint changes in a scene, so do the segmentation predictions, as well as the characterizations of “coarse” or “fine” granularity. In this work, we address the challenging task of lifting multi-granular and view-inconsistent image segmentations into a hierarchical and 3D-consistent representation. We learn a novel feature field within a Neural Radiance Field (NeRF) representing a 3D scene, whose segmentation structure can be revealed at different scales by simply using different thresholds on feature distance. Our key idea is to learn an ultrametric feature space, which unlike a Euclidean space, exhibits transitivity in distance-based grouping, naturally leading to a hierarchical clustering. Put together, our method takes view-inconsistent multi-granularity 2D segmentations as input and produces a hierarchy of 3D-consistent segmentations as output. We evaluate our method and several baselines on synthetic datasets with multi-view images and multi-granular segmentation, showcasing improved accuracy and viewpoint-consistency. We additionally provide qualitative examples of our model’s 3D hierarchical segmentations in real world scenes. (The code and dataset are available at: https://​github.​com/​hardyho/​ultrametric_​feature_​fields).
Haodi He, Colton Stearns, Adam W. Harley, Leonidas J. Guibas
Plug and Play: A Representation Enhanced Domain Adapter for Collaborative Perception
Abstract
Sharing intermediate neural features enables agents to effectively see through occlusions. Due to agent diversity, some pioneering works have studied domain adaption for heterogeneous neural features. Nevertheless, these works all partially replace agents’ private neural network with newly trained components, which breaks the model integrity and bidirectional compatibility of agents. In this paper, we consider an open challenge: how to learn non-destructive domain adapters for heterogeneous legacy models to achieve collaborative perception, while compatible with continually emerging new agent models? To overcome this challenge, we propose the first plug-and-play domain adapter (PnPDA) for heterogeneous collaborative perception. PnPDA builds a semantic calibrator based on contrastive learning to supervise domain gap bridging without destructing the original models. Semantic converter is learned to transform the semantic space of features, while semantic enhancer is utilized to enhance the representation of features. By specifying standard semantics, new models with PnPDA can easily join existing collaborations. Extensive experiments on OPV2V dataset show that PnPDA non-destructively bridges the domain gap and outperforms SOTA by 9.13%. The code is available at: https://​github.​com/​luotianyou349/​PnPDA.
Tianyou Luo, Quan Yuan, Guiyang Luo, Yuchen Xia, Yujia Yang, Jinglin Li
Follow the Rules: Reasoning for Video Anomaly Detection with Large Language Models
Abstract
Video Anomaly Detection (VAD) is crucial for applications such as security surveillance and autonomous driving. However, existing VAD methods provide little rationale behind detection, hindering public trust in real-world deployments. In this paper, we approach VAD with a reasoning framework. Although Large Language Models (LLMs) have shown revolutionary reasoning ability, we find that their direct use falls short of VAD. Specifically, the implicit knowledge pre-trained in LLMs focuses on general context and thus may not apply to every specific real-world VAD scenario, leading to inflexibility and inaccuracy. To address this, we propose AnomalyRuler, a novel rule-based reasoning framework for VAD with LLMs. AnomalyRulercomprises two main stages: induction and deduction. In the induction stage, the LLM is fed with few-shot normal reference samples and then summarizes these normal patterns to induce a set of rules for detecting anomalies. The deduction stage follows the induced rules to spot anomalous frames in test videos. Additionally, we design rule aggregation, perception smoothing, and robust reasoning strategies to further enhance AnomalyRuler’s robustness. AnomalyRuleris the first reasoning approach for the one-class VAD task, which requires only few-normal-shot prompting without the need for full-shot training, thereby enabling fast adaption to various VAD scenarios. Comprehensive experiments across four VAD benchmarks demonstrate AnomalyRuler’s state-of-the-art detection performance and reasoning ability. AnomalyRuleris open-source and available at: https://​github.​com/​Yuchen413/​AnomalyRuler.
Yuchen Yang, Kwonjoon Lee, Behzad Dariush, Yinzhi Cao, Shao-Yuan Lo
SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation
Abstract
We introduce SAM4MLLM, an innovative approach which integrates the Segment Anything Model (SAM) with Multi-Modal Large Language Models (MLLMs) for pixel-aware tasks. Our method enables MLLMs to learn pixel-level location information without requiring excessive modifications to the existing model architecture or adding specialized tokens. We introduce an inquiry-based approach that can effectively find prompt points for SAM to perform segmentation based on MLLM. It combines detailed visual information with the powerful expressive capabilities of large language models in a unified language-based manner without additional computational overhead in learning. Experimental results on pubic benchmarks demonstrate the effectiveness of our approach.
Yi-Chia Chen, Wei-Hua Li, Cheng Sun, Yu-Chiang Frank Wang, Chu-Song Chen
TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias
Abstract
We identify a critical bias in contemporary CLIP-based models, which we denote as single tag bias. This bias manifests as a disproportionate focus on a singular tag (word) while neglecting other pertinent tags, stemming from CLIP embeddings prioritizing one specific tag in image-text relationships. In this paper, we introduce a novel two-step fine-tuning approach, Text-Tag Self-Distillation (TTD), to address this challenge. We first extract all image-relevant tags from text based on their similarity to the nearest pixels. Then, we distill a combined mask containing the extracted tags’ content to a text-derived mask. This approach ensures the unbiased image-text alignment of the CLIP-based models using only image-text pairs without necessitating additional supervision. Our technique demonstrates model-agnostic improvements in multi-tag classification and segmentation tasks, surpassing competing methods that rely on external resources. The code and data are available at https://​github.​com/​shjo-april/​TTD.
Sanghyun Jo, Soohyun Ryu, Sungyub Kim, Eunho Yang, Kyungsu Kim
Learning Quantized Adaptive Conditions for Diffusion Models
Abstract
The curvature of ODE trajectories in diffusion models hinders their ability to generate high-quality images in a few number of function evaluations (NFE). In this paper, we propose a novel and effective approach to reduce trajectory curvature by utilizing adaptive conditions. By employing a extremely light-weight quantized encoder, our method incurs only an additional 1% of training parameters, eliminates the need for extra regularization terms, yet achieves significantly better sample quality. Our approach accelerates ODE sampling while preserving the downstream task image editing capabilities of SDE techniques. Extensive experiments verify that our method can generate high quality results under extremely limited sampling costs. With only 6 NFE, we achieve 5.14 FID on CIFAR-10, 6.91 FID on FFHQ 64 \(\times \) 64 and 3.10 FID on AFHQv2.
Yuchen Liang, Yuchan Tian, Lei Yu, Huaao Tang, Jie Hu, Xiangzhong Fang, Hanting Chen
STAMP: Outlier-Aware Test-Time Adaptation with Stable Memory Replay
Abstract
Test-time adaptation (TTA) aims to address the distribution shift between the training and test data with only unlabeled data at test time. Existing TTA methods often focus on improving recognition performance specifically for test data associated with classes in the training set. However, during the open-world inference process, there are inevitably test data instances from unknown classes, commonly referred to as outliers. This paper pays attention to the problem that conducts both sample recognition and outlier rejection during inference while outliers exist. To address this problem, we propose a new approach called STAble Memory rePlay (STAMP), which performs optimization over a stable memory bank instead of the risky mini-batch. In particular, the memory bank is dynamically updated by selecting low-entropy and label-consistent samples in a class-balanced manner. In addition, we develop a self-weighted entropy minimization strategy that assigns higher weight to low-entropy samples. Extensive results demonstrate that STAMP outperforms existing TTA methods in terms of both recognition and outlier detection performance. The code is released at https://​github.​com/​yuyongcan/​STAMP.
Yongcan Yu, Lijun Sheng, Ran He, Jian Liang
RePLAy: Remove Projective LiDAR Depthmap Artifacts via Exploiting Epipolar Geometry
Abstract
3D sensing is a fundamental task for Autonomous Vehicles. Its deployment often relies on aligned RGB cameras and LiDAR. Despite meticulous synchronization and calibration, systematic misalignment persists in LiDAR projected depthmap. This is due to the physical baseline distance between the two sensors. The artifact is often reflected as background LiDAR incorrectly projected onto the foreground, such as cars and pedestrians. The KITTI dataset uses stereo cameras as a heuristic solution to remove artifacts. However most AV datasets, including nuScenes, Waymo, and DDAD, lack stereo images, making the KITTI solution inapplicable. We propose RePLAy, a parameter-free analytical solution to remove the projective artifacts. We construct a binocular vision system between a hypothesized virtual LiDAR camera and the RGB camera. We then remove the projective artifacts by determining the epipolar occlusion with the proposed analytical solution. We show unanimous improvement in the State-of-The-Art (SoTA) monocular depth estimators and \(3\)D object detectors with the artifacts-free depthmaps.
Shengjie Zhu, Girish Chandar Ganesan, Abhinav Kumar, Xiaoming Liu
Accelerating Online Mapping and Behavior Prediction via Direct BEV Feature Attention
Abstract
Understanding road geometry is a critical component of the autonomous vehicle (AV) stack. While high-definition (HD) maps can readily provide such information, they suffer from high labeling and maintenance costs. Accordingly, many recent works have proposed methods for estimating HD maps online from sensor data. The vast majority of recent approaches encode multi-camera observations into an intermediate representation, e.g., a bird’s eye view (BEV) grid, and produce vector map elements via a decoder. While this architecture is performant, it decimates much of the information encoded in the intermediate representation, preventing downstream tasks (e.g., behavior prediction) from leveraging them. In this work, we propose exposing the rich internal features of online map estimation methods and show how they enable more tightly integrating online mapping with trajectory forecasting (Code: https://​github.​com/​alfredgu001324/​MapBEVPrediction​). In doing so, we find that directly accessing internal BEV features yields up to 73% faster inference speeds and up to 29% more accurate predictions on the real-world nuScenes dataset.
Xunjiang Gu, Guanyu Song, Igor Gilitschenski, Marco Pavone, Boris Ivanovic
High-Fidelity Modeling of Generalizable Wrinkle Deformation
Abstract
This paper proposes a generalizable model to synthesize high-fidelity clothing wrinkle deformation in 3D by learning from real data. Given the complex deformation behaviors of real-world clothing, this task presents significant challenges, primarily due to the lack of accurate ground-truth data. Obtaining high-fidelity 3D deformations requires special equipment like a multi-camera system, which is not easily scalable. To address this challenge, we decompose the clothing into a base surface and fine wrinkles; and introduce a new method that can generate wrinkles as high-frequency 3D displacement from coarse clothing deformation. Our method is conditioned by Green-Lagrange strain field—a local rotation-invariant measurement that is independent of body and clothing topology, enhancing its generalizability. Using limited real data (e.g., 3K) of garment meshes, we train a diffusion model that can generate high-fidelity wrinkles from a coarse clothing mesh, conditioned on its strain field. Practically, we obtain the coarse clothing mesh using a body-conditioned VAE, ensuring compatibility of the deformation with the body pose. In our experiments, we demonstrate that our generative wrinkle model outperforms existing methods by synthesizing high-fidelity wrinkle deformation from novel body poses and clothing while preserving the quality comparable to the one from training data.
Jingfan Guo, Jae Shin Yoon, Shunsuke Saito, Takaaki Shiratori, Hyun Soo Park
Instruction Tuning-Free Visual Token Complement for Multimodal LLMs
Abstract
As the open community of large language models (LLMs) matures, multimodal LLMs (MLLMs) have promised an elegant bridge between vision and language. However, current research is inherently constrained by challenges such as the need for high-quality instruction pairs and the loss of visual information in image-to-text training objectives. To this end, we propose a Visual Token Complement framework (VTC) that helps MLLMs regain the missing visual features and thus improve response accuracy. Specifically, our VTC integrates text-to-image generation as a guide to identifying the text-irrelevant features, and a visual selector is then developed to generate complementary visual tokens to enrich the original visual input. Moreover, an iterative strategy is further designed to extract more visual information by iteratively using the visual selector without any additional training. Notably, the training pipeline requires no additional image-text pairs, resulting in a desired instruction tuning-free property. Both qualitative and quantitative experiments demonstrate the superiority and efficiency of our VTC.
Dongsheng Wang, Jiequan Cui, Miaoge Li, Wang Lin, Bo Chen, Hanwang Zhang
Backmatter
Metadaten
Titel
Computer Vision – ECCV 2024
herausgegeben von
Aleš Leonardis
Elisa Ricci
Stefan Roth
Olga Russakovsky
Torsten Sattler
Gül Varol
Copyright-Jahr
2025
Electronic ISBN
978-3-031-73004-7
Print ISBN
978-3-031-73003-0
DOI
https://doi.org/10.1007/978-3-031-73004-7