IV 2025 Program | Monday June 23, 2025


MoOC Plenary Room	Add to My Program
Opening Ceremony


MoP1L Plenary Session, Plenary Room	Add to My Program
Plenary 1 >> Navigating Global SDV Trends: Insights from Market Demands and End-To-End Roadmaps, Dr. Yuhan Yao, Robert Bosch GmbH, Germany

Chair: Vlacic, Ljubo	Griffith University
Co-Chair: Olaverri-Monreal, Cristina	Johannes Kepler University Linz, Austria


MoA1 Regular Session, Plenary Room	Add to My Program
Oral 1

Chair: López, Antonio M.	Universitat Autònoma De Barcelona
Co-Chair: Nashashibi, Fawzi	INRIA

09:15-09:33, Paper MoA1.1	Add to My Program
UDA4Inst: Unsupervised Domain Adaptation for Instance Segmentation

Guo, Yachan	Universitat Autònoma De Barcelona
Xiao, Yi	Computer Vision Center, Universitat Autònoma De Barcelona
Xue, Danna	Computer Vision Center, Universitat Autònoma Barcelona
Gomez Zurita, Jose Luis	Computer Vision Center (CVC)
López, Antonio M.	Universitat Autònoma De Barcelona
Keywords: Techniques for Dataset Domain Adaptation, Instance and Panoptic Segmentation Techniques, Data Annotation and Labeling Techniques Abstract: Instance segmentation is crucial for autonomous driving but is hindered by the lack of annotated real-world data due to expensive labeling costs. Unsupervised Domain Adaptation (UDA) offers a solution by transferring knowledge from labeled synthetic data to unlabeled real-world data. While UDA methods for synthetic to real-world domains (synth-to-real) excel in tasks such as semantic segmentation and object detection, its application to instance segmentation for autonomous driving remains underexplored and often relies on suboptimal baselines. We introduce UDA4Inst, a powerful framework for synth-to-real UDA in instance segmentation. Our framework enhances instance segmentation through Semantic Category Training and Bidirectional Mixing Training. Semantic Category Training groups semantically related classes for separate training, improving pseudo-label quality and segmentation accuracy. Bidirectional Mixing Training combines instance-wise and patch-wise data mixing, creating realistic composites that enhance generalization across domains. Extensive experiments show UDA4Inst sets a new state-of-the-art on the SYNTHIA-> Cityscapes benchmark (mAP 31.3) and introduces results on novel datasets, using UrbanSyn and Synscapes as sources and Cityscapes and KITTI360 as targets. Code and models are available at https://github.com/gyc-code/UDA4Inst.

09:33-09:51, Paper MoA1.2	Add to My Program
Adaptive Semantic Segmentation of Traffic Scenes Via Frequency Domain Analysis

Zhang, Tengwen	Xi'an Jiaotong University
Li, Yaochen	Xi'an Jiaotong University
Zou, Runlin	Xian Jiaotong University
Gao, Yuan	Xi'an Jiaotong University
Qiu, Chao	Xi'an Jiaotong University
Ni, Hong	Xi'an Jiaotong University
He, Ziyuan	Xi'an Jiaotong University
Keywords: Techniques for Dataset Domain Adaptation Abstract: High-precision semantic segmentation is an important research topic in the communities of computer vision and intelligent transportation. The existing unsupervised domain adaptation methods based on image translation often lead to artifacts and structural distortions. To overcome this problem, a novel adaptive semantic segmentation method of traffic scenes via frequency domain analysis is proposed. Firstly, we leverage the frequency domain space to decouple style and semantic features. The Fast Fourier Transform is applied to achieve structural-preserving style alignment. Subsequently, a content enhancement module is proposed based on the Wavelet transform, which utilizes the original source images to correct and enhance high-frequency structural and semantic details. Furthermore, a convolutional enhancement attention module is proposed, which utilizes depthwise separable convolution to capture more local details. The experimental results based on the GTA5→Cityscapes and SYNTHIA→Cityscapes tasks have respectively attained state-of-the-art mIoU of 76.4 and 67.7, convincingly demonstrating the effectiveness of the methods.

09:51-10:09, Paper MoA1.3	Add to My Program
Map-Free Trajectory Prediction Via Deformable Attention in Bird’s-Eye View Space

Kong, Minsang	Kookmin University
Kim, Myeong jun	Kookmin University
Sung, Jinwook	Kookmin University
Kang, Sang Gu	Kookmin University
Park, Kyu min	Kookmin University
Park, Minseo	Kookmin University
Jeong, Dahun	Kookmin University
Lee, Sang Hun	Kookmin University
Keywords: Advanced Multisensory Data Fusion Algorithms, Motion Forecasting, Deep Learning Based Approaches Abstract: In autonomous driving, trajectory prediction is crucial for safe navigation. While many recent methods rely on pre-built high definition (HD) maps, these are limited to specific regions and cannot reflect real-time changes. We propose a novel framework that constructs bird's-eye view representations from real-time sensor data and selectively extracts critical features using deformable attention, eliminating the need for HD maps. We also introduces a sparse goal candidate proposal module for fully end-to-end prediction without post-processing. Experiments demonstrate that our model achieves competitive performance compared to HD map-based methods.

10:09-10:27, Paper MoA1.4	Add to My Program
TPK: Trustworthy Trajectory Prediction Integrating Prior Knowledge for Interpretability and Kinematic Feasibility

Abouelazm, Ahmed	FZI Research Center for Information Technology
Baden, Marius	Karlsruhe Institute of Technology
Hubschneider, Christian	FZI Research Center for Information Technology
Wu, Yin	Karlsruhe Institute of Technology
Slieter, Daniel	CARIAD SE
Zöllner, J. Marius	FZI Research Center for Information Technology; KIT Karlsruhe In
Keywords: Predictive Trajectory Models and Motion Forecasting, Motion Forecasting, Trust and Acceptance of Autonomous Technologies Abstract: Trajectory prediction is crucial for autonomous driving, enabling vehicles to navigate safely by anticipating the movements of surrounding road users. However, current deep learning models often lack trustworthiness as their predictions can be physically infeasible and illogical to humans. To make predictions more trustworthy, recent research has incorporated prior knowledge, like the social force model for modeling interactions and kinematic models for physical realism. However, these approaches focus on priors that suit either vehicles or pedestrians and do not generalize to traffic with mixed agent classes. We propose incorporating interaction and kinematic priors of all agent classes--vehicles, pedestrians, and cyclists with class-specific interaction layers to capture agent behavioral differences. To improve the interpretability of the agent interactions, we introduce DG-SFM, a rule-based interaction importance score that guides the interaction layer. To ensure physically feasible predictions, we proposed suitable kinematic models for all agent classes with a novel pedestrian kinematic model. We benchmark our approach on the Argoverse 2 dataset, using the state-of-the-art transformer HPTR as our baseline. Experiments demonstrate that our method improves interaction interpretability, revealing a correlation between incorrect predictions and divergence from our interaction prior. Even though incorporating the kinematic models causes a slight decrease in accuracy, they eliminate infeasible trajectories found in the dataset and the baseline model. Thus, our approach fosters trust in trajectory prediction as its interaction reasoning is interpretable, and its predictions adhere to physics.

10:27-10:45, Paper MoA1.5	Add to My Program
HotShot: A Loss-Guided Data Augmentation and Curriculum Learning Technique for the Task of Semantic Segmentation

Frickenstein, Lukas	BMW AG
Thoma, Moritz	BMW AG
Mori, Pierpaolo	BMW AG
Balamuthu Sampath, Shambhavi	BMW AG
Fasfous, Nael	BMW AG
Vemparala, Manoj Rohit	BMW AG
Frickenstein, Alexander	BMW AG
Unger, Christian	BMW Group
Passerone, Claudio	Dipartimento Di Elettronica E Telecomunicazioni Politecnico Di
Stechele, Walter	Technical University of Munich (TUM)
Keywords: Data Augmentation Techniques Using Neural Networks, Semantic Segmentation Techniques, Deep Learning Based Approaches Abstract: Semantic segmentation is an important computer vision task that requires costly pixel-level annotations to train deep neural networks (DNNs) for. Especially for applications like autonomous driving, precise pixel-level understanding of scenes is a decisive factor between success and failure of the application. It follows that every labeled sample of an existing dataset is highly valuable and should be optimally used during training to maximize its value. This is achieved using (1) augmentation of the same labeled sample to help the model learn it in different ways, and (2) curriculum learning to introduce training samples to the model in an strategic order to ease the learning process. In this work, we present HotShot, a loss-guided cropping technique that assesses the DNN’s prediction capability during the training to derive probability scores of potential cropping regions. This effectively combines augmentation and curriculum learning in one technique, where a single sample is cropped (augmentation) in regions selected based on the DNN’s loss throughout the training (curriculum learning). For UperNet using a ConvNeXt-tiny backbone and DeepLabV3+ architecture using a ResNet-50 backbone, applying HotShot provides a +0.41 p.p. and +0.43 p.p. mIoU improvement over randomly cropping regions on the CityScapes and BDD100K datasets respectively. More interestingly, the analysis shows HotShot primarily boosts the classes that are most challenging for the model. For example, the rider and motorcycle classes on the BDD100K dataset improve by 163% and 129% using DeepLabV3+ with a ResNet-50 backbone. HotShot achieves improved mIoU in almost all cases and normalizes imbalances in learning challenging classes in datasets.


MoAM_BR Coffee Break, Foyer	Add to My Program
Coffee Break - Monday AM


MoLU_BR Lunch break, Venezia Restaurant	Add to My Program
Lunch - Monday


MoC1 Regular Session, Plenary Room	Add to My Program
Oral 2

Chair: Martinet, Philippe	INRIA
Co-Chair: Petrovai, Andra	Technical University of Cluj-Napoca

13:30-13:48, Paper MoC1.1	Add to My Program
DOC-Depth: A Novel Approach for Dense Depth Ground Truth Generation

de Moreau, Simon	Mines Paris - PSL & Valeo
Corsia, Mathias	Exwayz
Bouchiba, Hassan	Exwayz
Almehio, Yasser	Valeo
Bursuc, Andrei	Valeo
El-Idrissi, Hafid	Valeo
Moutarde, Fabien	MINES Paris - PSL
Keywords: Data Annotation and Labeling Techniques, Static and Dynamic Object Detection Algorithms, 3D Scene Reconstruction Methods Abstract: Accurate depth information is essential for many computer vision applications. Yet, no available dataset recording method allows for fully dense accurate depth estimation in a large scale dynamic environment. In this paper, we introduce DOC-Depth, a novel, efficient and easy-to-deploy approach for dense depth generation from any LiDAR sensor. After reconstructing consistent dense 3D environment using LiDAR odometry, we address dynamic objects occlusions automatically thanks to DOC, our state-of-the art dynamic object classification method. Additionally, DOC-Depth is fast and scalable, allowing for the creation of unbounded datasets in terms of size and time. We demonstrate the effectiveness of our approach on the KITTI dataset, improving its density from 16.1% to 71.2% and release this new fully dense depth annotation, to facilitate future research in the domain. We also showcase results using various LiDAR sensors and in multiple environments. All software components are publicly available for the research community at https://simondemoreau.github.io/DOC-Depth/

13:48-14:06, Paper MoC1.2	Add to My Program
LiDPM: Rethinking Point Diffusion for Lidar Scene Completion

Martyniuk, Tetiana	Valeo.ai, Inria
Puy, Gilles	Valeo.ai
Boulch, Alexandre	Valeo.ai
Marlet, Renaud	Valeo
De Charette, Raoul	INRIA
Keywords: 3D Scene Reconstruction Methods Abstract: Training diffusion models that work directly on lidar points at the scale of outdoor scenes is challenging due to the difficulty of generating fine-grained details from white noise over a broad field of view. The latest works addressing scene completion with diffusion models tackle this problem by reformulating the original DDPM as a local diffusion process. It contrasts with the common practice of operating at the level of objects, where vanilla DDPMs are currently used. In this work, we close the gap between these two lines of work. We identify approximations in the local diffusion formulation, show that they are not required to operate at the scene level, and that a vanilla DDPM with a well-chosen starting point is enough for completion. Finally, we demonstrate that our method, LiDPM, leads to better results in scene completion on SemanticKITTI. The project page is https://astra-vision.github.io/lidpm.

14:06-14:24, Paper MoC1.3	Add to My Program
UAVD-Mamba: Deformable Token Fusion Vision Mamba for Multimodal UAV Detection

Li, Wei	Hunan University
Tang, Jiaman	Hunan University
Li, Yang	Hunan University, College of Mechanical and Vehicle Engineering
Xia, Beihao	Huazhong University of Science and Technology
Tan, Ligang	Hunan University
Qin, Hongmao	Hunan University
Keywords: Remote Sensing Techniques for UAVs Abstract: Unmanned Aerial Vehicle (UAV) object detection has been widely used in traffic management, agriculture, emergency rescue, etc. However, it faces significant challenges, including occlusions, small object sizes, and irregular shapes. These challenges highlight the necessity for a robust and efficient multimodal UAV object detection method. Mamba has demonstrated considerable potential in multimodal image fusion. Leveraging this, we propose UAVD-Mamba, a multimodal UAV object detection framework based on Mamba architectures. To improve geometric adaptability, we propose the Deformable Token Mamba Block (DTMB) to generate deformable tokens by incorporating adaptive patches from deformable convolutions alongside normal patches from normal convolutions, which serve as the inputs to the Mamba Block. To optimize the multimodal feature complementarity, we design two separate DTMBs for the RGB and infrared (IR) modalities, with the outputs from both DTMBs integrated into the Mamba Block for feature extraction and into the Fusion Mamba Block for feature fusion. Additionally, to improve multiscale object detection, especially for small objects, we stack four DTMBs at different scales to produce multiscale feature representations, which are then sent to the Detection Neck for Mamba (DNM). The DNM module, inspired by the YOLO series, includes modifications to the SPPF and C3K2 of YOLOv11 to better handle the multiscale features. In particular, we employ crossenhanced spatial attention before the DTMB and cross-channel attention after the Fusion Mamba Block to extract more discriminative features. Experimental results on the DroneVehicle dataset show that our method outperforms the baseline OAFA method by 3.6% in the mAP metric. Codes will be released at https://github.com/Great

14:24-14:42, Paper MoC1.4	Add to My Program
Intersection Safety Modeling Using Semantic Scene Graph and Graph Neural Network

Sarkar, Abhijit	Virginia Tech
Sonth, Akash	Virginia Tech
Abbott, Amos	Virginia Tech
Keywords: Data Augmentation Techniques Using Neural Networks, Decision Making, Vulnerable Road User Protection Strategies Abstract: Traffic intersections are critical zones where vehicle and pedestrian interactions significantly impact road safety. This study presents a novel graph-based approach to model and analyze intersection traffic dynamics, leveraging Graph Neural Networks (GNNs) for risk assessment. By representing traffic participants and road infrastructure as a structured graph, we capture spatial-temporal relationships that influence crash likelihood. Using real-world intersection video data, we construct semantic scene graphs to encode actor interactions and road topology, enabling a data-driven understanding of risk factors. Two GNN models, TransformerConv and GINEConv, are employed to assess safety risks, where TransformerConv captures dynamic interactions through adaptive attention weighting, and GINEConv models structured dependencies within the intersection network. Our findings demonstrate that this framework can effectively classify high-risk scenarios, threat assessment of each actors (node), characterize their interaction (edge), and provide near real time safety analysis with 79.8% accuracy. This provides a scalable method for proactive intersection safety monitoring.

14:42-15:00, Paper MoC1.5	Add to My Program
SpikingRTNH: Spiking Neural Network for 4D Radar Object Detection

Paek, Dong-Hee	Korea Advanced Institute of Science and Technology
Kong, Seung-Hyun	Korea Advanced Institute for Science and Technology
Keywords: Radar Object Detection and Tracking, Static and Dynamic Object Detection Algorithms, Deep Learning Based Approaches Abstract: Recently, 4D Radar has emerged as a crucial sensor for 3D object detection in autonomous vehicles, offering both stable perception in adverse weather and high-density point clouds for object shape recognition. However, processing such high-density data demands substantial computational resources and energy consumption. We propose SpikingRTNH, the first spiking neural network (SNN) for 3D object detection using 4D Radar data. By replacing conventional ReLU activation functions with leaky integrate-and-fire (LIF) spiking neurons, SpikingRTNH achieves significant energy efficiency gains. Furthermore, inspired by human cognitive processes, we introduce biological top-down inference (BTI), which processes point clouds sequentially from higher to lower densities. This approach effectively utilizes points with lower noise and higher importance for detection. Experiments on K-Radar dataset demonstrate that SpikingRTNH with BTI significantly reduces energy consumption by 78% while achieving comparable detection performance to its ANN counterpart (51.1% AP 3D, 57.0% AP BEV). These results establish the viability of SNNs for energy-efficient 4D Radar-based object detection in autonomous driving systems. All codes are available at https://github.com/kaist-avelab/k-radar.


MoPM_BR Coffee Break, Foyer	Add to My Program
Coffee Break - Monday PM


MoP2L Plenary Session, Plenary Room	Add to My Program
Plenary 2 >> Learning to Drive: AI at the Wheel of Autonomy, Prof. Dr. Daniela Rus, MIT, USA

Chair: Oniga, Florin Ioan	Technical University of Cluj Napoca
Co-Chair: Orosz, Gabor	University of Michigan


MoIT Plenary Room	Add to My Program
IEEE ITSS Session


MoRRRRA UTCN Hub	Add to My Program
RoboRacer - Competition


MoRRRA UTCN Hub	Add to My Program
RoboRacer Awards Ceremony


MoW1S Workshop Session, Botticelli Room	Add to My Program
W11.1 >> Challenges and Opportunities for Safe and Reliable Autonomous Agents in the Era of Large Language Models


MoW2S Workshop Session, Botticelli Room	Add to My Program
W11.2 >> Challenges and Opportunities for Safe and Reliable Autonomous Agents in the Era of Large Language Models


MoBOSCH Bosch ECC	Add to My Program
Bosch Mobility Competition - Qualification

Technical Program for Monday June 23, 2025