ITSC 2025 Paper Abstract

Paper FR-LA-T43.1

Marques, Alexandre (Institute of Systems and Robotics), Ferreira, Pedro (Institute of Systems and Robotics), Silva, Bruno (Institute of Systems and Robotics, University of Coimbra), Batista, Jorge (DEEC-FCTUC, ISR-Coimbra, University of Coimbra, PORTUGAL)

SMAE-DIM: Vehicle-Centric Semantic Masked AutoEncoders Pre-Training by Distilling Multimodal Foundational Models

Scheduled for presentation during the Regular Session "S43c-Multi-Sensor Fusion and Perception for Robust Autonomous Driving" (FR-LA-T43), Friday, November 21, 2025, 16:00−16:20, Stradbroke

2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC), November 18-21, 2025, Gold Coast, Australia

This information is tentative and subject to change. Compiled on October 18, 2025

Keywords Deep Learning for Scene Understanding and Semantic Segmentation in Autonomous Vehicles, Advanced Sensor Fusion for Robust Autonomous Vehicle Perception

Abstract

In this work, we present SMAE-DIM, a vehicle-centric semantic Masked Autoencoder (MAE) pre-training by distilling multimodal foundational models. Our approach integrates MAE-based Masked Image Modeling (MIM) with CLIP-style distillation, leveraging a curated, large-scale vehicle dataset (Automobile1M) and a visually grounded, unpaired text corpus. Unlike traditional vision-language models that require carefully curated image-caption pairs or large-scale aligned datasets, our method distills semantic knowledge from a pre-trained CLIP model without the need for direct image-text alignment during training. We also introduce Automobile1M—a large-scale, in-domain dataset specifically curated for vehicle-centric pre-training, alongside with a dedicated distillation text corpus focused on visually grounded language. Specialized distillation losses were employed to enhance open-vocabulary logits during vision-language distillation by leveraging image embeddings as auxiliary pseudo-text embeddings, thereby strengthening semantic alignment between visual and textual modalities. Experiments demonstrate that SMAE-DIM effectively transfers to vehicle-specific downstream tasks via both linear probing and fine-tuning, unifying structural and semantic understanding within a single pretraining framework.