ITSC 2025 Paper Abstract

Paper VP-VP.58

Liu, Pei (The Hong Kong University of Science and Technology(GuangZhou)), Zhang, Yiheng (Hohai University), LIU, Haipeng (Shanghai Li Auto Co., Ltd.), Liu, xingyu (Shenyang Agricultural University), Meng, Huang (Jiangsu Ocean University), Chen, Junlan (Monash University), Shandong, Wang (Hohai University)

DSATFusion: Multi-Modal 3D Detection Via Vision-Text Synergy with Dynamic Sparse Attention Temporal Fusion

Scheduled for presentation during the Video Session "On-Demand Video Presentations" (VP-VP), Saturday, November 22, 2025, 08:00−18:00, On-Demand Platform

2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC), November 18-21, 2025, Gold Coast, Australia

This information is tentative and subject to change. Compiled on April 2, 2026

Keywords Advanced Sensor Fusion for Robust Autonomous Vehicle Perception, Real-time Object Detection and Tracking for Dynamic Traffic Environments, Multimodal Transportation Networks for Efficient Urban Mobility

Abstract

Sparse query-based paradigms have demonstrated remarkable success in multi-view 3D detection for autonomous driving systems. However, they remain constrained by two critical limitations: computational redundancy in complex scenes and ineffective detection of long-tail categories. To address these challenges, we propose DSATFusion, a novel dynamic sparse attention temporal fusion framework. DSATFusion leverages an dynamic sparse attention network to dynamically identify region-specific sparse features, significantly reducing computational overhead while maintaining focus on critical regions. Furthermore, to effectively harness temporal information and tackle the challenges of long-term sequence modeling, we introduce a Recurrent Temporal Attention Fusion strategy. This approach dynamically weights and fuses historical frame features, ensuring continuous updates to target representations while emphasizing the most salient objects across diverse viewpoints. The integration of rich, long-term temporal information with an efficient fusion pipeline enhances the robustness of temporal sequence modeling, particularly in scenarios involving occlusions or complex backgrounds. Additionally, to address the detection of long-tail categories, we incorporate a Vision-Language Model, establishing a similarity-driven alignment between visual and textual domains. This alignment significantly improves fine-grained classification accuracy for rare-class objects. Extensive experiments on the nuScenes benchmark demonstrate the efficacy of DSATFusion, achieving state-of-the-art performance across various camera-based 3D detection tasks.