ITSC 2025 Paper Abstract

Paper FR-LM-T43.5

Liu, Peng (South China University of Technology), LIN, HONGYI (Tsinghua University), Zhao, Yiyue (Tsinghua University), Liu, Yang (Tsinghua University), Qu, Xiaobo (Tsinghua University)

DesEAD: Enhancing End-To-End Autonomous Driving with Scene Descriptions

Scheduled for presentation during the Regular Session "S43a-Multi-Sensor Fusion and Perception for Robust Autonomous Driving" (FR-LM-T43), Friday, November 21, 2025, 11:50−12:10, Stradbroke

2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC), November 18-21, 2025, Gold Coast, Australia

This information is tentative and subject to change. Compiled on October 18, 2025

Keywords Deep Learning for Scene Understanding and Semantic Segmentation in Autonomous Vehicles, Advanced Sensor Fusion for Robust Autonomous Vehicle Perception, Real-time Motion Planning and Control for Autonomous Vehicles in ITS Networks

Abstract

Modular autonomous driving systems based on the perception–planning–control pipeline often suffer from redundant data formats and error accumulation across modules. In contrast, end-to-end autonomous driving (E2EAD) offers a streamlined alternative by directly mapping sensor inputs to driving actions through unified feature extraction. However, sensor-based perception alone lacks a rich environmental understanding comparable to that of human cognition. Fortunately, visual-language models (VLMs) have emerged as a promising complement, capable of extracting semantic knowledge from raw images through text descriptions. This paper presents DesEAD, a novel framework that integrates E2EAD with VLMs to enhance scene comprehension in autonomous driving. DesEAD uses VLMs to extract semantic information from multi-camera images and generates textual descriptions of driving scenes. These descriptions are fused with E2EAD features to improve environmental awareness. We further evaluate multiple fusion strategies between textual and feature-level information and analyze their effects on downstream tasks. Experiments on the nuScenes dataset show that DesEAD effectively addresses the limitations of traditional E2EAD systems by incorporating semantic context. The proposed method outperforms existing state-of-the-art approaches in perception, prediction, and planning tasks.