ITSC 2025 Paper Abstract

Paper FR-LM-T31.4

Zhou, Lanqi (Tongji University), Zhang, Lin (Tongji University), Chi, Zhaozhan (Tongji University), Wang, Han (Tongji University), Pan, Wei (Tongji University), Meng, Qiang (Tongji University), Chu, Hongqing (Tongji University), Chen, Hong (Tongji University)

Real-Time Vision-Language Model Guided Semantic-Aware Diffusion Navigation with Intelligent Vehicle Validation

Scheduled for presentation during the Regular Session "S31a-AI-Driven Motion Prediction and Safe Control for Autonomous Systems" (FR-LM-T31), Friday, November 21, 2025, 11:30−11:50, Southport 1

2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC), November 18-21, 2025, Gold Coast, Australia

This information is tentative and subject to change. Compiled on October 18, 2025

Keywords Deep Learning for Scene Understanding and Semantic Segmentation in Autonomous Vehicles, Real-time Motion Planning and Control for Autonomous Vehicles in ITS Networks, Verification of Autonomous Vehicle Sensor Systems in Real-world Scenarios

Abstract

In autonomous driving systems, enabling vehicles to comprehend and execute high-level semantic instructions from humans is a crucial direction for advancing natural interactive navigation. To this end, this paper proposes a vision-language model guided semantic-aware diffusion navigation method (VL-SAN) for semi-structured campus road scenarios, establishing a strategy for translating natural language instructions into vehicle motion control. The method takes natural language, raw images, and their high-dimensional visual features as joint inputs, leveraging the vision-language model (VLM) to generate semantically guided drivable semantic region maps, achieving multimodal information fusion and semantic alignment. Furthermore, this paper restructures the denoising diffusion probabilistic model (DDPM) to adapt it to sequential trajectory generation tasks and introduces a lightweight convolutional neural network to optimize the image condition encoding process. Using the current and historical frame sequences of semantic regions as conditional inputs, the model predicts the vehicle’s future continuous trajectory point sequence and performs motion control based on the current vehicle state. The results show that our method demonstrates trajectories that are more consistent with human navigation patterns, reduces inference time by 80.2% and navigation time by 27.9% compared to the benchmark algorithm, and possesses a higher success rate, validating its effectiveness and real-time performance in real-world campus navigation scenarios.