ITSC 2025 Paper Abstract

Paper TH-EA-T26.1

Xing, Jiandong (Beihang University), Min, Shuai (Beihang University), xie, danmu (beihang university), Wang, Xinyu (beihang University), Kang, Letian (Beihang University), Ren, Yilong (Beihang University), Yu, Haiyang (Beihang University), bai, Xuesong (Beihang University)

LoCo-VLM: End-To-End Autonomous Driving with a Loosely Coupled Vision-Language Model

Scheduled for presentation during the Regular Session "S26b-Motion Planning, Trajectory Optimization, and Control for Autonomous Vehicles" (TH-EA-T26), Thursday, November 20, 2025, 13:30−13:50, Broadbeach 1&2

2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC), November 18-21, 2025, Gold Coast, Australia

This information is tentative and subject to change. Compiled on October 18, 2025

Keywords Real-time Motion Planning and Control for Autonomous Vehicles in ITS Networks

Abstract

Recently, end-to-end (E2E) autonomous driving methods integrating VLMs have achieved significant performance improvements in long-tail scenarios, primarily due to the contributions of VLMs in enhancing environmental understanding and reasoning capabilities. In practice, VLMs are often impacted by hallucinations and delays, leading to erroneous outputs or prolonged response times. These issues can increase decision-making time for driving and cause misguidance due to misinformation. To mitigate the negative impact of VLM instability, we propose LoCo-VLM, an end-to-end autonomous framework that is loosely coupled with the VLM through an event-triggered, parallel structure. We integrate VLM decisions into the E2E system through Signal Temporal Logic (STL) after verifying the consistency of decisions and adjusting the driving style to enhance autonomous driving capabilities. This design not only improves the effectiveness of VLMs decision integration but also enhances the diversity and effectiveness of trajectories within a modality. Additionally, VLM is training-free in our methods, facilitating implementation and deployment. On the nuScenes dataset, our framework reliably plans trajectories with an accuracy of 0.57 m and an FPS of 8.2 on a GTX 4090 GPU.