ITSC 2025 Paper Abstract

Paper WE-EA-T8.6

Yin, Wenyu (Tongji University), Zhang, ChongHao (Tongji University), Yu, Hao (TongJi University), Luo, Xiao (Tongji University), Huang, Jinyi (Tongji university), Zhang, Jialei (Tongji University), Qi, Shuwen (Tongji University)

Beyond Pixels: Vision-Language Models for Enhanced Street Environment Perception

Scheduled for presentation during the Regular Session "S08b-Intelligent Modeling and Prediction of Traffic Dynamics" (WE-EA-T8), Wednesday, November 19, 2025, 14:50−15:30, Coolangata 2

2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC), November 18-21, 2025, Gold Coast, Australia

This information is tentative and subject to change. Compiled on October 19, 2025

Keywords AI, Machine Learning for Real-time Traffic Flow Prediction and Management, Data Analytics and Real-time Decision Making for Autonomous Traffic Management

Abstract

With the continuous development of smart cities and human-oriented transportation, the streets, as a key component of urban public space, have a significant impact on the behavior and urban vitality of residents. Traditional street perception methods based on semantic segmentation face challenges such as limited semantics, insufficient understanding of relationships, and poor task adaptability in open-world environments. Therefore, enhancing the accuracy and fine-grained perception capabilities of street environment perception is particularly important. This study proposes the introduction of Vision-Language Models to overcome the limitations of traditional frameworks. Using the Qwen2.5-VL-72B-Instruct model combined with the MIT Place Pulse 2.0 Dataset, experiments were carried out to verify the superiority of Vision-Language Models in six dimensions of street perception: safe, lively, clean, wealthy, depressing and beautiful. The research results indicate that Vision-Language Models not only enables fine-grained street space perception but also flexibly adapts to various tasks through prompt-based approaches, further enhancing perception accuracy and person alization. Future work can optimize Vision-Language Models performance through low-parameter fine-tuning (e.g., LoRA), improve perception speed, reduce computational consumption, and thus promote its application in urban space perception.