ITSC 2025 Paper Abstract

Paper WE-EA-T4.6

Dai, Zekai (Institute of Automation, Chinese Academy of Sciences), Dai, Xingyuan (Chinese Academy of Sciences, University of Chinese Academy of Sc), Lv, Yisheng (Institute of Automation, Chinese Academy of Sciences), PEI, Xin (Tsinghua University), Wang, Xu (Xiong’an Institute of Innovation), Gong, Xiaoyan (Chinese Academy of Sciences), Liu, Yu-Liang (Huawei), Huang, Wu-Ling (Institute of Al for Industries, Chinese Academy of Sciences)

MS-VLMDet: Multi-Scale Feature Enhanced Vision-Language Model for Pedestrian Detection

Scheduled for presentation during the Regular Session "S04b-Intelligent Perception and Detection Technologies for Connected Mobility" (WE-EA-T4), Wednesday, November 19, 2025, 14:50−15:30, Surfers Paradise 1

2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC), November 18-21, 2025, Gold Coast, Australia

This information is tentative and subject to change. Compiled on October 19, 2025

Keywords Real-time Object Detection and Tracking for Dynamic Traffic Environments

Abstract

Pedestrian detection, as a critical component of Intelligent Transportation Systems, encounters numerous challenges. Pedestrians are often situated against complex backgrounds, appear as small targets in images, and adopt diverse poses, all of which make detection challenging. This research proposes MS-VLMDet (Multi-Scale Feature-Enhanced Vision-Language Model for Pedestrian Detection), which fuses multi‑scale features by integrating multi‑level visual representations from a Feature Pyramid Network into large vision‑language models to overcome the models' limitations in precisely localizing small objects. MS-VLMDet contains three key modules: a multi-scale feature extraction module that captures pedestrian features at different resolutions; a feature fusion module that integrates the extracted features with the original image and text prompts; and a feature enhanced vision-language model inference module that uses the fused features to guide the model's attention toward pedestrian regions. Experimental results demonstrate that MS-VLMDet outperforms existing deep learning and vision-language models in small-target pedestrian detection across various complex traffic scenarios, achieving an F1 score improvement of 4.6 times over the original Qwen2.5-VL model on the CityPersons dataset.