ITSC 2025 Paper Abstract

Paper VP-VP.11

Li, Yaoning (Institute of Automation Chinese Academy of Sciences), Zhang, Zhang (NLPR, Institute of Automation Chinese Academy of Sciences), Li, Ming-Zhe (Beijing University of Posts and Telecommunications), Li, Da (NLPR, Institute of Automation Chinese Academy of Sciences), Wang, Liang (NLPR, Institute of Automation, Chinese Academy of Sciences)

Dynamic Gesture-Guided Pedestrian Intention Prediction Via Dual-Stream Information Fusion

Scheduled for presentation during the Video Session "On-Demand Video Presentations" (VP-VP), Saturday, November 22, 2025, 08:00−18:00, On-Demand Platform

2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC), November 18-21, 2025, Gold Coast, Australia

This information is tentative and subject to change. Compiled on April 2, 2026

Keywords Deep Learning for Scene Understanding and Semantic Segmentation in Autonomous Vehicles, Protection Strategies for Vulnerable Road Users (Pedestrians, Cyclists, etc.)

Abstract

In the field of autonomous driving, Pedestrian Intention Prediction (PIP) aims to accurately determine whether pedestrians intend to cross the street or not. Previous studies have explored various information, such as the ego vehicle's speed, pedestrian boundary boxes, and human pose to enhance PIP performance. However, most current advanced models, which primarily rely on spatio-temporal relationships between pedestrians and ego-vehicles, often neglect the dynamic gestures of pedestrians. This omission of fine-grained motion information can lead to underestimating crossing intentions, resulting in low recall rates in PIP, which is impractical for safety-critical autonomous driving applications. To address this issue, we propose a Dynamic Gesture-Guided (DGG) PIP model, where a dual-stream information fusion network is proposed to integrate both global spatio-temporal features and local gesture motions at two levels. At the feature level, a cross-attention module is adopted to combine local motion features with the spatio-temporal features. At the score level, a logistic regression-based gating module is employed to adaptively select the most reliable prediction from the dual-stream outputs. Extensive experimental results on the PIE and JAAD datasets demonstrate that the proposed DGG model significantly improves recall while maintaining high accuracy in pedestrian intention prediction.