ITSC 2025 Paper Abstract

Paper TH-EA-T29.4

Li, Yuanzhe (Technische Universität Berlin), Müller, Steffen (Technical University of Berlin)

ACIT: Attention-Guided Cross-Modal Interaction Transformer for Pedestrian Crossing Intention Prediction

Scheduled for presentation during the Regular Session "S29b-Human Factors and Human Machine Interaction in Automated Driving" (TH-EA-T29), Thursday, November 20, 2025, 14:30−14:50, Currumbin

2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC), November 18-21, 2025, Gold Coast, Australia

This information is tentative and subject to change. Compiled on October 18, 2025

Keywords Protection Strategies for Vulnerable Road Users (Pedestrians, Cyclists, etc.), Human-Machine Interaction Systems for Enhanced Driver Assistance and Safety, Deep Learning for Scene Understanding and Semantic Segmentation in Autonomous Vehicles

Abstract

Predicting pedestrian crossing intention is crucial for autonomous vehicles to prevent pedestrian-related collisions. However, effectively extracting and integrating complementary cues from different types of data remains one of the major challenges. This paper proposes an attention-guided cross-modal interaction Transformer (ACIT) for pedestrian crossing intention prediction. ACIT leverages six visual and motion modalities, which are grouped into three interaction pairs: (1) Global semantic map and global optical flow, (2) Local RGB image and local optical flow, and (3) Ego-vehicle speed and pedestrian’s bounding box. Within each visual interaction pair, a dual-path attention mechanism enhances salient regions within the primary modality through intra-modal self-attention and facilitates deep interactions with the auxiliary modality (i.e., optical flow) via optical flow-guided attention. Within the motion interaction pair, cross-modal attention is employed to model the cross-modal dynamics, enabling the effective extraction of complementary motion features. Beyond pairwise interactions, a multi-modal feature fusion module further facilitates cross-modal interactions at each time step. Furthermore, a Transformer-based temporal aggregation module is introduced to capture sequential dependencies. Experimental results demonstrate that ACIT outperforms state-of-the-art methods, achieving accuracy rates of 70% and 89% on the JAADbeh and JAADall datasets, respectively. Extensive ablation studies are further conducted to investigate the contribution of different modules of ACIT. The source code will be released at https://github.com/lyzDE/ACIT-PIP.