ITSC 2024 Paper Abstract

Paper WeBT13.3

Elgazwy, Ahmed (OntarioTech University), Elmoghazy, Somayya (Ontario Tech University), Elgazzar, Khalid (Ontario Tech University), Khamis, Alaa (General Motors Canada)

Pedestrian Crossing Intent Prediction Using Vision Transformers

Scheduled for presentation during the Poster Session "Transformer networks" (WeBT13), Wednesday, September 25, 2024, 14:30−16:30, Foyer

2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC), September 24- 27, 2024, Edmonton, Canada

This information is tentative and subject to change. Compiled on April 25, 2025

Keywords Driver Assistance Systems, Automated Vehicle Operation, Motion Planning, Navigation, Sensing, Vision, and Perception

Abstract

The prediction of pedestrian intentions is crucial and one of the most challenging problems for self-driving vehicles. For this reason, a fast, efficient, and robust vision-based model is required to predict pedestrian crossing as fast as possible and to prevent serious injuries or casualties that may occur. Transformers have rapidly replaced recurrent neural networks (RNN) based architectures for their better generalization and fast performance. Vision transformer (ViT) is a variant of transformers that has also proven to be efficient in image classification and has outperformed the state-of-the-art convolutional neural networks (CNN) when trained on large datasets. In this paper, a fully transformer-based architecture is presented to efficiently predict pedestrian intention with minimum latency. The proposed architecture is composed of two branches: the first branch handles the non-visual features while the second branch handles the visual features. The model is trained on the Joint Attention in Autonomous Driving (JAAD) dataset and different variants of the architecture are tested to find the optimal model. Experimental analysis shows that the proposed model outperforms all the previous state-of-the-art techniques, achieving the highest accuracy (83%) and F1 score (64%) on the testing dataset while maintaining the lowest processing time.