ITSC 2025 Paper Abstract

Paper FR-EA-T36.3

Wang, Sensen (Xi'an Jiaotong University), Duan, Yixuan (Southwest Jiaotong University), Zhang, Chi (Xi'an Jiaotong University), Liu, Yuehu (Institute of Artificial Intelligence and Robotics, Xi'an Jiaoton), Li, Li (Tsinghua University)

BiTPerceiver: Bidirectional Temporal Mamba for Online Driving Behavior Perception

Scheduled for presentation during the Regular Session "S36b-Behavior Modeling and Decision-Making in Traffic Systems" (FR-EA-T36), Friday, November 21, 2025, 14:10−14:30, Surfers Paradise 3

2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC), November 18-21, 2025, Gold Coast, Australia

This information is tentative and subject to change. Compiled on October 18, 2025

Keywords Deep Learning for Scene Understanding and Semantic Segmentation in Autonomous Vehicles, Real-time Object Detection and Tracking for Dynamic Traffic Environments

Abstract

Interpreting ongoing but incomplete current driving actions and anticipating future driving intentions are formulated as Online Action Detection (OAD) and Online Action Anticipation (OAA). However, existing methods generally follow unidirectional forward temporal modeling, which imposes a forward constraint that prevents subsequent frames from correcting earlier modeling errors, leading to the accumulation of misperceptions in driving behavior. To this end, we propose to incorporate backward temporal modeling as a complementary strategy following forward temporal modeling. The backward temporal modeling mitigates the effect of early-stage misinterpretation through reevaluating and reinterpreting earlier ambiguous cues based on subsequent context. Based on this, we propose a unified model for OAD and OAA, named Bidirectional Temporal Perceiver (BiTPerceiver). Specifically, BiTPerceiver extracts task-relevant information as video memory from online videos via Transformer Decoders. Then, motivated by the recent success of Mamba in sequence modeling, BiTPerceiver models forward-then-backward temporal dependency in the video memory through Mamba. The final modeling result includes current and future action representations. BiTPerceiver achieves state-of-the-art OAD and OAA performance on the Honda Research Institute Driving Dataset (HDD) (OAD: 41.3%, OAA: 27.3%), THUMOS'14 (OAD: 72.6%, OAA: 59.3%) and TVSeries (OAD: 89.8%, OAA: 83.6%).