ITSC 2024 Paper Abstract

Paper ThBT2.2

Xiong, Hui (Beijing ldrivernlus Technology Co., Ltd.), ZHANG, FANG (Beijing ldrivernlus Technology Co., Ltd.), Fang, Xuzhi (Zhongke Atomically Precise Manufacturing Technology Co., Ltd), Huang, Heye (University of Wisconsin-Madison), Yuan, Quan (Tsinghua University), Pan, Qinggui (Beijing ldrivernlus Technology Co., Ltd.), Zhang, Dezhao (Beijing Idriverplus Technology Co., Ltd)

DTP-M3Net: Monocular Multiclass Multistage Vulnerable Road User Detection, Tracking and Prediction for Autonomous Driving

Scheduled for presentation during the Invited Session "Towards Human-Inspired Interactive Autonomous Driving II" (ThBT2), Thursday, September 26, 2024, 14:50−15:10, Salon 5

2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC), September 24- 27, 2024, Edmonton, Canada

This information is tentative and subject to change. Compiled on July 1, 2025

Keywords Sensing, Vision, and Perception, Advanced Vehicle Safety Systems, Driver Assistance Systems

Abstract

In the autonomous driving environment, Vulnerable Road Users (VRUs), including pedestrians and various types of riders with diverse visual attributes and unpredictable movements, necessitate prompt and precise motion intention to ensure their safety. However, detection methods tend to misidentify VRUs, while tracking methods struggle with maintaining fast-moving objects, and prediction methods overly depend on map topological data, which leads to unsatisfactory accuracy and reliability in VRU prediction. This paper introduces a DTP-M3Net for VRU-oriented monocular multiclass multistage detection, tracking and prediction. DTP-M3Net makes use of the basic location and classification information, as well as historical trajectories and coarse-to-fine semantic features to achieve comprehensive perception. Key innovations encompass a faster multi-object detection via deep neural network, an online multi-object tracking with ego-motion compensation, and a joint trajectory prediction based on a sequence-to-sequence network and spatial-temporal predicted cues. Excellently, deep convolutional features generated by detection are shared among downstream tracking and prediction modules. The effectiveness of proposed DTP-M3Net is validated on the public MOT challenge and VRU-Track dataset, demonstrating improvements in VRU perception.