ITSC 2024 Paper Abstract

Paper FrAT13.2

Yang, Xiao (Lanzhou University), Zhao, Rui (LanZhou University), Zhi, Peng (Lanzhou University), Zhang, Yichi (Lanzhou University), Zhou, Qingguo (Lanzhou University), Liu, Gang (Lanzhou University, School of Information Science and Engineerin)

Spatial Inception Pillars: Enhancing Perceptual Robustness for 3D Object Detection

Scheduled for presentation during the Poster Session "3D Object Detection" (FrAT13), Friday, September 27, 2024, 10:30−12:30, Foyer

2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC), September 24- 27, 2024, Edmonton, Canada

This information is tentative and subject to change. Compiled on April 25, 2025

Keywords Sensing, Vision, and Perception

Abstract

As a vital aspect of three-dimensional perception in autonomous driving, objection detection leveraging point cloud data has garnered significant attention in recent years. Efficient real-time feature representation is crucial for 3D object detection with this data format. Currently, high-performing detectors offer high accuracy due to the three-dimensional voxels. However, they often fall short in terms of computational efficiency and real-time performance, limiting their practicality in real-world applications. In contrast, detectors with pillar-based structures are more efficient, require fewer computational resources, and are easier to deploy, thereby meeting the demands of real-time applications. Unfortunately, their accuracy does not yet reach that of grid-based methods. In this paper, we present a lightweight and effective pillar-based 3D single-stage object detector named 'Spatial Inception Pillars' (SIP), achieving an impressive balance between accuracy and efficiency. It consists of four parts: a powerful pillar-based feature extraction encoder, the spatial feature extraction backbone network, a neck network that expands the receptive field of multi-scale features, and a versatile detection head. Specifically, in the backbone network, we introduce the Sparse Squeeze-and-Excitation(SE) network in the basic feature blocks for spatial feature enhancement and incorporate a model scaling factor to adjust the model's depth and width for different environmental needs. We propose a receptive field enhancement module in the neck network based on the basic inception network architecture in the neck network. The network architecture utilizes multi-scale feature information at different stages, achieving high-performance detection. It is an elegant and efficient framework that does not require overly complex calculations and is easy to deploy. Extensive experiments on the large-scale Waymo Open Dataset and nuScenes Dataset demonstrate that SIP surpasses baselines by a large margin and achieves competitive performance with real-time inference speed.