ITSC 2024 Paper Abstract

Paper WeBT13.9

Petrovai, Andra (Technical University of Cluj-Napoca), Miclea, Vlad (Technical University of Cluj-Napoca), Nedevschi, Sergiu (Technical University of Cluj-Napoca)

Depth-Aware Panoptic Segmentation with Mask Transformers and Panoptic Bins for Autonomous Driving

Scheduled for presentation during the Poster Session "Transformer networks" (WeBT13), Wednesday, September 25, 2024, 14:30−16:30, Foyer

2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC), September 24- 27, 2024, Edmonton, Canada

This information is tentative and subject to change. Compiled on April 3, 2025

Keywords Sensing, Vision, and Perception

Abstract

Depth-aware panoptic segmentation reconstructs the 3D scene from a single image through the tasks of monocular depth estimation and panoptic segmentation. Recent works have designed unified networks where both tasks are formulated as per-pixel prediction. Depth is usually learned via regression while segmentation is posed as per-pixel classification. However, panoptic segmentation benchmarks have been recently dominated by methods that employ an alternative formulation and solve instance and semantic segmentation in a unified manner with mask classification. Motivated by this insight and by the recent success of hybrid classification-regression depth estimation methods based on adaptive bins, we propose a novel unified network based on Mask Transformer for depth-aware panoptic segmentation. Instead of predicting a global depth distribution per image, we leverage cross-task information sharing and equip the network with novel panoptic bins which perform depth discretization at panoptic mask-level. Consequently, the network can better understand the 3D structure of the scene at a more detailed level, leading to depth predictions that are more aligned with the visible objects. Moreover, image embeddings are efficiently shared by both tasks, leading to synergy during training and improved performance. We perform extensive experiments on the Cityscapes-DVPS and SemKITTI-DVPS datasets and demonstrate that the proposed Mask Transformer network with panoptic adaptive bins achieves state-of-the-art results and more accurate predictions than per-pixel dense prediction methods.