ITSC 2025 Paper Abstract

Paper FR-LM-T37.6

Chen, Jialei (Nagoya University), Li, Dongyue (Nagoya University), Yi, Chong (Nagoya University), Zheng, Xu (The Hong Kong University of Science and Technology - Guangzhou C), Ito, Seigo (TOYOTA CENTRAL R&D LABS., INC), Murase, Hiroshi (Nagoya University), Deguchi, Daisuke (Nagoya University)

Semantic-Driven Distillation for Semantic Segmentation with Unknown Classes

Scheduled for presentation during the Regular Session "S37a-Reliable Perception and Robust Sensing for Intelligent Vehicles" (FR-LM-T37), Friday, November 21, 2025, 12:10−12:30, Coolangata 1

2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC), November 18-21, 2025, Gold Coast, Australia

This information is tentative and subject to change. Compiled on October 18, 2025

Keywords Deep Learning for Scene Understanding and Semantic Segmentation in Autonomous Vehicles

Abstract

Intelligent transportation systems (ITS) rely on semantic segmentation for dense scene understanding and safe decision-making. However, real-world ITS scenarios often involve rare or uncommon objects that may significantly impact decision-making. To handle such cases, models must go beyond closed-set assumptions. Zero-shot semantic segmentation (ZSS) addresses this by allowing models to segment novel classes without labeled examples. To adapt models to these unseen classes, existing approaches typically rely on self-training strategies, where pseudo labels are generated for unlabeled regions based on high-confidence predictions. However, these methods often underutilize the semantic embeddings, which are merely employed to produce pseudo labels, thereby failing to fully exploit CLIP’s powerful vision-language alignment capabilities. To address this limitation, we propose Semantic-Driven Distillation (SDD). Specifically, SDD aggregates dense features from a segmentation model into a predicted CLS token via a weighted sum, where the weights are computed based on similarity to the original CLS token from the CLIP visual encoder. It then constructs probability distributions over the predicted and original CLS tokens, as well as the corresponding text embeddings, and aligns these distributions using KL divergence. By leveraging semantic embeddings as a bridge, SDD enables the segmentation model to better align with the CLIP visual encoder, thereby inheriting CLIP’s strong vision-language matching capabilities. To further enhance the effectiveness of SDD, we introduce Region-aware Self-Training (RST), which first discovers potential object regions by clustering dense features extracted from CLIP. Within each region, high-confidence predictions are selected as pseudo labels fo