ITSC 2025 Paper Abstract

Paper VP-VP.84

Zha, Ruijian (Columbia University), Liu, Bojun (Columbia University), Fu, Yongjie (Columbia University), Di, Xuan (Columbia University)

SliDeR-VLM: Interpretable Collision Prediction Via Sliding-Window Depth-Enhanced Vision-Language Models

Scheduled for presentation during the Video Session "On-Demand Video Presentations" (VP-VP), Saturday, November 22, 2025, 08:00−18:00, On-Demand Platform

2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC), November 18-21, 2025, Gold Coast, Australia

This information is tentative and subject to change. Compiled on April 2, 2026

Keywords Deep Learning for Scene Understanding and Semantic Segmentation in Autonomous Vehicles, AI, Machine Learning and Predictive Analytics for Traffic Incident Detection and Management

Abstract

Collision detection from long dashcam video footage is challenging. Vision-Language Models (VLMs) are good at combining visual input with semantic reasoning, making them useful for interpretable analysis. However, applying VLMs directly to entire video sequences can lead to hallucinated or unreliable explanations, as they may overlook subtle but critical events in long footage.To address this, we introduce SliDeR-VLM (Sliding-Window Depth-Enhanced Reasoning Vision-Language Model), a model designed specifically for interpretable traffic risk analysis. SliDeR-VLM predicts collision risk by using depth-enhanced spatial context—where depth is shown through color gradients that indicate how close objects are—and a sliding-window segmentation strategy that breaks videos into shorter, overlapping segments to help the model focus and reduce hallucinations.SliDeR-VLM is fine-tuned using Group Relative Policy Optimization (GRPO), which greatly improves its ability to understand and reliably predict risk. This model provides both clear explanations and precise timing, making it an effective tool for evaluating collision risks in large dashcam datasets and supporting better traffic management decisions.On the Nexar Collision Prediction dataset, SliDeR-VLM achieves strong results, with an F1 score of 0.671 and a Critical-Time-Window Recall (CTWR)—a metric for evaluating how accurately the model predicts collisions within key timeframes—of 0.728.