ITSC 2025 Paper Abstract

Paper FR-LA-T42.2

Ataei, Denise (University of California, Santa Cruz), Paranjape, Ishaan (University of California, Santa Cruz), Whitehead, Jim (UC Santa Cruz)

Enhancing Autonomous Vehicle Test Scenario Reasoning in Language Models

Scheduled for presentation during the Regular Session "S42c-Safety and Risk Assessment for Autonomous Driving Systems" (FR-LA-T42), Friday, November 21, 2025, 16:20−16:40, Broadbeach 3

2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC), November 18-21, 2025, Gold Coast, Australia

This information is tentative and subject to change. Compiled on October 18, 2025

Keywords Autonomous Vehicle Safety and Performance Testing, Multi-vehicle Coordination for Autonomous Fleets in Urban Environments, Digital Twin Modeling for ITS Infrastructure and Traffic Simulation

Abstract

Scenario based testing is a promising approach to evaluate autonomous vehicles for safety due to its ability to evaluate several components at once. Automated generation of these scenarios in simulation is needed to address the scale and diversity requirements in scenarios. Large Language Models (LLMs) can address this need due to their ability for world modeling. However, these models are ineffective in their reasoning thereby limiting their abilities in generating complex, dynamic vehicle interaction scenarios. In this paper, we present Cruzway Scenario Reasoner, an LLM based system that enhances reasoning capabilities of language models for complex vehicle interaction questions from the Waymo Open Motion Dataset - Reasoning dataset. This system consists of a suite of prompting approaches which include both Chain-of-thought prompting as well as prompting based on model based task planning in the Planning Domain Defintion Language (PDDL). In addition, this system also contains LLM as a judge modules for the effective evaluation of generated responses and PDDL models. With this system, we are able to elevate the reasoning capabilities of the OpenAI GPT 4o-mini model. In addition, we also provide an in-depth qualitative analysis of language model responses to 15 scenarios categorized by complexity in the information provided.