ITSC 2025 Paper Abstract

Paper TH-LA-T29.3

Suzuki, Hayato (Chubu University), Shimomura, Kota (Chubu University), Hirakawa, Tsubasa (Chubu University), Yamashita, Takayoshi (Chubu University), Fujiyoshi, Hironobu (Chubu University)

Enhancing Navigation Text Generation and Visual Explanation Using Spatio-Temporal Scene Graphs with Graph Attention Networks

Scheduled for presentation during the Regular Session "S29c-Human Factors and Human Machine Interaction in Automated Driving" (TH-LA-T29), Thursday, November 20, 2025, 16:40−17:00, Currumbin

2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC), November 18-21, 2025, Gold Coast, Australia

This information is tentative and subject to change. Compiled on October 18, 2025

Keywords Deep Learning for Scene Understanding and Semantic Segmentation in Autonomous Vehicles, Human-Machine Interaction Systems for Enhanced Driver Assistance and Safety, AI, Machine Learning for Real-time Traffic Flow Prediction and Management

Abstract

Navigation systems are widely used in modern vehicles. However, conventional approaches that rely on static map information struggle to adapt to dynamic changes in the surrounding environment. To address this limitation, human-like guidance has attracted increasing attention as a method that leverages image recognition to interpret driving scenes and generate natural-language navigation in a human-like manner. Scene graphs, which structurally represent relationships among objects, have proven effective for this task. However, existing methods often rely on high-dimensional visual features, posing challenges in terms of interpretability and scalability. In this study, we propose a novel approach for generating navigation text by constructing a spatio-temporal scene graph using only object positions and class labels as node information. This enables a more compact and interpretable graph representation. The proposed system generates natural navigation using a graph-to-text model based on Graph Attention Networks (GAT). Furthermore, we incorporate vehicle motion information at intersections into the graph and introduce mechanisms to enhance attention to important nodes, enabling visual interpretation of the model's decision-making process through attention visualization. Experimental results show that the proposed method achieves better performance than existing Convolutional Neural Network (CNN)- and Transformer-based approaches, particularly in the integration of long-term temporal information for text generation.