ITSC 2024 Paper Abstract

Paper WeAT13.7

Lin, Fei (Macau University of Science and Technology), Tian, Yonglin (Institute of Automation, Chinese Academy of Sciences), Wang, Yunzhe (Capital University of Economics and Business), Zhang, Tengchao (Macau University of Science and Technology), Zhang, Xinyuan (University of Chinese Academy of Sciences), Wang, Fei-Yue (Institute of Automation, Chinese Academy of Sciences)

AirVista: Empowering UAVs With 3D Spatial Reasoning Abilities Through A Multimodal Large Language Model Agent

Scheduled for presentation during the Poster Session "Large Language Models" (WeAT13), Wednesday, September 25, 2024, 10:30−12:30, Foyer

2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC), September 24- 27, 2024, Edmonton, Canada

This information is tentative and subject to change. Compiled on April 25, 2025

Keywords Aerial, Marine and Surface Intelligent Vehicles, Sensing, Vision, and Perception, Multi-modal ITS

Abstract

In urban environments, complex transportation demands, including land, air, and maritime transport, are increasingly growing. While significant advancements have been made in land transportation and autonomous driving, research on urban air mobility (UAM) systems is still in its early stages. This paper presents an innovative urban air unmanned aerial vehicle (UAV) framework: AirVista, which is designed and built based on the Artificial Systems, Computational experiments, and Parallel execution (ACP) approach, integrated with a multimodal large language model (MLLM) agent. Considering that UAM tasks often require UAVs to possess fine-grained spatial perception and reasoning capabilities, and given that existing MLLMs are somewhat lacking in exploring 3D spaces, this paper further proposes an instruction fine-tuning strategy integrated with 3D spatial knowledge, which has been validated experimentally for its effectiveness. Additionally, to enhance the understanding of the efficiency of MLLM for UAV tasks, this paper delves into prompt fine-tuning templates tailored for UAV task decomposition. Through experimental demonstrations, we showcase that the prompt-tuned MLLM exhibits efficient task decomposition and execution capabilities in handling complex UAV tasks.