Autonomous Vehicles meet Multimodal Foundation Models

Overview

Building safe and intelligent Autonomous Vehicles (AVs) capable of human-like reasoning is a challenging problem, pushing the limits of computer vision. Current AV systems struggle with diverse and unseen driving scenarios, necessitating a shift in research focus. Recently, multimodal large language models (MLLMs) have shown great promise in understanding human intent and solving complex problems. Such models not only showcase incredible capabilities in understanding human intent and solving complex and unstructured problems, but scale gracefully with data and compute. This workshop explores leveraging MLLMs to tackle key challenges in AV.

Invited Speakers

Boris Ivanovic

NVIDIA

Hongyang Li

The University of Hong Kong & Shanghai AI Lab

Hang Zhao

Tsinghua University

Long Chen

Wayve

Katerina Fragkiadaki

Carnegie Mellon University

Schedule

Time	Event
13:50 - 14:00	Opening Remarks
14:00 - 14:30	Boris Ivanovic
14:30 - 15:00	Hongyang Li
15:00 - 15:30	Hang Zhao
15:30 - 16:00	Break
16:00 - 16:30	Long Chen
16:30 - 17:00	Katerina Fragkiadaki
17:00 - 17:30	Oral Session
17:30 - 18:00	Poster Session

Call for Papers

We welcome authors to submit their papers in two different formats: full-paper (~~4-8~~ 8-14 pages) or short-abstract (2 4 pages). A full-paper should describe work that has not been published or accepted to another venue. A short-abstract can highlight work that has been published or accepted recently. Please use the ECCV 2024 paper template and follow the ECCV submission guidelines. Accepted papers will be posted on the website, but there will not be archival proceedings for this workshop.

The submission needs to be submitted to the CMT system: https://cmt3.research.microsoft.com/MLLMAV2024.

Topics

General system design of MLLMs for AV: The integration of MLLMs into AVs necessitates a reevaluation of data collection and usage, training and evaluation methodologies, and the overall system architecture.
Perception: How can we leverage MLLMs to build more robust and powerful perception models in AV? Can we have a systematic way to deal with "tail" examples that are hard for traditional methods but easy for a human driver?
Motion prediction: Can we use MLLMs to better understand the intents of other traffic participants and accurately forecast the movements of them?
Trajectory planning: Can we use MLLMs to enable more sophisticated and adaptable planning algorithms that account for a wider range of variables and scenarios, leading to safer and more efficient navigation?
Simulation and world models: Can MLLMs help generate more realistic and comprehensive simulation environments or build world models?
End-to-end solutions: Can MLLMs play a crucial role in end-to-end AV solutions?
Testing and safety: Can MLLMs make our AV system safer?

Important Dates

Submission Open: June 25, 2024
Submission Deadline: ~~August 15, 2024, 11:59 PM Pacific Time~~ August 25, 2024, 11:59 PM Pacific Time
Acceptance Decision: September 3, 2024
Camera Ready Deadline: September 20, 2024

Accepted Papers

Title and Authors	Link
Shelf-Supervised Cross-Modal Pre-Training for 3D Object Detection Authors: Mehar Khurana, Neehar Peri, James Hays, Deva Ramanan	pdf
Think-Driver: From Driving-Scene Understanding to Decision-Making with Vision Language Models Authors: Qiming Zhang, Meixin Zhu, Frank Yang	pdf
T-MAE: Temporal Masked Autoencoders for Point Cloud Representation Learning Authors: Weijie Wei, Fatemeh Karimi Nejadasl, Theo Gevers, Martin R. Oswald	pdf
Hard Cases Detection in Motion Prediction by Vision-Language Foundation Models Authors: Yi Yang, Qingwen Zhang, Kei IKEMURA, Nazre Batool, John Folkesson	pdf
Distillation of Vision Language Models for Enhancing End-to-End Autonomous Driving Authors: Feng Tao, Abhirup Mallik, Chenbin Pan, Xin Ye, Yuliang Guo, Burhaneddin Yaman, Liu Ren	pdf