SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving

Li, Peizheng; Zhang, Zhenghao; Holtz, David; Yu, Hang; Yang, Yutong; Lai, Yuzhi; Song, Rui; Geiger, Andreas; Zell, Andreas

SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving

Peizheng Li^1,2*, Zhenghao Zhang^1,4*, David Holtz¹, Hang Yu^1,5, Yutong Yang^1,6, Yuzhi Lai², Rui Song⁷, Andreas Geiger^2,3, Andreas Zell²

¹Mercedes-Benz AG ²University of Tübingen ³Tübingen AI Center
⁴TU Munich ⁵Karlsruhe Institute of Technology ⁶University of Stuttgart ⁷UCLA
^* Equal contribution

Paper Supplementary Code arXiv

AGO Teaser Figure — Spatial awareness in VLM-based end-to-end autonomous driving. (a) Constrained by insufficient 3D pre-training and discrete token-wise encoding, existing end-to-end planners based on the VLM struggle to precisely ground, associate, and predict 3D spatial positions, limiting their planning capabilities. (b) Our proposed SpaceDrive planner introduces a unified 3D coordinate encoding to replace the original VLM’s textual digit tokens and augment visual features, achieving explicit association with 2D perspective semantics to enhance joint spatial reasoning for E2E planning. Compared to current VLM-based methods, it achieves state-of-the-art driving capability in the nuScenes open-loop evaluation and the second-best driving performance in the Bench2Drive closed-loop simulation

Abstract

End-to-end autonomous driving methods built on vision language models (VLMs) have undergone rapid development driven by their universal visual understanding and strong reasoning capabilities obtained from the large-scale pretraining. However, we find that current VLMs struggle to understand fine-grained 3D spatial relationships which is a fundamental requirement for systems interacting with the physical world. To address this issue, we propose SpaceDrive, a spatial-aware VLM-based driving framework that treats spatial information as explicit positional encodings (PEs) instead of textual digit tokens, enabling joint reasoning over semantic and spatial representations. SpaceDrive employs a universal positional encoder to all 3D coordinates derived from multi-view depth estimation, historical ego-states, and text prompts. These 3D PEs are first superimposed to augment the corresponding 2D visual tokens. Meanwhile, they serve as a task-agnostic coordinate representation, replacing the digit-wise numerical tokens as both inputs and outputs for the VLM. This mechanism enables the model to better index specific visual semantics in spatial reasoning and directly regress trajectory coordinates rather than generating digit-by-digit, thereby enhancing planning accuracy. Extensive experiments validate that SpaceDrive achieves state-of-the-art open-loop performance on the nuScenes dataset and the second-best Driving Score of 78.02 on the Bench2Drive closed-loop benchmark over existing VLM-based methods. Code will be released upon acceptance.

Method

SpaceDrive is a spatial-aware framework that enhances end-to-end planning through explicit injection of 3D information into the VLM architecture. Surrounding images are processed by a depth estimator to obtain absolute depths, which are converted into 3D positional encodings through a universal PE encoder. The visual tokens and their 3D PEs are then added element-wise, yielding spatially-aware visual tokens that serve as inputs to the VLM.

Besides, text prompts for various reasoning tasks are also fed into the VLM as text token inputs. Notably, coordinates within these prompts are processed separately by the same PE encoder to generate universal PEs, replacing the corresponding original text tokens.

Quantitative Results

Closed-world benchmark — Open-loop planning results on nuScenes benchmark.

SpaceDrive achieves state-of-the-art performance among all VLM-based methods in open-loop planning on nuScenes.

Open-world Benchmark — Closed-loop planning results on Bench2Drive benchmark.

SpaceDrive achieves a Driving Score of 78.02 (2nd-best in VLM-based planners) in closed-loop simulation on Bench2Drive.

Qualitative Results

Construction site blocks one direction of a two-lane road.

A kid suddenly runs out from a parked car.

Give way to ambulance coming from behind.

Wait for cyclist crossing road right after turning.

Car door of a parked car blocks the lane.

Navigate around cyclists in heavy traffic at night.

Closed-world Visualization — Closed-loop Visualization

BibTeX

@article{li2025spacedrive,
  title={SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving},
  author={Li, Peizheng and Zhang, Zhenghao and Holtz, David and Yu, Hang and Yang, Yutong and Lai, Yuzhi and Song, Rui and Geiger, Andreas and Zell, Andreas},
  journal={arXiv preprint arXiv:2512.10719},
  year={2025}
}

More Works from Our Lab

AGO: Adaptive Grounding for Open World 3D Occupancy Prediction

PowerBEV: A Powerful Yet Lightweight Framework for Instance Prediction in Bird's-Eye View

SeFlow: A Self-Supervised Scene Flow Method in Autonomous Driving

TQD-Track: Temporal Query Denoising for 3D Multi-Object Tracking

SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving

Abstract

Method

Quantitative Results

Qualitative Results

Construction site blocks one direction of a two-lane road.

A kid suddenly runs out from a parked car.

Give way to ambulance coming from behind.

Wait for cyclist crossing road right after turning.

Car door of a parked car blocks the lane.

Navigate around cyclists in heavy traffic at night.

BibTeX