Can Self-Supervised Learning Drive Autonomous Cars Across Cities?

Autonomous driving models face significant challenges when moving to new cities, but self-supervised learning techniques offer a promising solution.
Autonomous driving models trained on multi-city datasets often struggle to adapt to new cities. That's a problem, especially when these models rely on standard supervised ImageNet-pretrained backbones. The reality is stark: geographic and cultural cues can create unexpected hurdles in urban environments. This study delves into zero-shot cross-city generalization for autonomous vehicles, examining whether self-supervised visual representations can bridge this gap.
The Challenge of City-Specific Cues
When training and evaluation cover mixed geographic data, models may inadvertently use city-specific cues, hiding the real problems they face when deployed in new locations. This study investigates these challenges by integrating self-supervised backbones, specifically I-JEPA, DINOv2, and MAE, into trajectory planning frameworks. The findings are revealing.
Using strict geographic splits, tests were conducted on the nuScenes dataset in open-loop settings and on NAVSIM in closed-loop evaluations. The results highlight a significant generalization gap when transferring models between cities with varying road topologies and driving norms. Notably, moving from right-side to left-side driving environments exacerbates these issues. However, self-supervised representation learning appears to mitigate these challenges.
Numbers Don't Lie
Consider the data: in open-loop evaluations, a traditional supervised backbone transferring from Boston to Singapore resulted in a L2 displacement ratio inflated by 9.77 times and a collision ratio by 19.43 times. Contrast that with models using self-supervised pretraining, which saw reductions to 1.20 and 0.75 times respectively. The benchmark results speak for themselves.
In closed-loop evaluations, self-supervised models improved PDMS by up to 4 percent across all single-city training scenarios. What the English-language press missed: these findings establish zero-shot geographic transfer as an essential test for evaluating autonomous driving systems. They reveal that representation learning critically affects cross-city planning robustness.
Why It Matters
Why should we care about these results? Because they highlight a important flaw in how we prepare autonomous systems for real-world conditions. The paper, published in Japanese, reveals that reliance on supervised backbones can lead to blind spots. So why aren't more developers pivoting to self-supervised learning? Are we clinging too much to conventional wisdom?
It seems clear that to achieve truly adaptable and safe autonomous vehicles, the industry must embrace these new techniques. Western coverage has largely overlooked this, but the data is compelling. As cities continue to evolve, so too must the systems navigating them.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
A standardized test used to measure and compare AI model performance.
The process of measuring how well an AI model performs on its intended task.
A massive image dataset containing over 14 million labeled images across 20,000+ categories.
The idea that useful AI comes from learning good internal representations of data.