Can Self-Supervised Learning Drive Autonomous Cars...

Autonomous driving models trained on multi-city datasets often struggle to adapt to new cities. That's a problem, especially when these models rely on standard supervised ImageNet-pretrained backbones. The reality is stark: geographic and cultural cues can create unexpected hurdles in urban environments. This study delves into zero-shot cross-city generalization for autonomous vehicles, examining whether self-supervised visual representations can bridge this gap.

The Challenge of City-Specific Cues

When training and evaluation cover mixed geographic data, models may inadvertently use city-specific cues, hiding the real problems they face when deployed in new locations. This study investigates these challenges by integrating self-supervised backbones, specifically I-JEPA, DINOv2, and MAE, into trajectory planning frameworks. The findings are revealing.

Using strict geographic splits, tests were conducted on the nuScenes dataset in open-loop settings and on NAVSIM in closed-loop evaluations. The results highlight a significant generalization gap when transferring models between cities with varying road topologies and driving norms. Notably, moving from right-side to left-side driving environments exacerbates these issues. However, self-supervised representation learning appears to mitigate these challenges.

Numbers Don't Lie

Consider the data: in open-loop evaluations, a traditional supervised backbone transferring from Boston to Singapore resulted in a L2 displacement ratio inflated by 9.77 times and a collision ratio by 19.43 times. Contrast that with models using self-supervised pretraining, which saw reductions to 1.20 and 0.75 times respectively. The benchmark results speak for themselves.

In closed-loop evaluations, self-supervised models improved PDMS by up to 4 percent across all single-city training scenarios. What the English-language press missed: these findings establish zero-shot geographic transfer as an essential test for evaluating autonomous driving systems. They reveal that representation learning critically affects cross-city planning robustness.

Why It Matters

Why should we care about these results? Because they highlight a important flaw in how we prepare autonomous systems for real-world conditions. The paper, published in Japanese, reveals that reliance on supervised backbones can lead to blind spots. So why aren't more developers pivoting to self-supervised learning? Are we clinging too much to conventional wisdom?

It seems clear that to achieve truly adaptable and safe autonomous vehicles, the industry must embrace these new techniques. Western coverage has largely overlooked this, but the data is compelling. As cities continue to evolve, so too must the systems navigating them.

Can Self-Supervised Learning Drive Autonomous Cars Across Cities?

The Challenge of City-Specific Cues

Numbers Don't Lie

Why It Matters

Key Terms Explained