Reimagining GAN Hierarchies: CAT's Leap Over Traditional...

Generative Adversarial Networks (GANs) have often been lauded for their hierarchical generation capabilities, supposedly refining outputs from coarse to fine stages. However, what's really happening under the hood might not be as aligned with this narrative as many believe. Can a GAN truly refine outputs across stages, or are we just seeing scale-wise image realism?

The Misalignment Challenge

Traditional GANs, as they're often implemented, use adversarial supervision at each scale, pushing those intermediate outputs toward reality at their respective resolutions. But here's the catch: this doesn't guarantee a true hierarchical refinement of a single sample. Each stage can drift toward different outputs, leading to what researchers are now calling 'cross-scale trajectory misalignment.'

This misalignment is a fundamental flaw. When each stage isn't tasked with refining the last, you're not getting a coherent, refined trajectory from start to finish. Instead, each intermediate output may just as well be a different beast altogether.

Enter CAT: The Cross-scale Aligned Transformer

To tackle this, a fresh perspective emerges with the Cross-scale Aligned Transformer (CAT). CAT keeps the discriminator focused on scale-specific realism. However, its real innovation lies in enforcing a consistency regularization within the generator. By aligning intermediate outputs with the final result, it ensures a coherent refinement process.

The results speak volumes. On the class-conditional ImageNet-256, CAT-H/2 hit an impressive FID-50K of 1.56 with one-step inference after just 60 training epochs. That's not just a modest improvement, it's a leap, outperforming established GAN and diffusion/flow models.

Why This Matters

Why should we care? Because this approach redefines how we think about multi-scale generation. It questions the very basis of what we've accepted for years in GAN hierarchies. If each stage isn't truly refining the last, then what's the point of calling it hierarchical at all? Slapping a model on a GPU rental isn't a convergence thesis. It's time to demand more from these models.

The intersection is real. Ninety percent of the projects aren't. But when they're, they redefine possibilities. With CAT, we're not just getting better numbers. We're getting a glimpse of what true hierarchical generation should look like. Show me the inference costs. Then we'll talk about what's truly groundbreaking.

Reimagining GAN Hierarchies: CAT's Leap Over Traditional Barriers

The Misalignment Challenge

Enter CAT: The Cross-scale Aligned Transformer

Why This Matters

Key Terms Explained