ZACH-ViT: Reshaping Medical Imaging with a Compact Vision Transformer

ZACH-ViT challenges conventional Vision Transformers by eliminating positional embeddings and the [CLS] token, performing well in medical imaging without relying on spatial layout.
Vision Transformers have made waves in natural image processing, largely due to their reliance on positional embeddings and class tokens. These components encode spatial priors, which serve well for structured imagery. But what happens when these spatial cues provide little information? Enter the world of medical imaging, where traditional Vision Transformers might falter.
ZACH-ViT: A Fresh Approach
ZACH-ViT, short for Zero-token Adaptive Compact Hierarchical Vision Transformer, takes a bold step. By removing positional embeddings and the [CLS] token, ZACH-ViT achieves permutation-invariant processing of image patches through global average pooling. This architectural choice eliminates the need for the dedicated aggregation token while retaining patch tokens in their original form.
Why does this matter? Medical images often present weaker spatial structures compared to natural images. ZACH-ViT adapts by forgoing traditional spatial priors, and instead, focuses on compact processing of patches, making it highly relevant in data-scarce scenarios.
Performance in Medical Datasets
Evaluating ZACH-ViT across seven MedMNIST datasets reveals intriguing results. Under a strict few-shot protocol, 50 samples per class, fixed hyperparameters, tested over five seeds, ZACH-ViT demonstrates its prowess. With only 0.25 million parameters, it was trained from scratch yet excelled on BloodMNIST and maintained competitiveness on PathMNIST. This supports the hypothesis that lesser spatial priors can sometimes be a boon.
However, on datasets with stronger anatomical cues, like OCTMNIST and OrganAMNIST, its edge diminishes. This aligns with the idea that when spatial structure becomes pronounced, traditional methods might regain their advantage.
A Call for Architectural Alignment
Component and pooling ablations conducted during testing showed that positional support adds mild benefits as spatial structure increases. Yet, reintroducing a [CLS] token proved consistently detrimental. This suggests that aligning architecture closely with the data structure can outperform universal benchmarks.
Despite its minimal size and lack of pretraining, ZACH-ViT achieves results that challenge larger, more resource-intensive models. This raises a question: Are we over-relying on entrenched methods when simpler, tailored solutions could suffice?
For those in the medical imaging field, the implications are clear. ZACH-ViT presents a compelling case for reevaluating how we approach image processing, especially in resource-limited environments. The chart tells the story: architectural simplicity coupled with strategic innovation often trumps sheer size or complexity.
Get AI news in your inbox
Daily digest of what matters in AI.