Rethinking Label Selection for AI in Tabular Data

Selecting which instances deserve labeling is a critical hurdle in low-label tabular learning. Recent Tabular Foundation Models, like TabPFN, highlight this issue. The wrong choices can handicap predictive performance, while the right ones can supercharge it. When supervised oracle experiments were conducted, they demonstrated that a carefully curated labeled context could outperform random selections within the same budget. But what about the cold-start setting? There's been little attention in the literature on how to select instances before any labels appear.

The Geometry Problem

This selection issue is fundamentally geometric. In vision and language, models create embedding spaces where simple geometric methods shine. In the tabular world, however, traditional selection methods operate in the original tabular space. This space lacks a natural metric, making distances unreliable due to mixed data types and nonlinear interactions. Raw-space selection often performs worse than random on most datasets as budgets grow. Slapping a model on a GPU rental isn't a convergence thesis.

Introducing LUCoS

Enter LUCoS, short for Latent Unsupervised Context Selection. It ditches raw-feature geometry, opting for the latent geometry from embeddings of an unsupervised Prior-Fitted Network (PFN). By selecting representative medoids, LUCoS establishes a new benchmark. Evaluations on 67 OpenML-CC18 datasets under six low-label budgets reveal LUCoS leading in mean AUC, ACC, and F1 scores. These conclusions hold firm across metrics and robustness checks. If the AI can hold a wallet, who writes the risk model?

Mechanisms and Implications

What's the secret sauce for LUCoS? A gain decomposition unveils a straightforward mechanism. At minimal budgets, the primary advantage comes from ensuring coverage. As budgets climb, the key is the representation space where coverage is measured. LUCoS breaks the mold by showing that reliable context selection relies less on selector complexity and more on defining representativeness in a meaningful representation geometry. Show me the inference costs. Then we'll talk.

So, why should you care? Because this strategy upends traditional methods and offers a more effective path forward in low-label learning. The intersection is real. Ninety percent of the projects aren't. Tabular learning's future isn't about clinging to outdated metrics but embracing the latent geometries that models already explore across vision and language domains.

Rethinking Label Selection for AI in Tabular Data

The Geometry Problem

Introducing LUCoS

Mechanisms and Implications

Key Terms Explained