Cracking the Code on Startup Success: The Data Challenge
Predicting startup success remains a tough nut to crack. A new model using structured features and XGBoost shows promise, but data richness is the real hurdle.
Predicting which startups will succeed is a formidable challenge. The data is sparse, with only 9% of founders achieving success. This recent study tackled the issue by engineering 28 structured features from raw JSON fields, focusing on jobs, education, and exits. The model also incorporated a rule layer with XGBoost boosted stumps.
Model Performance
The model achieved a validation F0.5 score of 0.3030, with precision at 0.3333 and recall at 0.2222. That's a notable 17.7 percentage point improvement over the zero-shot language model (LLM) baseline. But let's not get too carried away. While the numbers are better, they're not groundbreaking.
Experimenting with LLM Features
The researchers also ran a controlled experiment by extracting nine features from the prose field using Claude Haiku. They tested this at 67% and full dataset coverage. Surprisingly, the LLM features accounted for 26.4% of model importance but added no cross-validation signal (a decrease of 0.05 percentage points). Why? Because the anonymised prose is generated from the same JSON fields already used, making it a lossy re-encoding rather than a source of richer data.
What's the Real Limitation?
The real limitation isn't the modeling. It's the dataset itself. The ceiling of a cross-validation score around 0.25 and validation score of about 0.30 reflects the dataset's inherent information content. This study serves as a diagnostic benchmark, highlighting where the data falls short. What would make the dataset richer? More varied and detailed founder experiences could be a start. Or better yet, a dataset that includes qualitative insights, not just quantitative metrics.
Why should we care about predicting startup success? The stakes are high. Investors, founders, and even employees benefit from insights into which startups have the potential to thrive. But if the data isn't there, even the best models can't provide the clarity we crave. Isn't it time we invest in gathering richer datasets rather than solely focusing on refining models?
Get AI news in your inbox
Daily digest of what matters in AI.