Token Wars: Ukrainian Legal Texts Reveal AI Model Challenges

JUST IN: Tokenizers are making waves in the AI world, and the Ukrainian legal system is in the spotlight this time. Who would've thought?

chewing through Ukrainian legal text, the differences between foundation models can cost you big time. A wild 1.6x variation in tokenizer fertility was found across models, and yet most don't even consider this in their model selection process. Insane, right?

Token Efficiency: The Unseen Cost

Sources confirm: Qwen 3 models devour 60% more tokens than the Llama family's models on the same input. That's a budget-busting difference. And just like that, the leaderboard shifts. If you're not looking at tokenizer efficiency when picking a model, you're leaving money on the table.

The NVIDIA Nemotron Super 3 (120B) is the new kid on the block, taking the top spot with an impressive score of 83.1. It outshines Mistral Large 3, which boasts 5.6x more parameters, and does so at one-third the API cost. Bigger isn't always better, folks. This changes the landscape.

Few-Shot vs Zero-Shot: Choose Wisely

Few-shot prompting is supposed to be the next big thing, right? Not so fast. In Ukrainian, it actually tanks performance by up to 26 percentage points. The stratified and prompt-sensitivity ablations reveal that this issue is tied to the language itself. When your language is morphologically rich, sticking with zero-shot is the way to go.

Why aren't more practitioners all over this? If you're working with Ukrainian legal texts, start with zero-shot. It's not just an option. It's a necessity.

Time Travel Troubles: Old vs New Models

Here's a kicker: classifiers trained on court decisions from the pre-war era (2008-2013) drop a whopping 27.9 percentage points when applied to decisions from the invasion era (2022-2026). The forward-backward asymmetry is startling. Newer models adapt backward (+14.6 pp), but older models? They crash and burn with wartime legal language.

What does this mean for AI practitioners? Never underestimate the impact of historical context. Model adaptability is critical, and your models need to be flexible enough to handle shifting legal landscapes.

To help tackle these challenges, the release of a public dataset of 14,452 court decisions is a big step forward. Spaning from 2008 to 2026, this data is gold for those trying to get to grips with how armed conflict shakes up the courts.