UA-Legal-Bench Challenges English-Centric Legal NLP

In the space of legal NLP, English has long dominated the scene. But UA-Legal-Bench, a new benchmark, sets out to change that by shifting focus to Ukrainian legal reasoning. It's noteworthy because it highlights the gaps that arise when non-English languages, especially those with complex scripts and morphology, are overlooked in natural language processing.

Benchmark Composition

UA-Legal-Bench is built on the Unified State Register of Court Decisions (EDRSR), one of the largest open judicial corpora globally, with a staggering 99.5 million decisions. This benchmark covers five distinct tasks: case-type classification, judgment form classification, case-outcome prediction, legal norm extraction, and cause category prediction. With thousands of data points for each task, it offers a rigorous test for large language models (LLMs).

The tasks vary in complexity and class size. For example, case-type classification involves just four classes, while cause category prediction spans 22. The question is, can models trained primarily on English data really excel here?

Model Evaluation

The benchmark evaluated 11 LLMs with parameter counts ranging from 3 billion to 675 billion. They used zero-shot and 3-shot prompting via AWS Bedrock, resulting in 158,000 API calls. The benchmark results speak for themselves, revealing a essential insight: few-shot prompting can significantly enhance performance, particularly in judgment form classification where improvements hit a notable +38.6 percentage points. However, the same can't be said for case-outcome prediction, where effects were mixed.

Interestingly, accuracy proved to be a deceptive metric on imbalanced tasks. The model with the highest case-outcome prediction accuracy at 62% was a majority-class predictor with a mere 23% macro-F1 score. Meanwhile, the top-performing model achieved only a 44% macro-F1 score. It underscores the need for nuanced evaluation metrics when dealing with legal tasks that aren't evenly distributed.

Scaling and Performance

Within-family scaling analysis demonstrated that 8 billion parameter models could match the performance of larger models on surface-level tasks. However, the scaling thresholds varied widely across different model families, suggesting that bigger isn't always better. Western coverage has largely overlooked this point. The data shows that more parameters don't necessarily equate to better performance in every context.

Why should this matter to you? The introduction of UA-Legal-Bench highlights the importance of diversifying NLP benchmarks beyond English. It challenges the status quo and questions the one-size-fits-all approach often taken in language modeling. As the field evolves, will benchmarks like this become the new norm, providing a better reflection of our linguistically diverse world?

UA-Legal-Bench Challenges English-Centric Legal NLP

Benchmark Composition

Model Evaluation

Scaling and Performance

Key Terms Explained