PersLitEval: Testing LLMs with Persian Literature

Here's the thing: large language models (LLMs) are like the Swiss Army knives of AI. They're multilingual, versatile, and, let's be honest, impressive. But literary knowledge in non-English languages, their abilities are a mixed bag. Enter PersLitEval, a new benchmark that tests these models on 4,514 Persian literature questions. That's a massive dataset sourced from the Konkur university entrance exam, and it covers everything from grammar to literary devices.

What's in the Benchmark?

PersLitEval isn't just a random collection of questions. It's meticulously categorized into eight sections, testing spelling, vocabulary, grammar, and more. Think of it this way: it's like asking a model to switch between chess and checkers without missing a beat. The models, however, show some interesting patterns. They're pretty good at tasks requiring a big-picture view, like conceptual understanding, yet stumble the nitty-gritty details, like spelling and word formation. If you've ever trained a model, you know these differences can be striking.

The Role of Prompting

How you ask a question matters a lot. Among the ten different prompting strategies tested, showing examples, what we call explained few-shot examples, yielded the best results. Especially in formal linguistic categories, these examples gave models the context they needed to perform better. But here's a question: should we be tailoring prompts for each task, or is this just a band-aid on a bigger problem?

Why This Matters

Here's why this matters for everyone, not just researchers: if LLMs struggle with something as fundamental as spelling in Persian, what about other languages or even dialects? The analogy I keep coming back to is testing a car's performance on different terrains. If it can't handle gravel, how confident are we it'll manage mud? The error analysis from PersLitEval identifies three main failure modes: semantic comprehension gaps, formal linguistic knowledge gaps, and counting errors. These aren't just academic concerns. they're the cracks in the foundation of multilingual AI applications.

Honestly, this kind of research is a wake-up call. It shows that while LLMs are powerful, they aren't infallible. We need to focus on improving these models for specific languages if we want them to be truly global tools. It's not just about adding more data or tweaking algorithms. It's about understanding the cultural and linguistic nuances that make each language unique.

PersLitEval: Testing LLMs with Persian Literature

What's in the Benchmark?

The Role of Prompting

Why This Matters

Key Terms Explained