Evaluating LLMs: When More Rules Mean Worse Results

Large Language Models (LLMs) bring a fresh set of challenges in evaluation. Unlike traditional software, their outputs can be uncertain, vary semantically, and depend on prompts and model specifics. The Minimum Viable Evaluation Suite (MVES) offers a new approach designed to audit LLM applications effectively.

Tracking Failure Modes

MVES bridges application categories with potential failure modes, metrics, and validation evidence. This framework applies to general LLM applications, retrieval-augmented systems, and agentic workflows. It provides a reproducible evaluation method focusing on structured extraction, retrieval-augmented generation (RAG) citation/content compliance, and instruction-following accuracy.

In practical terms, using Ollama with Llama 3 8B Instruct and Qwen 2.5 7B Instruct, five prompt conditions were tested over 30-case-per-suite ablations. The outcomes are revealing. Generic prompt additions don't consistently improve results. In some scenarios, they even backfire. For instance, adding generic rules to a user prompt reduced Qwen 2.5's performance on RAG from 26 out of 30 to only 9 out of 30 cases.

The Risk of Prompt Changes

These findings underscore a essential point: prompt changes, often seen as innocuous, can be regression risks. They must be carefully evaluated against specific task suites before being rolled out. Nobody wants to discover after deployment that an update caused performance to plummet. The container doesn't care about your consensus mechanism, but your customers will certainly notice.

Why should we care? If you're deploying LLMs in mission-critical environments, the cost of these 'hidden' errors can be immense. The ROI isn't in the model itself but in avoiding the pitfalls that can come with poorly thought-out prompt modifications.

Practical Implications

Is it time we re-evaluate our approach to LLM prompt modifications? Absolutely. Treating prompt changes with the same scrutiny as any system update isn't just advisable, it's essential. The accompanying repository offers all necessary resources for reproduction, including test suites and scripts. It's a toolkit for ensuring that your LLM applications don't just work, they work well.

In the field of enterprise AI, boring is effective. And LLMs, sticking to what reliably works might just be the smartest move.