PDF Processing Frameworks: The Linchpin of RAG Accuracy

Retrieval-Augmented Generation (RAG) systems, which rely heavily on document preprocessing, have long been under scrutiny for their accuracy. Yet, surprisingly, assessments of how PDF processing frameworks impact this accuracy have been lacking. Now, a comprehensive study fills that gap, exploring the performance of four open-source PDF-to-Markdown conversion frameworks. These include Docling, MinerU, Marker, and DeepSeek OCR, across 19 different configurations for text extraction and content processing.

The Benchmarks

The study's findings are built on a manually curated benchmark comprising 50 questions on a corpus of 36 Portuguese administrative documents, totaling 1,706 pages or approximately 492,000 words. Two baselines helped in bounding the results: a naive PDFLoader with an accuracy of 86.9% and a manually curated Markdown reaching 97.1%. Notably, Docling, when using hierarchical splitting and adding image descriptions, achieved the highest automated accuracy of 94.1%.

Why Data Prep Matters

The data shows that metadata enrichment and hierarchy-aware chunking play a more significant role in achieving high accuracy than the mere choice of conversion framework. The results indicate that reconstructing hierarchy based on fonts consistently outperformed LLM-based methods. This highlights that the quality of data preparation is a more critical determinant of RAG system efficacy than previously recognized.

The Underwhelming GraphRAG

Meanwhile, an exploratory GraphRAG implementation scored a mere 82%, underperforming even basic RAG frameworks. This suggests that creating naive knowledge graphs without a solid ontological foundation adds unnecessary complexity. Here lies a essential question: Are we over-engineering solutions without tangible results? The benchmark results speak for themselves.

In an era where data preparation can make or break AI implementations, the emphasis shifts from merely choosing advanced frameworks to refining the preprocessing steps. So, the next time you're considering which tools to use, remember that sometimes going back to basics, like properly preparing your data, might just outperform the latest flashy tech.