Epstein Bench

The Files Don’t Lie, But Your RAG Might

Overview

Epstein Bench is a RAG benchmark built upon the Epstein Files—a complex, noisy collection of publicly released documents regarding the Jeffrey Epstein associates. Similar to how the Enron Email Dataset became a standard for network analysis and NLP in the early 2000s, this corpus provides a highly complex, noisy, and entity-rich environment for stress-testing modern Retrieval-Augmented Generation systems.

The Challenge

Real-world data is messy. This benchmark simulates enterprise environments with:

Extreme Noise: Scanned PDFs, handwritten notes, and messy OCR.
Complex Graph: Multi-hop reasoning across thousands of entities.
Needle-in-a-Haystack: Critical info buried in legalese.

The Framework

Based on the Auepora evaluation methodology, we decouple performance into:

Retrieval: Relevance (Recall@K) and Accuracy (MRR).
Generation: Correctness (ROUGE/BLEU) and Faithfulness (LLM Judges).
Robustness: Handling paraphrases and noise.

Performance Leaderboard

Live benchmark results from the latest system evaluations. Metrics are computed using the standard Auepora evaluation suite.

System Name	Dataset	Recall	Token F1	BERTScore	LLM Quality	Latency	Date (UTC)
Loading benchmark data...

Citation

If you use Epstein Bench in your research, please cite it as follows:

@software{epstein_bench_2025,
  title  = {Epstein Bench: The Files Don’t Lie, But Your RAG Might},
  author = {Conner Swann},
  year   = {2025},
  url    = {https://github.com/yourbuddyconner/epstein-bench}
}

Overview

The Challenge

The Framework

Performance Leaderboard

Citation

System Details