Epstein Bench is a RAG benchmark built upon the Epstein Files—a complex, noisy collection of publicly released documents regarding the Jeffrey Epstein associates. Similar to how the Enron Email Dataset became a standard for network analysis and NLP in the early 2000s, this corpus provides a highly complex, noisy, and entity-rich environment for stress-testing modern Retrieval-Augmented Generation systems.
Real-world data is messy. This benchmark simulates enterprise environments with:
Based on the Auepora evaluation methodology, we decouple performance into:
Live benchmark results from the latest system evaluations. Metrics are computed using the standard Auepora evaluation suite.
| System Name | Dataset | Recall | Token F1 | BERTScore | LLM Quality | Latency | Date (UTC) |
|---|---|---|---|---|---|---|---|
| Loading benchmark data... | |||||||
If you use Epstein Bench in your research, please cite it as follows:
@software{epstein_bench_2025,
title = {Epstein Bench: The Files Don’t Lie, But Your RAG Might},
author = {Conner Swann},
year = {2025},
url = {https://github.com/yourbuddyconner/epstein-bench}
}