First-ever benchmark for evaluating multi-turn long-form question answering in knowledge-intensive domains.
Sep 26, 2025