| CARVIEW |
BERDS: A Benchmark for Retrieval Diversity for Subjective Questions
Abstract
We study retrieving a set of documents that covers various perspectives on a complex and contentious question (e.g., would ChatGPT do more harm than good?).
First, we curate a Benchmark for Retrieval Diversity for Subjective questions (BERDS), where each example consists of a question and diverse perspectives associated with the question, sourced from survey questions and debate websites. This task diverges from most retrieval tasks where document relevancy can be evaluated with simple string matches to reference answers. To evaluate the performance of retrievers on this task, we build an automatic evaluator that decides whether each retrieved document contains a perspective.
Our experiments show that existing retrievers struggle to surface diverse perspectives. Re-ranking and query expansion approaches encourage retrieval diversity and achieve substantial gains to base performances. Yet, retrieving diverse documents from a large, web-scale corpus remains challenging, as existing retrievers could only cover all the perspectives with the top 5 documents 30% of the time. Our work presents benchmark datasets and an evaluation framework, laying the foundation for future studies in retrieval diversity handling complex queries.
Baseline Performances
We evaluate the performance of existing retrievers on the BERDS benchmark. We use the following models as baselines: BM25, DPR, and Contriever. MRecall @ k measure the percentage of questions where all perspectives are covered by the top k retrieved documents. Precision @ k measures the percentage of retrieved documents that contain a perspective. (k=5 in this table)
If you would like to evaluate your own model, follow the instructions in the github repository.