A Retrieval-Augmented Generation approach for efficient multi-hop question answering
Paper: FrugalRAG: Learning to retrieve and reason for multi-hop QA
Reinforcement learning (RL) based on the final answer’s reward has driven recent progress in small language models (SLMs) on reasoning-heavy tasks such as math and code. However, applying the same techniques to retrieval-augmented generation (RAG) benchmarks like multi-hop QA has yielded limited gains—often trailing supervised or prompting-only baselines. Instead, we argue that a viable path for RL in multi-hop QA is to use test-time scaling judiciously, for optimizing both the final answer accuracy and the efficiency in reaching that answer. We propose FrugalRAG, a two-stage finetuning framework that adaptively reduces the number of retrieval steps based on a question’s difficulty. First, we train an SLM with supervised finetuning on a full-exploration policy that generates broad sub-queries. Then, we apply RL to adaptively prune search depth based on question difficulty, directly rewarding policies that balance correctness with frugality. Unlike prior approaches requiring 100× more data, our method achieves competitive performance with only 1,000 examples. On HotPotQA and other multi-hop QA benchmarks, FrugalRAG attains state-of-the-art efficiency–accuracy tradeoffs, cutting retrieval cost nearly in half. Moreover, on the challenging BrowseCompPlus benchmark, it generalizes zero-shot and surpasses SLM-based and other baselines. These results demonstrate the use of RL—not to increase reasoning steps but to optimize them—as an effective solution for scalable, efficient RAG.
- Python 3.10+
- CUDA-compatible GPU
- 16GB+ GPU memory (recommended)
We recommend using Conda for environment management:
git clone https://github.com/microsoft/FrugalRAG.git
cd FrugalRAG
conda env create -n frag --file environment.yaml
conda activate frag
pip install vllm==0.8.3 --no-depsSee dataset setup and training guide before evaluation.
python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct --gpu-memory-utilization 0.70 --tensor-parallel-size 1 --port 7501Option A: ColBERT (Abstracts)
CUDA_VISIBLE_DEVICES=1 PORT=8000 python -m src.search.serve_colbert.py --index_root ../data/index/ --index wiki17.nbits.local --colbert_path ../data/colbertv2.0 --collection_path ../data/wiki.abstracts.2017/collection.tsvOption B: ColBERT (Abstracts and Body)
CUDA_VISIBLE_DEVICES=1 PORT=8000 python -m src.search.serve_colbert.py --index_root ../data/index/ --index wiki18.nbits.local --colbert_path ../data/colbertv2.0 --collection_path ../data/wiki.2018/collection.tsvOption C: E5
# convert faiss flat index to pytorch shards for fast execution
python src/search/shard_embeddings.py
# start server
INDEX_DIR=../data/e5-base-v2/pytorch-shards/ E5_MODEL_NAME_OR_PATH="intfloat/e5-base-v2" TOP_K=5 uvicorn src.search.start_e5_server_main:app --port 8001 Ensure all required services are running before evaluation.
Run evaluation:
python -m src.evaluation.eval_mp --model_name_or_path [MODEL_PATH] --output_path [OUTPUT_PATH] --prompt_path [PROMPT_PATH] --answer_model [BASE_MODEL_NAME] --port 7501 7502 --search_port 8000 --dataset_name [DATASET_NAME] --input_file [DEV_FILE_PATH]
# extract the final answer with CoT prompt
python -m src.evaluation.eval_mp --model_name_or_path [MODEL_PATH] --output_path [OUTPUT_PATH] --prompt_path [PROMPT_PATH] --answer_model [BASE_MODEL_NAME] --port 7501 7502 --search_port 8000 --dataset_name [DATASET_NAME] --input_file [DEV_FILE_PATH] --answer_only TrueRun MBE (ensure you set the path in the grade_all.py script)
python -m src.evaluation.grade_all
The evaluation framework automatically computes:
- Exact Match (EM): Binary accuracy for correct answers
- Match: Checks if gold answer is present in the generated answer
- F1 Score: Token-level overlap between predicted and gold answers
- Cost Efficiency: Retrieval operations per query
- Recall/Support F1: Retrieval peformance
- MBE: LLM Judge Score
Configuration files are organized in the configs/ directory:
configs/
├── create_data/ # Data generation configs
│ └── colbert/
│ ├── hotpot_qwen7b_finish.json
│ └── hotpot_qwen7b_nofinish.json
├── sft/ # Supervised fine-tuning configs
│ └── colbert/
│ ├── hotpot_qwen7b_m5_0.90.json
│ └── hotpot_qwen7b_m5_nofinish.json
├── grpo/ # GRPO reinforcement learning configs
│ └── colbert/
│ └── hotpot_qwen7b_m5_0.90.json
└── default_config.yaml # Accelerate configuration
Please set the correct paths, port numbers to ensure the models run smoothly.
| Parameter | Description | Example |
|---|---|---|
model_name_or_path |
Base model for training | "Qwen/Qwen2.5-7B-Instruct" |
search_port |
Retrieval server port | 8000 |
port |
Model server ports (reasoner, answer generator) | [7501, 7502] |
max_iters |
Maximum reasoning iterations | 5 |
ndocs |
Documents retrieved per iteration | 3 or 5 |
dataset_name |
Target dataset (hotpot, 2wiki, musique) | "hotpot" |
Issue: ImportError: cannot import name 'AdamW' from 'transformers'
Solution: Comment out the import in the relevant files. We use a newer transformers version.
Issue: RuntimeError: Expected all tensors to be on the same device
Solution: In the transformers package file loss_utils.py, line 38, use:
if reduction == "sum":
loss = loss / num_items_in_batch.to(loss.device)
return lossException: Call to collective_rpc method failed: Weight update group already initialized. Call close_communicator first.
Just rerun trl vllm-serve, sometimes it does not call close_communicator on its own.
- Memory Usage: Adjust
--gpu-memory-utilizationbased on your GPU memory - Training Speed: Use DeepSpeed Zero3 for large model training
- Inference Speed: Use
--tensor-parallel-sizefor multi-GPU inference
If you use this repository, please cite our paper:
@misc{java2025frugalraglearningretrievereason,
title={FrugalRAG: Learning to retrieve and reason for multi-hop QA},
author={Abhinav Java and Srivathsan Koundinyan and Nagarajan Natarajan and Amit Sharma},
year={2025},
eprint={2507.07634},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.07634},
}