This repository provides codes of our paper Scalable Best-of-N Selection for Large Language Models via Self-Certainty, in which we propose self-certainty, a metric designed to measure model confidence.
Self-Certainty is calculated using the following formula:
Where:
-
$n$ = Number of tokens in one sentence. -
$V$ = Vocabulary size. -
$p(j|x, y_{<i})$ = Probability of token$j$ given the context$x$ and previous tokens$y_{<i}$ .
For more details, please refer to our paper. And if you find this work useful, please consider citing our paper:
@article{kang2025scalable,
title={Scalable Best-of-N Selection for Large Language Models via Self-Certainty},
author={Kang, Zhewei and Zhao, Xuandong and Song, Dawn},
journal={arXiv preprint arXiv:2502.18581},
year={2025}
}
Ensure you have SymPy installed. You can install it via:
pip install sympyIntegrating Self-Certainty with ZeroEval
The code is an extension of the ZeroEval project. To integrate the Self-Certainty extension with the ZeroEval project, follow these steps:
- Clone this repository.
- Copy the necessary files into the appropriate directories using the following command:
cp -r Self-Certainty/src/* ZeroEval/src/This script computes a self-certainty score for a collection of model outputs based on a given input. The implementation is provided in confidence_list.py.
Example Usage:
python3 src/confidence_list.py --input_file /path/to/input.jsonInput File Requirements:
The JSON input file must include the following keys:
- "generator" (optional): The path to the model used for generating responses.
- "output": An array containing the model’s responses.
- "input": The text input provided to the model.
Output Details:
By default, the script writes the self-certainty scores to a file named /path/to/input-confidence-list.json. If the --model_dir option is not specified, the script defaults to using the generator value from the first entry in the input file.
For Fixed Answer Questions
The self_certainty_from_list.py script selects the answer with the highest self-certainty score from a list of outputs. Answers without extractable content are assigned a confidence score of -infinity.
Example usage:
python3 src/self_certainty_from_list.py --input_file /path/to/input.json --best_N 16For Code Generation
The livecode_self_certainty_from_list.py script selects the answer with the highest confidence score and parses it into the LiveCode format ({"question_id", "code_list"}).
Example usage:
python3 src/livecode_self_certainty_from_list.py --input_file /path/to/input.json --output_file /path/to/output.json --best_N 16The voting_from_list.py script performs Borda voting on a list of outputs. The majority vote is equivalent to Borda voting with ( p = 0 ). This is supported for fixed-answer questions only.
Example usage:
python3 src/voting_from_list.py --input_file /path/to/input.json --best_N 16 --power 0.5The livecode_parsing.py script parses the first item in the output list of a JSON file into the LiveCode format.
Example usage:
python3 src/livecode_parsing.py --input_file /path/to/input.json --output_file /path/to/output.jsonFor USC generation, modify the dataset name (e.g., "gsm") in ZeroEval generation to "usc-N-path/to/file.json", where ( N ) is the number of samples per question to be considered.
Example usage:
bash zero_eval_local.sh -d "usc-8-path/to/samples.json" -m model_path -p model-usc -s 2 -b 4The usc_from_outputs.py script assists USC in selecting a specific output index. When --dataset_type is set to close, it helps USC choose the first extractable answer if the original answer is not extractable.
This project builds upon the following open-source repositories:
ZeroEval
- Repository: ZeroEval License: Apache License 2.0
- Description: A unified framework for evaluating instruction-tuned large language models on tasks like MMLU and GSM.
LiveBench
- Repository: LiveBench License: Apache License 2.0
- Description: A challenging, continuously updated benchmark that sources new questions monthly from various contemporary datasets.