Self-Certainty Evaluation

This repository provides codes of our paper Scalable Best-of-N Selection for Large Language Models via Self-Certainty, in which we propose self-certainty, a metric designed to measure model confidence.

Self-Certainty is calculated using the following formula:

$$-\frac{1}{nV} \sum_{i=1}^n \sum_{j=1}^{V} \log \left( V \cdot p(j|x , y_{&lti} ) \right)$$

Where:

$n$ = Number of tokens in one sentence.
$V$ = Vocabulary size.
$p(j|x, y_{<i})$ = Probability of token $j$ given the context $x$ and previous tokens $y_{<i}$.

For more details, please refer to our paper. And if you find this work useful, please consider citing our paper:

@article{kang2025scalable,
  title={Scalable Best-of-N Selection for Large Language Models via Self-Certainty},
  author={Kang, Zhewei and Zhao, Xuandong and Song, Dawn},
  journal={arXiv preprint arXiv:2502.18581},
  year={2025}
}

Installation

Ensure you have SymPy installed. You can install it via:

pip install sympy

Integrating Self-Certainty with ZeroEval

The code is an extension of the ZeroEval project. To integrate the Self-Certainty extension with the ZeroEval project, follow these steps:

Clone this repository.
Copy the necessary files into the appropriate directories using the following command:

cp -r Self-Certainty/src/* ZeroEval/src/

Usage

1. Self-Certainty Calculation

This script computes a self-certainty score for a collection of model outputs based on a given input. The implementation is provided in confidence_list.py.

Example Usage:

python3 src/confidence_list.py --input_file /path/to/input.json

Input File Requirements:

The JSON input file must include the following keys:

"generator" (optional): The path to the model used for generating responses.
"output": An array containing the model’s responses.
"input": The text input provided to the model.

Output Details:

By default, the script writes the self-certainty scores to a file named /path/to/input-confidence-list.json. If the --model_dir option is not specified, the script defaults to using the generator value from the first entry in the input file.

2. Choose Answer with Highest Self-Certainty Score

For Fixed Answer Questions

The self_certainty_from_list.py script selects the answer with the highest self-certainty score from a list of outputs. Answers without extractable content are assigned a confidence score of -infinity.

Example usage:

python3 src/self_certainty_from_list.py --input_file /path/to/input.json --best_N 16

For Code Generation

The livecode_self_certainty_from_list.py script selects the answer with the highest confidence score and parses it into the LiveCode format ({"question_id", "code_list"}).

Example usage:

python3 src/livecode_self_certainty_from_list.py --input_file /path/to/input.json --output_file /path/to/output.json --best_N 16

3. Borda Voting on Output List

The voting_from_list.py script performs Borda voting on a list of outputs. The majority vote is equivalent to Borda voting with ( p = 0 ). This is supported for fixed-answer questions only.

Example usage:

python3 src/voting_from_list.py --input_file /path/to/input.json --best_N 16 --power 0.5

4. LiveCode Parsing

The livecode_parsing.py script parses the first item in the output list of a JSON file into the LiveCode format.

Example usage:

python3 src/livecode_parsing.py --input_file /path/to/input.json --output_file /path/to/output.json

5. Universal Self-Consistency (USC) Generation

For USC generation, modify the dataset name (e.g., "gsm") in ZeroEval generation to "usc-N-path/to/file.json", where ( N ) is the number of samples per question to be considered.

Example usage:

bash zero_eval_local.sh -d "usc-8-path/to/samples.json" -m model_path -p model-usc -s 2 -b 4

6. USC Selection from Outputs

The usc_from_outputs.py script assists USC in selecting a specific output index. When --dataset_type is set to close, it helps USC choose the first extractable answer if the original answer is not extractable.

References

This project builds upon the following open-source repositories:

ZeroEval

Repository: ZeroEval License: Apache License 2.0
Description: A unified framework for evaluating instruction-tuned large language models on tasks like MMLU and GSM.

LiveBench

Repository: LiveBench License: Apache License 2.0
Description: A challenging, continuously updated benchmark that sources new questions monthly from various contemporary datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Self-Certainty Evaluation

Installation

Usage

1. Self-Certainty Calculation

2. Choose Answer with Highest Self-Certainty Score

3. Borda Voting on Output List

4. LiveCode Parsing

5. Universal Self-Consistency (USC) Generation

6. USC Selection from Outputs

References

About

Uh oh!

Releases

Packages

Contributors 2

Languages

backprop07/Self-Certainty

Folders and files

Latest commit

History

Repository files navigation

Self-Certainty Evaluation

Installation

Usage

1. Self-Certainty Calculation

2. Choose Answer with Highest Self-Certainty Score

3. Borda Voting on Output List

4. LiveCode Parsing

5. Universal Self-Consistency (USC) Generation

6. USC Selection from Outputs

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages