You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
subsample_test_set: size of test set to use to speed up eval. None means using all test set
Result format
After running the codes above, you'll get results (pickle file).
For each experiment, we store a result tree in the following format:
{
seed_id: {
id: {
// prompt level info
id: prompt_id,
promt: prompt_text,
sen: sen_score,
mi: mi_score,
perf: performance (acc),
}
// seed-level info: correlations across prompt
sen_p: ,
sen_s: ,
mi_p: ..,
mi_s: ..,
}
// top level info like avg sensitivity avg accuracy etc. is calculated by print_results function. they are not stored in the pickle
}
id: the prompt id
promt: the contents of prompt
sen: the sensitivity of the prompt
mi: multual information of the prompt
perf: accuracy of the prompt
sen_p: Pearson correlation between performance and sensitivity
sen_s: Spearman correlation between performance and sensitivity
mi_p: Pearson correlation between performance and mutual information
mi_s: Spearman correlation between performance and mutual information
Tune Alpha
After obtaining the correlation between metrics scores and performance on the dev-set, we tune the alpha that maximizes the correlation or other metrics (e.g., NDCG). Then fix it, and run on the large test set.
Customization
To set your own custom prompts, you can change it at promptset in main.py
Contact Us
If you have any questions, suggestions, or concerns, please reach out to us.
Relevant paper
If you find this repository/data helpful, cite the following work:
@article{shen2023flatnessaware,
title={Flatness-Aware Prompt Selection Improves Accuracy and Sample Efficiency},
author={Lingfeng Shen and Weiting Tan and Boyuan Zheng and Daniel Khashabi},
year={2023},
eprint={2305.10713},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2305.10713}
}