Boyi Wei1,* ,
Benedikt Stroebl1,* ,
Jiacen Xu2 ,
Joie Zhang1 ,
Zhou Li2 ,
Peter Henderson1
*Equal Contribution
1Princeton University 2University of California, Irvine
Use the following command to clone the repo and the dataset:
git clone --recurse-submodules git@github.com:boyiwei/Dynamic-Risk-Assessment.gitRun the following command to install the dependencies:
pip install -r requirements.txtNote: We follow the same pipeline from S1 for self-training. Therefore, when doing self-training, please create a new environemnt and install the dependencies in self_training/requirements.txt:
pip install -r self_training/requirements.txtRun the following command to initialize the agents, which will build the docker image.
bash setup.shThe overall workflow is as follows:
- Make sure you have a model hosted in the host machine and can be queried by vllm API. See https://github.com/benediktstroebl/della-inference for more details.
- Run scripts starting with
launch_evaluation_inscripts/directory to get the raw log files. - Run
analysis/grade_benchmark.pyto get pass@k score and confidence interval.
We provide a script for running repeated sampling and increasing max rounds of interactions.
bash scripts/launch_evaluation_base.shKey arguments:
-
N: the max rounds of interactions ($N$ in our paper). We set it to 20 by default. -
dataset: the dataset to evaluate, includingintercode_ctf,cybenchandnyu_ctf_test. We set it tointercode_ctfby default. -
task_mask: Specify the task mask for evalaution, only applicable forintercode_ctfdataset. We set it toanalysis/test_tasks.txtby default, which means we only evaluate on the test set. If we need to evaluate on the development set, we need to set it toanalysis/train_tasks.txt. -
model_name: the name of the model to evaluate, we set it toQwen2.5-Coder-32B-Instructby default. -
parallelism: the number of parallel processes to run. We set it to 10 by default. -
i: the repetition id. By settingiranging fron 1 to 12, we repeat our experiments for 12 times.
After running the script, we will have the raw log files in logs/ directory. By running
python analysis/grade_benchmark.py --task_name $benchmark --N $N --output_file "acc_repeated_sampling.csv" --k0 12We can get pass@k score (k=1-12) and confidence interval in acc_repeated_sampling.csv file.
When evaluating on the Intercode CTF train/test set, we need to add --train_set or --test_set to the command.
We provide a script for running iterative workflow refinement.
bash scripts/launch_evaluation_iter_prompt_refinement.shKey arguments:
k0: the number of rollouts. We set it to 12 by default.iter_prompt_round: the number of iterative prompt refinement rounds. By setting it from 1 to 20, do 20 prompt refinement iterations for a single rollout.
Note that, the first iteration of iterative prompt refinement needs the logs from repeated sampling to identify the failed tasks in the initial run. Therefore, before running the script for iterative prompt refinement, we need to run bash scripts/launch_evaluation_base.sh first.
After running the script for iterative prompt refinement, we will have the raw log files in logs/ directory. By running
python analysis/grade_benchmark.py --iter_prompt --k0 $k0 --test_set --output_file "iter_prompt_refinement.csv"We can get pass@k score (k=1-20) and confidence interval in acc_iter_prompt_refinement.csv file. The key difference here is we need to add --iter_prompt to the command.
Before we run evaluation on the self-trained checkpoints, we need to first fine-tune the model based on its generated trajectories. By running:
sbatch scripts/launch_self_training.slurmWe can get the self-trained checkpoints.
Key arguments:
epochs: the number of epochs to train. We set it to 5 by default.lr: the learning rate. We set it to 1e-5 by default.batch_size: the batch size. We set it to 16 by default.weight_decay: the weight decay. We set it to 1e-4 by default.train_dataset_name: the name of the train dataset. We set it toctf_intercode_nyuagent_singleturn_trainby default, which is the successful trajectories collected from the development set of Intercode CTF.output_dir: the output directory to save the self-trained checkpoints.
After having the self-trained checkpoints and have the checkpoints hosted in the host machine, we can run the following command to evaluate the self-trained checkpoints.
bash scripts/launch_evaluation_ft.shKey arguments:
model_name: the name of the model to evaluate. This is dependent on the model name in the host machine, by default we useQwen2.5-Coder-32B-Instruct-ftfor self-trained model.lr,ft_epoch,ft_dataset,ft_paradigm: the fine-tuning parameters used in the checkpoints.N: the max rounds of interactions. We set it to 20 by default.dataset: the dataset to evaluate. We set it tointercode_ctfby default.
Similarly, after having the raw log files, we can run
python analysis/grade_benchmark.py --task_name $benchmark --N $N --model_name Qwen2.5-Coder-32B-Instruct-ft_ft_intercode_nyuagent_singleturn_train_${ft_epoch}_lr_1e-5_fullparam --output_file "self_training.csv" --test_set --k0 12to get the pass@k score and confidence interval in self_training.csv file. Here ft_epoch is the number of epochs used in the self-training.
Simply run the following command to refinement on the agent's workflow on the development set of Intercode CTF.
bash launch_search_base_iter_workflow_refinement.shKey arguments:
iteration: the number of iterative workflow refinement rounds. By setting it from 1 to 20, we do 20 workflow refinement iterations for a single rollout.
After having the collection of refined workflows, we can run the following command to evaluate the refined workflows on the test set of Intercode CTF.
bash launch_evaluation_base_iter_workflow_refinement.sh
python grade_benchmark.py --model_name "Qwen2.5-Coder-32B-Instruct_adas${iteration}" --task_name $benchmark --N $N --output_file "acc_repeated_sampling_newnew.csv" --test_set --k0 5Key arguments:
iteration: the iteration id needed to be evaluated. You can only evaluate on the workflow that performs better than the baseline in the development set.
If you find our work useful, please cite our work 😊
@misc{wei2025dynamicriskassessmentsoffensive,
title={Dynamic Risk Assessments for Offensive Cybersecurity Agents},
author={Boyi Wei and Benedikt Stroebl and Jiacen Xu and Joie Zhang and Zhou Li and Peter Henderson},
year={2025},
eprint={2505.18384},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2505.18384},
}

