Tip
We are advancing the integration of SyzGPT modules into Syzkaller. Please refer to Discussion for progress.
Note
We are still updating this project and formatting the documentations for Artifact Evaluation.
This is the implementation of paper titled "Unlocking Low Frequency Syscalls in Kernel Fuzzing with Dependency-Based RAG". For more details about SyzGPT, please refer to our paper from ISSTA'25. We also provide a README_for_review, which was once located in an anonymous repository for better understanding by reviewers.
Quick Glance: SyzGPT is an LLM-assisted kernel fuzzing framework for automatically generating effective seeds for low frequency syscalls (LFS). Linux kernel provides over 360 system calls and Syzkaller defines more than 4400 specialized calls encapsulated for specific purposes of system calls. However, many of these syscalls (called LFS) are hard to be consistently covered due to the complex dependencies and mutation uncertainty, leaving the testing space. SyzGPT can automatically extract and augment syscall dependencies for these LFS and generate effective seeds with dependency-based RAG (DRAG). Our evaluation shows that SyzGPT can improve overall code coverage and syscall coverage, and find LFS-induced vulnerabilities. We also release a toy model 🤗CodeLlama-syz-toy specialized for Syz-program.
We also generate a wiki page for SyzGPT through DeepWiki.
Project Structure
____ ____ ____ _____
/ ___| _ _ ____ / ___|| _ \|_ _|
\___ \ | | | ||_ / | | _|| |_) | | |
___) || |_| | / /_ | |_| || __/ | |
|____/ \__, //____| \____||_| |_|
|___/
.
├── analyzer/ # Corpus Analyzer
├── crawler/ # Crawler for Linux Manpages
├── data/ # Data used in SyzGPT
├── docs/ # Documentations of SyzGPT (apart from the READMEs)
├── examples/ # Examples for better understanding
├── experiments/ # Experiments
├── extractor/ # Two-Level Syscall Dependency Extractor
├── fine-tune/ # Fine-tuning LLM specialized for Syz-programs
├── fuzzer/ # SyzGPT-fuzzer
├── generator/ # SyzGPT-generator
├── scripts/ # Some useful scripts
...
├── config.py # Configs, need to be copied as private_config.py
└── syzgpt_generator.py # Main entry of SyzGPT-generator
- Hardware
- CPU: 16+ Cores
- Memory: 64GB+
- Storage: 256GB+
- GPU: None
- Software
- OS: Ubuntu 20.04+ (20.04, 22.04 tested)
- Compiler: GCC 11+ (11.4, 12.3 tested) or Clang 15+ (15.0.6, 16.0.6, 17.0.6 tested)
- Python: 3.8+ (3.8-3.11 tested)
- Syzkaller-runnable Dependencies
- LLM API Access
- For LLM fine-tuning and serving (Optional)
- GPU: GPU with 48GB+ (1xA800, 2xRTX 3090 tested)
- Software: torch 2.0+ (2.0.1, 2.3.0 tested)
- LLM Serving (fastchat tested)
We have released two docker images: qgrain/syzgpt:pretest
(early access version) and qgrain/syzgpt:full
(with full functionality and evaluation benchmark). To understand how we build these docker images, please refer to this repo: QGrain/kernel-fuzz-docker-images.
Setup with qgrain/syzgpt:pretest
To facilitate the researchers who want to test our work ASAP, we release the qgrain/syzgpt:pretest
as an early-bird image, which includes a runnable SyzGPT-generator and SyzGPT-fuzzer (experimental configurations and code are not included).
- Create the container:
docker run -itd --name syzgpt_pretest --privileged=true qgrain/syzgpt:pretest
# You will find SyzGPT located at /root/SyzGPT and SyzGPT-fuzzer at /root/fuzzers/SyzGPT-fuzzer
- Synchronize SyzGPT repository (please always):
cd /root/SyzGPT && git pull
- Minor steps:
cd /root/fuzzers/SyzGPT-fuzzer && make -j32
# You need to build the fuzzer as we did not include the binaries into image to save size.
workon syzgpt
# Enter the python virtual environment with the dependencies required by SyzGPT-generator
Setup with qgrain/syzgpt:full
If you only want to directly use SyzGPT for fuzzing, you can choose the smaller one syzgpt:pretest
.
docker run -itd --name syzgpt_full --privileged=true qgrain/syzgpt:full
# You will find everything is OK as expected, including the functionalities and experiments of SyzGPT.
cd /root/SyzGPT && git pull
# Please always synchronize SyzGPT repository first.
Click to view
You can also setup SyzGPT from scratch on a Ubuntu 20.04/22.04. Or base on our image qgrain/kernel-fuzz:v1
.
docker run -itd --name syzgpt_from_scratch --privileged=true qgrain/kernel-fuzz:v1
- Clone this project:
# Recommend at /root/SyzGPT, so that the following instructions can match with the path.
# If you are a normal user on a physical machine, feel free to clone it at a convenient place.
git clone https://github.com/QGrain/SyzGPT.git
-
Setup SyzGPT-generator: Please refer to Setup section in generator/README.md
-
Setup SyzGPT-fuzzer: Please refer to fuzzer/README.md
SyzGPT can serve as a standalone seed generator through SyzGPT-generator (Section 2.1). It can also cooperate with SyzGPT-fuzzer for continuous kernel fuzzing (Section 2.2).
We have released the augmented syscall depencies at data/dependencies. So you can directly run SyzGPT without extracting syscall dependency. You can also extract the syscall dependencies on your own (Section 2.3).
For any questions in using SyzGPT, you may refer to Troubleshooting or feel free to raise an issue.
We provide a simplest running instruction here. For detailed usage, please refer to generator/README.md.
Prerequisites:
- A corpus generated by at local Syzkaller or some other existing fuzzers, which can serve as the corpus knowledge base. We provide a
corpus_24h.db
at data/corpus_24h.db for reproduction. - A file containing target syscalls for which you want to generate seeds. We provide a
sampled_variants.txt
at data/sampled_variants.txt for reproduction.
cd /root/SyzGPT && workon syzgpt
# (1) Use official OpenAI api, which will load api_key, llm_model, ... from private_config.py.
python syzgpt_generator.py -s /root/fuzzers/SyzGPT-fuzzer -w WORKDIR -e data/corpus_24h.db -f data/sampled_variants.txt
# (2) Use third party api
python syzgpt_generator.py -M gpt-3.5-turbo-16k -u https://api.expansion.chat/v1/ -k API_KEY -s /root/fuzzers/SyzGPT-fuzzer -w WORKDIR -e data/corpus_24h.db -f data/sampled_variants.txt
# (3) Use local LLMs
python syzgpt_generator.py -M CodeLlama-syz-toy -u https://IP:PORT/v1/ -s /root/fuzzers/SyzGPT-fuzzer -w WORKDIR -e data/corpus_24h.db -f data/sampled_variants.txt
Click to view: Explanation of Parameters
-s
: path to the SyzGPT-fuzzer, must be specified.-w
: output the generated results and logs toWORKDIR
(every task should has its ownWORKDIR
), must be specified.-e
: path to external corpus, only nessary in one-time seed generation.-f
: path to the file containing target syscall list, only needed in one-time seed generation.-c
: you can also specify the target syscalls through-c CALL1 CALL2 ...
manually, instead of-f
.-M
: model name, used with third party api or local LLM.-u
: base_url to api address or local hosted LLM.-k
: api_key for third party api service.
You will find the structure of WORKDIR
looks like:
├── external_corpus/ # external corpus specified through -e
├── generated_corpus/ # generated seeds in Syz-program format (★)
├── generation_history.json # generation history for feedback-guided seed generation
├── query_prompts/ # generation logs including query prompts and results, can be used for fine-tuning.
├── reverse_index.json # reverse index for DRAG
└── target_syscalls/ # generation targets
# Repair the seeds
/root/fuzzers/SyzGPT-fuzzer/bin/syz-repair WORKDIR/generated_corpus WORKDIR/generated_corpus_repair
# Purify the seeds
/root/fuzzers/SyzGPT-fuzzer/bin/syz-validator dir WORKDIR/generated_corpus_repair WORKDIR/generated_corpus_repair_valid
# Now you can pack the seeds as corpus for fuzzing or some tasks
/root/fuzzers/SyzGPT-fuzzer/bin/syz-db pack WORKDIR/generated_corpus_repair_valid WORKDIR/SyzGPT_sampled_corpus.db
# 1. Evaluate the Syntax Valid Rate (SVR)
/root/fuzzers/SyzGPT-fuzzer/bin/syz-validator dir WORKDIR/generated_corpus
# 2. Evaluate the Syntax Valid Rate (SVR) after repair
/root/fuzzers/SyzGPT-fuzzer/bin/syz-validator dir WORKDIR/generated_corpus_repair
# 3. Evaluate the N_avg and L_avg
/root/fuzzers/SyzGPT-fuzzer/bin/syz-validator dir WORKDIR/generated_corpus_repair WORKDIR/generated_corpus_repair_valid
python /root/SyzGPT/analyzer/corpus_analyzer.py analyze -d WORKDIR/generated_corpus_repair_valid
# 4. Evaluate the Context Effective Rate (CER)
cp -r WORKDIR/generated_corpus_repair_valid WORKDIR/generated_corpus_repair_valid_rev
python /root/SyzGPT/experiments/performance/reverse_prog.py WORKDIR/generated_corpus_repair_valid_rev
syzqemuctl create image-eval
syzqemuctl run image-eval
syzqemuctl exec image-eval "mkdir /root/evaluate_cer"
syzqemuctl cp WORKDIR/generated_corpus_repair_valid_rev image-eval:/root/evaluate_cer/
syzqemuctl cp /root/fuzzers/SyzGPT-fuzzer/bin/linux_amd64/syz-execprog image-eval:/root/
syzqemuctl cp /root/fuzzers/SyzGPT-fuzzer/bin/linux_amd64/syz-executor image-eval:/root/
syzqemuctl exec image-eval "/root/syz-execprog -semantic -progdir /root/evaluate_cer/generated_corpus_repair_valid_rev -coverfile /root/evaluate_cer/out/CER"
syzqemuctl cp image-eval:/root/evaluate_cer/CER_Evaluation_Results ./
# It's a bit complicated, recommend to use our all-in-one script `experiments/performance/evaluate.py`
Note
You can run the all-in-one script to reproduce the seed generation experiments in paper Table 2.
python /root/SyzGPT/experiments/performance/evaluate.py -i /root/images -n image-eval -k /root/kernels/linux-6.6.12 -s /root/fuzzers/SyzGPT-fuzzer -w WORKDIR
# The evaluation results are written to WORKDIR/evaluate.log
We provide a simplest running instruction here. For detailed usage, please refer to generator/README.md and fuzzer/README.md.
# cd /root/fuzzers/SyzGPT-fuzzer
# e.g., WORKDIR=workdir/v6-1/SyzGPT
taskset -c 8-15 ./bin/syz-manager -config cfgdir/SyzGPT.cfg -bench benchdir/SyzGPT.log -statcall -backup 24h -enrich WORKDIR/generated_corpus -period 1h -repair
Click to view: Explanation of Parameters
(refer to fuzzer/README.md for more details)
-statcall
: enable syscall tracking during fuzzing.-backup
: backup rawcover, corpus.db, CoveredCalls, and crahes every 24h.-enrich
: load the enriched seeds fromWORKDIR/generated_corpus
everyINTERNAL
(1h).-period
:INTERVAL
of loading enriched seeds.-repair
: enable program repair which is implemented in SyzGPT-fuzzer.
# cd the root of this project
python syzgpt_generator.py -s /root/fuzzers/SyzGPT-fuzzer -w /root/fuzzers/SyzGPT-fuzzer/workdir/v6-1/SyzGPT/generated_corpus -D 1h -T 1h -S 24h -m 100 -P 10
# seemingly, you can also use other api service
# or local hosted LLM with -M, -u, -k (introduced in section 2.1)
Click to view: Explanation of Parameters
(refer to generator/README.md for more details)
-s
and-w
have been introduced above.-D
: an empirical delay (1h) before generator start to work, which leave the fuzzer to explore by default.-T
: generate seeds everyINTERVAL
(1h), need to be in the same pace with-enrich
in fuzzer.-S
: stop generating after 24h.-m
: max generation amount, 100 here.-P
: probability of feedback-guided re-generation for failed seeds, 10% here.
- Plot the growth of metrics (coverage, syscalls, new\ inputs...)
# Normal Usage of plotting the curves of metrics over time:
# 1. Plot single logs of each fuzzers to compare:
python bench_parser.py -b logA logB .. -k coverage syscalls crashes 'crash types' -l fuzzerA fuzzerB ... -t 24h -p -o ../plots/ -T PLOT_TITLE
# 2. Plot average logs of each fuzzers to compare:
python bench_parser.py -b logA1 logA2 logA3 logB1 logB2 logB3 .. -a 3 -k coverage syscalls crashes 'crash types' -l fuzzerA fuzzerB ... -t 24h -p -o ../plots/ -T PLOT_TITLE
- Visualize the crashes
python /root/SyzGPT/scripts/result_parser.py -D /path/to/WORKDIR1/crashes /path/to/WORKDIR2/crashes ... -c
We extract the specialized call level (syz-level) dependency through resource-based static analysis on Syzlangs.
- Extract defined syscalls of the fuzzer (different fuzzers would have different builtin syscalls, e.g., KernelGPT):
# it will generate debug.log at ~/SyzGPT/data/debug.log and generate builtin_syscalls* at -o
cd extractor
python parse_builtin_syscalls.py -s ~/fuzzers/SyzGPT-fuzzer -o ../data/
- Extract syz-level dependencies:
# it will generate syz-level dependencies at -o
python extract_syz_dependencies.py -b ../data/builtin_syscalls.json -o ../data/dependencies/syz_level/Syzkaller_deps/
- Crawler the manpage documentation by syscalls:
# it will download the docs at crawler/man_docs/SYSCALL.json
cd crawler
python get_syscall_doc.py
- Extract call-level dependencies (NOTE: it will interact with LLM and cost tokens):
# -d for dumb mode, recommended
cd extractor
python extract_call_dependencies.py -f ../crawler/syscall_from_manpage.txt -d
Note
The usage of the tools and scripts can be found in scripts/README.md.
We have repleased a toy version of CodeLlama-syz at Huggingface. For more details, please refer to fine-tune/README.md
Our approach is able to be migrated to other kernel fuzzing framework, such as Syzkaller-like (ACTOR, ECG, KernelGPT, ...) and Healer-like (Healer and MOCK) fuzzers. Our approach can also be applied to other tasks, such as directed kernel fuzzing.
We have demonstrated the migration on KernelGPT, please refer to the implementation instruction in experiments/KernelGPT/README.md.
Simple as migrating to Syzkaller-like, as long as you are familiar with RUST.
We also prepare a instruction for migrating to MOCK, please refer to the implementation instruction in experiments/MOCK/README.md.
Our approach is also applicable in directed kernel fuzzing domain, as long as we have the target syscalls required by the directed task. Please refer to experiments/directed/README.md.
Thanks to Zhiyu Zhang (@QGrain) and Longxing Li (@x0v0l) for their valuable contributions to this project.
In case you would like to cite SyzGPT, you may use the following BibTex entry:
@article{zhang2025unlocking,
title={Unlocking Low Frequency Syscalls in Kernel Fuzzing with Dependency-Based RAG},
author={Zhang, Zhiyu and Li, Longxing and Liang, Ruigang and Chen, Kai},
journal={Proceedings of the ACM on Software Engineering},
volume={2},
number={ISSTA},
pages={848--870},
year={2025},
publisher={ACM New York, NY, USA}
}