CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 13.3k
guide : running gpt-oss with llama.cpp #15396
Replies: 48 comments · 100 replies
-
I can provide some numbers for AMD part of the guide. My hardware is RX 7900 XT (20GB VRAM) + Ryzen 9 5900X + 32GB of RAM, running on latest Arch Linux with locally built llama.cpp version 6194 (3007baf), built with ROCm 6.4.1-1 (from official Arch repo) Pulled the gpt-oss-20b repository and converted to GGUF using 7900XT can load the full 20B model with full context without offloading MoE layers to CPU (although barely, because it will fill up the whole VRAM), by running llama-server -m ./gpt-oss-20b.auto.gguf --ctx-size 0 --jinja -b 4096 -ub 4096 -ngl 99 -fa With that, i get generation speeds (as reported by llama.cpp webui) at ~94 tokens/second, slowly going down as the context fills up. I've also tested whether setting K/V cache quantization would help with model size or performance, but the result was... bad, performance was halved and CPU got involved... is this because of mxfp4 format of gpt-oss? I'd also like to note that my PC likes to hang when i fill up my VRAM to the brim, so i've also checked out how gpt-oss-20b behaves when i off-load MoE layers to CPU. When running with all MoE layers on CPU, as below: llama-server -m ./gpt-oss-20b.auto.gguf --ctx-size 0 --jinja -b 4096 -ub 4096 -ngl 99 -fa -cmoe my GPU VRAM usage (as reported by btop) is around 10GB, RAM usage went up only ~2GB. However, the performance took a major 80% hit, as now my generation speed is in ~20tok/s - CPU takes most of the load. If you have better CPU and faster RAM (i'm still running dual-channel DDR4s @ 3200MHz CL16, mind you), you probably will get better results. I wonder how X3Ds behave in that case... I assume that gpt-oss-20b has 24 MoE layers, so let's see how it behaves when i load only, let's say, 4 onto CPU: llama-server -m ./gpt-oss-20b.auto.gguf --ctx-size 0 --jinja -b 4096 -ub 4096 -ngl 99 -fa -ncmoe 4 VRAM is at 18GB (previously it was at 19, as reported by btop, so there's a decrease), RAM usage went up by around 1.5GB, generation speed is ~60tok/s. Neat, this is usable. How about 8 layers? llama-server -m ./gpt-oss-20b.auto.gguf --ctx-size 0 --jinja -b 4096 -ub 4096 -ngl 99 -fa -ncmoe 8 In that case, i get 16GB VRAM usage, ~1.5GB RAM bump as previously, and generation speed went down to 38 tokens/s. Still pretty usable. How about 16 layers? llama-server -m ./gpt-oss-20b.auto.gguf --ctx-size 0 --jinja -b 4096 -ub 4096 -ngl 99 -fa -ncmoe 16 VRAM: 13GB, RAM: as previously, not more than 2GB, generation speed: 27-25tok/s, this is getting pretty bad. As mentioned before - your results may vary, i'm not running current-gen top-tier hardware and IIRC the largest performance bottleneck will be the RAM/PCIe link speed anyway - i'm pretty curious to see what the performance with this GPU is on more recent platform, especially with an X3D CPU. |
Beta Was this translation helpful? Give feedback.
All reactions
-
❤️ 10
-
I had issues with a higher batch/ubatch size than the default but I'm not seeing that problem anymore so that was probably user error on my end. I believe you are likely hitting the case where the model needs the CoT from the past tool call but the client isn't sending it in or there is a mismatch in the reasoning field. That is an open issue across all client and inference servers/providers with GPT-OSS. If you can collect any dumps of this happening, I'm happy to dig in further. |
Beta Was this translation helpful? Give feedback.
All reactions
-
Sure @aldehir, here's my config for this model in "providers": {
"llamacpp": {
"type": "openai",
"base_url": "https://steelph0enix.pc:51536/v1",
"name": "Llama.cpp",
"id": "llamacpp",
"models": [
{
"id": "gpt-oss-20b.auto",
"name": "GPT-OSS 20B",
"context_window": 131072,
"default_max_tokens": 51200,
"has_reasoning_efforts": true,
"can_reason": true,
"supports_attachments": false,
"default_reasoning_effort": "high",
"cost_per_1m_in": 0,
"cost_per_1m_in_cached": 0,
"cost_per_1m_out": 0,
"cost_per_1m_out_cached": 0
}
]
}
}
// ... my llama-server invocation: llama-server --ctx-size 0 --model gpt-oss-20b.auto.gguf --alias "gpt-oss-20b.auto" --jinja i keep most of my llama-server settings in env vars, as following: export LLAMA_ARG_HOST="0.0.0.0"
export LLAMA_ARG_PORT="51536"
export LLAMA_ARG_BATCH=2048
export LLAMA_ARG_UBATCH=2048
export LLAMA_ARG_SWA_FULL=false
export LLAMA_ARG_KV_SPLIT=false
export LLAMA_SET_ROWS=1 # for ARG_KV_SPLIT=false to work
export LLAMA_ARG_FLASH_ATTN=true
export LLAMA_ARG_MLOCK=true
export LLAMA_ARG_NO_MMAP=false
export LLAMA_ARG_N_GPU_LAYERS=999
export LLAMA_OFFLINE=true
export LLAMA_ARG_ENDPOINT_SLOTS=true
export LLAMA_ARG_ENDPOINT_PROPS=true I've opened my test project (a Rust app) with gpt-oss-20b as chosen model in Crush and told it to initialize the project... and it seems to work just fine now! ![]() I've tested Crush back on 0.6.0 (or 0.6.1) with gpt-oss, if not on 0.5.x, and i definitely had issues (for example, the chat description below CRUSH logo contained gpt-oss chat template tags...) so you must've fixed it already - i just haven't noticed :) Thanks! If you want, you may add my piece of |
Beta Was this translation helpful? Give feedback.
All reactions
-
Yes, i think i did notice that on some other models, i've been mostly working with Qwen... If this happens again, how can i get some more logs/info? Oh, and one potential "issue" i've just noticed - i've set my reasoning in model config to "high", but |
Beta Was this translation helpful? Give feedback.
All reactions
-
I can't say, but it likely has no effect since llama-server only respects the I usually run mitmproxy in the background, but enabling verbose and searching for parse errors in the server log will likely show the root of the problem--if there is one. |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
btw @ggerganov i think you made a mistake labeling my test results, i don't have a mac, and they sure don't use AMD GPUs anymore 😄 |
Beta Was this translation helpful? Give feedback.
All reactions
-
😄 1
-
When starting a llama-server command, you can change the default sampling and reasoning settings like so:
The problem I see is that the
So the above command is actually equivalent to:
Which seems quite a bit different from the actual recommendation from OpenAI. Notably "min-p 0.1" will prune a lot of low-probability tokens, whereas the OpenAI recommendation is basically to follow the model output probabilities. If you look at a lot of guides and settings for other SOTA LLM, they all recommend min-p 0.01 or 0.00. Should the command line be changed to:
|
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 6
-
Regardless of the OpenAI recommendation, I still think it's a good idea to filter low-probability tokens (for example with Top-K or Min-P). For now, I've updated the guide with the following paragraph:
We can revisit if we determine that sampling from the full vocab is actually important. |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 2
-
Not sure what you mean. The support was merged in #13264 |
Beta Was this translation helpful? Give feedback.
All reactions
-
Yes, that seems quite sensible. Note that the default will have both
Looks like a bug crept in there, I'll file an issue.
Yes, and if you put it last like llama.cpp does by default, you don't have some of the key problems that sampler is supposed to fix 😀 |
Beta Was this translation helpful? Give feedback.
All reactions
-
Using min-p 0.0 causes significant performance losses: from 57 tokens per second at min-p 0.1 down to 35 tokens per second at min-p 0.0. |
Beta Was this translation helpful? Give feedback.
All reactions
-
That's to be expected. I'd recommend |
Beta Was this translation helpful? Give feedback.
All reactions
-
To fill better the low-end CUDA edge cases, here are some benchmarks for gpt-oss-20B (both MXFP4 and Unsloth UD quant) on 12GB VRAM: Optimal settings at 16K context window:
(Leaves about 600 MB VRAM budget, with 67 tok/sec initial generation rate)
(Leaves about 600 MB VRAM budget, with 64 tok/sec initial generation rate) Optimal settings at 32K context window:
(Leaves about 1.2 GB VRAM budget, with 53 tok/sec initial generation rate)
(Leaves about 600 MB VRAM budget, with 56 tok/sec initial generation rate - this is too aggresive, will likely OOM before reaching context limit.)
Similar results for latest build: 6195. For completeness, when running with a small context window of 2048 tokens ( when everything fits in the GPU, i.e. no offloading), the inference speed reaches 75 tok/sec. This is not irrelevant even for a reasoning model because there are scenarios where this window is sufficient for one-off tasks. |
Beta Was this translation helpful? Give feedback.
All reactions
-
The first section looks like it compares the former on MXFP4 and the latter on UD-Q4_K_XL -- is this intentionally not a "controlled" experiment? Or is it that you're showing the optimal settings in your testing for the MXFPR4 and the UD-Q4_K_XL respectively? Just seeking clarification on what the pairs of results per context window size are for. Is these where you got the GGUF files from?
Lastly, how is it that the 32K context with UD-Q4_K_XL leaves 1.2 GB VRAM but 16K only leaves over 600 MB? I'm understanding this as the 32K context left more free VRAM than 16K context. |
Beta Was this translation helpful? Give feedback.
All reactions
-
Hi @ElliotK-IB, The MXFP4 was either from Bartowski or ggml-org. The Unsloth URL is correct, I think. On the second question, the command with 16k context uses less aggressive regexp (20 to 99), while the 32k context would offload also tensors from 10 to 99 (if they existed ) of the up-projections of the feed forward network, thus leaving more VRAM available. (Which is needed for the larger context, about 370 to 400 MB per 16k if not mistaken). Looking at the layer structure of the LLM, actually blanketing everything up to 99 is not necessary. (up to 29 would have sufficed). On that note, a more aggressive regexp to leave a few more of the up-projections in VRAM would be -ot ".(1[7-9]|2[0-4]).ffn_up_exps.=CPU". This offloads from layers 17 to 24, leaving about 600 MB free VRAM (which would not be enough if the full 32k context window is to be used, so -ot ".(1[6-9]|2[0-4]).ffn_up_exps.=CPU" would be living on the absolute edge). Bottom line, this somewhat convoluted text shows optimal (with enough VRAM left for the chosen context window, except maybe in the last case) settings for the hardware in question and suggests that in most cases it is better to offload tensors, not whole expert layers to the CPU/RAM. |
Beta Was this translation helpful? Give feedback.
All reactions
-
Interesting, I learned about offloading tensors vs layers thanks to your post, glad I asked further. Appreciate the detailed follow-up as well! I'll revisit this and this post I came across on r/LocalLLaMA for my own experiments. |
Beta Was this translation helpful? Give feedback.
All reactions
-
Several people are having issues with tool calling in Cline/Roo Code when using
Passing this in a file with
Is this something useful to include in the docs? |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 10 -
❤️ 8 -
🚀 1
-
Could you ELI5 the difference between native and non-native tool calls? Or point me to a reference document. |
Beta Was this translation helpful? Give feedback.
All reactions
-
With native tool calls, the model invokes tools in its own syntax. The inference server is responsible for parsing it and exposing it via the API. For gpt-oss, it generates tool calls in its harmony format through the commentary channel
Other models may place them in tags such as With non-native tool calls, the client prompts the model to respond a certain way to perform a tool call. For example, Cline prompts the model to respond in an XML format. E.g.,
gpt-oss-20b is adamant about producing a native tool call when told it has tools. By constraining the grammar to only produce content and not a tool call, you force it to do a non-native call that Cline/Roo Code expect. Hope that clears things up. |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 4
-
Thanks. Expanded the "Tips" section with a link to this thread. |
Beta Was this translation helpful? Give feedback.
All reactions
-
@aldehir so if I want to use these models with RooCode + Openrouter via Cerebras or Grok, is it on the provider to fix this or is it on the RooCode developers? |
Beta Was this translation helpful? Give feedback.
All reactions
-
@maxiedaniels I doubt the providers will adopt this grammar, it really is more of a hack than a fix. I think the appropriate fix for Cline / Roo Code is to adopt native tool calling. Roo Code has an open PR that may be promising. The 120B model should work (mostly) with Cline/Roo Code. It seems to follow instruction quite well, but may fail occasionally. The 20B seems to always fail, and this grammar helps with that. |
Beta Was this translation helpful? Give feedback.
All reactions
-
Are we sure tool calling is currently implemented correctly? Openai has released a test script ( https://cookbook.openai.com/articles/gpt-oss/verifying-implementations ) to test backend implementations, but it's currently failing me with llama.cpp. Steps to run the test script:
Then edit the providers.ts file (edit the correct details in):
And then lastly start the test: These are the results I obtained:
Expected outcome according to the guide: Could anyone try replicating my findings? If they find the same, what should be done to fix this? |
Beta Was this translation helpful? Give feedback.
All reactions
-
I can't produce one right this moment, but if you duplicate the validResponse line and change |
Beta Was this translation helpful? Give feedback.
All reactions
-
It does look like they may have intended to pass the test if |
Beta Was this translation helpful? Give feedback.
All reactions
-
I believe there is some issue with the current implementation of tool calling. I used openai/gpt-oss-20b model with both lmstudio (compatibility test: success) & llama-server (compatibility test: failed) [version: 6190 (ae532ea)] and logged the ![]() Attaching output of both for the reference: output-llama-server.log, output-lm-studio.log If you see the Please let me know if I did something incorrectly or if I can provide any information that can be helpful in solving this. Also, I don't see any tool calls in the output using lm-studio with ggml-org/gpt-oss-20b-GGUF model. So, I believe they have done some extra handling just for the openai model to support tool calling. |
Beta Was this translation helpful? Give feedback.
All reactions
-
@0xshivamagarwal use |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 3 -
❤️ 3
-
@aldehir thanks for pointing it out. Tool calling is working perfectly. P.S. Let me know if I should delete the comments to avoid confusion for anyone seeing it in future. |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 2
-
Maybe not relevant as the models are kinda large... But perf tested CPU inferencing on an Ampere system before --threads cores/2 was the sweet spot... Also --cache-reuse what are the considerations for use? |
Beta Was this translation helpful? Give feedback.
All reactions
-
@ggerganov I've tested one more Apple Silicon. Here are the results of my MBP M1 Max 64GB 🟢 Benchmarks on M1 Max (64GB) for
build: 2e2b22b (6180)
|
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
@ggerganov MBP M3 Max 128GB
build: e92734d (6250) It did only run when I gave the full path to the model file. With just the name |
Beta Was this translation helpful? Give feedback.
All reactions
-
The
For example on my M4 Max (36GB) I get this:
build: 8b69686 (6293) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1 -
👀 1
-
Yes the heat throttling seems to be a real big issue with MB Pros, especially with M3 MAX SoC and big RAM. |
Beta Was this translation helpful? Give feedback.
All reactions
-
Good benchmark seems to be this with an Mac mini M4 Pro 64 GB:
build: 0fd90db (6280) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
Getting the following error when attempting to use
|
Beta Was this translation helpful? Give feedback.
All reactions
-
I'm not seeing this problem with Chatbox + @playright/mcp@latest. Which client are you using? |
Beta Was this translation helpful? Give feedback.
All reactions
-
I'm using the official OpenAI Go SDK. I can enable other MCP servers and internal tool implementations according to spec and they work fine. |
Beta Was this translation helpful? Give feedback.
All reactions
-
I'm afraid I'm stumped, as I cannot reproduce this with Chatbox or Crush. Neither produce a JSON schema with From what I can tell, |
Beta Was this translation helpful? Give feedback.
All reactions
-
Adding benchmarks for NVIDIA > 64 GB VRAM. 🟢 Benchmarks on RTX Pro 6000 Max-Q (96GB) for `gpt-oss-20b`llama-bench -m ./gpt-oss-20b-GGUF/gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: f08c4c0 (6199) 🟢 Benchmarks on RTX Pro 6000 Max-Q (96GB) for `gpt-oss-120b`llama-bench -m ./gpt-oss-120b-mxfp4/gpt-oss-120b-mxfp4-00001-of-00003.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: f08c4c0 (6199) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1 -
🚀 1 -
👀 1
-
Thanks for the data! p.s. Accidentally, edited your comment instead of the guide - sorry about that :) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
It's amazing how the continued performance improvements have added up, even within the short period of the past 3 weeks: 40-50% PP and 15% TG improvement. Incredible work. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: 00681df (6445) |
Beta Was this translation helpful? Give feedback.
All reactions
-
Beta Was this translation helpful? Give feedback.
All reactions
-
🚀 1
-
The |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1 -
👀 2
-
TL;DR I ran it with no Prompt
Generation
switching back for straight compare and adding First yes it definitely burns all GPUs at same time, and slight different loading Prompt
Generation
|
Beta Was this translation helpful? Give feedback.
All reactions
-
For giggles, I built latest vulkan: Same original prompt entry point: Prompt
Generation
is |
Beta Was this translation helpful? Give feedback.
All reactions
-
silly question @ggerganov - is it possible to use ROCm for prompt processing and Vulkan for token gen? |
Beta Was this translation helpful? Give feedback.
All reactions
-
👀 1
-
Does it work seamlessly with --tensor-split ? |
Beta Was this translation helpful? Give feedback.
All reactions
-
More benchmarks for NVIDIA >64 GB. This one with the workstation edition of the RTX 6000 Pro Blackwell. Crazy how the difference in performance to the 300W Max-Q version is only around 15%. I should start running my GPU at 300W as well to save some energy. 😅 gpt-oss-20bggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: a6d3cfe (6205) gpt-oss-120bggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: a6d3cfe (6205) Edit: maybe worth noting that my GPU only draws around 390W out of the maximum 600W while running the benchmark. Probably hints at optimization opportunities. |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 2 -
🚀 1
-
Very interesting that it doesn't come anywhere near max power draw for this workload! For reference, the Max-Q version draws around ~250W during pp, and ~280W during tg for this benchmark, measured in nvtop. |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 2
-
AMD Ryzen 7 7700
|
Beta Was this translation helpful? Give feedback.
All reactions
-
Thanks for the guide! llama-server -hf ggml-org/gpt-oss-20b-GGUF \
--ctx-size 32768 --jinja -ub 2048 -b 2048 -ngl 99 -fa \
--n-cpu-moe 2 --temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0 Speed: 62.21 tokens per second The 20‑B model itself is working surprisingly well. Has anyone managed to connect it to LangChain? |
Beta Was this translation helpful? Give feedback.
All reactions
-
This is actually quite interesting. Are you sure about the -ub batch size? I also have 12GB of VRAM on a RTX3060 (12288 MB total with 378 MB reserved for the Linux driver) and I cannot fit in the VRAM with those settings at 32K context (including -ncmoe 2). Need to drop -ub to the default of 512 to have 300MB spare. When I do, I get 60 tokens/sec (Ryzen 7 5700X, 32 GB DDR4 RAM). I am surprised that the difference in hardware generations (extra bandwidth of 5070 vs 3060, DDR5 vs DDR4 memory bandwidth etc.) results in such a small performance difference. Could it be the 6 cores of Ryzen 5 9600X vs. the 8 cores of 5700X? (Assuming that the --threads default of -1 automatically chooses the number of cores of the CPU) Edit: CPU is Ryzen 7 5700X, sorry. In any case, the LLM works well, makes a good case to upgrade RAM to 64GB and load its 120B big brother. |
Beta Was this translation helpful? Give feedback.
All reactions
-
Great results! I'd say the CPU and DDR5 are the bottleneck. When I run the model entirely on VRAM (with a small 2048 context window), I get 128 t/s. |
Beta Was this translation helpful? Give feedback.
All reactions
-
You are right. The extra compute power and bandwidth of the 5070 over the 3060 shines when there is no GPU to CPU (RAM) data jumps. Inference is almost twice as fast (128 t/s vs. 75 t/s). You should try the 120B with that much RAM and if you don't mind, post the results. |
Beta Was this translation helpful? Give feedback.
All reactions
-
Not enough memory to run the 120B model reliably. I did manage to start it, though, and got 12t/s. I suppose it would run ok with more RAM. |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
Bought 32GB extra RAM, and ran the unsloth UD-Q4_K_XL quant of gpt-oss-120b with -ncmoe 32 (both mmaped and -no-mmap). In either case I get 15 tok / sec generation and 85 tok / sec pp2048. The -ncmoe 32 setting leaves about 2.6 GB VRAM available on the GPU, but RAM is under moderate pressure with --no-mmap (only 2 GB RAM remains available). For lower RAM pressure let llama-server mmap the file and you should be OK in most cases (lower your swapiness just in case or try with --mlock ). Assuming your total model size is 63GB (the aforementioned quant or the MXFP4 from ggml-org). I think this is a reliable way to run the model, a true mid-tier LLM generating at reading speed on a Linux machine with a low-end NVIDIA GPU. Thank you llama.cpp developers! |
Beta Was this translation helpful? Give feedback.
All reactions
-
Some numbers for AMD Ryzen AI 9 HX 370 with Radeon 890M (64GB allocated to VRAM out of 128GB total RAM) using Vulkan: Benchmarks for `gpt-oss-120b`
Benchmarks for `gpt-oss-20b`
Edit: re-ran benchmarks on latest build as of Sep 15 (tldr: almost 2x increase in t/s for prompt processing) Benchmarks for `gpt-oss-120b`
Benchmarks for `gpt-oss-20b`
|
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 2
-
On Phoronix benchmarks, the performance was much higher, what could be the difference? |
Beta Was this translation helpful? Give feedback.
All reactions
-
I got this running on Intel AI PC (Intel Core Ultra 7 258V with Vulkan!). The new GPU driver now can change the GPU memory allocation and so it can easily fit all 25 layers on the VRAM (shared). ![]()
Video example: P.S. |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 2
-
I see benchmarks for pre-Hopper hardware. And MXFP4. ELI13, pretty please? Software emulation? |
Beta Was this translation helpful? Give feedback.
All reactions
-
From my understanding, MXFP4 is done using table lookup (there are only 16 values) |
Beta Was this translation helpful? Give feedback.
All reactions
-
Edit: Sorry, this is probably only relevant if you are using the /completions API! I wrote a grammar for guided generation of the Harmony response format. Ensures it more closely follows the specification. For tool calls, forces constraint json and applies the JSON grammar. This massively improves generation precision for me, especially for tool calls. https://gist.github.com/dstoc/ab58a1829b3f504c64f08bee5e8c6ea6 I'm still new to gbnf so there are some issues (I think it prevents the model from saying |
Beta Was this translation helpful? Give feedback.
All reactions
-
There is already a grammar applied when tools are passed in, and they constrain to the tool schema instead of generic JSON. If you're having issues with that, it's preferable to fix it for everyone. What kind of issues are you seeing? Do you happen to be using |
Beta Was this translation helpful? Give feedback.
All reactions
-
Where can I find that?
Ahh! Good to know! Can I get some pointers to this too? I'm not using the structured APIs or jinja templates, I'm using /completions directly with openai/harmony parser/renderer. |
Beta Was this translation helpful? Give feedback.
All reactions
-
Oh, I see. Yeah, that works if you want to roll your own parsing. I'm curious what the builtin parsing with The grammar is dynamically generated from the tools passed in. You won't be able to do that with a grammar file, but you can generate the grammar and pass it in the Lines 1447 to 1506 in 4fd1242
Your grammar is missing the recipient in the role section, e.g. |
Beta Was this translation helpful? Give feedback.
All reactions
-
Yes, prefill is one issue. Another is that tool preambles can't be properly represented in the /v1/chat/completions output format -- they would have to appear as either reasoning or content. I'm also trying to keep parsing at the token level to avoid special tokens appearing where they shouldn't.
Thanks! I think I have all the tools I need to build that out.
Hmm, I was following the examples from: https://cookbook.openai.com/articles/openai-harmony -- |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
In the fine print:
https://cookbook.openai.com/articles/openai-harmony#receiving-tool-calls |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
Hi! I figured I'd show my unusual Vulkan NVIDIA + AMD scenario for people who are on RTX 5070 or any 12GB VRAM card and have something older lying around. I am using my old RX 6600XT with no AI cores to run this, sitting in my secondary pcie gen3x4 slot. Suffice to say I am pretty happy with how it's running, a lot better than offloading to CPU or trying to shuffle MoE layers between GPUs using the older regex style command. (~30-40 t/s on either options - Ryzen 5800X3D, b550, 32gb ddr3600 tuned ram). I stick to 32k for size/performance/tensorsplit ratio. Yes, 89/11 split crashes. Table below, and a screenshot from NVTOP while running 32k. Hope this helps someone :) ![]()
|
Beta Was this translation helpful? Give feedback.
All reactions
-
It seems entire prompt gets reprocessed for gpt-oss when doing an inference. This slows the response in second query onwards. This is not the behavior for LLMs such as qwen3. In other LLMs, response is quick because the KV cache is not reprocessed. What is the reason for KV cache prompt reprocessing in gpt-oss? is there a fix for it? |
Beta Was this translation helpful? Give feedback.
All reactions
-
Increase the number of SWA KV cache checkpoints if you are getting cache misses with: The default is 3, try a higher value. Alternatively, you can disable Sliding Window Attention (SWA) entirely through |
Beta Was this translation helpful? Give feedback.
All reactions
-
I have tried using --swa-full but it doesn't help much. I will try increasing --swa-checkpoints N. |
Beta Was this translation helpful? Give feedback.
All reactions
-
Maybe you are using an old version. This has been fixed shortly after the gpt-oss release and I just double-checked that the prompt is not being reprocessed. So the problem is in your end. If it persist, open a issue with full logs and repro. |
Beta Was this translation helpful? Give feedback.
All reactions
-
I am running on a Emerald Rapids Xeon CPU socket with the following command. Commit id of code is - c252ce6. This is a recent commit. numactl -m 0 -C 0-59 llama-server --cpu-range 0-59 -m /cold_storage/ml_datasets/narendra/llama.cpp_models/gpt-oss/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4.gguf -t 60 --port 10000 --cont-batching --parallel 8 --ctx-size 65536 --numa numactl --reasoning-format none --jinja --mlock -fa auto --no-mmap --cache-reuse 256 --swa-checkpoints 10_ This is the server log. As you can see n_prompt_tokens goes from 86, 1916, 5877 in three queries. Here, n_tokens == n_prompt_tokens, which is not the case for other models such as qwen3. srv update_slots: all slots are idle |
Beta Was this translation helpful? Give feedback.
All reactions
-
I filed an issue on this - #15894 |
Beta Was this translation helpful? Give feedback.
All reactions
-
With flash attention enabled, I am able to fit the entire gpt-oss-20b model with a near-full context window size of 125k in my NVIDIA RTX 5060 TI with 16GB of VRAM using: llama-server -hf ggml-org/gpt-oss-20b-GGUF --ctx-size 125000 --jinja -ub 2048 -b 2048 --n-cpu-moe 0 -fa on Running an agentic refactoring with
and And I can run with the full 131k context with just one MoE layer moved to the CPU: llama-server -hf ggml-org/gpt-oss-20b-GGUF --ctx-size 0 --jinja -ub 2048 -b 2048 --n-cpu-moe 1 -fa on Running that same (non-deterministic) agentic
So we do take a bit of a hit in speed going from 125k to 131k in the max context window size and putting one of the MoE layers on the CPU. So you might as well just run the 125k context and get the 11% speedup from running the entire model on the GPU by going from full context 131k to 125k for the 5% reduction in context window size. FYI: I wrote these little scripts to get this info out of the llama-server log output (which I send to a file): |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
Benchmarks on NVIDIA GeForce RTX 4060 Ti (16GB) for
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: 261e6a2 (6471) |
Beta Was this translation helpful? Give feedback.
All reactions
-
Hi, Thank you very much for you work! My setup
gpt-oss:20b❯ llama-bench -m gpt-oss-20b-GGUF -ngl 99 -fa 1 -b 2048,4096 -ub 2048,4096 -p 2048,8192,16384,32768 --split-mode none
❯ llama-batched-bench -m gpt-oss-20b-GGUF -c 132096 -b 2048 -ub 2048 -npp 0,2048,8192,16384,32768 -ntg 128 -npl 1,2,4 --split-mode none
gpt-oss:120b❯ llama-bench -m gpt-oss-120b-GGUF -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768
|
Beta Was this translation helpful? Give feedback.
All reactions
-
Hi team, i'm mostly working with Vllm and TRT-LLM, trying out llama.cpp with 8 * H200, sharing my numbers: gpt-oss:20b
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: 72b24d9 (6602) I didn't try out 120b as 20b performance is already bad - i would expect a much higher tok/s on my system. Maybe I didn't use the correct configs for this benchmark (e.g. disabled tensor parallelism)? |
Beta Was this translation helpful? Give feedback.
All reactions
-
You can enable pipeline parallelism by lowering the ubatch size - probably |
Beta Was this translation helpful? Give feedback.
All reactions
-
I am looking for a config for a system with 96G RAM and 8G VRAM GPU. |
Beta Was this translation helpful? Give feedback.
All reactions
-
./llama-bench -m gpt-oss-20b-GGUF/gpt-oss-20b-mxfp4.gguf -fa 1 -b 4096 -ub 4096 -p 2048,8192 --device cuda0
build: e6d65fb (6611) |
Beta Was this translation helpful? Give feedback.
All reactions
-
4070 @ 100W, 2x32GB DDR5-4800 ECC, 7800X3D, B650 TUF GAMING-PLUS and
4070 @ normal 200W:
|
Beta Was this translation helpful? Give feedback.
All reactions
-
From my testing, it seems like the Vulkan backend suffers much more from offloading compared to e.g. CUDA. Could you try to test it again with the CUDA backend? |
Beta Was this translation helpful? Give feedback.
All reactions
-
Unfortunately there's no Linux CUDA build. But if the difference is within 20% or so, idc and vulkan is fine (not a fan of downloading the huge CUDA build each time, that's why I switched to the 25MB vulkan build and |
Beta Was this translation helpful? Give feedback.
All reactions
-
I believe the difference is not 20%, but rather 2-3x (on prompt processing, tg looks fine). 4060 ti gets 3800 t/s pp. |
Beta Was this translation helpful? Give feedback.
All reactions
-
quick tests on h100: 1K tps 20gb, with full context length (consider this as a very rough estimate) docker run -d --name=gptoss20 |
Beta Was this translation helpful? Give feedback.
All reactions
-
I've got my Framework Desktop, and i've managed to build llama.cpp there, so i can provide some data. Hardware specificationThis is quite unusual machine, as it's an APU with shared memory (architecture-wise, it's similar to Apple M-series APUs). My configuration rocks 128GB of DDR5 (V)RAM, running @ 8000MT/s with theoretical throughput of 256GB/s (around 210-220GB/s in practice) - unfortunately it's soldered to the motherboard, but that's the price we have to pay for the performance we get on those modules. That memory can be fully used by both CPU and GPU (on Linux, on Windows you get up to 96GB of VRAM with 32GB of RAM, as you don't have GTT there). ROCm (both the latest 6.x.x and 7.0.0rc1) is currently broken - it seems that it can't use the memory on this APU correctly, it reports completely wrong sizes of memories, in effect it cannot allocate more than 32GB of VRAM - while it's enough for stuff like embedding and small models, it's not worth it at this point IMHO. Vulkan works great, and uses GTT correctly - so it can provide GPU the access to whole memory, even beyond what's allocated in BIOS, up to limits configured via modprobe. In my case, it's ~120GB of memory configured with following modprobe config:
along with CPU: Ryzen AI MAX+ 395 (Strix Halo), 32 cores gpt-oss-20bThis is the same model that i've tested on my RX 7900XT here. Note that Vulkan reports no BF16 support, which may be the cause of hindered performance on that version (iirc this is mxfp4 mixed with bf16). I've checked other (u)batch sizes, 2048/512 is the optimal one.
gpt-oss-120bMy gpt-oss-120b GGUF is Unsloth's Q6_K_XL quant, which may be the cause of better token generation performance compared to the BF16 20B model.
I've re-ran the tests with -mmp 0 to check whether mmap affects performance on Vulkan, and i've got slightly better results, therefore i recommend disabling it.
|
Beta Was this translation helpful? Give feedback.
All reactions
-
Setup is 2x AMD Instinct MI50 with 32GB each, rocm 6.3.4:
build: 3a002af (6698) |
Beta Was this translation helpful? Give feedback.
All reactions
-
With multiple GPUs, you can reduce the |
Beta Was this translation helpful? Give feedback.
All reactions
-
Is there a good place where I can read up about |
Beta Was this translation helpful? Give feedback.
All reactions
-
The original PR introducing pipeline parallelism and logic/physical batches should have some info: #6017 |
Beta Was this translation helpful? Give feedback.
All reactions
-
Hi everyone, I am testing GPT-OSS 20B on a:
Despite the warning, one sample from the llama-server logs: eval time = 1016.55 ms / 79 tokens ( 12.87 ms per token, 77.71 tokens per second) which is way more than I was expecting. I share here the benchmark run on the model Q8_0: $ build/bin/llama-bench -m ../MODELS/gpt-oss-20b/gpt-oss-20b-Q8_0.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768
build: d2ee056 (6713) |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Note
This guide is a live document. Feedback and benchmark numbers are welcome - the guide will be updated accordingly.
Overview
This is a detailed guide for running the new gpt-oss models locally with the best performance using
llama.cpp
. The guide covers a very wide range of hardware configurations. Thegpt-oss
models are very lightweight so you can run them efficiently in surprisingly low-end configurations.Obtaining `llama.cpp` binaries for your system
Make sure you are running the latest release of
llama.cpp
: https://github.com/ggml-org/llama.cpp/releasesObtaining the `gpt-oss` model data (optional)
The commands used below in the guide will automatically download the model data and store it locally on your device. So this step is completely optional and provided for completeness.
The original models provided by OpenAI are here:
First, you need to manually convert them to GGUF format. For convenience, we host pre-converted models here in ggml-org.
Pre-converted GGUF models:
Tip
Running the commands below will automatically download the latest version of the model and store it locally on your device for later usage. A WebUI chat and an OAI-compatible API will become available on localhost.
Minimum requirements
Here are some hard memory requirements for the 2 models. These numbers could vary a little bit by adjusting the CLI arguments, but should give a good reference point.
Note
It is not necessary to fit the entire model in VRAM to get good performance. Offloading just the attention tensors and the KV cache in VRAM and keeping the rest of the model in the CPU RAM can provide decent performance as well. This is taken into account in the rest of the guide.
Relevant CLI arguments
Using the correct CLI arguments in your commands is crucial for getting the best performance for your hardware. Here is a summary of the important flags and their meaning:
-hf
curl
from the respective model repository-c
gpt-oss
models have a maximum context of 128k tokens. Use-c 0
to set to the model's default-ub N -b N
N
during processing. Larger size increases the size of compute buffers, but can improve the performance in some cases-fa
--n-cpu-moe N
N
to keep on the CPU. This is used in hardware configs that cannot fit the models fully on the GPU. The specific value depends on your memory resources and finding the optimal value requires some experimentation--jinja
llama.cpp
to use the Jinja chat-template embedded in the GGUF model fileApple Silicon
Apple Silicon devices have unified memory that is seamlessly shared between the CPU and GPU. For optimal performance it is recommended to not exceed 70% of the total memory that your device has.
Tip
Install the latest
llama.cpp
package from Homebrew with:Tip
To increase the amount of RAM available to the
llama-server
process, use the following command:# on a 192GB machine, raise the limit from 154GB (default) to 180GB sudo sysctl iogpu.wired_limit_mb=184320
✅ Devices with more than 96GB RAM
The M2 Max, M3 Max, M4 Max, M1 Ultra, M2 Ultra, M3 Ultra, etc. chips can run both models at full context:
🟢 Benchmarks on M3 Ultra (512GB, 80 GPU cores) for `gpt-oss-20b`
build: c8d0d14 (6310)
🟢 Benchmarks on M2 Ultra (192GB, 76 GPU cores) for `gpt-oss-20b`
build: 79c1160 (6123)
🟢 Benchmarks on M2 Ultra (192 GB, 76 GPU cores) for `gpt-oss-120b`
build: 79c1160 (6123)
✅ Devices with less than 96GB RAM
The small
gpt-oss-20b
model can run efficiently on Macs with at least 16GB RAM:🟢 Benchmarks on M4 Max (36GB) for `gpt-oss-20b`
build: 79c1160 (6123)
🟢 Benchmarks on M1 Max (64GB) for `gpt-oss-20b`
build: 2e2b22b (6180)
🟢 Benchmarks on M1 Pro (32GB) for `gpt-oss-20b`
build: 79c1160 (6123)
✅ Devices with 16GB RAM
Macs don't allow to utilize the full 16GB memory by the GPU, so in this case you have to keep part of the layer on the CPU. Adjust
--n-cpu-moe
and-c
as needed:🟥 Devices with 8GB RAM
Unfortunately, you are out of luck. The
gpt-oss
models are not possible to run on Macs with that small amount of memory.NVIDIA
✅ Devices with more than 64GB VRAM
With more than 64B VRAM, you can run both models by offloading everything (both the model and the KV cache) to the GPU(s).
🟢 Benchmarks on RTX Pro 6000 Max-Q (96GB) for `gpt-oss-20b`
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes
build: f08c4c0 (6199)
🟢 Benchmarks on RTX Pro 6000 (96GB) for `gpt-oss-20b`
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
build: a6d3cfe (6205)
🟢 Benchmarks on RTX Pro 6000 Max-Q (96GB) for `gpt-oss-120b`
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes
build: f08c4c0 (6199)
🟢 Benchmarks on RTX Pro 6000 (96GB) for `gpt-oss-120b`
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
build: a6d3cfe (6205)
✅ Devices with less than 64GB VRAM
In this case, you can fit the small
gpt-oss-20b
model fully in VRAM for optimal performance.🟢 Benchmarks on NVIDIA GeForce RTX 3090 (24GB) for `gpt-oss-20b`
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, CUDA 12.4
build: a094f38 (6210)
🟢 Benchmarks on NVIDIA GeForce RTX 4090 (24GB) for `gpt-oss-20b`
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, CUDA 12.6
build: a094f38 (6210)
🟢 Benchmarks on NVIDIA GeForce RTX 4080 SUPER (16GB) for `gpt-oss-20b`
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4080 SUPER, compute capability 8.9, VMM: yes
build: 009b709 (6316)
🟢 Benchmarks on NVIDIA GeForce RTX 5060 Ti (16GB) for `gpt-oss-20b`
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
build: 9ef6b0b (6208)
🟢 Benchmarks on NVIDIA GeForce RTX 5070 Ti (16GB) for `gpt-oss-20b`
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, CUDA 12.8
build: a094f38 (6210)
🟢 Benchmarks on NVIDIA GeForce RTX 5080 (16GB) for `gpt-oss-20b`
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5080, compute capability 12.0, VMM: yes, CUDA 12.8
build: a094f38 (6210)
🟢 Benchmarks on NVIDIA GeForce RTX 5090 (32GB) for `gpt-oss-20b`
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, CUDA 12.8
build: a094f38 (6210)
The large model has to be partially kept on the CPU.
🟡 TODO: add commands for
gpt-oss-120b
✅ Devices with 16GB VRAM
For example: NVIDIA V100
This config is just at the edge to fit the full context of
gpt-oss-20b
in VRAM, so we have to restrict the maximum context down to 32k tokens.🟢 Benchmarks on NVIDIA V100 (16GB) for `gpt-oss-20b`
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: Tesla V100-PCIE-16GB, compute capability 7.0, VMM: yes
build: 228f724 (6129)
Running the large
gpt-oss-120b
model with 16GB of VRAM requires to keep some of the layers on the CPU since it does not fit completely in VRAM:✅ Devices with less than 16GB VRAM
For this config, it is recommended to tell
llama.cpp
to run the entire model on the GPU while keeping enough layers on the CPU. Here is a specific example with an RTX 2060 8GB machine:Note that even with just 8GB of VRAM, we can adjust the CPU layers so that we can run the large 120B model too:
# gpt-oss-120b, 32k context, 35 layers on the CPU llama-server -hf ggml-org/gpt-oss-120b-GGUF --ctx-size 32768 --jinja -ub 2048 -b 2048 --n-cpu-moe 35
Tip
For more information about how to adjust the CPU layers, see the "Tips" section at the end of this guide.
AMD
Note
If you have AMD hardware, please provide feedback about running the
gpt-oss
models on it and the performance that you observe. See the sections above for what kind of commands to try and try to adjust respectively.With AMD devices, you can use either the ROCm or the Vulkan backends. Depending on your specific hardware, the results can vary.
✅ RX 7900 XT (20GB VRAM) using ROCm backend
🟢 Benchmarks for `gpt-oss-20b`
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XT, gfx1100 (0x1100), VMM: no, Wave Size: 32
build: 3007baf (6194)
More information: #15396 (comment)
✅ Few more low-end configurations
Tips
Determining the optimal number of layers to keep on the CPU
Good general advice for most MoE models would be to offload the entire model, and use
-n-cpu-moe
to keep as many MoE layers as necessary on the CPU. The minimum amount of VRAM to do this with the 120B model is about 8GB, below that you will need to start reducing context size and the number of layers offloaded. You can get for example about 30 t/s at zero context on a 5090 with--n-cpu-moe 21
.Caveat: on Windows it is possible to allocate more VRAM than available, and the result will be slow swapping to RAM and very bad performance. Just because the model loads without errors, it doesn't mean you have enough VRAM for the settings that you are using. A good way to avoid this is to look at the "GPU Memory" in Task Manager and check that it does not exceed the GPU VRAM.
Example on 5090 (32GB):

good,
--n-cpu-moe 21
, GPU Memory < 32:bad,

--n-cpu-moe 20
, GPU Memory > 32:Using `gpt-oss` + `llama.cpp` with coding agents (such as Claude Code, crush, etc.)
Setup the coding agent of your choice to look for a localhost OAI endpoint (see Tutorial: Offline Agentic coding with llama-server #14758)
Start
llama-server
like this:Sample usage with
crush
: guide : running gpt-oss with llama.cpp #15396 (reply in thread)Some agents such as Cline and Roo Code do not support native tool calls. A workaround is to use a custom grammar: guide : running gpt-oss with llama.cpp #15396 (comment)
Configure the default sampling and reasoning settings
When starting a
llama-server
command, you can change the default sampling and reasoning settings like so:Note that these are just the default settings and they could be overridden by the client connecting to the
llama-server
.Frequently asked questions
Q: Which quants to use?
Always use the original MXFP4 model files. The
gpt-oss
models are natively "quantized". I.e. they are trained in the MXFP4 format which is roughly equivalent toggml
'sQ4_0
. The main difference withQ4_0
is that the MXFP4 models get to keep their full quality. This means that no quantization in the usual sense is necessary.Q: What sampling parameters to use?
OpenAI recommends:
temperature=1.0 and top_p=1.0
.Do not use repetition penalties! Some clients tend to enable repetition penalties by default - make sure to disable those.
Q: Should I set a chat template file manually?
No. The
ggml-org/gpt-oss
models have a built-in chat template that is used by default. The only reasons to ever want to change the chat template manually are:Known issues
Some rough edges in the implementation are still being polished. Here is a list of issue to keep track of:
gpt-oss-120b
when using Vulkan #15274Beta Was this translation helpful? Give feedback.
All reactions