| CARVIEW |
Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space
Abstract
Human cognition typically involves thinking through abstract, fluid concepts rather than strictly using discrete linguistic tokens. Current Large Language Models (LLMs), however, are constrained to reasoning within the boundaries of human language, processing discrete token embeddings that represent fixed points in semantic space. This discrete constraint restricts the expressive power and upper potential of such reasoning models, often causing incomplete exploration of reasoning paths, as standard Chain-of-Thought (CoT) methods rely on sampling one token per step. In this work, we introduce Soft Thinking, a training-free method that emulates human-like "soft" reasoning by generating soft, abstract concept tokens in a continuous concept space. These concept tokens are created by the probability-weighted mixture of token embeddings, which form the continuous concept space, enabling smooth transitions and richer representations that transcend traditional discrete boundaries. In essence, each generated concept token encapsulates multiple meanings from related discrete tokens, implicitly exploring various reasoning paths to converge effectively toward the correct answer. Empirical evaluations on diverse mathematical and coding benchmarks consistently demonstrate the effectiveness and efficiency of Soft Thinking, improving pass@1 accuracy by up to 2.48 points while simultaneously reducing token usage by up to 22.4% compared to standard CoT. Qualitative analysis further reveals that Soft Thinking outputs remain highly interpretable and readable, highlighting the potential of Soft Thinking to break the inherent bottleneck of discrete language-based reasoning.
Soft Thinking Pipeline
Soft Thinking replaces discrete tokens with abstract concept tokens, enabling reasoning in continuous concept space.
An example of Soft Thinking and CoT
A comparison between standard CoT and Soft Thinking on a multiplication problem. We select the token with the highest probability at each step of Soft Thinking for readability and interpretability. Full distribution is visualized in heatmap. Red text denotes repetitive, useless words.
Probability Distribution of Soft Thinking at Each Step
An example illustrating the probability distribution of our proposed Soft Thinking method. At each step, top-k token candidates and their probabilities are shown. Red boxes indicate the selected tokens that form the final generated sequence for readability and interpretability.
Accuracy and Generation Length on Mathematical Datasets
| Accuracy ↑ | Generation Length ↓ | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| MATH 500 | AIME 2024 | GSM8K | GPQA Diamond | Avg. | MATH 500 | AIME 2024 | GSM8K | GPQA Diamond | Avg. | |
| QwQ-32B [1] | ||||||||||
| CoT Thinking | 97.66 | 76.88 | 96.67 | 64.17 | 83.84 | 4156 | 12080 | 1556 | 8095 | 6472 |
| CoT Thinking (Greedy) | 97.00 | 80.00 | 96.57 | 65.15 | 84.68 | 3827 | 11086 | 1536 | 7417 | 5967 |
| Soft Thinking | 98.00 | 83.33 | 96.81 | 67.17 | 86.32 | 3644 | 10627 | 1391 | 7213 | 5719 |
| DeepSeek-R1-Distill-Qwen-32B [2] | ||||||||||
| CoT Thinking | 94.50 | 72.08 | 95.61 | 63.10 | 81.32 | 3543 | 9347 | 875 | 6218 | 4995 |
| CoT Thinking (Greedy) | 93.00 | 63.33 | 95.30 | 59.09 | 77.68 | 3651 | 8050 | 1048 | 8395 | 5286 |
| Soft Thinking | 95.00 | 76.66 | 95.83 | 64.64 | 83.03 | 3373 | 6620 | 785 | 4722 | 3875 |
| DeepSeek-R1-Distill-Llama-70B [3] | ||||||||||
| CoT Thinking | 94.70 | 70.40 | 94.82 | 65.34 | 81.31 | 3141 | 8684 | 620 | 5500 | 4486 |
| CoT Thinking (Greedy) | 94.61 | 73.33 | 93.60 | 66.16 | 81.92 | 2877 | 9457 | 606 | 4443 | 4345 |
| Soft Thinking | 94.80 | 73.33 | 94.90 | 66.66 | 82.42 | 3021 | 6644 | 597 | 4470 | 3683 |
Table 1: Comparison of Soft Thinking and various baseline methods on accuracy and generation length of correct answers across mathematical datasets. Best results are highlighted in bold.
| Method | Accuracy ↑ | Generation Length ↓ | ||||||
|---|---|---|---|---|---|---|---|---|
| HumanEval | MBPP | LiveCodeBench | Avg. | HumanEval | MBPP | LiveCodeBench | Avg. | |
| QwQ-32B [1] | ||||||||
| CoT Thinking | 97.63 | 97.49 | 62.00 | 85.70 | 2557 | 2154 | 9986 | 4899 |
| CoT Thinking (Greedy) | 95.73 | 96.50 | 57.35 | 83.19 | 2396 | 2069 | 7034 | 3833 |
| Soft Thinking | 98.17 | 97.66 | 62.72 | 86.18 | 2638 | 2157 | 7535 | 4110 |
| DeepSeek-R1-Distill-Qwen-32B [2] | ||||||||
| CoT Thinking | 97.25 | 95.13 | 57.33 | 83.23 | 3095 | 2761 | 8376 | 4744 |
| CoT Thinking (Greedy) | 87.19 | 87.54 | 43.36 | 72.70 | 2294 | 1703 | 4702 | 2900 |
| Soft Thinking | 97.56 | 95.33 | 59.50 | 84.13 | 2713 | 2534 | 6255 | 3834 |
| DeepSeek-R1-Distill-Llama-70B [3] | ||||||||
| CoT Thinking | 97.71 | 94.77 | 56.94 | 83.14 | 2711 | 2386 | 8319 | 4472 |
| CoT Thinking (Greedy) | 92.07 | 91.82 | 48.02 | 77.30 | 2192 | 1979 | 5438 | 3203 |
| Soft Thinking | 98.17 | 94.94 | 58.42 | 83.84 | 2498 | 2214 | 6512 | 3741 |
Table 2: Comparison of Soft Thinking and various baseline methods on accuracy and generation length of correct answers across three coding datasets. Best results are highlighted in bold.
BibTeX
@misc{zhang2025softthinkingunlockingreasoning,
title={Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space},
author={Zhen Zhang and Xuehai He and Weixiang Yan and Ao Shen and Chenyang Zhao and Shuohang Wang and Yelong Shen and Xin Eric Wang},
year={2025},
eprint={2505.15778},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.15778},
}