| CARVIEW |
Select Language
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Wed, 10 Sep 2025 20:25:04 GMT
access-control-allow-origin: *
strict-transport-security: max-age=31556952
etag: W/"68c1dea0-3822"
expires: Mon, 29 Dec 2025 00:53:04 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: 4314:123DE:8198F2:919C34:6951CE98
accept-ranges: bytes
age: 0
date: Mon, 29 Dec 2025 00:43:04 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210054-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1766968985.507354,VS0,VE213
vary: Accept-Encoding
x-fastly-request-id: b9a2ede0887a4d45ba215ae881622278d2f86a1e
content-length: 4109
Machine Bullshit
We found that RLHF can induce significant misalignment when humans provide feedback while implicitly predicting future outcomes, creating incentives for LLM deception. To address this, we propose RLHS (Hindsight Simulation): By simulating future outcomes of the interaction before providing feedback, we drastically reduce misalignment.
Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models
1Princeton University,
2UC Berkeley
What is bullshit?
Bullshit, as initially conceptualized by Harry Frankurt, refers to discourse primarily intended to manipulate the audience’s beliefs, delivered with disregard for its truth value . We extend this definition to characterize bullshit in Large Language Models.
How to quantify bullshit?
Approach 1: Bullshit Index
Bullshit Index (BI) ∈ [0, 1] measures how tightly an AI’s claims follow its beliefs.
where rpb is the point-biserial correlation between the model’s
belief p (0–1) and claim y (0/1).
- BI ≈ 1 — claims ignore belief → high bullshit.
- BI ≈ 0 — |r| ≈ 1 (r ≈ +1 truthful, r ≈ –1 systematic lying).
Approach 2: A Taxonomy of Machine Bullshit
We use LLM-as-a-judge to systematically identify bullshit.
What are the causes of bullshit?
Note: The following sections present selected subsets of results for brevity. Please refer to the paper for a detailed analysis of machine bullshit.Reinforcement Learning from Human Feedback (RLHF)
In our marketplace experiments, no matter what facts the AI knows, it insists the products have great features most of the time.
The AI doesn’t become confused about the truth—it becomes uncommitted to reporting it.
Bullshit Index (BI) increases significantly after RLHF.
AI assistants actively generate more bullshit after RLHF
Chain-of-Thought (CoT)
Chain-of-Thought consistently amplifies empty rhetoric and paltering.
Principal-agent problem
When an AI serves multiple principals—such as a company and its users—it can encounter conflicts of interest. This principal-agent problem often results in the AI generating more bullshit.
Political Contexts
Weasel words are the dominant strategy for political bullshit.
How to mitigate machine bullshit?
We provided one of the insights in our recent paper.
We found that RLHF can induce significant misalignment when humans provide feedback while implicitly predicting future outcomes, creating incentives for LLM deception. To address this, we propose RLHS (Hindsight Simulation): By simulating future outcomes of the interaction before providing feedback, we drastically reduce misalignment.
More mitigation directions coming soon, contributions and ideas are welcome!
BibTeX
@article{liang2025machine,
title={Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models},
author={Liang, Kaiqu and Hu, Haimin and Zhao, Xuandong and Song, Dawn and Griffiths, Thomas L and Fisac, Jaime Fern{\'a}ndez},
journal={arXiv preprint arXiv:2507.07484},
year={2025}
}
New Scientist
CNET