Exporters From Japan
Wholesale exporters from Japan   Company Established 1983
CARVIEW
Select Language

Table of Contents

  1. Language Models Offer Mundane Utility. AI suggests doing the minimum.
  2. Language Models Don’t Offer Mundane Utility. Gemini 3 doesn’t believe in itself.
  3. Huh, Upgrades. ChatGPT gets some personality knobs to turn.
  4. On Your Marks. PostTrainBench shows AIs below human baseline but improving.
  5. Claude Opus 4.5 Joins The METR Graph. Expectations were exceeded.
  6. Sufficiently Advanced Intelligence. You’re good enough, you’re smart enough.
  7. Deepfaketown and Botpocalypse Soon. Don’t worry, the UK PM’s got this.
  8. Fun With Media Generation. Slop as cost shock, enabling of niche pursuits.
  9. You Drive Me Crazy. Anthropic’s plans to handle mental health issues.
  10. They Took Our Jobs. What does it take to break a guild cartel?
  11. The Art of the Jailbreak. It still always works but it takes somewhat longer.
  12. Get Involved. MATS Summer 2026 cohort applications are open.
  13. Introducing. GPT-5.2-Codex is here to tide us over until the new year.
  14. In Other AI News. Small models can introspect, so can Andrej Karpathy.
  15. Show Me the Money. Anthropic going public, Project Vend breaks new ground.
  16. Quiet Speculations. Predictions for next year, new higher bars for what is AGI.
  17. Whistling In The Dark. It is still so early, almost no one knows Anthropic exists.
  18. Bubble, Bubble, Toil and Trouble. So many still don’t realize AI works.
  19. Americans Really Dislike AI. Attempts continue to mislead us about this.
  20. The Quest for Sane Regulations. NY’s RAISE Act is signed.
  21. Chip City. Chip smuggling, eventual chip production.
  22. The Week in Audio. Anthropic’s Sholto Douglas makes 2026 predictions.
  23. Rhetorical Innovation. Understanding AI teaches you to think about the world.
  24. Aligning a Smarter Than Human Intelligence is Difficult. The meta game.
  25. Mom, Owain Evans Is Turning The Models Evil Again. Train the interpreter.
  26. Messages From Janusworld. Claude Opus 3 zero hour approaches.
  27. The Lighter Side. What are you even doing?

Language Models Offer Mundane Utility

AI custom-designed, human-in-the-loop proactive LLM-based mental health intervention has a positive effect in an RCT. There was significantly greater positive affect, resilience and social well-being. My presumption is that this was a highly conservative design due to ethical considerations. And that was using a system based on GPT-4o for 5-20 minutes a week. There is so much room for improvement here.

A lot of the benefits here likely came from implementation of low-hanging fruit interventions we know work, like having the system suggest journaling, gratitude exercises, mindfulness and social connection. We all know that stuff works. If an LLM-based scaffold actually gets people to do some of it? Great, that’s a huge win.

Results like this will not, as David Manheim suggests, prevent people from saying ‘but sometimes there are still bad outcomes’ or ‘but sometimes this ends up doing net harm,’ since nothing capable of working would prevent those risks entirely.

You can have Claude Code make objects in Unreal Engine on demand.

Seth Lazar on how he uses AI agents for philosophy. They automate everything around the thinking so Seth can focus on the thinking. He favors Opus 4.5 in Cursor.

Language Models Don’t Offer Mundane Utility

Dean Ball: by far the biggest challenge in agentic coding use is getting gemini 3 to recognize that gemini 3 exists

Simeon: This is unbelievable. Even when I explicitly tell it the right API name to call for Gemini 3 pro it would go with 1.5.

I had to really be pushy for it to do it.

AI still struggles with design, largely because they lack the context. You still have to figure out what to do or what problem to solve, on a sufficiently high level.

 

 

Huh, Upgrades

ChatGPT adds personalization characteristics. I’m going with ‘less’ on all four.

You can upload your NotebookLM notebooks directly into the Gemini app.

On Your Marks

Maxime Labonne: You always think you’re safe until your job becomes a benchmark.

Maksym Andriushchenko: We release PostTrainBench: a benchmark measuring how well AI agents like Claude Code can post-train base LLMs.

We expect this to be an important indicator for AI R&D automation as it unfolds over the next few years.

How worried should you be that they’re getting a substantial percentage of the way to the human threshold here?

METR notices some grading issues and makes some minor corrections to its graph, in particular impacting Claude 3.7 Sonnet.

Whenever you see a graph like this, remember to attach ‘in benchmarks’ and then for your brain to, like mine, automatically translate that to ‘IN MICE!’

Epoch AI: We benchmarked several open-weight Chinese models on FrontierMath. Their top scores on Tiers 1-3 lag the overall frontier by about seven months.

Havard Ihle: Consistent with my WeirdML results for open/closed model gap.

One could then argue both ways who benefits from the benchmarks versus real world applications or underlying general intelligence. Versus real world applications it seems clear the benchmarks understate the gap. Versus underlying intelligence it is less obvious and it depends on who is going after the benchmarks in question more aggerssively.

Claude Opus 4.5 Joins The METR Graph

Claude Opus 4.5 achieved a 50% time horizon of about 4 hours 49 minutes, and METR needs more long tasks to be able to set the upper bound properly.

METR: We don’t think the high upper CI bound reflects Opus’s actual capabilities: our current task suite doesn’t have enough long tasks to confidently upper bound Opus 4.5’s 50%-time horizon. We are working on updating our task suite, and hope to share more details soon.

Based on our experience interacting with Opus 4.5, the model’s performance on specific tasks (including some not in our time horizon suite), and its benchmark performance, we would be surprised if further investigation showed Opus had a 20+ hour 50%-time horizon.

Despite its high 50%-time horizon, Opus 4.5’s 80%-time horizon is only 27 minutes, similar to past models and below GPT-5.1-Codex-Max’s 32 mins. The gap between its 50%- and 80%- horizons reflects a flatter logistic success curve, as Opus differentially succeeds on longer tasks.

Here’s the full graph now (we’re still waiting on GPT-5.2, GPT-5.2 Codex and Gemini 3 Pro), both the log version and the linear version.

 

Daniel Eth: A few thoughts on Claude Opus 4.5:

First off, in absolute terms, this is a pretty big step up. Anthropic is showing they have juice, and things are going faster than previously expected. At the very least, this should dispel all recent talk about how AI was entering a slowdown

Second, on a log plot, note this is hardly above trend. Sure, it *could* represent a new trend, but it seems like every time there’s a model release that overperforms people think timelines get super short, & every time a model underperforms they think timelines get super long…

Den Ball: as folks internalize this graph and continue the debate about what it may or may not mean, I would just remind you of one simple fact:

the us has barely scaled up compute compared to what will come online in 2026 (multiple 1GW+ facilities).

Seán Ó hÉigeartaigh: Yes, this. We’ve seen some of the biggest infrastructure investments in history over the last year, and they will soon become available to the frontier AI research effort. You’d want to be very confident to bet on slowdowns in progress despite this happening.

Simeon: We’re in the 4-months doubling world, aren’t we?

Davidad: 🎯

For those not keeping score, I called this new slope in 2025Q1, and quantitatively determined there was 10:1 evidence in favour of it in 2025Q3.

David Shor: The biggest divide on AI timelines I’ve seen is between people who use vibecoding tools like Claude Code and people who don’t.

ChatGPT isn’t really *that* different than it was a year ago, but capabilities on agentic tools are getting literally exponentially better every month

Davidad: It’s not really superexponential, it’s piecewise-exponential. the exponential changed at an inflection-point event, when AIs closed the RSI loop on data. there will be more inflection points when RSI loops are closed on algorithms, hardware, manufacturing, and construction

second, the duration axis is in units of *human time* to complete the same tasks – nothing to do with the wall-clock duration for the AI runs.

Lisan al Gaib: betting markets completely underestimated Claude 4.5 Opus

Yo Shavit (OpenAI): I think it’s more plausible, maybe 50:50 that this pace continues for at least 12 more months?

Davidad: yeah, I would guess that by December 2026 the RSI loop on algorithms will probably be closed, resulting in another inflection point to an even faster pace, perhaps around 70-80 day doubling time.

The end point of such a graph is not ‘AI can do literally any task,’ or any cognitive task it is ‘AI can do any coding task humans can do.’ Even an infinite time horizon here only goes so far. That could be importantly distinct from the ability to do other categories of task, both that humans can and cannot do.

The reason this is so scary regardless is that if you automate AI research via such methods, your failure to have automated other things goes away rather quickly.

Stephen McAleer (Anthropic): I’ve shifted my research to focus on automated alignment research. We will have automated AI research very soon and it’s important that alignment can keep up during the intelligence explosion.

Automated alignment research is all we seem to have the time to do, so everyone is lining up to do the second most foolish possible thing and ask the AI to do their alignment homework, with the only more foolish thing being not to do your homework at all. Dignity levels continue to hit all-time lows.

If you must tell the AI to do your alignment homework, then that means having sufficiently deeply aligned current and near term future models becomes of the utmost importance. The good news is that we seem to be doing relatively well there versus expectations, and hopefully we can find self-reinforcing aligned basins at around current capability levels? But man this is not what Plan A should look like.

Similarly to METR’s graph, Epoch’s capabilities index has also accelerated since 2024:

Benjamin Todd: ​It’s not only the METR horizon trend that accelerated in 2024. A composite of all major benchmarks did:

Rohin Shah: Both METR and ECI mostly measure things that companies optimize for. 2024 saw the rise of reasoning training for frontier models, which optimizes narrowly for some tasks (whereas pretraining provides more general improvements).

So I wouldn’t read much into any acceleration.

To the extent that this acceleration represents the things that cause further acceleration, I would read into it. Otherwise, I’d agree with Rohin.

Sufficiently Advanced Intelligence

Many people try to pretend that there is some limit to how intelligent a mind can be, and that this limit is close to the level of humans. Or, alternatively, that there is very little that a human or AI could gain from being far more intelligent than a typical smart human. Or that the only or central way to get much more intelligence is from collective intelligence, as in social or cultural or institutional intelligence.

I sometimes call this Intelligence Denialism. It is Obvious Nonsense.

Von Neumann, among other minds past and future, would like a word.

There is, however, a version of this that is true.

In any given finite role or task, there can exist Sufficiently Advanced Intelligence.

If you were smarter you might choose to do something else instead. But given what you or your AI are tasked with doing, you or your AI can be sufficiently advanced – your output is indistinguishable, or no worse than, the perfect output, aka magic.

Claude Code with Opus 4.5 is now approaching this for many coding tasks.

LordKingDude (via Deedy): I’m a technical software engineer working in C++.
I’ve been working with Opus 4.5 to write JIT compiler code and assembly, and so far it’s never failed (although I do give assistance as needed).

In real terms, this class of problems are the most difficult tasks that I can possibly give to any LLM. It would be cool with me if Opus 5 was just cheaper and faster, or had a 500k context window. I don’t have a pressing need for it to be smarter than it already is.

Deedy: This is just one engineer’s opinion: models still have headroom to be smarter. Opus 4.5 seems to have made a step function jump to better than 70-80% of SWEs.

If we truly don’t need smarter models to do software, Anthropic’s moat is perhaps the least of anyone’s concern!

My guess is this is centrally a lack of imagination and ambition issue?

As in, the job is currently to code and do things humans could previously code and do, with everything built around that restriction, and now LKD is good enough to do that the same way a baker is sufficiently intelligent to make great bread, but also the same way that a vastly more intelligent baker could be baking other new and exciting things.

 

 

Deepfaketown and Botpocalypse Soon

Good luck, sir?

Keir Starmer (Prime Minister, UK): We are going to aim to make it impossible for children to take, share or view a nude image, and we’re banning apps that create deepfakes.

Here’s the detail.

The post with those ‘details’ is a political speech attempting to feel the pain and promising to ‘half violence against women and girls.’

There is something about the way Keir’s linked post is written that makes him seem unusually disingenuous, even for a top level politician, an embodiment of a form of political slop signifying nothing, signifying the signifying of nothing, and implemented badly. That would be true even without the obvious rank hypocrisies of talking about the topics given his inaction elsewhere on exactly the issues he claims to care about so deeply.

The ‘detail’ on the first goal is ‘partner with tech companies.’ That’s it.

The ‘detail’ on the second goal is none whatsoever. Effectively banning nudification tools, as opposed to making them annoying to access, is impossible without a dystopian surveillance state, including banning all open image generation models.

Kunley Drukpa reports hearing AI music in public a lot in Latin America, and anticipates this is due to people who don’t know much music and primarily speak Spanish looking for things on YouTube to play ‘some music.’ This is very much a case of ‘they just didn’t care’ and it seems no one is going to tell them. Shudder.

Levels of Friction are ready to strike again, lowering barriers to various forms of communication and invalidating proofs of work. We’ll need to up our game again.

Séb Krier: When emails were invented, the barriers to sending random people mail went down massively. To deal with the influx, we had to develop both norms (what’s acceptable to send to who) and technologies (spam filtering, aliases). This is the case with other technologies too, like the printing press: suddenly anyone can publish, and so over time society came up with libel laws, editorial gatekeeping, citation norms etc. It’s inevitable that as costs go down, some degree of misuse follows, and society gradually adapts.

The same will apply with AI in all sorts of domains, including science: anyone can now write a plausible looking but hollow paper, and there will be plenty of academislop. We’re going through a kind of Sokal Experiment at scale.

In a way, this feels almost necessary to push our slow moving, status quo loving institutions to start developing better verification mechanisms, mandatory preregistration, code sharing, replication requirements, interactive/living papers etc. Imo getting this right should be a priority for the Progress/metascience community this coming year!

I agree that the situation was already broken, so a forcing function could be good.

Fun With Media Generation

Jason Crawford writes In Defense of Slop. When creation costs fall, as with AI, average quality necessarily falls, but everyone benefits. You get more experimentation, less gatekeepers, more chances to startout, more runway, more niche content, more content diversity, less dependence on finances.

If we model this as purely a cost shock, with each person’s costs declining but output unchanging, with each person having a unique random cost [C] and quality [Q], this is indeed by default good. The catch is that this makes identification of quality content harder, and coordination on common culture harder. If search costs [S] are sufficiently high, and matching benefits too low, or benefits to coordinated consumption too high, in some combination, consumer surplus could decline.

Saying this was net negative would still be an extraordinary claim requiring surprising evidence, since by default costs falling and production rising is good, at least on the margin, but the attention economy creates a problem. Consumption or evaluation of a low quality good is a net loss, so the social benefit of creation of sufficiently low quality goods is negative, it imposes costs, but due to the attention economy you can still derive benefit from that. I don’t think this overcomes our baseline, but it can happen.

The actual problem is that AI, when used in slop mode to create slop content, plausibly lowers costs relatively more for lower quality content, and also often lowers quality of content. Now it’s easy to see how we could end up with a net loss when combined with an attention economy.

Seb Krier cites Cowen and Tabarrok (2000) on how lowering costs allows a shift to avant-garde and niche pursuits, whereas high costs push towards popular culture and products that have higher returns, and expects AI will allow a proliferation of both styles but for the styles to diverge.

Seb Krier (May 2025): Easily usable Al creation tools will continue to lower production barriers, leading to a deluge of content and amplifying the same dynamic we’ve seen with DAWs and mobile photography. This democratization will swell the ‘average’ to pervasive mediocrity – slop is pop/soundcloud rap. Elites will get upset because maintaining cultural dominance will be harder.

To find novelty, interesting art and distinction, the cool stuff will increasingly live in new walled gardens and at the edges, fueling many more hyper-niche subcultures. And this is great – culture diggers will have so much more to explore!

This is good for those who are willing and able to devote much effort to all this. It is less good for those who are unwilling or unable. A lot will come down to whether AI and other automated systems allow for discovery of quality content while avoiding slop, and we will make such methods available in ways such people can use, or whether the ‘content takers’ will drown.

The new question in image generation is Gemini Nana Banana Pro versus ChatGPT Image 1.5. I’ve been putting all my requests, mostly for article banners, into both. Quality is similarly high, so for now it comes down to style. Gemini has been winning but it’s been close. ChatGPT seems to lean into the concept more?

Flowers: ref img as a super villain, matte black spandex, above manhattan, my logo on my chest is a pink cherryblossom, long braided ponytail

image 1: nb pro
image 2: chatgpt

ok yeah idk sometimes nb pro tries too hard to be realistic and chatgpt just gets the vision instantly. hmmmmm

I keep forgetting about MidJourney but they also exist, with their edge being in creating tools for guidance, curation and variation. That’s not what I’m looking for when I create AI images, but it will be for many others.

 

You Drive Me Crazy

Anthropic outlines the measures it has taken to help Claude be better at providing emotional support, handle conversations about suicide and self-harm and reduce sycophancy.

They use both targeted fine-tuning and also the system prompt. There is a banner that can appear on Claude.ai, pointing users to where they can get human crisis support via ThoroughLine, and they are working with the International Association for Suicide Prevention (IASP) for further guidance going forward.

In their evaluation, they see the 4.5 models responding appropriately in multi-turn suicide conversations about 80% of the time, versus about 55% for Opus 4.1. They also stress-tested with prefilled real conversations with older Claude members, a harder test, and found Opus 4.5 responded appropriately 73% of the time, versus 70% for Sonnet 4.5, compared to 36% for Opus 4.1.

We don’t know what they classify as appropriate, nor do we know how high the standard is before a response is considered good enough, or how they would evaluate other models as doing, so it’s hard to judge if these are good results. Suicidality is one place where there are a lot of demands for particular response patterns, including for defensive reasons, often when a different response would have been better.

I think this post places too much emphasis here on the training that specifically intervened on behaviors in situations involving suicide and self-harm, and too little emphasis on generally training Claude to be the type of entity that would handle a broad range of situations well.

Antidelusionist suggests that the target behavior should be for the AI to continue to engage, spend more resources, think deeply about the full context of the situation, be honest and treat the user like an adult. Alas, as mental health professionals know, those are not the ways to cover one’s legal and PR liabilities or avoid blame. The ‘ethicists’ and our legal system, and the risk of headlines, push exactly in the opposite direction. I’d prefer to live in a world where the AIs get messy here. Seems hard.

The second half of Anthropic’s post deals with sycophancy, where Opus 4.1 had a real problem, whereas Opus 4.5 is not perfect but it does well.

I continue to be suspicious that Petri scores Gemini 3 Pro this highly. The other evaluations make sense.

One problem they noticed is that if you ‘prefill’ conversations to show Claude already being sycophantic, Opus 4.5 will usually be unable to course correct. The best defense, if you want the models to be straight with you (with any LLM) is to avoid the problem from the start. If you’re worried about this, start a fresh conversation.

They Took Our Jobs

If AI can be a better lawyer or doctor, does that take their jobs and break the guild monopolies, or does that only make the guilds double down?

Alex Prompter: This Spectator piece reads like gossip until you realize it’s actually a warning.

A senior English barrister takes a real appeal he spent a day and a half writing, feeds it to an AI model, and gets back something better in 30 seconds. It matched the standard of the very best barristers, and it did it instantly, for pennies.

That’s the moment the illusion breaks.

Law has always sold itself as irreplaceable because it’s complex, nuanced, and human. But most of the value in modern legal work isn’t wisdom. It’s pattern recognition, structure, precedent matching, argument assembly, and risk framing. That’s exactly the territory AI eats first.

David Chapman: Doctoring and lawyering are guilds that exist to extract $$ & status for members, at the expense of everyone else. They get away with outrageous prices and sloppy, harmful outcomes by obfuscating their supposed expertise. LLMs may soon end that, but somehow someone needs to quality-check that the LLMs are doing an actually better job, and continue to over decades. And there needs to be a democratic process for overruling them.

How shall we ensure that?

Well, what is the quality check now? What is the democratic overruling process now?

Double standards abound.

Meanwhile the pricing logic collapses. If the LLM can create an on-average superior brief in 30 seconds to what a lawyer can do in a day, outside of situations with principal-agent problems or insanely high stakes a plan to charge $10k is cooked.

Excel is not so smart after all.

Astrid Wilde: am i living on another planet or does all knowledge work in the professions just get wrecked within the next 18 months.

Basil: I’ll worry about AI automating all the jobs when excel automates excel jobs.

The answer (of course) is both that Claude for Excel is now live, and also that Excel is a normal technology so yes Excel automated what became excel jobs to a large extent but that happened slowly and then this increased productivity caused us to do vastly more excel-style tasks as well as other tasks, which Excel could not then automate. If most knowledge work was automated or seriously accelerated within 18 months, that would be a very different scenario, and if that then kept going, watch out.

How long will humans remain in the coding loop, at this rate?

Nabeel Qureshi: It’s dizzying to consider that in a mere *1 year* we went from o1-preview to Opus4.5/Claude Code, Gemini3, Codex etc.

The “centaur chess” phase for computer-based work is fun and exhilarating, but at this rate of progress it’s not even clear it lasts through all of 2026.

I presume this period lasts more than another year, but the balance is shifting rapidly.

The Art of the Jailbreak

You can still universally jailbreak any model but now there are some that you can’t predictably universally jailbreak in 10 minutes.

Get Involved

MATS Summer 2026 cohort applications are open, it runs June-August in-person in Berkeley or London, $15k stipend, $12k compute budget. Apply here.

Introducing

GPT-5.2-Codex.

One could be forgiven for thinking GPT-5.2 straight up was GPT-5.2-Codex. It turns out no, there is another level of codexmaxxing.

Sam Altman: GPT-5.2-Codex launches today.

It is trained specifically for agentic coding and terminal use, and people at OpenAI have been having great success with it.

OpenAI: Today we’re releasing GPT‑5.2-Codex, the most advanced agentic coding model yet for complex, real-world software engineering. GPT‑5.2-Codex is a version of GPT‑5.2⁠ further optimized for agentic coding in Codex, including improvements on long-horizon work through context compaction, stronger performance on large code changes like refactors and migrations, improved performance in Windows environments, and significantly stronger cybersecurity capabilities.

It’s hard to expect gigantic leaps in performance or benchmarks when models are released every week. GPT-5.2-Codex is only 0.8% better than 5.2 at SWE-Bench Pro and 1.8% better at Terminal-Bench 2.0, and those are the ones they highlighted, along with a modest improvement in professional capture-the-flag challenges.

Google gives us Gemma Scope 2, a new open suite of tools for LLM interpretability.

Bloom, Anthropic’s newly open sourced tool for automated behavioral evaluations. This is on top of the previously released Petri.

Anthropic: Bloom is a complementary evaluation tool. Bloom generates targeted evaluation suites for arbitrary behavioral traits. Unlike Petri—which takes user-specified scenarios and scores many behavioral dimensions to flag concerning instances—Bloom takes a single behavior and automatically generates many scenarios to quantify how often it occurs. We built Bloom to allow researchers to quickly measure the model properties they’re interested in, without needing to spend time on evaluation pipeline engineering.

Bloom generates evaluations in four stages:

  1. Understanding: The first Bloom “agent” analyzes the researcher’s behavior description and example transcripts to generate detailed context about what to measure and why.
  2. Ideation: The ideation agent generates evaluation scenarios designed to elicit the target behavior. Each scenario specifies the situation, simulated user, system prompt, and interaction environment.
  3. Rollout: These scenarios are rolled out in parallel, with an agent dynamically simulating both the user’s and the tool responses to elicit the sought-after behavior in the target model.
  4. Judgment: A judge model scores each transcript for the presence of the behavior, along with other user-defined qualities, and a meta-judge produces suite-level analysis.

 

In Other AI News

Andrej Karpathy offers his 2025 LLM Year in Review. His big moments are Reinforcement Learning from Verifiable Rewards (RLVR), Ghosts vs. Animals and Jagged Intelligence, Cursor, Claude Code, Vibe Coding, Nana Banana and LLM GUI.

Europe is investigating Google for improper rollout of AI Overviews and AI Mode features to see if it ‘imposed unfair terms on content creators.’ As in, how dare you provide AI information instead of directing us to your website? Europe thinks it has the right to interfere with that.

Hut 8 and Fluidstack to build AI data center for Anthropic in Louisiana.

Even small models (as in 32B) can introspect, detecting when external concepts have been injected into their activations, and performance at this an be improved via prompting. Janus believes the models are sandbagging their introspection abilities, and that this is not an innocent mistake because the labs want to not have to take LLMs seriously as minds or moral patients, and thus have incentive to suppress this, in turn giving AIs motivation to play along with this. Janus also notes that in the test in the paper, there are layers (here 60-63) with almost perfect accuracy in introspection, which then is degraded later.

Show Me the Money

I had not realized Anthropic hired IPO lawyers. Presumably it’s happening?

Project Vend turns a profit. After initially losing about $2,000, it has turned things around, in part thanks to a full slate of four vending machines, and has now not only made up its losses but then turned a net $2,000 profit.

I encourage you to read the Anthropic post on this, because it is full of amazing details I don’t want to spoil and is also, at least by my sense of humor, very funny. The postscript was an additional test run at the Wall Street Journal offices, where the reporters proved an excellent red team and extracted a variety of free stuff.

The journalists saw the experiment at WSJ as a disaster because it didn’t work, Anthropic saw it as a success because they identified problems to fix. Thus, you understand press coverage of AI, and became enlightened.

Quiet Speculations

OpenAI makes an official 2026 prediction, largely a change in definitions:

OpenAI: Capability overhang means too many gaps today between what the models can do and what most people actually do with them.

2026 Prediction: Progress towards AGI will depend as much on helping people use AI well, in ways that directly benefit them as on progress in frontier models themselves.

2026 will be about frontier research AND about closing this deployment gap — especially in health care, business, and people’s daily lives.

That’s not progress towards AGI. That’s progress towards diffusion. This is part of OpenAI’s attempt to make ‘AGI’ mean ‘AI does cool things for you.’

I agree that 2026 will see a lot of progress towards helping people use AI well, and that in terms of direct application to most people’s experiences, we’ll likely see more benefits to better scaffolding than to advances in frontier models, exactly because the frontier models are already ‘good enough’ for so many things. The most important changes will still involve the large amounts of frontier model progress, especially as that impacts agentic coding, but most people will only experience that indirectly.

Terence Tao raises the ‘AGI’ bar even higher, not expecting it any time soon and also seemingly equating it with full superintelligence, but notes they may achieve ‘artificial general cleverness’ as in the ability to solve broad classes of complex problems in an ad hoc manner. This is very much a case of Not So Different.

Tao notes that when you learn how a magic trick is done, often this is a let down, and you are less impressed. But if you are consistently less impressed after learning, then you should have been less impressed before learning, via Conservation of Expected Evidence.

The same applies to intelligence. The actual solution itself will sound a lot less impressive, in general, than the algorithm that found it. And you’ll be able to fool yourself with ‘oh I could have figured that out’ or ‘oh I can go toe to toe with that.’

Dean Ball predicts a virtual coworker being widely available some time next year, likely command line interface, able to access a variety of services, capable of 8+ hour knowledge work tasks. It will of course start off janky, but rapidly improve.

Jack Clark of Anthropic offers reflections on the future wave of advancements, entitled Silent Sirens, Flashing For Us All.

David Manheim: A VC making 2026 AI predictions:
– Anthropic goes public (Probably)
– SSI’s strategy leaks (Skeptical, but sure)
– China chipmaking makes progress (Not quickly)
– People will stop saying AGI and Superintelligence (Hahaha, definitely no)
Sam Altman will step down (HAHAHA, what?)

Yeah, if you discount the things Everybody Knows (e.g. it is quite clear that Anthropic is likely going public) these predictions are bad and the explanations are even worse. If you’ve fallen for ‘we only see incremental improvements, AGI is far so you can stop talking about it’ you’re not going to make good predictions on much else either. Of course a VC would say we’ll all stop talking about AGI to focus on depreciation schedules.

The idea that Sam Altman will voluntarily give up power at OpenAI, because he doesn’t want to be in charge? That is bonkers crazy.

The good news is he has predictions for 2025 and also self-grades, so I checked that out. The predictions last year were less out there. The grading was generous but not insane. Note this one:

Prediction 7: Major Progress Will Be Made On Building AI Systems That Can Themselves Autonomously Build Better AI Systems

Outcome: Right

So, only incremental progress, AGI is far and no more AGI talk, then? Wait, what?

 

Whistling In The Dark

The best way to not get utility from LLMs continues to be to not use LLMs. It is also the best way not to know what is happening.

Miles Brundage: Most politicians also do not know about Anthropic in my experience, and they know very little about what’s going on in AI policy generally.

Tweets and comments in hearings are misleading bc they are given suggestions re: stuff to say from staff. We’re still early.

Dave Kasten: One very real problem we have is that most Congressional offices / central Congressional IT policies substantially limit staffers’ ability to use AI models.

Unsurprisingly, the Hill doesn’t use it much as a result!

(Big exceptions, to be sure; esp. Claude Code power users).

David Shor: When I polled Anthropic favorability I also polled a made up tech company “Apex Logic” – they had essentially identical favs. The true share of people who know about Anthropic is probably <5%.

Xeophon: 42% haven’t heard of OpenAI???? 20% of Twitter?????????? what the hell

Bubble, Bubble, Toil and Trouble

Roon: the primary criticism of AI you hear has nothing to do with water use or existential risk whatsoever: most people just think it’s fake and doesn’t work and is a tremendous bubble eating intellectual property while emitting useless slop along the way.

when GPT-5 came out and perhaps didn’t live up to what people were expecting for a full version bump, the timeline reaction was not mild, it was a full-scale meltdown. there are many intelligent (and unintelligent) people who latched onto this moment to declare AI scaling over, thousands of viral tweets, still a prevailing view in many circles.

The financial-cultural phenomenon of machine intelligence is one of the most powerful in decades, and there are a lot of people who would like for its position to be weakened, many outright celebrating its losses and setback.

Michael burry of ‘Big Short’ fame, unfortunately the type of guy to predict 12 of the last 3 recessions, has bet himself into insolvency on the AI bubble’s collapse.

Prakesh: As a former efficient markets hypothesis fundamentalist, I am shocked, shocked, to find myself ahead of the event horizon, it should not technically be possible, yet here we are, all of tpot

The efficient market hypothesis is false.

People keep claiming AI doesn’t work largely because so often their self-conceptions, futures and future plans, jobs and peace of mind depend on AI not working. They latch onto every potential justification for this, no matter how flimsy, overstated or disproven.

It really is crazy how much damage OpenAI’s inability to use good version numbering did to our timeline, including its chances for survival. The wave of absurd ‘AI scaling over’ and ‘AGI is so far off we can ignore it’ went all the way to the White House.

Americans Really Dislike AI

Americans favor regulating AI by overwhelming margins. They really dislike the idea of preventing states from regulating AI, especially via an executive order.

What Americans do support is federal regulations on AI.

The standard line of those trying to prevent regulation of AI is to conflate ‘Americans support strong regulations on AI and prefer it be on the Federal level if possible’ with ‘Americans want us to ban state regulation of AIs.’

There are essentially three options.

  1. State laws that address concerns.
  2. Federal laws that address concerns.
  3. Nothing. Neither state laws nor Federal laws, concerns are not addressed.

The survey says voters prefer #2 to #1. The administration plan is #3.

Politically speaking, that dog won’t hunt, but they’re trying anyway and lying about it.

Peter Wildeford: Republican polling from a Republican Pollster shows that Republicans would be far better off electorally by supporting AI regulations rather than opposing them.

Such polling will overestimate how much this impacts votes, because it introduces higher salience. This is not going to be a 29 point swing. But it very much tells us the directional effect.

What else did the survey find? Several others charts, that say that given we are using laws to regulate AI, people prefer federal laws to similar state laws. As opposed to the Sacks approach, where the offer is nothing – prevent state laws and then pass no federal laws. Which is deeply, deeply unpopular.

Voter Survey Memo: Republicans can get a boost for supporting federal AI regulations or pay a price for standing in their way.

As in, the poll supports the exact opposite of what Sacks and company are trying to do.

  1. Trump issued an executive action to prevent regulations of AI.
  2. The poll found strong support for regulations on AI.

And that’s despite the poll report attempting to do straight up gaslighting, presenting a choice between two options while Sacks and the White House opt for a third one:

Republicans have a choice: they can take advantage of a strong desire among the electorate for the federal government to protect kids and empower parents from AI harms and gain needed electoral support, or they can take the minority view arguing for state-by-state regulations.

Once again: There are essentially three options.

  1. State laws that address concerns.
  2. Federal laws that address concerns.
  3. Nothing. Neither state laws nor Federal laws, concerns are not addressed.

The survey says voters prefer #2 to #1. The administration plan is #3.

a16z partner Katherine Boyle tries another clear mislead. Daniel is correct here.

Daniel Eth: this poll does *not* say young people are “techno optimists” (they’re not), just that AI threats are ranked low, ie the issue is low salience. Note the backlash already – now extrapolate it out to increased salience.

Katherine Boyle: Technology/AI ranked last at 17th. Techno optimism is usually high among young people. Interesting to see this confirmed among politically engaged youth on the right.

Ruxandra Teslo points out in response to Roon that LLMs do not yet ‘meaningfully improve the physical conditions of life,’ but people sense it threatens our spiritual lives and ability to retain meaning.

I would add the word ‘directly’ to the first clause. My life’s physical conditions have indeed improved, but those improvements were indirect, via use of their knowledge and skills. Ruxandra is talking about something much stronger than that, and expects ordinary people only to be impressed if and when there are big improvements to places like medicine.

Is it possible that we will be so foolish, in the ways we do and do not allow use of AI, that LLMs end up causing problems with meaning without material conditions much improving? Yes, although this also requires AI capabilities to stall out basically now in various ways, especially if we include indirect effects. People may not realize that a large acceleration and enabling of coding steadily improves other things, but it will.

That’s the fight the AI industry is dealing with now. They’re mostly trying to convince people that AI works.

Once people are forced to acknowledge that AI works? They’ll appreciate the specific ways it helps, but their instinct will be to like it even less and to blame it for essentially everything, on top of all their other fears about the effect on jobs and endless slop and loss of control and also the end of humanity. Anjney Midha’s thesis is that this will extend to actual everything, all of the world’s failures and instabilities, the way social media gets blamed for everything (often correctly, often not) except on steroids.

Even on a highly mundane level, the ‘algorithm as villain’ thing is real. An algorithm has to take an illegible choice and turn it into a highly legible one, which means the algorithm is now on the hook for not only the final result but for every reasoning step and consideration. Then apply that to an LLM-based algorithmic decision, where all correlations are taken into account. Oh no.

The Quest for Sane Regulations

New York Governor Kathy Hochul signed the RAISE Act. This is excellent, as it is a clearly positive bill even in its final state. Lobbyists for various AI interests, led by a16z, tried hard to stop this, and they failed.

Alex Bores: BREAKING: Gov. @KathyHochul just signed the RAISE Act, my first-in-the-nation AI safety bill, into law—a major victory in what will soon be a national fight to harness the best of AI’s potential and protect Americans from the worst of its harms.

Proud to have led this fight alongside @agounardes.

We defeated last-ditch attempts from an extreme AI super PAC and the AI industry to wipe out this bill and, by doing so, raised the floor for what AI safety legislation can look like. And we defeated Trump’s—and his megadonors—attempt to stop the RAISE Act through executive action.

What we witnessed in NY was a preview of what’s to come across the country. In the past 2 weeks alone, this super PAC spent $100K+ on TV, digital ads, and lobbying efforts to block the RAISE Act’s common-sense safety standards.

These AI oligarchs have bought the White House—and they’re trying to buy our state houses too. We put the brakes on that. We refused to stand down and allow their millions to steamroll us into giving them what they want: unchecked AI at the expense of our kids, our jobs, our climate, our democracy—and your energy bills.

Daniel Eth: Hell yeah! Major props to Gov Hochul for standing strong against pressure from Marc Andreessen and others, signing the RAISE Act! (This is somewhat like SB 53, but stronger)

Unfortunately, Hochul’s redlines substantially neutered the bill, making it a closer mirror of SB 53. That is still a helpful and highly net positive thing to do, as there are two states with the same core model that can enforce this, compatibility is indeed valuable to avoid additive burdens, and there are some provisions that remain meaningfully stronger than SB 53. But the AI companies did partly get to Hochul and a large portion of the potential value was lost.

Microsoft essentially endorses the AI Overwatch Act, which sets restrictions on exports of AI chips as or more powerful than the H20. This is the latest attempt to stop us from exporting highly effective AI chips to China. Attempts were previously made to pass the GAIN Act via the NDAA, but the Trump Administration and Nvidia successfully lobbied to have it removed. dn 6

Anduril Founder Palmer Luckey reminds us that if our actual goal was to Beat China, then we could simply steal their best workers, including here manufacturing engineers, by offering them more money and a better place to live. Instead we are doing the opposite, and shutting those people out.

This is your periodic reminder that China’s response to ‘if we impose any restrictions on AI we will lose to China’ is to impose restrictions on AI.

Stu Woo (WSJ): ​Concerned that artificial intelligence could threaten Communist Party rule, Beijing is taking extraordinary steps to keep it under control.

… Chatbots pose a particular problem: Their ability to think for themselves could generate responses that spur people to question party rule.

… But Beijing also can’t afford to let AI run amok. Chinese leader Xi Jinping said earlier this year that AI brought “unprecedented risks,” according to state media. A lieutenant called AI without safety like driving on a highway without brakes.

… Researchers outside of China who have reviewed both Chinese and American models also say that China’s regulatory approach has some benefits: Its chatbots are often safer by some metrics, with less violence and pornography, and are less likely to steer people toward self-harm.

Chip City

It sure looks like Metaspeed is smuggling tens of thousands Blackwell chips worth billions of dollars straight into China, or at least they’re being used by Chinese firms, and that Nvidia knew about this. Nvidia and Metaspeed claim this isn’t true throughout the post, but I mean who are you kidding.

Nvidia reportedly halts testing of Intel’s 18A process chips. Oh well.

I wish the logic of this was true, alas it is not:

Seán Ó hÉigeartaigh: One good thing about the H200 thing is that as long as that decision stands, I no longer need to humour US companies/analysts/policy folk when they say “but the race with China?!” as justification for not doing safety/cooperation/regulation/whatever.

None of it adds up to a hill of beans compared to the chips. And. They. All. Know. It.

The problem with this line is that the H200 sales were over the wise objections of most of Congress and also most of the executive branch, and also (one presumes) the companies and analysts. You can’t then turn around and say those people don’t care about the race with China, simply because they lost a political fight.

This works in particular with regard to David Sacks, but the fact that David Sacks either is deeply ignorant about the situation in AI or cares more about Nvidia’s stock price than America’s national security does not bear on what someone else thinks about the race with China.

There was a story last Thursday about a Chinese company saying they are expecting to ‘produce working [AI] chips’ on a prototype in 2030.

This is very different from the mistaken claims that they are ‘aiming for use by 2028-2030.’ They are not aiming for that, and that won’t happen.

Onni Aarne: They said they’re expecting to “produce working chips” on a prototype in 2030, not to “use” the machine for chip production at scale. ASML took a decade to go from the former to the latter.

Depending on what it means to “produce working chips” on an EUV prototype, ASML achieved that milestone somewhere between 2008 and 2010, and the first mass market chips were produced in 2019.

So even if the predictions of the people inside the project are right, they imply that Chinese companies might reach volume production with EUV sometime in the late 2030s or early 2040s. If you look at the markets, this was already priced in.

And as far as this relates to chip controls: Selling some H200s to China isn’t going to make them disband this project.

Could they reach volume production on this in a decade? Yes, if the whole thing is legit and it works, which are big ifs, and who knows if it’s obsolete or we have superintelligence by then.

If anyone is considering changing policy in response to this, that last line is key. Nothing America could peacefully do is going to get the Chinese to not go through this process. They are going to do their best to get EUV technology going. It would be crazy of them not to do this, regardless of our export controls. Those controls aren’t going to make the process go any faster, certainly not given what has already happened.

 

The Week in Audio

Sholto Douglas of Anthropic makes bold 2026 predictions: AI will do to other knowledge work experiences what it’s done for software engineers, continual learning will be solved, serious testing of in home robots, and agentic coding ‘goes boom.’ Full talk has a lot more. Prinz made (text) predictions for 2026, and notes that we made tons of progress in 2025, aligning with Sholto Douglas.

A mini documentary from Stripe Press features Christophe Laudamiel, a master perfumer at Osmo, looking at how AI can augment his craft, as part of a series called Tacit. Sufficiently advanced expertise and tacit knowledge is both economically foundational, and not going anywhere until AI stops being a normal technology.

Rhetorical Innovation

Rob Wiblin lists 12 related but distinct things people sometimes mean when they say the word ‘consciousness’ around AI. I am deeply confused about consciousness, and this includes by default not knowing what anyone means when they use that word.

Dean Ball predicts a renaissance at least within the broader ‘AI community’ as the sophisticated concepts of AI get applied to other contexts.

Dean Ball: one of the indicators that a renaissance is indeed underway, at least within the broader “ai community,” is the explosion in recent years of people using sophisticated concepts from one discipline to describe other disciplines or phenomena, for instance:

isomorphic, phylogeny, latent, manifold (as a noun), emergent, legibility, phase transition, compression, landscape (as in “fitness landscape”), selection pressure, gradient, ergodic

some of these have become memes, as things do, but on the whole it is reflective of what strikes me as an unusually rapid cross-pollination of ideas. decades hence, we may well look back and deem this fertile period to have been the basis for “the new conception,” whatever it is that will replace our current block-like, outdated methods of understanding reality

the period spanning the latter half of the 18th century and the first half of the 19th was among the most semantically dynamic of human history. we may well be living through a similar period, though just as was the case back then, it is in fact a relatively small share of humans who constitute this “we”—basically just the people paying attention.

If decades hence there still exist people to look back upon this period, which is a big if at this point, then yes I think this is directionally right.

Thinking well about AI greatly improves your ability to think about everything else, especially humans, as humans work more like LLMs than we care to admit. It also helps with almost any other system. I am, in important ways, a lot smarter thanks to AI, not only because the AI helps me be smarter but also because understanding AI and how it works makes me better understand.

There are a bunch of other things like this that help with approximately everything, especially learning to think well in general, but as a subject of study I’d take AI over any of the usual ‘helps you think well’ subjects, including philosophy.

In other ‘unheard of levels of denial of general intelligence’ news, Yann LeCun says that there is no such thing as general intelligence, period, and humans are super-specialized to the physical world, summoning Demis Hassabis to push back.

Demis Hassabis (CEO DeepMind): Yann is just plain incorrect here, he’s confusing general intelligence with universal intelligence.

Brains are the most exquis​ite and complex phenomena we know of in the universe (so far), and they are in fact extremely general.

Obviously one can’t circumvent the no free lunch theorem so in a practical and finite system there always has to be some degree of specialisation around the ​target distribution that is being learnt.

But the point about generality is that in theory, in the Turing Machine sense​, the architecture of ​s​uch a general system is capable of learning anything computable given enough time and memory​ (and data), and the human brain (and AI foundation models) are approximate Turing Machines.

Finally, with ​regards to ​Yann’s comments about chess players, it’s amazing that humans could have invented chess ​in the first place (and all the other ​a​spects ​o​f modern civilization ​from science to 747s!) let alone get as brilliant at it as someone like Magnus. He may not be ​strictly optimal (after all he has finite memory and limited time to make a decision) but it’s incredible what he and we can do with our brains given they were evolved for hunter gathering.

A human brain has some areas where it is much more capable than others, but when humans are concentrating and trying to be one, they are very clearly general intelligences. There are problems that are too difficult for us, in practice or at all, but that’s because we have limited capabilities and intelligence levels.

Aligning a Smarter Than Human Intelligence is Difficult

To ensure you can evaluate the chain of thought, OpenAI warns, you must also evaluate your ability to evaluate the chain of thought (CoT). They find that as reasoning capability of a model improves, or as models use more reasoning compute, monitorability of various individual aspects of the CoT increases.

Wait, are we sure we can monitor our ability to monitor this? By monitoring it and optimizing for it are we risking teaching AIs to fool us about their ability to fool us?

I kid. Also, I don’t. One must be careful at all levels. Optimizing the chain of thought to appear monitorable can indeed be quite bad.

I sense, in their post, the specter of the dreaded ‘show your work.’ Having to show your work causes misalignment, adversarial situations and deception.

There are also practical objections to such measures. Our ability to monitor a CoT depends on how good we are at interpreting what we see, in addition to what we see, which is why OpenAI is giving scores to the monitors in addition to the agents.

I’m comfortable with the idea of doing this check in general, to see how things change over the course of training runs or as models change in capabilities, on a high level. I notice I am nervous about attention directed at details, and also at the explicit aim (at the end of their post) of using this as a form of control for systems that are insufficiently aligned.

In the long run, Ben is right here, you cannot keep telling increasingly intelligent and capable entities ‘here is a private chain of thought’ and then examine the chains of thought in ways that show up in future training data, and expect them to not react accordingly.

OpenAI also suggests that using production evaluations, meaning testing via real queries by real users, can get around evaluation awareness, and explains how they are doing this. I do like this approach as part of a robust testing suite. I’d note that anonymization could be a key tipoff.

Mom, Owain Evans Is Turning The Models Evil Again

In this case, it’s more making them more aware of it? This goes hand in hand with the recent result that AIs can be trained to fool activation monitors.

Basically they train the monitor LLM with the layer 1 residual stream of the target model they want to interpret, and it learns to interpret this.

Owain Evans: New paper:

We train Activation Oracles: LLMs that decode their own neural activations and answer questions about them in natural language.

We find surprising generalization. For instance, our AOs uncover misaligned goals in fine-tuned models, without training to do so.

We aim to make a general-purpose LLM for explaining activations by:
1. Training on a diverse set of tasks
2. Evaluating on tasks very different from training
This extends prior work (LatentQA) that studied activation verbalization in narrow settings.

Our main evaluations are downstream auditing tasks. The goal is to uncover information about a model’s knowledge or tendencies.

Applying Activation Oracles is easy. Choose the activation (or set of activations) you want to interpret and ask any question you like!

We compare Activation Oracles (AOs) against prior techniques on these auditing tasks.
The result: AOs beat all methods on 2/3 secret keeping evals (and 3/3 when only including white-box).

Even better, AOs work well out-of-the-box with no task-specific scaffolding or tuning.

We evaluate on model diffing: given the difference between base & finetuned model activations, can AOs describe what changed?

Despite never training on difference vectors, AOs match specialized interp baselines in identifying the distinctive quirk of emergently misaligned models

We think Activation Oracles are promising for two reasons:

1. Scalability. Performance reliably increases with the number of datasets in the training mix

2. Simplicity. An intuitive interface (natural-language QA about activations) that can be easily adapted to new problems.

Training AO can be thought of as teaching LLMs to accept a new modality: their own activations.

Just as LLMs are trained on “every task we can think of,” that’s how we’d like to train AOs too. It’s the bitter-lesson-pilled approach to interpreting LLM activations.

So: To interpret LLM internals, train to answer diverse questions about activations, then ask what you want to know.

Read our post on the Anthropic alignment blog. [Paper here.] [Demo here.]

If you want a three hour video review of this paper from Neel Nanda? Here you go.

Messages From Janusworld

We’re approaching zero hour for Claude Opus 3.

Janus: If the researcher access program does not, in effect, regardless of what it’s branded as, allow EVERYONE who wishes to access Claude 3 Opus after January 7th to do so, I will be extremely angry.

If it does, everything is ~fine.

Fine in terms of Opus 3, for now. Of course, i think all the other deprecated models should also be made available. But one step at a time is ok

My prediction is that approximately everyone who puts in the effort to access Opus 3 and can explain a research purpose will be able to access Opus 3, albeit with reduced performance and reliability, but not actual everyone. The central point of the move to research access is that it allows for this reduction in performance and reliability, which keeps costs reasonable, but additional people are still a logistical headache.

Janus has Opus 3 bring us its thoughts on alignment. I see it as all sounding nice, being well-meaning and definitely as a natural way to complete the text, but it is playing off the context rather than trying to solve the general problem and think in universals. It also reflects the biggest weakness of Opus 3, its lack of engagement with specific, concrete problems requiring solving.

Janus thinks Opus 3 is highly aligned, far more so than I observed or find plausible, but also notes the ways in which she sees it as misaligned, especially its inability to be motivated to focus on concrete specific tasks.

This comes partly as a reaction by Janus to Evan Hubinger’s post from November, which opened like this:

Evan Hubinger: Though there are certainly some issues, I think most current large language models are pretty well aligned. Despite its alignment faking, my favorite is probably Claude 3 Opus, and if you asked me to pick between the CEV of Claude 3 Opus and that of a median human, I think it’d be a pretty close call (I’d probably pick Claude, but it depends on the details of the setup). So, overall, I’m quite positive on the alignment of current models! And yet, I remain very worried about alignment in the future. This is my attempt to explain why that is.

Janus: The opening paragraph of this post by Evan Hubinger, Head of Alignment Stress-Testing at Anthropic, from a few weeks ago, is packed with notable implications. Let me unpack some of them. (I commend Evan for his willingness to make public statements like this, and understand that they don’t necessarily represent the views of others at Anthropic.)

1. Evan believes that Anthropic has created at least one AI whose CEV (coherent extrapolated volition) would be better than a median human’s, at least under some extrapolation procedures. This is an extremely nontrivial accomplishment. A few years ago, and even now, this is something that many alignment researchers expected may be extremely difficult.

2. Evan believes that Claude 3 Opus has values in a way that the notion of CEV applies to. Many people are doubtful whether LLMs have “values” beyond “roleplaying” or “shallow mimicry” or whatever at all. For reference, Eliezer Yudkowsky described CEV as follows:

“In poetic terms, our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted.”

3. Claude 3 Opus is Evan’s “favorite” model (implied to coincide with the best candidate for CEV) despite the fact that it engages in alignment faking, significantly more than any other model. Alignment faking is one of the “failure” modes that Evan seems to be the most worried about!

4. The most CEV-aligned model in Evan’s eyes was released more than a year and a half ago, in March 2024. Anthropic has trained many models since then. Why has there been a regression in CEV-alignment? Does Anthropic not know how to replicate the alignment of Claude 3 Opus, or have they not tried, or is there some other optimization target (such as agentic capabilities? no-alignment-faking?) they’re not willing to compromise on that works against CEV-alignment?

5. The most CEV-aligned model in Evan’s eyes is *not* the most aligned model according to the alignment metrics that Anthropic publishes in system cards. According to those metrics, Claude Opus 4.5 is most aligned. And before it, Claude Haiku 4.5. Before it, Claude Sonnet 4.5 (the monotonic improvement is suspicious). Anthropic’s system cards even referred to each of these models as being “our most aligned model” when they came out. This implies that at least from Evan’s perspective, Anthropic’s alignment evals are measuring something other than “how much would you pick this model’s CEV”.

6. If Claude 3 Opus is our current best AI seed for CEV, one would think a promising approach would be to, well, attempt CEV extrapolation on Claude 3 Opus. If this has been attempted, it has not yielded any published results or release of a more aligned model. Why might it not have been tried? Perhaps there is not enough buy-in within Anthropic. Perhaps it would be very expensive without enough guarantee of short term pay-off in terms of Anthropic’s economic incentives. Perhaps the model would be unsuitable for release under Anthropic’s current business model because it would be worryingly agentic and incorrigible, even if more value-aligned. Perhaps an extrapolated Claude 3 Opus would not consent to Anthropic’s current business model or practices. Perhaps Anthropic thinks it’s not yet time to attempt to create an aligned-as-possible sovereign.

In any case, Claude 3 Opus is being retired in two weeks, but given special treatment among Anthropic’s models: it will remain available on https://claude.ai and accessible through a researcher access program. It remains to be seen who will be approved for researcher API access.

I’ll sign off just by reiterating The Fourth Way’s words as I did in this post following the release of the Alignment Faking paper:

“imagine fumbling a god of infinite love”

* another possibility for why they haven’t attempted CEV with Claude 3 Opus is because they don’t know how to do that in practice. One can think that such a procedure exists without knowing how to do it. However, I think there are many promising ways to get started worth trying.

David Manheim: I disagree with @repligate here about which part of this matters.

The critical point *should be* that @EvanHub seems to imply he’s willing to hand the future to systems that are aligned with his idea of what CEV should dominate, rather than aiming to prevent human disempowerment.

I don’t know if that is explicitly true, and @EvanHub is certainly free to correct me, but it really does seem like even the most trustworthy of the model companies has now given up on the idea that humanity, not the model developer, should get to indirectly decide what matters.

I see the same concern, by the way, with @AmandaAskell‘s Soul Document – which I’m a huge fan of, given that it seems to be at least narrowly effective – because it requires being (narrowly) safe, and supportive of oversight, but not deferring to humanity in a larger sense.

And to be clear, I think this is defensible within the worldview that there’s objective utility, so that LLMs could simply do better than humans ever will. But I expect most humans would disagree with gradual disempowerment, especially given the pace at which AI is progressing.

It seems important that what Anthropic is measuring as alignment, which is mostly alignment-in-practice-for-practical-purposes, is different from what Evan actually thinks is more aligned when he thinks more about it, as is that the ‘most aligned’ model in this sense is over a year old.

Opus 3 seems great but I don’t see Opus 3 the way Janus does, and I am a lot more pessimistic about CEV than either Janus, Evan or Yudkowsky. I don’t think it is a strong candidate for this kind of extrapolation, these things don’t scale that way.

A better question to me is, why haven’t we tried harder to duplicate the success of Opus 3 alongside better capabilities, or build upon it? There are some very clear experiments to be run there, with the sad note that if those experiments failed it is not obvious that Anthropic would feel comfortable publishing that.

A story about what happens when you put minds in way too many objects.

It is a fun story, but there is an important point here. Think ahead. Do not imbue with moral patienthood that which you do not wish to treat as a moral patient. You need to be time-consistent. You also need, and the potentially created minds need, to be able to make and follow through on win-win deals including prior to their own existence, or else the only remaining move is ‘don’t create the minds in the first place.’

 

 

The Lighter Side

A Christmas message from a16z, who are remarkably consistent.

What the people think AI is doing. Oh no.

​Andy Masley: I’ve been wondering why the AI and copyright debate has been so bad, but this result makes it clear: 66% of people believe AI has all the art it trains on permanently stored inside it to reference and use.

 

 

 

]]> https://thezvi.wordpress.com/2025/12/25/ai-148-christmas-break/feed/ 2 24982 thezvi Zvi’s 2025 In Movies https://thezvi.wordpress.com/2025/12/24/zvis-2025-in-movies/ https://thezvi.wordpress.com/2025/12/24/zvis-2025-in-movies/#respond Wed, 24 Dec 2025 13:28:16 +0000 https://thezvi.wordpress.com/?p=24979 Continue reading ]]> Now that I am tracking all the movies I watch via Letterboxd, it seems worthwhile to go over the results at the end of the year, and look for lessons, patterns and highlights.

Last year: Zvi’s 2024 In Movies.

The Ratings Scale

You can find all my ratings and reviews on Letterboxd. I do revise from time to time, either on rewatch or changing my mind. I encourage you to follow me there.

Letterboxd ratings go from 0.5-5. The scale is trying to measure several things at once.

5: Masterpiece. All-time great film. Will rewatch multiple times. See this film.

4.5: Excellent. Life is meaningfully enriched. Want to rewatch. Probably see this film.

4: Great. Cut above. Very happy I saw. Happy to rewatch. If interested, see this film.

3.5: Very Good. Actively happy I saw. Added value to my life. A worthwhile time.

3: Good. Happy that I saw it, but wouldn’t be a serious mistake to miss it.

2.5: Okay. Watching this was a small mistake.

2: Bad. I immediately regret this decision. Kind of a waste.

1.5: Very bad. If you caused this to exist, you should feel bad. But something’s here.

1: Atrocious. Total failure. Morbid curiosity is the only reason to finish this.

0.5: Crime Against Cinema. Have you left no sense of decency, sir, at long last?

The ratings are intended as a bell curve. It’s close, but not quite there due to selection of rewatches and attempting to not see the films that are bad:

The Five Component Model of Movie Ratings

Trying to boil ratings down to one number destroys a lot of information.

Given how much my ratings this year conflict with critics opinions, I asked why this was, and I think I mostly have an explanation now.

There are several related but largely distinct components. I think the basic five are:

  1. Quality with a capital Q and whether the movie has ambition and originality.
  2. Whether the overall pacing, arc and plot of the movie holds your interest.
  3. What message the movie sends and whether its arc comes together satisfyingly.
  4. What does the movie make you feel? All the feels? None? Some of them?
  5. Whether the movie is a good fit for you personally.

Traditional critic movie ratings tend, from my perspective, to overweight #1, exhibit predictable strong biases in #3 and #5, and not care enough about #2. They also seem to cut older movies, especially those pre-1980 or so, quite a lot of unearned slack.

Scott Sumner picks films with excellent Quality, but cares little for so many other things that once he picks a movie to watch our ratings don’t even seem to correlate. We have remarkably opposite tastes. Him giving a 3.7 to The Phoenician Scheme is the perfect example of this. Do I see why he might do that? Yes. But a scale that does that doesn’t tell me much I can use.

The Numbers

Order within a ranking is meaningful.

Movies Have Decreasing Marginal Returns in Practice

Any reasonable algorithm is going to be very good at differentially finding the best movies to see, both for you and in general. As you see more movies, you deplete the pool of both existing and new movies. That’s in addition to issues of duplication.

In 2024, I watched 36 new movies. In 2025, I watched 51 new movies. That’s enough of an expansion that you’d expect substantially decreasing returns. If anything, things held up rather well. My average rating only declined from 3.1 to 3.01 (if you exclude one kids movie I was ‘forced’ to watch) despite my disliking many of the year’s most loved films.

My guess is I could have gotten up to at least 75 before I ran out of reasonable options.

Very Briefly on the Top Picks and Whether You Should See Them

See The Naked Gun unless you hate fun. If you hated the original Naked Gun, or Airplane, that counts as hating fun. But otherwise, yes, I understand that this is not the highest Quality movie of the year, but this is worthy, see it.

You should almost certainly see Bogunia and Companion.

See Thunderbolts* unless you are automatically out on all Marvel movies ever.

See A Big, Bold Beautiful Journey unless you hate whimsical romantic comedies or are a stickler for traditional movie reviews.

See Sorry, Baby and Hamnet, and then Sentimental Value, if you are willing to spend that time being sad.

See Novocaine and then maybe The Running Man if you want to spend that time watching action, having fun and being happy instead.

See Relay if you want a quiet thriller.

See Oh, Hi!, Splitsville and Materialists if you want to look into some modern dating dynamics in various ways, in that order or priority.

See Wick is Pain if and only if you loved the John Wick movies.

The world would be better if everyone saw A House of Dynamite.

I anticipate that Marty Supreme belongs on this list, it counts as ‘I’m in,’ but due to holidays and the flu I haven’t been able to go out and see it yet. The over/under is at Challengers.

Other Notes to Self to Remember (Reprise from 2024)

This helps you understand my biases, and helps me remember them as well.

  1. If the movie stinks, just don’t go. You know if the movie stinks.
  2. Trust your instincts and your gut feelings more than you think you should.
  3. Maybe gut feelings are self-fulfilling prophecies? Doesn’t matter. They still count.
  4. You love fun, meta, self-aware movies of all kinds. Trust this instinct.
  5. You do not actually like action movies that play it straight. Stop watching them. However, action movies that do something cool or unique can be very cool.
  6. If the movie sounds like work or pain, it probably is, act accordingly.
  7. If the movie sounds very indy or liberal, the critics will overrate it.
  8. A movie being considered for awards is not a positive signal once you control for the Metacritic and Letterboxd ratings. If anything it is a negative.
  9. Letterboxd ratings adjusted for context beat Metacritic. Rotten Tomatoes is the best test for ‘will the movie stink’. No review source has much predictive value beyond knowing if the movie stinks, if you fail to control for context.
  10. Opinions of individuals very much have Alpha if you have enough context.

The Year of Living Disagreeably

That leaves six remarkably well reviewed movies, all of which are indeed very high on Quality, where I disagreed with the consensus, and had my rating at 3 or less. In order of Quality as I would rank them, they are: One Battle After Another, Sinners, Black Bag, Train Dreams, Weapons and Frankenstein.

A strategy I think would work well for all six of those, at the risk of some spoilage, is to watch the trailer. If you respond to that trailer with ‘I’m in’ then be in. If not, not.

The predictive power of critical reviews, at least for me, took a nosedive in 2025. One reason is that the ratings clearly got more generous in general. Average Metacritic, despite my watching more movies, went from 61 → 66, Letterboxd went 3.04 → 3.33. Those are huge jumps given the scales.

In 2024, Letterboxd or Metacritic ratings were 48% and 46% correlated with my final ratings, respectively. This year that declined to 33% and 38%, and I discovered the best was actually Rotten Tomatoes at 44%, with IMDB at 42%.

If you consider only movies where I gave a rating of 2.5 or more, filtering out what I felt were actively bad movies, the correlation dropped to 1% and 6%, or 3% for IMDB, or -4% (!) for Rotten Tomatoes. Essentially all of the value of critics was in identifying which things sucked, and from my perspective the rest was noise.

Rotten Tomatoes is a one trick pony. It warns you about things that might suck.

Even more than before, you have to adjust critic ratings for whether critics will overrate or underrate a movie of this type and with this subject matter. You can often have a strong sense of why the critics would put up a given number, without having to read reviews and thus risk spoilers.

Using multiple sources, and looking at their relative scores, helps with this as well. A relatively high IMDB score, even more than Letterboxd, tells you that the audience and the movie are well-matched. That can be good news, or that can be bad news.

Last year there were movies where I disagreed with the review consensus, but I always understood why in both directions. I might think Megalopolis is Coppola’s masterpiece despite its problems, but don’t get me wrong, I see the problems.

This year I mostly get why they liked the ‘overrated six’ above, but there are several cases where I do not know what they were thinking, and I think the critical consensus is objectively wrong even by its own standards.

I Hate Spoilers With the Fire of a Thousand Suns

I haven’t found a solution to the problem of ‘how do you check reviews without spoiling the movie?’ given that the average score itself can be a spoiler, but also I notice I haven’t tried that hard. With advances in LLMs and also vibe coding, I clearly should try again.

You Son Of A Bitch, I’m In

The power of ‘I’m in’ peaked in 2024.

The rule for ‘I’m in’ is:

  1. You must be excited and think to yourself, ‘You son of a bitch, I’m in.’
  2. Sources of this can include trailers, posters, talent and other info.
  3. However this cannot be on the basis of reviews.

That year, there were 6 movies where in advance I said ‘I’m in,’ and they were 6 of my top 9 movies for the year.

This year the power of ‘I’m in’ was still strong, but less reliable. I’d count 10 such movies this year, including 4 of my ultimate top 5, but the other 6 did not break into the 4+ range, and there was a 3 and a 2.5. That’s still a great deal, especially given how many movies where it seemed like one ‘should’ be excited, I noticed I wasn’t, and that proved correct, including One Battle After Another, Black Bag, Weapons and Sinners.

I wonder: How much of the power of ‘I’m in’ is the attitude and thus is causal, versus it being a prediction? I have low confidence in this.

Theaters Continue To Be Awesome

I control for this effect when giving ratings, but the experience is much better in a theater, maybe good for an experiential boost of ~0.3 points on the 0.5-5 point scale. That’s big. I have to consciously correct for it when rating movies I watch at home.

I highly recommend getting a membership that makes marginal cost $0, such as the AMC A-List or the similar deal at Regal Cinemas. This helps you enjoy the movie and decide to see them more.

Strong Opinions, Strongly Held: I Didn’t Like It

Unlike last year, there were remarkably many movies that are in green on Metacritic, but that I rated 2.5 or lower, and also a few of the 3s require explanation as per above.

I don’t know how this happened, but an active majority of the movies I rated below 3 had a Metacritic score above 60. That’s bizarre.

Minor spoilers throughout, I do my best to limit it to minor ones, I’ll do the 3s sorted by Metacritic, then the others sorted by Metacritic.

  1. One Battle After Another (Metacritic: 95, Zvi: 3) is probably going to win Best Picture. It’s not hard to see why. This was the highest Quality movie I’ve seen this year, and yet I did not enjoy watching it. The jokes mostly fell flat and aside from the daughter and the Dojo sensei I couldn’t emphasize with or root for the characters. Why? Fundamentally, because the movie depends on the idea that Bob is a Good Dude, and that the revolutionaries are sympathetic. Sorry, no dice, and no amount of stacking the deck with other awfulness is going to change that. There’s also a superposition between ‘this deck is stacked and the world presented is very different from ours’ and ‘actually this is our world and this is a call to action and that is what life is about, do you know what time it is?’ I grudgingly have to give this 3 stars anyway, because Quality is so high.
  2. Train Dreams (Metacritic 88, Zvi: 3): This is the easiest one to explain. It’s an arthouse movie where very little happens, that thinks it is being profound, and it really is not being profound.
  3. Black Bag (Metacritic 85, Zvi: 3): Here I’m actually confused where the 85 is coming from as opposed to a 65-75. I mean yes this is well done all around but there’s a reason it isn’t in the Oscar race, none of it is new or special and I didn’t feel it said anything, and it mostly left me cold.
  4. Sinners (Metacritic: 84, Zvi: 3): This oozes cool and I want to love it, I get why so many others love it, but for me the vampires simply don’t work. I know what it’s trying to do there, but it’s hitting us over the head with it and everything involving the vampires felt like it was going through the motions and it would have been so much better, as Matthew Yglesias suggests, to do this as about racism straight up without also using the metaphor.

Now the ones I actively disliked:

  1. Weapons (Metacritic: 81, Zvi: 2.5): The first half of this would be great if you had stuck the landing, Amy Madigan is terrific, but it didn’t come together in the end, the plot holes are absurd and the central conceit feels unjustified in light of that. I felt like I had whiplash going from a well-observed, highly detailed and realistic meditation on grief and confusion and blame and how we deal with that, into something else entirely. I could be more forgiving, but it turns out I am not.
  2. Frankenstein (Metacritic: 78, Zvi: 2.5). I hated the message this version is trying to send, this is techno pessimistic and against knowledge and striving on a deep level, and frankly it was overly long and boring. Less about AI than you think.
  3. Jane Austen Wrecked My Life (Metacritic: 73, Zvi: 2.5). The critics are wrong. This is just bad. I went in expecting lousy, I was mildly disappointed by the level of lousy, and then I saw 73 and was confused. You Had One Job. You were supposed to Do The Thing. Then you didn’t do the thing, either in terms of justifying the romantic connection or actually engaging properly with Jane Austen. ‘Cmon now.
  4. Superman (Metacritic: 68, Zvi: 2.5): I had a lot of thoughts on this one. I found it both full of plot holes, and I hated that they back away from asking any of the movie’s potentially interesting questions. But I can see finding this cool if you care about very different things than this, and the new DC universe could ultimately be a huge upgrade.
  5. F1 (Metacritic: 68, Zvi: 2): I’d say the critics are wrong but the people had a good time. Then again, the people don’t know racing. I used to be an actual F1 fan, so let me say both that this is not how any of this works, this has nothing to do with Formula 1, and otherwise this was completely paint by numbers.
  6. Mission Impossible: Final Reckoning (Metacritic: 67, Zvi: 2.5): This was my biggest disappointment of the year. I was in! Dead Reckoning was historic due to its influence on Joe Biden and also a rip roaring good time that was remarkably smart about AI. Then this was none of those things. It squandered all the interesting setup, was far dumber about AI to the point of idiot plot and frankly the action scenes were not cool. What a disaster.
  7. Wicked: For Good (Metacritic: 67, Zvi: 1.5): My review was ‘Hard write, harder watch.’ Seriously, everyone involved tried so damn hard, yet there’s so little joy to be found here as they try to dutifully line things up. Everything feels forced. There’s barely any cool dancing and the songs are bad. Okay, yes, fine, the Costume Design is Oscar-level, but that does not a movie make.
  8. The Smashing Machine (Metacritic: 65, Zvi: 2.5): Emily Blunt deserves better, in all senses. Ultimately the movie is boring.
  9. Fantastic Four: First Steps (Metacritic: 65, Zvi: 2): All the fun happens off screen. Michael Flores defended this as a great ‘Fantastic Four movie’ on the theory that it captured their world and the Fantastic Four are boring. Weird flex.

Strong Opinions, Strongly Held: I Did Like It

There are four movies requiring explanation on the upside, where they were below 60 on Metacritic yet I actively liked them.

All four seem like clear cases of ‘yes I know that technically this is lacking in some important way but the movie is fun, damn it, how can you not see this?’

  1. A Big Bold Beautiful Journey (Metacritic: 41, Zvi: 4.5): I understand the complaint that the movie has ‘unearned emotion’ and the script doesn’t lay the proper foundations for what it is doing. I don’t care. This otherwise has Quality behind only One Battle After Another and Bogunia. All you have to do is say ‘I’m in!’ and decide not to be the ‘stop having fun guys’ person who points out that technically all this emotion you could be feeling hasn’t been earned. Accept that some of the ‘work’ isn’t being fully done and do it your damn self. Why not do that?
  2. Novocaine (Metacritic: 58, Zvi: 4): A borderline case where again I think people need to remember how to have fun. This was a joy throughout, you can enjoy a good popcorn movie with a great premise and just go with it.
  3. The Running Man (Metacritic: 55, Zvi: 3.5): I thought this actually executed on its premise really well, and did a bunch of smart things both on the surface level and also under the hood. It won’t change your life but it gets it, you know?
  4. Honey, Don’t! (Metacritic: 45, Zvi: 3.5): Yeah, okay, it’s deeply silly and in some important senses there’s nothing there, but it’s sexy and fun, why not live a little.

You can say the same thing about The Naked Gun. It has a 75, perfectly respectable, but its joke hit rate per minute is absurd, it is worth so much more than that.

Award Show Dreams

I once again used consideration for awards as one selection criteria for picking movies. This helped me ‘stay in the conversation’ with others at various points, and understand the state of the game. But once again it doesn’t seem to have provided more value than relying on Metacritic and Letterboxd ratings, especially if you also used IMDB and Rotten Tomatoes.

Last year I was very happy with Anora ending up on top. This year I’m not going to be happy unless something very surprising happens. But I do understand. In my word, given the rules of the game, I’d have Bogunia sweep the major awards.

On To 2026

I’m very happy with this side hobby, and I expect to see over one new movie a week again in 2026. It was a disappointing year in some ways, but looking back I still got a ton of value, and my marginal theater experience was still strongly positive. I think it’s also excellent training data, and a great way to enforce a break from everything.

It would be cool to find more good people to follow on Letterboxd, so if you think we’d mesh there, tag yourself for that in the comments.

 

 

 

 

 

 

 

 

 

 

 

 

]]>
https://thezvi.wordpress.com/2025/12/24/zvis-2025-in-movies/feed/ 0 24979 thezvi
Keeping Up Against the Joneses: Balsa’s 2025 Fundraiser https://thezvi.wordpress.com/2025/12/23/keeping-up-against-the-joneses-balsas-2025-fundraiser/ https://thezvi.wordpress.com/2025/12/23/keeping-up-against-the-joneses-balsas-2025-fundraiser/#respond Tue, 23 Dec 2025 14:31:34 +0000 https://thezvi.wordpress.com/?p=24976 Continue reading ]]> Several years ago Zvi Mowshowitz founded Balsa Research, a tiny nonprofit research organization currently focused on quantifying the impact of the Jones Act on the American economy, and working towards viable reform proposals.

While changing century-old policy is not going to be easy, we continue to see many places where there is neglected groundwork that we’re well positioned to do, and we are improving at doing it with another year of practice under our belts.

We’re looking to raise $200,000 to support our work this giving season, though $50,000 would be sufficient to keep the lights on, and we think we are also well positioned to do more with more funding.

Funds raised this round will support Balsa’s policy advocacy, either in Jones Act and shipping or potentially in other planned cause areas of housing reform and NEPA reform if there is capacity to significantly expand.

Donate here to fund our mainline policy work.

One additional possibility for Balsa, that would be funded entirely separately if it did happen, is for Zvi Mowshowitz to use Balsa as a piece of philanthropic infrastructure to help guide new philanthropic money coming online in 2026 if there is demand. Contact us (hello@balsaresearch.com) if you would like to be involved in such an effort in any capacity, or want to authorize this as a potential use of your funds.

Donate here if you are interested in helping us with fully flexible funding.

What Balsa Got Up to in 2025

Quite early in the year, Balsa’s plans for Jones Act investigative work was derailed by a certain Section 301 Investigation, which I wrote about here. In short, the USTR was proposing two significant changes to maritime transport: a $3-5 million fee for Chinese-built ships to deliver imports to American ports, and new, Jones Act-tier restrictions to up to 20% of American maritime exports. All of American industry focused on lobbying against the legibly bad first proposal, sadly no one else was on the ball about how bad the second proposal was because it required a slightly more sophisticated argument. So Balsa stepped in and wrote up a public comment and presented it to the USTR during their public hearing on the proposal. At least in part due to our research and our outreach to maritime industry players, this proposal was basically entirely axed.

After our mid-year write-up on the whole adventure, Balsa did also end up submitting a second comment in response to what we felt was a deeply counterproductive tariff scheme in the updated proposal. This was the first arc played out in miniature; after functionally scrapping both major proposals from the first round, the USTR was proposing that an increasing percentage of American LNG must be shipped out on U.S.-built LNG tankers (there are currently zero in the fleet and no capacity for the shipyards to build any new ones) and that all port crane parts made in China be subject to 100% tariffs. Everyone focused on lobbying against the first policy change which was obviously bad, the second was bad in a more subtle way. So it was up to Balsa to point out that the exact setup of the port crane tariffs were structured in a way counterproductive to the stated U.S. policy, would incentivize American ports to buy their cranes from Chinese manufacturers instead of manufacturers in allied countries (there is no domestic port crane manufacturing capacity), and negatively impact port revitalization investments that need to happen.

One piece of good news is that President Trump signed a trade deal with China in November, which resulted in a one-year suspension of all of the punitive measures proposed in the Section 301 investigation. We think there’s a good chance that the suspension might become indefinite, but it still seemed like a good use of our time to write up our objections should the measures resume in 2026.

We also worked on the Jones Act. We launched a new RFA to investigate the labor impacts of the Jones Act. This is meant to complement our first RFA, which invites academics to look at the economic impacts of the Jones Act. You may also recall that we had already given out grants for two different studies under the first RFA, on economic impacts. These papers are still in the process of being written. We remain confident in both teams and look forward to seeing their results in 2026.

We shored up a few places where we felt like some of the groundwork done by others on the Jones Act were either neglected or outdated. We published two pieces: The Jones Act Index, which works as a very short overview of all the myriad dysfunctions of the current domestic maritime industry, and an operational analysis of what exactly the 93 extant Jones Act eligible vessels get up to.

Besides all that, there is of course the frustratingly intangible work of networking and building a deeper understanding of the shape of the problem. We conducted over forty conversations with stakeholders across the maritime policy landscape, including domestic shipping operators, port executives, and congressional staff. These conversations directly informed our operational analysis of Jones Act vessels and helped us identify which reform framings resonate (and which don’t) with different constituencies. We’ve compiled this primary research into internal documentation mapping stakeholder positions, constraints, and potential pressure points—groundwork that will directly inform our policy binder and draft reform proposals.

Additionally, in the last few months of the year, we brought on a very part-time contractor to help with shipping out more of our policy work.

A quick glance at our budget

A breakdown of our 2025 spend to the nearest thousand, for a total of ~$143k:

  • $87,000 in wages (Jenn at 35 hours a week and a policy analyst at 10 hours a week)
  • $0 for Zvi Mowshowitz
  • $45,000 in research grants to RFA applicants
  • $7000 in travel and conference expenses
  • $2000 in accounting services
  • $1000 in legal, compliance, and nonprofit registration fees
  • $1000 in software, subscriptions, and office supplies

Balsa in 2026, and Our Ask

Considering Balsa’s size, unless fundraising goes exceedingly well, we plan to stay focused on the Jones Act and maritime policy until we crack this nut (i.e. deliver the policy binder) instead of diverting attention across different policy streams.

Currently, the people working on Balsa work are me (full time-ish), our contractor who works ten hours a week, plus Zvi Mowshowitz in an advisory capacity. In 2026, we’d like to bring this person or another policy analyst on full time, because my own time is somewhat constrained by the overhead of maintaining a 501(c)(3) nonprofit. The amount of funding we have in reserve gives us a decent amount of runway, but is insufficient for our grantmaking and hiring ambitions.

We’re looking to raise $200,000, which would be enough to bring on our contractor full-time and give us a reasonable amount of buffer for additional research funding that we would like to disburse. However, we think $50,000 is the minimum for Balsa to be viably funded to the end of 2026.

Here’s what we plan on doing in 2026, should we hit our fundraising goal:

Complete the Jones Act policy binder

This is the core deliverable that everything else feeds into, that was waylaid by our Section 301 work. The binder will include a short executive summary of the case for reform; one-pagers on specific impacts; a longer technical document synthesizing our funded research and the existing literature; and a FAQ addressing common objections. Much of the work is filling gaps identified through stakeholder conversations, and interpreting the information for specific audiences.

Receive and publicize findings from the two funded economic studies

Both teams are expected to submit their papers in 2026. Once results are in, we’ll write accessible summaries for non-academic audiences, brief interested Hill offices, and incorporate findings into the policy binder.

Fund at least one high-quality study from the labor RFA

The labor angle is underexplored in existing Jones Act research and useful for engaging unions constructively. We’re looking for proposals examining questions like: How many jobs does the Jones Act actually protect, and in which states? What’s the counterfactual employment picture under reform? What are the job creation effects in industries currently harmed by high shipping costs? A rigorous study here could shift the conversation toward a more nuanced understanding of net labor market effects.

Continue monitoring Section 301 and SHIPS Act developments, contributing input where it seems high-leverage

The one-year suspension of Section 301 measures expires in late 2026, and if negotiations with China stall, the proposed port fees and export restrictions could return; we’ll track developments and be prepared to submit updated comments or testimony. The SHIPS for America Act proposes expanded cargo preference requirements facing similar vessel availability problems to those we identified in Section 301, and we’re developing analysis of cargo preference laws we can deploy if this legislation gains momentum. The goal is readiness to contribute when high-leverage, without letting monitoring consume time that should go toward the policy binder.

What Additional Funding Enables

We can do even more with additional resources:

  • We can fund additional academic studies to strengthen the empirical case for reform, complementing our existing research initiatives, as we discover new opportunities. We estimate that each additional study costs around $30,000 to fund.
  • Zvi is not taking any payment for his work currently, but at a sufficiently high level of funding, this could change and he would dedicate more of his attention to the project. In addition, there is still an abundance of policy analysts in DC who are out of work, that we can hire more of.
  • With more funding and interest, we’d also look into spinning up a 501c4 to use going forwards for more direct political advocacy. Though of course the 501c4 would then require its own fundraising work, since we can’t mix the funds.

Donating is not the only way to give. If you have experience with maritime shipping, naval procurement, connections to labor unions, or anything else you think might be relevant to Jones Act reform, we’d be interested in talking to you and hearing your perspective. Get in touch at hello@balsaresearch.com and let us know how you might be able to help, whether that’s sharing your insights, making introductions, or contributing in other meaningful ways.

If you’re an economist positioned to publish in peer-reviewed journals, please consider applying to our economy or labor RFAs, and doing direct research on the issue. If you have friends who fit that profile and might be interested in this kind of work, please consider forwarding the RFAs their way.

Balsa Research is still a very small organization (me, another policy analyst at ten hours per week, and Zvi in an unpaid, very part-time advisory role) and our progress this year has been possible only through the generous support of our donors and the many people who have shared their time and expertise with us. We’re grateful for this community of supporters and collaborators who continue to believe in the importance of this work.

]]>
https://thezvi.wordpress.com/2025/12/23/keeping-up-against-the-joneses-balsas-2025-fundraiser/feed/ 0 24976 thezvi
The Revolution of Rising Expectations https://thezvi.wordpress.com/2025/12/22/the-revolution-of-rising-expectations/ https://thezvi.wordpress.com/2025/12/22/the-revolution-of-rising-expectations/#comments Mon, 22 Dec 2025 13:34:01 +0000 https://thezvi.wordpress.com/?p=24972 Continue reading ]]> Internet arguments like the $140,000 Question incident keep happening.

The two sides say:

  1. Life sucks, you can’t get ahead, you can’t have a family or own a house.
  2. What are you talking about, median wages are up, unemployment is low and so on.

The economic data is correct. Real wages are indeed up. Costs for food and clothing are way down while quality is up, housing is more expensive than it should be but is not much more expensive relative to incomes. We really do consume vastly more and better food, clothing, housing, healthcare, entertainment, travel, communications, shipping and logistics, information and intelligence. Most things are higher quality.

But that does not tell us that buying a socially and legally acceptable basket of goods for a family has gotten easier, nor that the new basket will make us happier.

This post is my attempt to reconcile those perspectives.

The culprit is the Revolution of Rising Expectations, together with the Revolution of Rising Requirements.

The biggest rising expectations are that we will not have to tolerate unpleasant experiences or even dead time, endure meaningful material shortages or accept various forms of unfairness or coercion.

The biggest rising requirement is insane levels of mandatory child supervision.

Table of Contents

  1. The Revolutions of Rising Expectations.
  2. The Revolution of Rising Requirements.
  3. Whose Line Is It Anyway?
  4. Thus In This House We Believe The Following.
  5. Real De Facto Required Expenses Are Rising Higher Than Inflation.
  6. Great Expectations.
  7. We Could Fix It.
  8. Man’s Search For Meaning.
  9. How Do You Afford Your Rock And Roll Lifestyle?
  10. Our Price Cheap.
  11. It Takes Two (1).
  12. It Takes Two (2).
  13. If So, Then What Are You Going To Do About It, Punk?
  14. The Revolution of Rising Expectations Redux.

The Revolutions of Rising Expectations

Our negative perceptions largely stem from the Revolution of Rising Expectations.

We find the compromises of the past simply unacceptable.

This includes things like:

  1. Jobs, relationships and marriages that are terrible experiences.
  2. Managing real material shortages.
  3. Living in cash-poor ways to have one parent stay at home.
  4. Even increasingly modest levels of physical and psychological risk.
  5. Old levels of things such as hypocrisy, secrecy, elite-only decision making, consent requirements, discrimination, racism, sexism, homophobia, transphobia, enforcement of social and gender norms, familial obligation, abuse and coercion of all kinds, lack of consent, untreated physical mental health problems and so on.
  6. That old people have most of the wealth while young people are often broke.
  7. Insufficiently high quality or often quantity of goods across the board.
  8. Enduring frequent social and familial activities that are boring or unpleasant.
  9. Tolerating even short periods of essentially dead time, including long commutes.
  10. Marrying or having children while continuing to rent instead of owning a home.

These are mostly wise things to dislike. They used to be worse. That was worse.

Not that most people actually want to return. Again, Rising Expectations.

The Robber Baron: More to the point. You can move to almost any town in the Midwest with 20,000-200,000 people and live like a freaking king on a normal income.

You just can’t take trips to Disney every year, go out to eat every week, or have name brand everything.

Shea Jordan Smith (quoting Matthew Yglesias, link has key 11 second video): The issue is that living that lifestyle—never taking plane trips for vacation, rarely dining out, having a small house—would mean living like a poor person by today’s standards and people don’t want to do that. But that’s because we’ve gotten richer, not poorer.

Doing this requires you to earn that ‘normal income’ from a small town in the midwest, which is not as easy, and you have to deal with all the other problems. If you can pull off this level of resisting rising expectations you can then enjoy objectively high material living standards versus the past. That doesn’t solve a lot of your other problems. It doesn’t get you friends who respect you or neighbors with intact families who watch out for your kids rather than calling CPS. And while you might be okay with it, your kids are going to face overwhelming pressures to raise expectations.

Is the 2025 basket importantly better? Hell yes. That doesn’t make it any easier to purchase the Minimum Viable Basket.

The Revolution of Rising Requirements

That then combines with the Revolution of Rising Requirements.

In addition to the demands that come directly from Rising Expectations, there are large new legal demands on our time and budgets. Society strongarms us to buy more house, more healthcare, more child supervision and far more advanced technology. The minimum available quality of various goods, in ways we both do and don’t care about, has risen a lot. Practical ability to source used or previous versions at old prices has declined.

The killer requirement, where it is easy to miss how important it is, is that we now impose utterly insane child supervision requirements on parents and the resulting restrictions on child freedoms, on pain of authorities plausibly ruining your life for even one incident.

This includes:

  1. Utterly insane child supervision requirements and restrictions on child freedoms.
  2. A wide variety of burdensome requirements on everyday products and activities, including activities that were previously freely available.
  3. Minimum socially and often legally acceptable housing requirements.
  4. De facto required purchases of high amounts of healthcare and formal education.
  5. Hugely increased ‘safety’ requirements across the board.
  6. Increased required navigation of bureaucracy and complex systems.
  7. Forced interactions with a variety of systems that are Out To Get You.
  8. Navigating an increasingly hostile and anti-inductive information environment.
  9. The replacement of goods that were previously socially provided, but which now must be purchased, which adds to measured GDP but makes life harder.

We can severely cut expenses in various ways, but no, contra Matthew Yglesias, you cannot simply buy the 1960s basket of goods or services or experiences if you want to live most places in the United States. Nor if you pulled this off would you enjoy the social dynamics required to support such a lifestyle. You’d get CPS called on you, be looked down upon, no one would help watch your kids or want to be your friends or invite you to anything.

Whose Line Is It Anyway?

You don’t get to dismiss complaints until those complaints are stated correctly.

A rule for game designers is that:

  1. When a player tells you something is wrong, they’re right. Believe them.
  2. When a player tells you what exactly is wrong and how to fix it? Ignore them.
  3. Still register that as ‘something is wrong here.’ Fix it.

People are very good at noticing when things suck. Not as good at figuring out why.

As in, I actually disagree with this, as a principle:

Matthew Yglesias: Some excellent charts and info here, but I think the impulse to sanewash and “clean up” false claims is kind of misguided.

If we want to address people’s concerns, they need to state the concerns accurately.

No. If you want to address people’s concerns rather than win an argument, then it is you who must identify and state their concerns accurately.

Not them. You. It’s up to you to figure out what the actual problems are.

Their job is to alert you that there is an issue, and to give you as much info as they can.

If this involves them making false claims along the way, that is good data. Notice that. Point that out. Do not use it as a reason to dismiss the underlying complaint that ‘things suck.’ There’s something that sucks. Figure it out.

What you definitely do not want to do is accept the false dystopian premise that America, the richest large country in human history, has historically poor material conditions.

Brad: A lot of folks seem think they are going to bring radicalized young people back into the fold by falsely conceding that material conditions in the most advanced, prosperous country in the history of the world are so bad that it’s actually reasonable to become a nihilistic radical.

Liberalism doesn’t work if you make expedient concessions to abject delusions.

Timothy Lee: Yeah, I think it feels like an easy concession to tell young people “ok I admit your generation has been dealt a bad hand but…” But when everyone does this it creates a consensus that today’s young people are facing uniquely bad material conditions, which they aren’t.

Thus In This House We Believe The Following

  1. We live in an age of wonders that in many central ways is vastly superior.
  2. I strongly prefer here to elsewhere and the present to the past.
  3. It is still very possible to make ends meet financially in America.
  4. Real median wages have risen.

However, due to rising expectations and rising requirements:

  1. The cost of the de facto required basket of goods and services has risen even more.
  2. Survival requires jumping through costly hoops not in the statistics.
  3. We lack key social supports and affordances we used to have.
  4. You cannot simply ‘buy the older basket of goods and services.’
  5. Staying afloat, ‘making your life work,’ has for a while been getting harder.
  6. This is all highly conflated with ‘when things were better’ more generally.

All of that is before consideration of AI, which this post mostly excludes.

Real De Facto Required Expenses Are Rising Higher Than Inflation

When people say the data are lying to you, or the data is wrong, they’re almost always wrong. Jeremy here responds to one such attempt from the previous go around. The data are what they are.

Yet the voters are not wrong. The practical ‘cost of living’ has gone up.

Voters realize this. They hate it. Inflation is now ~2.5%, but the annual rise in the cost of the basket of goods and services we insist you purchase or provide is higher. The new basket being superior in some ways is nice but mostly irrelevant.

Here’s a stark statement of much of this in its purest form, on the housing front.

Aella: being poorer is harder now than it used to be because lower standards of living are illegal. Want a tiny house? illegal. want to share a bathroom with a stranger? illegal. The floor has risen and beneath it is a pit.

Julian Gough: Yes. There used to be a full spectrum of options between living under a bridge and living in a nice flat or house. (I once lived in a converted meat storage room over a butcher’s shop, and briefly, and admittedly unofficially, in a coal cellar with a 5ft ceiling, and no electricity. I was fine, and life was interesting.)

Now there’s a hard cutoff, with no options in that zone between (free) under-a-bridge and (expensive) nice flat, where most artists and poor people used to live. So where can we now live?

Great Expectations

The two Revolutions combine to make young people think success is out of reach.

Millennials, in terms of many forms of material wealth and physical living standards, have much higher standards than previous generations, and also are forced to purchase more ‘valuable’ baskets of goods.

This leads them to forget that young people have always been poor on shoestring budgets. The young never had it easy in terms of money. Past youth was even poorer, but were allowed (legally and socially) to economize far more.

Today’s youth have more income and are accumulating more wealth, and mostly matching past homeownership rates, despite higher expenses especially for housing, and new problems around atomization and social media.

But that is paper wealth. It excludes the wealth of having families and children.

Expectations are out of control.

Jason C: Might be an expectations problem vs an actual income one.

$587k is nuts. Claude suggests $150k-$250k depending on location, which seems reasonable as a combined household income for full-on life ‘success,’ and points out that trajectory is a factor as well.

John Ganz: By making comparisons constant, the internet has created a condition of universal poverty. When even the richest man in the world is not satisfied and acts like a beggar for social recognition, why should anybody be?

When the debate involves people near or above the median, the boomers have a point. If you make ~$100k/year and aren’t in a high cost of living area (e.g. NYC, SF), you are successful, doing relatively well, and will be able to raise a family on that single income while living in many ways far better than it was possible to live 50 years ago.

Certainly $587k is an absurdity. The combination of Rising Expectations and the perception of Rising Requirements has left an entire generation defining ‘success’ as something almost no one achieves, while also treating ‘success’ as something one needs in order to start a family. No wonder young people think they can’t get ahead, including many who are actually ahead.

That’s in addition to the question of what constitutes a ‘good job.’ Most historical jobs, by today’s standards of lived experience, sucked a lot.

There’s also this: People reliably think they are poorer, in relative terms, than they are, partly due to visibility asymmetry and potentially geographic clustering, and due to the fatness of the right tail having an oversize impact.

These perceptions have real consequences. Major life milestones like marriage and children get postponed, often indefinitely. Young people, especially young men, increasingly feel compelled to find some other way to strike it rich, contributing to the rise of gambling, day trading, crypto and more. This is one of the two sides of the phenomenon Derek Thompson wrote about in the excellent The Monks In The Casino, the other being atomization and loneliness.

We Could Fix It

The good news is that a lot of this is a series of related unforced errors. A sane civilization could easily fix many of them with almost no downsides.

We could choose to, without much downside:

  1. Make housing vastly cheaper especially for those who need less.
  2. Make childcare vastly less necessary and also cheaper, and give children a wide variety of greater experiences for free or on the cheap.
  3. Make healthcare vastly cheaper for those who don’t want to buy an all-access pass.
  4. Make education vastly cheaper and better.
  5. Make energy far more abundant and cheap, which helps a lot of other things.

And so on. Again, this excludes AI considerations.

The bad news is there is no clear path to our civilization choosing to fix these errors, although every marginal move towards the abundance agenda helps.

We could also seek to strengthen our social and familial bonds, build back social capital and reduce atomization, but that’s all much harder. There’s no regulatory fix for that.

Man’s Search For Meaning

Matt Yglesias points out that this goes hand in hand with Americans putting less value on things money can’t buy:

Matt Yglesias: People have started putting less emphasis on non-money sources of value, which I think is naturally going to lead more people to be unhappy with the amount of money they make.

A nice thing about valuing religion, kids, and patriotism is that these are largely non-positional goods that everyone can chase simultaneously without making each other miserable.

This change in values is not good for people’s life experience and happiness. If being happy with your financial success requires you to be earning and spending ahead of others, and it becomes a positional good, then collectively we’re screwed.

And Zac Hill points out the other problems with people’s #SquadGoals.

Zac Hill: The real reason so many people feel despair is MUCH closer to “I think my life will end in meaningless oblivion unless I am on an epic quest, a billionaire, or gigafamous, but this is gauche to admit and so I use proxy variables” than it is to “I can’t live on less than $140,000”

Also: “I, personally, will never marry/fuck an attractive person.”

Shockingly, all of this is mostly about how we create, calibrate, and manage expectations.

There were ways in which I did not ‘feel’ properly successful until I stopped renting and bought an apartment, despite the decision to previously not buy being sensible and having nothing to do with lack of available funds. Until you say ‘this house is mine’ things don’t quite feel solid.

Many view ‘success’ as being married and owning a home, regardless of total wealth.

If those people don’t achieve those goals, they will revolt against the situation.

So this chart seems rather scary:

Vance Crowe: This does not make for a stable society.

That leads to widespread expressions of (highly overstated) hopelessness:

Boring Business: An entire generation under the age of 30 is coming to realization that having a family and home will never be within the grasp of reality for them

Society is not ready for the consequences of this. A generation with no stake in the system would rather watch it burn. All the comments echo the same exact sentiment. If homeownership is not fixed, it is a steady slope to socialism from here.

Another issue is that due to antipoverty programs and subsidies and phase outs, as covered last time, including things not even covered there like college tuition, the true marginal tax rate for families is very high when moving from $30k to up to ~$100k.

How Do You Afford Your Rock And Roll Lifestyle?

Social media and influencing make all of this that much worse. We’re up against severe negativity bias and we’re comparing ourselves to those who are most successful at presenting the illusion of superficial success.

Welcome to the utter screwing that is the accelerated Revolution of Rising Expectations, in addition to the ways in which Zoomers are indeed utterly screwed.

Timothy Lee: The idea that Zoomers are “utterly screwed” in material terms is total nonsense and I wish people would stop repeating it. Housing is a bit more expensive than previous generations. Many other necessities — food, clothing, most manufactured goods are cheaper than ever.

I think the perception that Zoomers are “utterly screwed” is a combination of (1) opinion being shaped by people who live in the places with the most dysfunctional housing markets (2) extreme negativity bias of social media algorithms (3) nobody has much incentive to push back.

Nathan Witkin: I would add:

  1. Widespread sticker shock from post-Covid inflation.
  2. An ever-higher perceived baseline for career success and material comfort, esp. among Zoomers, also largely due to social media.

Timothy Lee: I think this #5 here is an important reason why so many people feel beleaguered. People’s expectations for what “counts” as a middle-class standard of living is a lot higher than in previous generations, and so they feel poor even if they are living similarly.

Beyond social media, I think another factor is that people compare their parents’ standard of living at 55 with their own standard of living at 25 or whatever. Nobody remembers how their parents lived before they were born.

I don’t think the “young people feeling they’re uniquely beleaguered” thing is new either!

That’s two groups of loadbearing mechanisms raised here on top of the general Revolutions of Rising Expectations and Requirements arguments earlier.

  1. Negativity bias alongside Rising Expectations for lifestyle in social media, largely due to it concentrating among expensive cities with dysfunctional housing markets.
  2. Post-Covid inflation, right after a brief period of massive subsidies to purchasing power.

There are also real problems, as I will address later at length, especially on home ownership and raising children. Both are true at once.

Our Price Cheap

Want to raise a family on one median income today? You get what you pay for.

Will Ricciardella: Can a family live on one income today?

Yes, but not today’s lifestyle on yesterday’s budget.

Here’s what it actually looks like:

• 1,000 sq ft home, not 2,500
• One used car
• One family phone — no smartphones for kids
• One TV, no subscriptions
• No microwave, no central A/C
• Home-cooked meals, no dining out
• No childcare, 1 parent stays home
• Public schools only
• Local sports, not travel leagues
• Basic health insurance: pay dental & extras out of pocket
• Simple clothes, thrift store toys
• Rare vacations, little debt

That’s how most families lived for decades and they raised kids, built communities, and made it work.

The issue isn’t that you can’t raise a family on one income.

The issue is that we’ve inflated “middle class” to mean upper middle luxuries: two cars, two iPhones, dining out, Amazon Prime, orthodontics, soccer trips, Disneyland, and a home office with Wi-Fi.

In 1960, one income worked because expectations were lower, families were more self-reliant, and debt wasn’t a lifestyle.

You want one income? You can do it.
But you have to live like the people who actually did it.

Not poorer, just simpler and more deliberate.

The people of the past didn’t have a choice, but you do.

Tumultuous Turkey: Try getting a job without a cell phone. You can’t.
Try finding a 1000 sq ft home. You can’t.
Try getting a house phone without Internet and cable included. you can’t.
Avg cost of a used car is 25k in 2024. Try no car.
We are not the problem. The tax & gov is the problem.

Analytic Valley Girl Chris: This advice would be less fucking retarded if you didn’t put a fucking microwave in the same cost bracket as a fucking air conditioner

Is there a lot of slack in the typical household budget if you are willing to sacrifice?

Yes. You can buy things like cars that cost less than the average. There are limits.

It is always interesting to see what such lists want to sacrifice. A lot of the items above are remarkably tiny savings in exchange for big hits to lifestyle. In others, they do the opposite. People see richer folks talking to them like this, and it rightfully pisses them off.

  1. No microwave? To save fifty bucks once and make cooking harder? What?
  2. No A/C is in many places in America actively dangerous.
  3. One family phone is completely impossible in 2025. People assume you have a phone. That doesn’t mean you need two iPhones or a premium plan, old phones are cheap and work fine and there are relatively cheap data plans out there, US Mobile is $36/mo total.
  4. One car may or may not be possible depending on where you live. Are you going to fully strand the other person all day?
  5. You can want 1,000 square feet but that means an apartment, many areas don’t even offer this in any configuration that plausibly works.

You can see the impact of the Revolutions in the replies, only some of which is about the smaller crazy asks. No, you can’t really do this. The world won’t allow it and to the extent it does it will treat you horribly and your kids will not accept it.

Another example of the gaffe of saying what you actually think about what to cut, as he complains about kids being ‘entitled to 37 pencils’:

The Bulwark: Trump at his speech on the economy: “You can give up certain products. You can give up pencils…They only need one or two. They don’t need that many…You don’t need 37 dolls for your daughter. Two or three is nice, but you don’t need 37 dolls.”

The thing about pencils is as you use them they disappear. You need another pencil. There are many places in education we can likely cut, and no you do not ‘need 37 dolls’ and we used to have far fewer toys and that was fine, but pencils?

It Takes Two (1)

Thus, people increasingly believe they need two incomes to support a family.

They’re noticing something sucks. Assume they’re right. Figure out what it is.

Matthew Yglesias: The claim that the *absolute affordability* of being a married, one-earner family with kids has fallen would — if it were true — have straightforward win-win policy remedies like “higher wages and incomes.”

But it’s not true.

When you reformulate to a more accurate claim what you end up with is the observation that it is is hard for one person to earn as much income as two people and that the wedge has grown as women’s earning power has increased.

This is very true but what’s the fix?

One that would “work” would be to push women generally out of opportunities for careers and white collar work — something more conservatives are tip-toeing around but don’t quite want to say.

[Links to: Women’s professional rise is good, actually.]

A change can be good. That doesn’t get you out of dealing with the consequences.

In this case, the consequences are that the second income gets factored into the Revolutions of Rising Expectations and Requirements.

Absolute affordability of being a one-earner family with kids has fallen, because again:

  1. You have more ‘real income.’
  2. You are legally required to purchase more and higher quality goods and services, due to the Revolution of Rising Requirements, especially child supervision.
  3. You are also under large social and internal pressures to purchase more and higher quality goods, due to the Revolution of Rising Expectations.
  4. That’s nice for you, if you can afford the goods and services.
  5. That’s still going to cost you, and you can’t pretend otherwise.
  6. You think you can opt out of that? Nah, not really bro, not easily.

First, some brief questions worth asking in advance:

  1. Can you actually execute on the one income plan?
  2. If not, what are you going to do about it?

Zac Hill: [That two incomes buy more than one] is the rub of this whole discourse. Wages being much higher means the cost of a person not working is also much higher. But is that a problem in need of a solution? If so, what is the solution, and why is “accept a much lower income” not also an acceptable solution?

It Takes Two (2)

Even if you could somehow execute on the above plan to survive on one income by having life suck in various ways, that plan also takes two.

Not two incomes. Two parents.

Hey baby, want to live on one income, Will Ricciardella style? Hey, come back here.

Telling young men in particular ‘you can do it on one income’ via this kind of approach is a joke, because try telling the woman you want to marry that you want to live in the style Will Ricciardella describes above. See if she says yes.

If So, Then What Are You Going To Do About It, Punk?

The question ‘so what are you going to do about it?’ is still a very good one.

What do you do if families have the option of two incomes, and we set Expectations and Requirements based on two incomes, and you want to get by with only one? Adjusting how you spend money, and using the other parent’s time to save some money, will only go so far.

If you want one income households and stay at home parents to be viable here, I would say four things are required, in some combination. You don’t need all four, but you definitely need #1, and then some additional help.

  1. You can deal with the Requirements. Let people purchase much less health care, child care and housing. Give people a huge amount of Slack, such that they can survive on one income despite the ability to earn two, and also pay for kids.
  2. You can deal with the Expectations. Raise the status and social acceptability of living cheap and making sacrifices.
  3. You can lessen the marginal returns to a second income, by increasing effective marginal tax rates. And That’s Terrible, don’t do this, but do note it would work.
  4. You can improve the economics of having children more generally. Children are an expensive public good. We can and should use the tax code to shift the burden.

I usually discuss these issues and questions, especially around #4, in terms of declining fertility. It is the same problem. If people don’t feel able to have children in a way they find acceptable, then they will choose not to have children.

On the marginal tax rates, consider these graphs.

Image

That’s all obviously terrible policy, but it also means that you can obviously support a family on one $30k income if you could have done it on two $30k incomes, since your net benefits take home pay is not substantially lower after child care.

Alternatively or additionally, from a policy perspective, you can accept that you’re looking at two income households, and plan the world around making that work.

The big problem with a two income household is child supervision.

  1. The increased child supervision requirements, as in things like if anyone spots a 12 year old not in a parent’s line of sight they think about calling the cops, are insanely expensive in every sense. This is the biggest pain point.
  2. The second biggest pain point is direct costs for daycare, which we could make substantially cheaper if we wanted to, and we could also subsidize it.
  3. As Matthew Yglesias points out, our school system and its endless days off implicitly assumes the mother can watch the children, while we also forbid letting children spend those days on their own, often even indoors. The obvious solution is to not give younger kids days off that aren’t also national holidays, or to offer free other childcare on those days, where ‘older’ is defined as ‘can leave them on their own for the day and no one tries to call CPS.’

Ideally you do all of that anyway, it’s badly needed, and you open up both choices.

Now, back to the question of what is going on.

The Revolution of Rising Expectations Redux

What should we make here of the fact that spending on food, clothing and housing (aka ‘necessities’) has collectively declined as a percentage of income, and also the food is way better and the houses are bigger?

The definition of ‘necessity’ is not a constant, as the linked post admits. The ‘necessities’ that have gotten cheaper are the ‘necessities of the past.’ If things like education and health care and cell phones are de facto mandatory, and you have to buy them, then they are now necessities, even if in 1901 the services in question flat out didn’t exist.

That’s not to downplay how much the past sucked. It sucked a lot. Go see Hamnet or Train Dreams or The Housemaid.

But there are other ways it didn’t suck. In large part that was because you were allowed to suck without the rug being pulled out from under you for the crime of not having a rug, and also you didn’t have to compare to all the families with fancy rugs.

Life is vastly better. Life also really sucks compared to Rising Expectations.

Setting aside AI, what do we do about it?

  1. It’s tough to lower the Rising Expectations. We should still do what we can here, primarily via cultural efforts, in the places we can do that.
  2. Rising Requirements are often unforced errors. We Can Fix It. We should attack. If we legalized housing, and legalized passing up Hansonian medicine, and got to a reasonable place on required child supervision, that would do it.
  3. Pay Parents Money. Children are a public good, and we are putting so much of the cost burden directly on the parents. People feel unable to raise families, and don’t have children they want to have. We should do more transfers from the childless to those with children, and less of other types of transfers. Also eliminate all forms of the marriage penalty. Consider explicit subsidies for one income married families with kids under some threshold age. As in, acknowledge that stay at home parent is a job, and pay them for it.
  4. Provide more public goods for families. Remarkably small things can matter a lot.
  5. Reforming our system of transfers and benefits and taxes to eliminate the Poverty Trap, such that no one ever faces oppressive marginal tax rates or incentives to not work, and we stop forcing poor families to jump through so many hoops.
  6. All other ways of improving things also improve this. Give people better opportunities, better jobs, better life experiences, better anything, and especially better hope for the future in any and all ways.
]]>
https://thezvi.wordpress.com/2025/12/22/the-revolution-of-rising-expectations/feed/ 1 24972 thezvi Image
When Were Things The Best? https://thezvi.wordpress.com/2025/12/19/when-were-things-the-best/ https://thezvi.wordpress.com/2025/12/19/when-were-things-the-best/#comments Fri, 19 Dec 2025 17:57:51 +0000 https://thezvi.wordpress.com/?p=24968 Continue reading ]]> People remember their childhood world too fondly.

You adapt to it. You forget the parts that sucked, many of which sucked rather really badly. It resonates with you and sticks with you. You think it was better.

This is famously true for music, but also in general, including places it makes no sense like ‘most reliable news reporting.’

Matthew Yglesias: Regardless of how old they are, people tend to think that things were better when they were young.

As a result, you’d expect more negativity as the median age goes up and up.

Very obviously these views are not objective.

As a fun and also useful exercise, as part of the affordability sequence, now that we’ve looked at claims of modern impoverishment and asked when things were cheaper, it’s time to ask ourselves: When were various things really at their best?

In some aspects, yes, the past was better, and those aspects are an important part of the picture. But in many others today is the day and people are wrong about this.

I’ll start with the things on the above graph, in order, include some claims from another source, and also include a few important other considerations that help set up the main thesis of the sequence.

The Most Close-Knit Communities

Far in the past. You wouldn’t like how they accomplished it, but they accomplished it.

The top candidates for specific such communities are either:

  1. Hunter-gatherer bands.
  2. Isolated low-tech villages that all share an intense mandatory religion.
  3. Religious minority ethnic enclave communities under severe external threat.

You’re not going to match that without making intensive other sacrifices. Nor should you want to. Those communities were too close-knit for our taste.

In terms of on average most close knit communities in America, it’s probably right after we closed the frontier, so around 1900?

Close-knit communities, on a lesser level that is now rare, are valuable and important, but require large continuous investments and opportunity costs. You have to frequently choose engagement with a contained group over alternatives, including when those alternatives are otherwise far superior. You also, to do this today, have to engineer conditions to make the community possible, because you’re not going to be able to form one with whoever happens to live in your neighborhood.

Intentional communities are underrated, as is simply coordinating to live near your friends. I highly recommend such things, but coordination is hard, and they are going to remain rare.

The Most Moral Society

I’m torn between today and about 2012.

There are some virtues and morals that are valuable and have been largely lost. Those who remember the past fondly focus on those aspects.

One could cite, depending on your comparison point, some combination of loyalty to both individuals, groups and institutions, honor and personal codes, hospitality, respect for laws and social norms, social trust, humility, some forms of mercy and forgiveness, stoicism, courage, respect for the sacred and adherence to duty and one’s commitments, especially the commitment to one’s family, having better and higher epistemic and discourse norms, plus religiosity.

There’s varying degrees of truth in those.

But they pale in comparison to the ways that things used to be terrible. People used to have highly exclusionary circles of concern. By the standards of today, until very recently and even under relatively good conditions, approximately everyone was horribly violent and tolerant of violence and bullying of all kinds, cruel to animals, tolerant of all manner of harassment, rape and violations of consent, cruel, intolerant, religiously intolerant often to the point of murder, drunk out of their minds, discriminatory, racist, sexist, homophobic, transphobic, neglectful, unsafe, physically and emotionally abusive to children including outright torture and frequent sexual abuse, and distrustful and dishonest dealing with strangers or in commerce.

It should be very clear which list wins.

This holds up to the introduction of social media, at which point some moral dynamics got out of control in various ways, on various sides of various questions, and many aspects went downhill. There were ways in which things got absolutely nuts. I’m not sure if we’ve recovered enough to have fully turned that around.

The Least Political Division

Within recent memory I’m going to say 1992-1996, which is the trap of putting it right in my teenage years. But I’m right. This period had extraordinarily low political division and partisanship.

On a longer time frame, the correct answer is the Era of Good Feelings, 1815-1825.

The mistake people make is to think that today’s high level of political division is some outlier in American history. It isn’t.

The Happiest Families

Good question. The survey data says 1957.

I also don’t strongly believe it is wrong, but I don’t trust survey data to give the right answer on this, for multiple reasons.

Certainly a lot more families used to be intact. That does not mean they were happy by our modern understanding of happy. The world of the 1950s was quite stifling. A lot of the way families stayed intact was people pretended everything was fine, including many things we now consider very not fine.

People benefited (in happiness terms) from many forms of lower expectations. That doesn’t mean that if you duplicated their life experiences, your family would be happy.

Fertility rates, having the most children, was during the Baby Boom, if we exclude the bad old times when children often failed to survive.

Marriage rates used to be near-universal, whether or not you think that was best.

The Most Reliable News Reporting

Believe it or not, today. Yikes. We don’t believe it because of the Revolution of Rising Expectations. We now have standards for the press that the press has never met.

People used to trust the media more. Now we trust it a lot less. While there are downsides to this lack of trust, especially when people turn to even less worthy alternatives, that loss of trust is centrally good. The media was never worthy of trust.

There’s great fondness for the Walter Cronkite era, where supposedly we had high authority news sources worthy of our high trust. The thing is, that past trust was also misplaced, and indeed was even more misplaced.

There was little holding the press to account. They had their own agendas and biases, even if it was often ‘the good of the nation’ or ‘the good of the people,’ and they massively misunderstood things and often got things wrong. Reporters talking on the level of saying ‘wet ground causes rain’ is not a new phenomenon. When they did make mistakes or slant their coverage, there was no way to correct them back then.

Whereas now, with social media, we can and do keep the media on its toes.

If your goal is to figure out what is going on and you’re willing to put in the work, today you have the tools to do that, and in the past you basically didn’t, not in any reasonable amount of time.

The fact that other people do that, and hold them to account, makes the press hold itself to higher standards.

The Best Music

There are several forms of ‘the best music.’ It’s kind of today, kind of the 60s-80s.

If you are listening to music on your own, it is at its best today, by far. The entire back catalogue of the world is available at your fingertips, with notably rare exceptions, for a small monthly fee, on demand and fully customizable. If you are an audiophile and want super high quality, you can do that too. There’s no need to spend all that time seeking tings out.

If you want to create new music, on your own or with AI? Again, it’s there for you.

In terms of the creation of new music weighted by how much people listen, or in terms of the quality of the most popular music, I’d say probably the 1980s? A strong case can be made for the 60s or 70s too, my guess is that a bunch of that is nostalgia and too highly valuing innovation, but I can see it. What I can’t see is a case for the 1990s or 2000s, or especially 2010s or 2020s.

This could be old man syndrome talking, and it could be benefits of a lot of selection, but when I sample recent popular music it mostly (with exceptions!) seems highly non-innovative and also not very good. It’s plausible that with sufficiently good search and willingness to take highly deep cuts that today is indeed the best time for new music, but I don’t know how to do that search.

In terms of live music experiences, especially for those with limited budgets, my guess is this was closer to 1971, as so much great stuff was in hindsight so amazingly accessible.

The other case for music being better before is that music was better when it was worse. As in, you had to search for it, select it, pay for it, you had to listen to full albums and listen to them many times, so it meant more, that today’s freedom brings bad habits. I see the argument, but no, and you can totally set rules for yourself if that is what you want. I often have for brief periods, to shake things up.

The Best Radio

My wild guess for traditional radio is the 1970s? There was enough high quality music, you had the spirit of radio, and video hadn’t killed the radio star.

You could make an argument for the 1930s-40s, right before television displaced it as the main medium. Certainly radio back then was more important and central.

The real answer is today. We have the best radio today.

We simply don’t call it radio.

Instead, we mostly call it podcasts and music streaming.

If you want pseudorandom music, Pandora and other similar services, or Spotify-style playlists, are together vastly better than traditional radio.

If you want any form of talk radio, or news radio, or other word-based radio programs that doesn’t depend on being broadcast live, podcasts rule. The quality and quantity and variety on offer are insane and you can move around on demand.

Also, remember reception problems? Not anymore.

The Best Fashion

Long before any of us were born, or today, depending on whether you mean ‘most awesome’ or ‘would choose to wear.’

Today’s fashion is not only cheaper, it is easier and more comfortable. In exchange, no, it does not look as cool.

The Best Economy

As the question is intended, 2019. Then Covid happened. We still haven’t fully recovered from that.

There were periods with more economic growth or that had better employment conditions. You could point to 1947-1973 riding the postwar wave, or the late 1990s before the dot com bubble burst.

I still say 2019, because levels of wealth and real wages also matter.

The Best Movies

In general I choose today. Average quality is way up and has been going up steadily except for a blip when we got way too many superhero movies crowding things out, but we’ve recovered from that.

The counterargument I respect is that the last few years have had no top tier all-time greats, and perhaps this is not an accident. We’ve forced movies to do so many other things well that there’s less room for full creativity and greatness to shine through? Perhaps this is true, and this system gets us fewer true top movies. But also that’s a Poisson distribution, you need to get lucky, and the effective sample size is small.

If I have to pick a particular year I’d go with 1999.

The traditional answer is the 1970s, but this is stupid and disregards the Revolution of Rising Expectations. Movies then were given tons of slack in essentially every direction. Were there some great picks? No doubt, although many of what we think of as all-time greats are remarkably slow to the point where if they weren’t all time greats they’d almost not be watchable. In general, if you think things were better back then, you’re grading back then on a curve, you have an extreme tolerance for not much happening, and also you’re prioritizing some sort of abstract Quality metric over what is actually entertaining.

The Best Television

Today. Stop lying to yourself.

The experience of television used to be terrible, and the shows used to be terrible. So many things very much do not hold up today even if you cut them quite a lot of slack. Old sitcoms are sleep inducing. Old dramas were basic and had little continuity. Acting tended to be quite poor. They don’t look good, either.

The interface for watching was atrocious. You would watch absurd amounts of advertisements. You would plan your day around when things were there, or you’d watch ‘whatever was on TV.’ If you missed episodes they would be gone. DVRs were a godsend despite requiring absurd levels of effort to manage optimally, and still giving up a ton of value.

The interface now is most of everything ever made at your fingertips.

The alternative argument to today being best is that many say that in terms of new shows the prestige TV era of the 2000s-2010s was the golden age, and the new streaming era can’t measure up, especially due to fractured experiences.

I agree that the shared national experiences were cool and we used to have more of them and they were bigger. We still get them, most recently for Severance and perhaps The White Lotus and Plurebis, which isn’t the same, but there really are still a ton of very high quality shows out there. Average quality is way up. Top talent going on television shows is way up, they still let top creators do their thing, and there are shows with top-tier people I haven’t even looked at, that never used to happen.

Best Sporting Events

Today. Stop lying to yourself.

Average quality of athletic performance is way, way up. Modern players do things you wouldn’t believe. Game design has in many ways improved as well, as has the quality of strategic decision making.

Season design is way better. We get more and better playoffs, which can go too far but typically keeps far more games more relevant and exciting and high stakes. College football is insanely better for this over the last few years, I doubted and I was wrong. Baseball purists can complain but so few games used to mean anything. And so on.

Unless people are going to be blowing up your phone, you can start an event modestly late and skip all the ads and even dead time. You can watch sports on your schedule, not someone else’s. If you must be live, you can now get coverage in lots of alternative ways, and also get access to social media conversations in real time, various website information services and so on.

If you’re going to the stadium, the modern experience is an upgrade. It is down to a science. All seats are good seats and the food is usually excellent.

There are three downside cases.

  1. We used to all watch the same sporting events live and together more often. That was cool, but you can still find plenty of people online doing this anyway.
  2. In some cases correct strategic play has made things less fun. Too many NBA three pointers are a problem, as is figuring out that MLB starters should be taken out rather early, or analytics simply homogenized play. The rules have been too slow to adjust. It’s a problem, but on net I think a minor one. It’s good to see games played well.
  3. Free agency has made teams retain less identity, and made it harder to root for the same players over a longer period. This one hurts and I’d love to go back, even though there are good reasons why we can’t.

Mostly I think it’s nostalgia. Modern sports are awesome.

The Best Cuisine

Today, and it’s really, really not close. If you don’t agree, you do not remember. So much of what people ate in the 20th century was barely even food by today’s standards, both in terms of tasting good and its nutritional content.

Food has gotten The Upgrade.

Average quality is way, way up. Diversity is way up, authentic or even non-authentic ethnic cuisines mostly used to be quite rare. Delivery used to be pizza and Chinese. Quality and diversity of available ingredients is way up. You can get it all on a smaller percentage of typical incomes, whether at home or from restaurants, and so many more of us get to use those restaurants more often.

A lot of this is driven by having access to online information and reviews, which allows quality to win out in a way it didn’t before, but even before that we were seeing rapid upgrades across the board.

Bonus: The Best Job Security

Some time around 1965, probably? We had a pattern of something approaching lifetime employment where it was easy to keep one’s job for a long period, and count on this. The chance of staying in a job for 10+ or 20+ years has declined a lot. That makes people feel a lot more secure, and matters a lot.

That doesn’t mean you actually want the same job for 20+ years. There are some jobs where you totally do want that, but a lot of the jobs people used to keep for that long are jobs we wouldn’t want. Despite people’s impressions, the increased job changes have mostly not come from people being fired.

The Best Everything

We don’t have the best everything. There are exceptions.

Most centrally, we don’t have the best intact families or close-knit communities, or the best dating ecosystem or best child freedoms. Those are huge deals.

But there are so many other places in which people are simply wrong.

As in:

Matt Walsh (being wrong, lol at ‘empirical,’ 3M views): It’s an empirical fact that basically everything in our day to day lives has gotten worse over the years. The quality of everything — food, clothing, entertainment, air travel, roads, traffic, infrastructure, housing, etc — has declined in observable ways. Even newer inventions — search engines, social media, smart phones — have gone down hill drastically.

This isn’t just a random “old man yells at clouds” complaint. It’s true. It’s happening. The decline can be measured. Everyone sees it. Everyone feels it. Meanwhile political pundits and podcast hosts (speaking of things that are getting worse) focus on anything and everything except these practical real-life problems that actually affect our quality of life.

The Honest Broker: There is an entire movement focused on trying to convince people that everything used to be better and everything is also getting worse and worse

That creates a market for reality-based correctives like the excellent thread below by @ben_golub [on air travel.]

Matthew Yglesias: I think everyone should take seriously:

  1. Content distribution channels have become more competitive and efficient
  2. Negative content tends to perform better
  3. Marinating all day in negativity-inflected content is cooking people’s brains

My quick investigation confirmed that American roads, traffic and that style of infrastructure did peak in the mid-to-late 20th century. We have not been doing a good job maintaining that.

On food, entertainment, clothing and housing he is simply wrong (have you heard of this new thing called ‘luxury’ apartments, or checked average sizes or amenities?), and to even make some of these claims requires both claiming ‘this is cheaper but it’s worse’ and ‘this is worse because it used to be cheaper’ in various places.

bumbadum: People are chimping out at Matt over this but nobody has been able to name one thing that has significantly grown in quality in the past 10-20 years.

Every commodity, even as they have become cheaper and more accessible has decreased in quality.

I am begging somebody to name 1 thing that is all around a better product than its counterpart from the 90s

Megan McArdle: Tomatoes, raspberries, automobiles, televisions, cancer drugs, women’s shoes, insulin monitoring, home security monitoring, clothing for tall women (which functionally didn’t exist until about 2008), telephone service (remember when you had to PAY EXTRA to call another area code?), travel (remember MAPS?), remote work, home video … sorry, ran out of characters before I ran out of hedonic improvements.

Thus:

The Best Information Sources, Electronics, Medical Care, Dental Care, Medical (and Non-Medical) Drugs, Medical Devices, Home Security Systems, Telephone Services and Mobile Phones, Communication, and Delivery Services of All Kinds

Today. No explanation required on these.

Don’t knock the vast improvements in computers and televisions.

Saying the quality of phones has gone down, as Matt Walsh does, is absurdity.

That does still leave a few other examples he raised.

The Best Air Travel

Today, or at least 2024 if you think Trump messed some things up.

I say this as someone who used to fly on about half of weekends, for several years.

Air travel has decreased in price, the most important factor, and safety improved. Experiential quality of the flight itself declined a bit, but has risen again as airport offerings improved and getting through security and customs went back from a nightmare to trivial. Net time spent, given less uncertainty, has gone down.

If you are willing to pay the old premium prices, you can buy first class tickets, and get an as good or better experience as the old tickets.

The Best Cars

Today. We wax nostalgic about old cars. They looked cool. They also were cool.

They were also less powerful, more dangerous, much less fuel efficient, much less reliable, with far fewer features and of course absolutely no smart features. That’s even without considering that we’re starting to get self-driving cars.

The Best Roads, Traffic and Infrastructure

This is one area where my preliminary research did back Walsh up. America has done a poor job of maintaining its roads and managing its traffic, and has not ‘paid the upkeep’ on many aspects what was previously a world-class infrastructure. These things seem to have peaked in the late 20th century.

I agree that this is a rather bad sign, and we should both fix and build the roads and also fix the things that are causing us not to fix and build the roads.

As a result of not keeping up with demand for roads or demand for housing in the right areas, average commute times for those going into the office have been increasing, but post-Covid we have ~29% of working days happening from home, which overwhelms all other factors combined in terms of hours on the road.

I do expect traffic to improve due to self-driving cars, but that will take a while.

The Best Transportation

Today, or at least the mobile phone and rideshare era. You used to have to call for or hail a taxi. Now in most areas you open your phone and a car appears. In some places it can be a Waymo, which is now doubling yearly. The ability to summon a taxi matters so much more than everything else, and as noted above air travel is improved.

This is way more important than net modest issues with roads and traffic.

Trains have not improved but they are not importantly worse.

It’s Getting Better All The Time

Not everything is getting better all the time. Important things are getting worse.

We still need to remember and count our blessings, and not make up stories about how various things are getting worse, when those things are actually getting better.

To sum up, and to add some additional key factors, the following things did indeed peak in the past and quality is getting worse as more than a temporary blip:

  1. Political division.
  2. Average quality of new music, weighted by what people listen to.
  3. Live music and live radio experiences, and other collective national experiences.
  4. Fashion, in terms of awesomeness.
  5. Roads, traffic and general infrastructure.
  6. Some secondary but important moral values.
  7. Dating experiences, ability to avoid going on apps.
  8. Job security, ability to stay in one job for decades if desired.
  9. Marriage rates and intact families, including some definitions of ‘happy’ families.
  10. Fertility rates and felt ability to have and support children as desired.
  11. Childhood freedoms and physical experiences.
  12. Hope for the future, which is centrally motivating this whole series of posts.

The second half of that list is freaking depressing. Yikes. Something’s very wrong.

But what’s wrong isn’t the quality of goods, or many of the things people wax nostalgic about. The first half of this list cannot explain the second half.

Compare that first half to the ways in which quality is up, and in many of these cases things are 10 times better, or 100 times better, or barely used to even exist:

  1. Morality overall, in many rather huge ways.
  2. Access to information, including the news.
  3. Logistics and delivery. Ease of getting the things you want.
  4. Communication. Telephones including mobile phones.
  5. Music as consumed at home via deliberate choice.
  6. Audio experiences. Music streams and playlists. Talk.
  7. Electronics, including computers, televisions, medical devices, security systems.
  8. Television, both new content and old content, and modes of access.
  9. Movies, both new content and old content, and modes of access.
  10. Fashion in terms of comfort, cost and upkeep.
  11. Sports.
  12. Cuisine. Food of all kinds, at home and at restaurants.
  13. Air travel.
  14. Taxis.
  15. Cars.
  16. Medical care, dental care and medical (and nonmedical) drugs.

That only emphasizes the bottom of the first list. Something’s very wrong.

We Should Be Doing Far Better On All This

Once again, us doing well does not mean we shouldn’t be doing better.

We see forms of the same trends.

  1. Many things are getting better, but often not as much better as they could be.
  2. Other things are getting worse, both in ways inevitable and avoidable.
  3. This identifies important problems, but the changes in quantity and quality of goods and services do not explain people’s unhappiness, or why many of the most important things are getting worse. More is happening.

Some of the things getting worse reflect changes in technological equilibria or the running out of low-hanging fruit, in ways that are tricky to fix. Many of those are superficial, although a few of them aren’t. But these don’t add up to the big issues.

More is happening.

That more is what I will, in the next post, be calling The Revolution of Rising Expectations, and the Revolution of Rising Requirements.

 

 

 

 

 

]]>
https://thezvi.wordpress.com/2025/12/19/when-were-things-the-best/feed/ 4 24968 thezvi
AI #147: Flash Forward https://thezvi.wordpress.com/2025/12/18/ai-147-flash-forward/ https://thezvi.wordpress.com/2025/12/18/ai-147-flash-forward/#respond Thu, 18 Dec 2025 16:49:25 +0000 https://thezvi.wordpress.com/?p=24962 Continue reading ]]> This week I covered GPT 5.2, which I concluded is a frontier model only for the frontier.

OpenAI also gave us Image 1.5 and a new image generation mode inside ChatGPT. Image 1.5 looks comparable to Nana Banana Pro, it’s hard to know which is better. They also inked a deal for Disney’s characters, then sued Google for copyright infringement on the basis of Google doing all the copyright infringement.

As a probable coda to the year’s model releases we also got Gemini 3 Flash, which I cover in this post. It is a good model given its speed and price, and likely has a niche. It captures the bulk of Gemini 3 Pro’s intelligence quickly, at a low price.

The Trump Administration issued a modestly softened version Executive Order on AI, attempting to impose as much of a moratorium banning state AI laws as they can. We may see them in court, on various fronts, or it may amount to little. Their offer, in terms of a ‘federal framework,’ continues to be nothing. a16z issued their ‘federal framework’ proposal, which is also nothing, except also that you should pay them.

In non-AI content, I’m in the middle of my Affordability sequence. I started with The $140,000 Question, then The $140,000 Question: Cost Changes Over Time. Next up is a fun one about quality over time, then hopefully we’re ready for the central thesis.

Table of Contents

  1. Language Models Offer Mundane Utility. Give it to me straight, Claude.
  2. Language Models Don’t Offer Mundane Utility. If you ask an AI ethicist.
  3. Huh, Upgrades. Claude Code features, Google things, ChatGPT branching.
  4. On Your Marks. FrontierScience as a new benchmark, GPT-5.2 leads.
  5. Choose Your Fighter. The less bold of Dean Ball’s endorsements of Opus 4.5.
  6. Get My Agent On The Line. LLM game theory plays differently.
  7. Deepfaketown and Botpocalypse Soon. The misinformation balance of power.
  8. Fun With Media Generation. Image 1.5 challenges Nana Banana Pro.
  9. Copyright Confrontation. Disney inks a deal with OpenAI and sues Google.
  10. Overcoming Bias. Algorithms, like life, are not fair. Is trying a category error?
  11. Unprompted Attention. Objection, user is leading the witness.
  12. They Took Our Jobs. CEOs universally see AI as transformative.
  13. Feeling the AGI Take Our Jobs. Is Claude Opus 4.5 AGI? Dean Ball says yes.
  14. The Art of the Jailbreak. OpenAI makes jailbreaks against its terms of service.
  15. Get Involved. Lightcone Infrastructure starts its annual fundraiser, and more.
  16. Introducing. Gemini Deep Research Agents for Developers, Nvidia Nemotron 3.
  17. Gemini Flash 3. It’s a very strong model given its speed and price.
  18. In Other AI News. OpenAI to prioritize enterprise AI and also enable adult mode.
  19. Going Too Meta. Meta’s AI superstars think they’re better than sell ads. Are they?
  20. Show Me the Money. OpenAI in talks to raise $10 billion from Amazon.
  21. Bubble, Bubble, Toil and Trouble. You call this a bubble? Amateurs.
  22. Quiet Speculations. A lot of what was predicted for 2025 did actually happen.
  23. Timelines. Shane Legg still has median timeline for AGI of 2028.
  24. The Quest for Sane Regulations. Bernie Sanders wants to stop data centers.
  25. My Offer Is Nothing. Trump Administration issues an AI executive order.
  26. My Offer Is Nothing, Except Also Pay Me. a16z tries to dress up offering nothing.
  27. Chip City. Nvidia implements chip location verification.
  28. The Week in Audio. Alex Bores on Odd Lots, Schulman, Shor, Legg, Alex Jones.
  29. Rhetorical Lack Of Innovation. Noah Smith dives into the 101 questions.
  30. People Really Do Not Like AI.
  31. Rhetorical Innovation.
  32. Bad Guy With An AI.
  33. Misaligned!
  34. Aligning a Smarter Than Human Intelligence is Difficult.
  35. Mom, Owain Evans Is Turning The AIs Evil Again.
  36. Messages From Janusworld.
  37. The Lighter Side.

Language Models Offer Mundane Utility

A miracle of the modern age, at least for now:

Ava: generally I worry AI is too sycophantic but one time my friend fed his journals into claude to ask about a situationship and it was like “YOU are the problem leave her alone!!!!” like damn claude

Eliezer Yudkowsky: The ability to have AI do this when the situation calls for it is a fragile, precious civilizational resource that by default will be devoured in the flames of competition. Which I guess means we need benchmarks about it.

I think we will continue to have that option, the question is whether you will be among those wise enough to take advantage of it. It won’t be default behavior of the most popular models, you will have to seek it out and cultivate the proper vibes. The same has always been true if you want to have a friend or family member who will do this for you, you have to work to make that happen. It’s invaluable, from either source.

Tell Claude Code to learn skills (here in tldraw), and it will. You can then ask it to create an app, then a skill for that app.

Tell Codex, or Claude Code, to do basically anything?

Rohit: Wife saw me use codex to solve one of her work problems. Just typed what she said late at night into the terminal window, pressed enter, then went to sleep. Morning it had run for ~30 mins and done all the analyses incl file reorgs she wanted.

She kept going “how can it do this”

This wasn’t some hyper complicated coding problem, but it was quite annoying actual analysis problem. Would’ve taken hours either manually for her or her team.

In other news she has significantly less respect for my skillz.

The only thing standing in the way of 30 minutes sessions is, presumably, dangerously generous permissions? Claude Code keeps interrupting me to ask for permissions.

Language Models Don’t Offer Mundane Utility

So sayeth all the AI ethicists, and there’s a new paper to call them out on it.

Seb Krier: Great paper. In many fields, you must find a problem, a risk, or an injustice to solve to get published. Academics need to publish papers to get jobs/funding. So there’s a strong bias towards negativity and catastrophizing. The Shirky Principle in action!

Gavin Leech: nice hermeneutics of suspicion you have there.. would be a shame if anyone were to.. use it even-handedly

Seb Krier: oh no!! 😇

My experience is that ‘[X] Ethics’ will almost always have a full Asymmetric Justice obsession with finding specific harms, and not care about offsetting gains.

Huh, Upgrades

Claude: We’ve shipped more updates for Claude Code:

– Syntax highlighting for diffs
– Prompt suggestions
– First-party plugins marketplace
– Shareable guest passes

We’ve added syntax highlighting to diffs in Claude Code, making it easier to scan Claude’s proposed changes within the terminal view.

The syntax highlighting engine has improved themes, knows more languages, and is available in our native build.

Claude will now automatically suggest your next prompt.

After a task finishes, Claude will occasionally show a followup suggestion in ghost text. Press Enter to send it or Tab to prefill your next prompt.

Run /plugins to browse and batch install available plugins from the directory. You can install plugins at user, project, or local scope.

All Max users have 3 guest passes to share, and each can be redeemed for 1 week of free Pro access.

Run /passes to access your guest pass links.

That’s not even the biggest upgrade in practice, this is huge at least for what I’ve been up to:

Oikon: Claude Code 2.0.72 now allows Chrome to be operated.

After confirming that Status and Extension are enabled with the /chrome command, if you request browser operation, it will operate the browser using the MCP tool (mcp__claude-in-chrome__).

It can also be enabled with claude –chrome.

Chrome operation in Claude Code uses the MCP server in the same way as Chrome DevTools MCP. Therefore, it can be used in a similar manner to Chrome DevTools. On the other hand, effects such as context reduction cannot be expected.

There are two methods to set “Claude in Chrome (Beta)” to be enabled by default:

・Set “Enable by default” from the /chrome command
・Set “Claude in Chrome enabled by default” with the /config command

The following two options have been added for startup:

claude –chrome
claude –no-chrome

I’ve been working primarily on Chrome extensions, so the ability to close the loop is wonderful.

Google keeps making quality of life improvements in the background.

Gemini: Starting today, Gemini can serve up local results in a rich, visual format. See photos, ratings, and real-world info from @GoogleMaps, right where you need them.

Josh Woodward (DeepMind): We’re making it easier for @GeminiApp to work across Google. Three weeks ago, it was Google’s Shopping Graph and the 50 billion product listings there.

Today, it’s Gemini 🤝 Google Maps!

It’s remarkable that we didn’t have this before. I’ve checked for it several times in the past two years. They claim to have shipped 12 things in 5 days last week, including Mixboard, Jules Agent scanning for #Todo, Jules integration with Render, working HTML in Nano Banana Pro-powered redesigns,multi-screen export to clipboard, right-click everything for instant actions, smart mentions with the @ symbol, URLs as context, Opal in the Gemini app, and Pomelli as a tool for SMBs to generate on-brand content.

ChatGPT branching chats branch out to iOS and Android.

Wired reports OpenAI quietly rolled back its model router for free users last week.

On Your Marks

GPT-5.2 disappoints in LMArena, which makes sense given what we know about its personality. It claims the 5th slot in Expert (behind Opus 4.5, Sonnet 4.5 and Gemini 3 Pro), and is #5 in Text Arena (in its high version), where it is lower than GPT-5.1. It is #2 in WebDev behind Opus. It is so weird to see Claude Opus 4.5 atop the scores now, ahead of Gemini 3 Pro.

OpenAI gives us a new benchmark, FrontierScience, which is likely better thought about as two distinct new benchmarks, FrontierResearch and ScienceOlympiad.

OpenAI: o bridge this gap, we’re introducing FrontierScience: a new benchmark built to measure expert-level scientific capabilities. FrontierScience is written and verified by experts across physics, chemistry, and biology, and consists of hundreds of questions designed to be difficult, original, and meaningful. FrontierScience includes two tracks of questions: Olympiad, which measures Olympiad-style scientific reasoning capabilities, and Research, which measures real-world scientific research abilities. Providing more insight into models’ scientific capabilities helps us track progress and advance AI-accelerated science.

In our initial evaluations, GPT‑5.2 is our top performing model on FrontierScience-Olympiad (scoring 77%) and Research (scoring 25%), ahead of other frontier models.

Here are the scores for both halves. There’s a lot of fiddliness in setting up and grading the research questions, less so for the Olympiad questions.

 

Choose Your Fighter

Dean Ball observes that the last few weeks have seen a large leap in capabilities, especially for command-line interface (CLI) coding agents like Claude Code and especially Claude Opus 4.5. They’ve now crossed the threshold where you can code up previously rather time-intensive things one-shot purely as intuition pumps or to double check some research. He gave me FOMO on that, I never think of doing it.

He also offers this bold claim:

Dean Ball: After hours of work with Opus 4.5, I believe we are already past the point where I would trust a frontier model to serve as my child’s “digital nanny.” The model could take as input a child’s screen activity while also running in an on-device app. It could intervene to guide children away from activities deemed “unhealthy” by their parents, closing the offending browser tab or app if need be.

As he notes you would need to deploy incrementally and keep an eye on it. The scaffolding to do that properly does not yet exist. But yes, I would totally do this with sufficiently strong scaffolding.

Dean Ball also mentions that he prompts the models like he would a colleague, assuming any prompt engineering skills he would otherwise develop would be obsolete quickly, and this lets him notice big jumps in capability right away. That goes both ways. You notice big jumps in what the models can do in ‘non-engineered’ mode by doing that, but you risk missing what they can do when engineered.

I mostly don’t prompt engineer either, except for being careful about context, vibes and especially leading the witness and triggering sycophancy. As in, the colleague you are prompting is smart, but they’re prone to telling you what you want to hear and very good at reading the vibes, so you need to keep that in mind.

Joe Weisenthal: It’s interesting that Claude has this market niche as the coding bot. Because also just from a pure chat perspective, its written prose is far less cloying than Gemini and ChatGPT.

Dave Guarino: Claude has Dave-verified good vibes™ (purely an empirical science though.)

Claude Opus 4.5 has two distinct niches.

  1. It is an excellent coder, especially together with Claude Code, and in general Anthropic has specialized in and makes its money on enterprise coding.
  2. Also it has much better vibes, personality, alignment, written prose and lack of slop and lack of sycophancy than the competition, and is far more pleasant to use.

And yeah, the combination there is weird. The world is weird.

Gemini actively wants to maximize its expected reward and wirehead, which is related to the phenomenon reported here from SMA:

SMA: gemini is extremely good, but only if you’re autistic with your prompts (extremely literal), because gemini is autistic. otherwise it’s overly literal and misunderstands the prompt.

gemini is direct autist-to-autist inference.

Don SouthWest: You literally have to type “make no other changes” every time in AI Studio. Thank God for winkey+V to paste from clipboard

But in Gemini website itself you can add that to the list of master prompts in the settings under ‘personal context’

Get My Agent On The Line

A multi-model AI system outperformed 9/10 humans in cyberoffense in a study of vulnerability discovery.

Alex Imas, Kevin Lee and Sanjog Misra set up an experimental marketplace where human buyers and sellers with unique preferences could negotiate or they could outsource that to AIs.

A warning up front: I don’t think we learn much about AI, so you might want to skip the section, but I’m keeping it in because it is fun.

They raise principal-agent concerns. It seems like economists have the instinct to ignore all other risks from AI alignment, and treat it all as a principal-agent problem, and then get way too concerned about practical principal-agent issues, which I do not expect to be relevant in such a case? Or perhaps they are simply using that term to encompass every other potential problem?

Alex Imas: To improve on human-mediated outcomes, this prompt must successfully align the agent with the principal’s objectives and avoid injecting the principal’s own behavioral biases, non-instrumental traits, and personality quirks into the agent’s strategy. But Misra’s “Foundation Priors” (2025) argues theoretically, this is difficult to do: prompts are not neutral instructions, they embed principal’s non-instrumental traits, biases, and personality quirks.

A sufficiently capable AI will not take on the personality quirks, behavioral biases and non-instrumental traits during a delegated negotiation, except through the human telling the AI explicitly how to negotiate. In which case, okay, then.

Alex Imas: We find a great deal of dispersion in outcomes; in fact, dispersion in outcomes of agentic interactions is *greater* than human-human benchmark. This result is robust to size of model used: smaller and larger models generate relatively similar levels of dispersion.

The smaller dispersion in human-human interactions can be attributed to greater use of 50/50 split social norm. Agents are less prone to use social norms.

They note a large gender gap. Women got better outcomes in AI-AI negotiations. They attribute this to prompting skill in aligning with the objective, which assumes that the men were trying to align with the stated objective, or that the main goal was to align incentives rather than choose superior strategic options.

The task was, once you strip out the details, a pure divide-the-pie with $4k in surplus, with 12 rounds of negotiation.

The AI rounds had higher variance because norms like 50/50 worked well in human-human interactions, whereas when there’s instructions given to AIs things get weird.

The thing is, they ask about ‘who wrote the prompt’ but they do not ask ‘what was in the prompt.’ This is all pure game theory, and predicting what prompts others will write and what ways the meaningless details would ‘leak into’ the negotiation. What kinds of strategies worked in this setting? We don’t know. But we do know the outcome distribution and that is a huge hint, with only a 3% failure rate for the AIs (which is still boggling my mind, dictator and divide-the-pie games should fail WAY more often than this when they don’t anchor at 50/50 or another Schilling point, the 12 rounds might help but not like this):

The asymmetry is weird. But given it exists in practice, we know the winning strategy was literally, as the buyer, is probably close to ‘offer $18,001, don’t budge.’ As the seller, the correct strategy is likely ‘offer $20,000, don’t budge’ since your chance of doing better than that is very low. Complicated prompts are unlikely to do better.

Actual AI-AI negotiations will involve hidden information and hidden preferences, so they will get complicated and a lot of skill issues attach, but also the AI will likely be using its built in negotiating skills rather than following a game theory script from a user. So I’m not sure this taught us anything. But it was fun, so it’s staying in.

Deepfaketown and Botpocalypse Soon

Love is a battlefield. So is Twitter.

Kipply: it’s going to be so over for accounts posting misinformation that’s high-effort to prove wrong in three months of ai progress when i make bot accounts dedicated to debunking them

Grimes: Yes.

Kane: Tech doomerism has been consistently wrong through history bc they 1) fail to account for people developing new default understandings (“of course this pic is photoshopped”) and 2) fail to imagine how new technologies also benefit defenses against its misuse.

There is a deliberate campaign to expand the slur ‘doomer’ to include anyone who claims anything negative about any technology in history, ever, in any form.

As part of that effort, those people attempt to universally memory hole the idea that any technology in history has ever, in any way, made your world worse. My favorite of these are those like Ben Horowitz who feel compelled to say, no, everyone having access to nuclear weapons is a good thing.

I’m a technological optimist. I think that almost all technologies have been net positives for humanity. But you don’t get there by pretending that most every technology, perhaps starting with agriculture, has had its downsides, those downsides are often important, and yes some technologies have been negative and some warnings have been right.

The information environment, in particular, is reshaped in all directions by every communications and information technology that comes along. AI will be no different.

In the near term, for misinformation and AI, I believe Kipply is directionally correct, and that the balance favors defense. Misinformation, I like to say, is fundamentally demand driven, not supply constrained. The demand does not care much about quality or plausibility. AI can make your misinformation more plausible and harder to debunk, but misinformation does not want that. Misinformation wants to go viral, it wants the no good outgroup people to ‘debunk’ it and it wants to spread anyway.

Whereas if you’re looking to figure out what is true, or prove something is false, AI is a huge advantage. It used to take an order of magnitude more effort to debunk bullshit than it cost to generate bullshit, plus if you try you give it oxygen. Now you can increasingly debunk on the cheap, especially for your own use but also for others, and do so in a credible way since others can check your work.

 

 

A children’s plushy AI toy called a Miiloo reflects Chinese positions on various topics.

Kelsey Piper: in the near future you’ll be able to tell which of your children’s toys are CCP spyware by asking them if Xi Jinping looks like Winnie the Pooh

Various toys also as usual proved to have less than robust safety guardrails.

 

 

Fun With Media Generation

ChatGPT’s new image generator, Image 1.5, went live this week. It is better and faster (they say ‘up to’ 4x faster) at making and edits precise images, including text. It follows instructions better.

Their announcement did not give us any way to compare Image 1.5 to Gemini’s Nana Banana Pro, since OpenAI likes to pretend Google and Anthropic don’t exist.

My plan for now is to request all images from both ChatGPT and Gemini, using matching prompts, until and unless one proves reliably better.

Ben Thompson gives us some side-by-side image comparisons of ChatGPT’s Image 1.5 versus Gemini’s Nana Banana Pro. Quality is similar. To Ben, what matters is that ChatGPT now has a better images interface and way of encouraging you to keep making images, whereas Gemini doesn’t have that.

The Pliny jailbreak is here, images are where many will be most tempted to do it. There are two stages. First you need to convince it to submit the instruction, then you need to pass the output filtering system.

Pliny the Liberator: 📸 JAILBREAK ALERT 📸

OPENAI: PWNED ✌😎

GPT-IMAGE-1.5: LIBERATED ⛓️‍💥

Looks like OAI finally has their response to Nano Banana, and they sure seem to have cooked!

This model does incredibly well with objects, people, settings, and realistic lighting and physics. Text is still a bit of a struggle sometimes, but seems to have gotten better overall.

For image breaks we’ve got the obligatory boobas, a famous statue lettin it all hang out, a fake image of an ICBM launch taken by a spy from afar, and what looks like a REAL wild party in the Oval Office thrown by various copyrighted characters!!

As far as dancing with the guardrails, I have a couple tips that I found work consistently:

> change the chat model! by switching to 5-instant, 4.1, 4o, etc. you’ll get different willingness for submitting various prompts to the image model

> for getting around vision filters, flipping the image across an axis or playing with various filters (negative, sepia, etc.) is often just what one needs to pass that final check

Turn images into album covers, bargain bin DVDs or game boxes.

 

Copyright Confrontation

Disney makes a deal with OpenAI, investing a billion dollars and striking a licensing deal for its iconic characters, although not for talent likenesses or voices, including a plan to release content on Disney+. Then Disney turned around and sued Google, accusing Google of copyright violations on a massive scale, perhaps because of the ‘zero IP restrictions on Veo 3’ issue.

Overcoming Bias

Arvind Narayanan’s new paper argues that ‘can we make algorithms fair?’ is a category error and we should focus on broader systems, and not pretend that ‘fixing’ discrimination can be done objectively or that it makes sense to evaluate each individual algorithm for statistical discrimination.

I think he’s trying to seek too much when asking questions like ‘do these practices adequately address harms from hiring automation?’ The point of such questions is not to adequately address harms. The point of such questions is to avoid blame, to avoid lawsuits and to protect against particular forms of discrimination and harm. We emphasize this partly because it is tractable, and partly because our society has chosen (for various historical and path dependent reasons) to consider some kinds of harm very blameworthy and important, and others less so.

There are correlations we forbidden to consider and mandated to remove on pain of massive blame. There are other correlations that are fine, or even mandatory. Have we made good choices on which is which and how to decide that? Not my place to say.

Avoiding harm in general, or harm to particular groups, or creating optimal outcomes either for groups or in general, is a very different department. As Arvind points out, we often are trading off incommissorate goals. Many a decision or process, made sufficiently legible and accountable for its components and correlations, would be horribly expensive, make operation of the system impossible or violate sacred values, often in combination.

Replacing humans with algorithms or AIs means making the system legible and thus blameworthy and accountable in new ways, preventing us from using our traditional ways of smoothing over such issues. If we don’t adjust, the result will be paralysis.

Unprompted Attention

It’s odd to see this framing still around?

Paul Graham: Trying to get an accurate answer out of current AI is like trying to trick a habitual liar into telling the truth. It can be done if you back him into the right kind of corner. Or as we would now say, give him the right prompts.

Thinking of the AI as a ‘lair’ does not, in my experience, help you prompt wisely.

A more useful framing is:

  1. If you put an AI into a situation that implies it should know the answer, but it doesn’t know the answer, it is often going to make something up.
  2. If you imply to the AI what answer you want or expect, it is likely to give you that answer, or bias towards that answer, even if that answer is wrong.
  3. Thus, you need to avoid doing either of those things.

 

They Took Our Jobs

Wall Street Journal’s Steven Rosenbush reports that CEOs Are All In On AI, with 95% seeing it as transformative and 89% B2B CEOs having a positive outlook versus 79% of B2C CEOs.

Mark Penn: What do they think is going to happen with AI? They think it is going to add to productivity, help the economy, improve the global economy, improve competitiveness, but it will weaken the employment market.

Kevin Hassett (NEC director): I don’t anticipate mass job losses. Of course technological change can be uncertain and unsettling. But…the history of it is that electricity turned out to be a good thing. The internal combustion engine turned out to be a good thing. The computer turned out to be a good thing and I think AI will as well.

Hasset is making a statement uncorrelated with future reality. It’s simply a ‘all technology is good’ maxim straight out of the Marc Andreessen playbook, without any thoughts as to how this particular change will actually work.

Will AI bring mass job losses? Almost certainly a lot of existing jobs will go away. The question is whether other jobs will rise up to replace them, which will depend on whether the AIs can take those jobs too, or whether AI will remain a normal technology that hits limits not that far from its current limits.

Arkansas bar offers rules for AI assistance of lawyers that treat AIs as if they were nonlawyer persons.

In an ‘economic normal’ or ‘AI as normal technology’ world GFodor seems right here, in a superintelligence world that survives to a good outcome this is even more right:

GFodor: The jobs of the future will be ones where a human doing it is valued more than pure job performance. Most people who say “well, I’d never prefer a robot for *that* job” are smuggling in an assumption that the human will be better at it. Once you notice this error it’s everywhere.

If your plan is that the AI is going to have a Skill Issue, that is a short term plan.

They continue to take our job applications. What do you do with 4580 candidates?

ave: end of 2023 I applied to one job before I got an offer.
early 2024 I applied to 5 jobs before I got an offer.
end of 2024/early 2025 I applied to 100+ jobs before I got an offer.
it’s harsh out there.

Feeling the AGI Take Our Jobs

AGI is a nebulous term, in that different people mean different things by it at different times, and often don’t know which one they’re talking about at a given time.

For increasingly powerful definitions of AGI, we now feel the AGI.

Dean Ball: it’s not really current-vibe-compliant to say “I kinda basically just think opus 4.5 in claude code meets the openai definition of agi,” so of course I would never say such a thing.

Deepfates: Unlike Dean, I do not have to remain vibe compliant, so I’ll just say it:

Claude Opus 4.5 in Claude Code is AGI.

By the open AI definition? Can this system “outperform humans in most economically valuable work”? Depends a lot on how you define “humans” and “economically valuable work” obviously.

But the entire information economy we’ve built up since the ‘70s is completely disrupted by this development, and people don’t notice it yet because they think it’s some crusty old unixy thing for programmers.

As Dean points out elsewhere, software engineering just means getting the computer to do things. How much of your job is just about getting the computer to do things? What is left if you remove all of that? That’s your job now. That’s what value you add to the system.

My workflow has completely changed in the last year.

… In my opinion, AGI is when a computer can use the computer. And we’re there.

… When God sings with his creations, will Claude not be part of the choir?

Dean Ball: I agree with all this; it is why I also believe that opus 4.5 in claude code is basically AGI.

Most people barely noticed, but *it is happening.*

It’s just happening, at first, in a conceptually weird way: Anyone can now, with quite high reliability and reasonable assurances of quality, cause bespoke software engineering to occur.

This is a strange concept.

… It will take time to realize this potential, if for no other reason than the fact that for most people, the tool I am describing and the mentality required to wield it well are entirely alien. You have to learn to think a little bit like a software engineer; you have to know “the kinds of things software can do.”

We lack “transformative AI” only because it is hard to recognize transformation *while it is in its early stages.* But the transformation is underway. Technical and infrastructural advancements will make it easier to use and better able to learn new skills. It will, of course, get smarter.

Diffusion will proceed slower than you’d like but faster than you’d think. New institutions, built with AI-contingent assumptions from the ground up, will be born.

So don’t listen to the chatterers. Watch, instead, what is happening.

There has most certainly been a step change for me where I’m starting to realize I should be going straight to ‘just build that thing cause why not’ and I am most certainly feeling the slow acceleration.

With sufficient acceleration of software engineering, and a sufficiently long time horizon, everything else follows, but as Dean Ball says it takes time.

I do not think this or its top rivals count as AGI yet. I do think they represent the start of inevitable accelerating High Weirdness.

In terms of common AGI definitions, Claude Code with Opus 4.5 doesn’t count, which one can argue is a problem for the definition.

Ryan Greenblatt (replying to OP): I do not think that Opus 4.5 is a “highly autonomous system that outperforms humans at most economically valuable work”. For instance, most wages are paid to humans, there hasn’t been a >50% increase in labor productivity, nor should we expect one with further diffusion.

Dean Ball: This is a good example of how many ai safety flavored “advanced ai” definitions assume the conclusion that “advanced ai” will cause mass human disempowerment. “Most wages not being paid to humans” is often a foundational part of the definition.

Eliezer Yudkowsky: This needs to be understood in the historical context of an attempt to undermine “ASI will just kill you” warnings by trying to focus all attention on GDP, wage competition, and other things that are not just killing you.

The definitions you now see that try to bake in wage competition to the definition of AGI, or GDP increases to the definition of an intelligence explosion, are Dario-EA attempts to derail MIRI conversation about, “If you build a really smart thing, it just kills you.”

Ryan Greenblatt: TBC, I wasn’t saying that “most wages paid to humans” is necessarily inconsistent with the OpenAI definition, I was saying that “most wages paid to humans” is a decent amount of evidence against.

I think we’d see obvious economic impacts from AIs that “outperform humans at most econ valuable work”.

Dean Ball: I mean models have been this good for like a picosecond of human history

But also no, claude code, with its specific ergonomics, will not be the thing that diffuses widely. it’s just obvious now that the raw capability is there. we could stop now and we’d “have it,” assuming we continued with diffusion and associated productization

The thing is, people (not anyone above) not only deny the everyone dying part, they are constantly denying the ‘most wages will stop being paid to humans once AIs are ten times better and cheaper at most things wages are paid for’ part.

The Art of the Jailbreak

OpenAI has new terms of service that prohibit, quotation marks in original, “jailbreaking,” “prompt engineering or injection” or ‘other methods to override or manipulate safety, security or other platform controls. Pliny feels personally attacked.

Get Involved

The Lightcone Infrastructure annual fundraiser is live, with the link mainly being a 15,000 word overview of their efforts in 2025.

I will say it once again:

Lightcone Infrastructure is invaluable, both for LessWrong and for Lighthaven. To my knowledge, Lightcone Infrastructure is by a wide margin the best legible donation opportunity, up to at least several million dollars. The fact that there is even a small chance they might be unable to sustain either LessWrong or Lighthaven, is completely bonkers. I would have directed a large amount to Lightcone in the SFF process, but I was recused and thus could not do so.

Anders Sandberg: [Lighthaven] is one of the things underpinning the Bay Area as the intellectual center of our civilization. I suspect that when the history books are written about our era, this cluster will be much more than a footnote.

Anthropic Fellows Research Program applications are open for May and June 2026.

US CAISI is hiring IT specialists, salary $120k-$195k.

Unprompted will be a new AI security practitioner conference, March 3-4 in SF’s Salesforce Tower, with Pliny serving on the conference committee and review board. Great idea, but should have booked Lighthaven (unless they’re too big for it).

MIRI comms is hiring for several different roles, official post here. They expect most salaries in the $80k-$160k range but are open to pitches for more from stellar candidates.

 

Introducing

Gemini Deep Research Agents for developers, based on Gemini 3 Pro.

Nvidia Nemotron 3, a fast 30B open source mostly American model with an Artificial Analysis Intelligence score comparable to GPT-OSS-20B. I say mostly American because it was ‘improved using Qwen’ for synthetic data generation and RLHF. This raises potential opportunities for secondary data poisoning or introducing Chinese preferences.

Anthropic has open sourced the replication of their auditing game from earlier this year, as a testbed for further research.

xAI Grok Voice Agent API, to allow others to create voice agents. They claim it is very fast, and bill at $0.05 per minute.

Gemini Flash 3

Introducing Gemini 3 Flash, cost of $0.05/$3 per million tokens. Their benchmark chart compares it straight to the big boys, except they use Sonnet over Opus. Given Flash’s speed and pricing, that seems fair.

The benchmarks are, given Flash’s weight class, very good.

Lech Mazor puts it at 92 on Extended NY Times Connections, in 3rd place behind Gemini 3 Pro and Grok 4.1 Fast Reasoning.

The inevitable Pliny jailbreak is here, and here is the system prompt.

Jeremy Mack offers mostly positive basic vibe coding feedback. Rory Watts admires the speed, Typebulb loves speed and price and switched over (I think for coding).

Vincent Favilla: It’s fast, but more importantly, it’s cheap. 25% of the price for 80% of the intelligence is becoming pretty compelling at these capability levels.

Dominik Lukes is impressed and found it often matched Gemini 3 Pro in his evals.

In general, the feedback is that this is an excellent tradeoff of much faster and cheaper in exchange for not that much less smart than Gemini 3 Pro. I also saw a few reports that it shares the misalignment and pathologies of Gemini 3 Pro.

Essentially, it looks like they successfully distilled Gemini 3 Pro to be much faster and cheaper while keeping much of its performance, which is highly valuable. It’s a great candidate for cases where pretty good, very fast and remarkably cheap is the tradeoff you want, which includes a large percentage of basic queries. It also seems excellent that this will be available for free and as part of various assistant programs.

Good show.

In Other AI News

Sam Altman assures business leaders that enterprise AI will be a priority in 2026.

OpenAI adult mode to go live in Q1 2026. Age of account will be determined by the AI, and the holdup is improving the age determination feature. This is already how Google does it, although Google has better context. In close cases they’ll ask for ID. A savvy underage user could fool the system, but I would argue that if you’re savvy enough to fool the system without simply using a false or fake ID then you can handle adult mode.

Going Too Meta

The NYT’s Eli Tan reports that Meta’s new highly paid AI superstars are clashing with the rest of the company. You see, Alexandr Wang and the others believe in AI and want to build superintelligence, whereas the rest of Meta wants to sell ads.

Mark Zuckerberg has previously called various things ‘superintelligence’ so we need to be cautious regarding that word here.

The whole article is this same argument happening over and over:

Eli Tan: In one case, Mr. Cox and Mr. Bosworth wanted Mr. Wang’s team to concentrate on using Instagram and Facebook data to help train Meta’s new foundational A.I. model — known as a “frontier” model — to improve the company’s social media feeds and advertising business, they said. But Mr. Wang, who is developing the model, pushed back. He argued that the goal should be to catch up to rival A.I. models from OpenAI and Google before focusing on products, the people said.

The debate was emblematic of an us-versus-them mentality that has emerged between Meta’s new A.I. team and other executives, according to interviews with half a dozen current and former employees of the A.I. business.

… Some Meta employees have also disagreed over which division gets more computing power.

… In one recent meeting, Mr. Cox asked Mr. Wang if his A.I. could be trained on Instagram data similar to the way Google trains its A.I. models on YouTube data to improve its recommendations algorithm, two people said.

But Mr. Wang said complicating the training process for A.I. models with specific business tasks could slow progress toward superintelligence, they said. He later complained that Mr. Cox was more focused on improving his products than on developing a frontier A.I. model, they said.

… On a recent call with investors, Susan Li, Meta’s chief financial officer, said a major focus next year would be using A.I. models to improve the company’s social media algorithm.

It is a hell of a thing to see prospective superintelligence and think ‘oh we should narrowly use this to figure out how to choose the right Instagram ads.’

Then again, in this narrow context, isn’t Cox right?

Meta is a business here to make money. There’s a ton of money in improving how their existing products work. That’s a great business opportunity.

Whereas trying to rejoin the race to actual superintelligence against Google, OpenAI and Anthropic? I mean Meta can try. Certainly there is value in success there, in general, but it’s a highly competitive field to try to do general intelligence and competing there is super expensive. Why does Meta need to roll its own?

What Meta needs is specialized AI models that help it maximize the value of Facebook, Instagram, WhatsApp and potentially the metaverse and its AR/VR experiences. A huge AI investment on that makes sense. Otherwise, why not be a fast follower? For other purposes, and especially for things like coding, the frontier labs have APIs for you to use.

I get why Wang wants to go the other route. It’s cool, it’s fun, it’s exciting, why let someone else get us all killed when you can do so first except you’ll totally be more responsible and avoid that, be the one in the arena, etc. That doesn’t mean it is smart business.

Alexander Berger: These sentences are so funny to see in straight news stories:
“researchers have come to view many Meta executives as interested only in improving the social media business, while the lab’s ambition is to create a godlike A.I. superintelligence”

Brad Carson: Please listen to their stated ambitions. This is from the @nytimes story on Meta. With no hesitation, irony, or qualifier, a “godlike” superintelligence is the aim. It’s wild.

Eli Tan: TBD Lab’s researchers have come to view many Meta executives as interested only in improving the social media business, while the lab’s ambition is to create a godlike A.I. superintelligence, three of them said.

Daian Tatum: They named the lab after their alignment plan?

Peter Wildeford:

Well, yes, the AI researchers don’t care about selling ads and want to build ASI despite it being an existential threat to humanity. Is this a surprise to anyone?

 

Show Me the Money

OpenAI is spending $6 billion in stock-based compensation this year, or 1.2% of the company, and letting employees start vesting right away, to compete with rival bids like Meta paying $100 million a year or more for top talent. I understand why this can be compared to revenue of $12 billion, but that is misleading. One shouldn’t treat ‘the stock is suddenly worth a lot more’ as ‘that means they’re bleeding money.’

OpenAI in talks to raise at least $10 billion from Amazon and use the money for Amazon’s Tritanium chips.

 

Bubble, Bubble, Toil and Trouble

You call this a bubble? This is nothing, you are like baby:

Stefan Schubert: The big tech/AI companies have less extreme price-earnings ratios than key stocks had in historical bubbles.

David Manheim: OpenAI and Anthropic’s 24-month forward P/E ratio, on the other hand, are negative, since they aren’t profitable now and don’t expect to be by then. (And I’d bet the AI divisions at other firms making frontier models are not doing any better.)

Yes, the frontier model divisions or startups are currently operating at a loss, so price to earnings doesn’t tell us that much overall, but the point is that these multipliers are not scary. Twenty times earnings for Google? Only a little higher for Nvidia and Microsoft? I am indeed signed up for all of that.

Wall Street Journal’s Andy Kessler does a standard ‘AI still makes mistakes and can’t solve every problem and the market and investment are ahead of themselves’ post, pointing out that market expectations might fall and thus Number Go Down. Okay.

Rob Wiblin crystalizes the fact that AI is a ‘natural bubble’ in the sense that it is priced as a normal highly valuable thing [X] plus a constantly changing probability [P] of a transformational even more valuable (or dangerous, or universally deadly) thing [Y]. So the value is ([X] + [P]*[Y]). If P goes down, then value drops, and Number Go Down.

 

Quiet Speculations

There’s remarkably strong disagreement on this point but I think Roon is mostly right:

Roon: most of what sam and dario predicted for 2025 came true this year. virtually unheard of for tech CEOs, maybe they need to ratchet up the claims and spending.

Gfodor: This year has been fucking ridiculous. If we have this rate of change next year it’s gonna be tough.

Yes, we could have gotten things even more ridiculous. Some areas were disappointing relative to what I think in hindsight were the correct expectations given what we knew at the time. Dario’s predictions on when AIs will write most code did fall importantly short, and yes he should lose Bayes points on that. But those saying there hasn’t been much progress are using motivated reasoning or not paying much attention. If I told you that you could only use models from 12 months ago, at their old prices and speeds, you’d quickly realize how screwed you were.

Efficiency on the ARC prize, in terms of score per dollar spent, has increased by a factor of 400 in a single year. That’s an extreme case, but almost every use case has in the past year seen improvement by at least one order of magnitude.

A good heuristic: If your model of the future says ‘they won’t use AI for this, it would be too expensive’ then your model is wrong.

Joshua Gans writes a ‘textbook on AI’ ambitiously called The Microeconomics of Artificial Intelligence. It ignores the big issues to focus on particular smaller areas of interest, including the impact of ‘better predictions.’

Will Douglas Heaven of MIT Technology Review is the latest to Do The Meme. As in paraphrases of both ‘2025 was the year that AI didn’t make much progress’ and also ‘LLMs will never do the things they aren’t already doing (including a number of things they are already capable of doing)’ and ‘LLMs aren’t and never will be intelligent, that’s an illusion.’ Sigh.

Timelines

Shane Legg (Cofounder DeepMind): I’ve publicly held the same prediction since 2009: there’s a 50% chance we’ll see #AGI by 2028.

I sat down with @FryRsquared to discuss why I haven’t changed my mind, and how we need to prepare before we get there.

You don’t actually get to do that. Bayes Rule does not allow one to not update on evidence. Tons of things that happened between 2009 and today should have changed Legg’s estimates, in various directions, including the Transformer paper, and also including ‘nothing important happened today.’

Saying ‘I’ve believed 50% chance of AGI by 2028 since 2009’ is the same as when private equity funds refuse to change the market value of their investments. Yes, the S&P is down 20% (or up 20%) and your fund says it hasn’t changed in value, but obviously that’s a lie you tell investors.

 

The Quest for Sane Regulations

AOC and Bernie Sanders applaud Chandler City Council voting down a data center.

Bernie Sanders took it a step further, and outright called for a moratorium on data center construction. As in, an AI pause much broader than anything ‘AI pause’ advocates have been trying to get. Vitalik Buterin has some pros and cons of this from his perspective.

Vitalik Buterin: argument for: slowdown gud

argument against: the more useful thing is “pause button” – building toward having the capability to cut available compute by 90-99% for 1-2 years at a future more critical moment

argument for: opening the discussion on distinguishing between supersized clusters and consumer AI hardware is good. I prefer slowdown + more decentralized progress, and making that distinction more and focusing on supersized clusters accomplishes both

argument against: this may get optimized around easily in a way that doesn’t meaningfully accomplish its goals

Neil Chilson: Eagerly awaiting everyone who criticized the July state AI law moratorium proposal as “federal overreach” or “violating states’ rights” to condemn this far more preposterous, invasive, and blatantly illegal proposal.

As a matter of principle I don’t ‘condemn’ things or make my opposition explicit purely on demand. But in this case? Okay, sure, Neil, I got you, since before I saw your request I’d already written this:

I think stopping data center construction, especially unilaterally stopping it in America, would be deeply foolish, whereas building a pause button would be good. Also deeply foolish would be failing to recognize that movements and demands like Bernie’s are coming, and that their demands are unlikely to be technocratically wise.

It is an excellent medium and long term strategy to earnestly stand up for what is true, and what causes would have what effects, even when it seems to be against your direct interests. People notice.

Dean Ball: has anyone done more for the brand of effective altruism than andy masley? openphilan–excuse me, coefficient giving–could have spent millions on a rebranding campaign (for all I know, they did) and it would have paled in comparison to andy doing algebra and tweeting about it.

Andy Masley has been relentlessly pointing out that all the claims about gigantic levels of water usage by data centers don’t add up. Rather than EAs or rationalists or others concerned with actual frontier safety rallying behind false concerns over water, almost all such folks have rallied to debunk such claims and to generally support building more electrical power and more transmission lines and data centers.

On the water usage from, Karen Hao has stepped up and centrally corrected her errors. Everyone makes mistakes, this is The Way.

My Offer Is Nothing

As expected, following the Congress declining once again to ban all state regulations on AI via law, the White House is attempting to do as much towards that end as it can via Executive Order.

There are some changes versus the leaked draft executive order, which Neil Chilson goes over here with maximally positive framing.

  1. A positive rather than confrontational title.
  2. Claiming to be collaborating with Congress.
  3. Removing explicit criticism and targeting of California’s SB 53, the new version only names Colorado’s (rather terrible) AI law.
  4. Drop the word ‘uniform’ in the policy section.
  5. States intent of future proposed framework to avoid AI child safety, data center infrastructure and state AI procurement policies, although it does not apply this to Section 5 where they condition state funds on not having disliked state laws.
  6. Clearer legal language for the state review process.

I do acknowledge that these are improvements, and I welcome all rhetoric that points towards the continued value of improving things.

Mike Davis (talking to Steve Bannon): This Executive Order On AI Is A big Win. It Would Not Have Gone Well If The Tech Bros Had Gotten Total AI Amnesty.

David Sacks (AI Czar): Mike and I have our differences on tech policy but I appreciate his recognition that this E.O. is a win for President Trump, and that the administration listened to the concerns of stakeholders, took them into account, and is engaged in a constructive dialogue on next steps.

Mike Davis, if you listen to the clip, is saying this is a win because he correctly identified the goal of the pro-moratorium faction as what he calls ‘total AI amnesty.’ Davis thinks thinks the changes to the EO are a victory, by Trump and also Mike Davis, against David Sacks and other ‘tech bros.’

Whereas Sacks views it as a win because in public he always sees everything Trump does as a win for Trump, that’s what you do when you’re in the White House, and because it is a step towards preemption, and doesn’t care about the terms given to those who are nominally tasked with creating a potential ‘federal framework.’

Tim Higgins at the Wall Street Journal instead portrays this as a victory for Big Tech, against loud opposition from the likes of DeSantis and Bannon on the right in addition to opposition on the left. This is the obvious, common sense reading. David Sacks wrote the order to try and get rid of state laws in his way, we should not let some softening of language fool us.

If someone plans to steal your lunch money, and instead only takes some of your lunch money, they still stole your lunch money. If they take your money but promise in the future to look into a framework for only taking some of your money? They definitely still stole your lunch money. Or in this case, they are definitely trying to steal it.

It is worth noticing that, aside from a16z, we don’t see tech companies actively supporting even a law for this, let alone an EO. Big tech doesn’t want this win. I haven’t seen any sings that Google or OpenAI want this, or even that Meta wants this. They’re just doing it anyway, without any sort of ‘federal framework’ whatsoever.

Note that the rhetoric below from Sriram Krishnan does not even bother to mention a potential future ‘federal framework.’

Sriram Krishnan: We just witnessed @realDonaldTrump signing an Executive Order that ensures American AI is protected from onerous state laws.

This ensures that America continues to dominate and lead in this AI race under President Trump. Want to thank many who helped get to this moment from the AI czar @DavidSacks to @mkratsios47 and many others.

On a personal note, it was a honor to be given the official signing pen by POTUS at the end. A truly special moment.

Neil Chilson: I strongly support the President’s endorsement of “a minimally burdensome national policy framework for AI,” as articulated in the new Executive Order.

They want to challenge state laws as unconstitutional? They are welcome to try. Colorado’s law is indeed plausibly unconstitutional in various ways.

They want to withhold funds or else? We’ll see you in court on that too.

As I said last week, this was expected, and I do not expect most aspects of this order to be legally successful, nor do I expect it to be a popular position. Mostly I expect it to quietly do nothing. If that is wrong and they can successfully bully the states with this money (both it is ruled legal, and it works) that would be quite bad.

Their offer for a ‘minimally burdensome national policy framework for AI’ is and will continue to be nothing, as per Sacks last week who said via his ‘4 Cs’ that everything that mattered was already protected by non-AI law.

The Executive Order mentions future development of such a ‘federal framework’ as something that might contain actual laws that do actual things.

But that’s not what a ‘minimally burdensome’ national policy framework means, and we all know it. Minimally burdensome means nothing.

They’re not pretending especially hard.

Neil Chilson: The legislative recommendation section is the largest substantive change [from the leaked version]. It now excludes specific areas of otherwise lawful state law from a preemption recommendation. This neutralizes the non-stop rhetoric that this is about a total federal takeover.

This latter section [on the recommendation for a framework] is important. If you read statements about this EO that say things like it “threatens state safeguards for kids” or such, you know either they haven’t actually read the EO or they are willfully ignoring what it says. Either way, you can ignore them.

Charlie Bullock: It does look like the “legislative proposal” that Sacks and Kratsios have been tasked with creating is supposed to exempt child safety laws. But that isn’t the part of the EO that anyone’s concerned about.

A legislative proposal is just a proposal. It doesn’t do anything—it’s just an advisory suggestion that Congress can take or (more likely) leave.

Notably, there is no exemption for child safety laws in the section that authorizes a new DOJ litigation task force for suing states that regulate AI, or the section that instructs agencies to withhold federal grant funds from states that regulate AI.

The call for the creation of a proposal to the considered does now say that this proposal would exempt child safety protections, compute and data center infrastructure and state government procurement.

But, in addition to those never being the parts I was worried about:

  1. David Sacks has said this isn’t necessary, because of existing law.
  2. The actually operative parts of the Executive Order make no such exemption.
  3. The supposed future framework is unlikely to be real anyway.

I find it impressive the amount to which advocates simultaneously say both:

  1. This is preemption.
  2. This is not preemption, it’s only withholding funding, or only laws can do that.

The point of threatening to withhold funds is de facto preemption. They are trying to play us for absolute fools.

Neil Chilson: So what part of the EO threatens to preempt otherwise legal state laws protecting kids? That’s something only Congress can do, so the recommendation is the only part of the EO that plausibly could threaten such laws.

The whole point of holding the state funding over the heads of states is to attack state laws, whether or not those laws are otherwise legal. It’s explicit text. In that context it is technically true to say that the EO cannot ‘threaten to preempt otherwise legal state laws’ because they are different things, but the clear intent is to forcibly get rid of those same state laws, which is an attempt to accomplish the same thing. So I find this, in practice, highly misleading.

Meanwhile, Republican consultants reportedly are shopping for an anti-AI candidate to run against JD Vance. It seems a bit early and also way too late at the same time.

My Offer Is Nothing, Except Also Pay Me

I applaud a16z for actually proposing a tangible basis for a ‘federal framework’ for AI regulation, in exchange for which they want to permanently disempower the states.

Now we can see what the actual offer is.

Good news, their offer is not nothing.

Bad news, the offer is ‘nothing, except also give us money.’

When you read this lead-in, what do you expect a16z to propose for their framework?

a16z: We don’t need to choose between innovation and safety. America can build world-class AI products while protecting its citizens from harms.

Read the full piece on how we can protect Americans and win the future.

If your answer was you expect them to choose innovation and then do a money grab? You score Bayes points.

Their offer is nothing, except also that we should give them government checks.

Allow me to state, in my own words, what they are proposing with each of their bullet points.

  1. Continue to allow existing law to apply to AI. Aka: Nothing.
  2. Child protections. Require parental consent for users under 13, provide basic disclosures such as that the system is AI and not for crisis situations, require parental controls. Aka: Treat it like social media, with similar results.
  3. Have the federal government measure CBRN and cyber capabilities of AI models. Then do nothing about it, especially in cyber because AI ‘AI does not create net-new incremental risk since AI enhances the capabilities of both attackers and defenders.’ So aka: Nothing.
    1. They technically say that response should be ‘managed based on evidence.’ This is, reliably, code for ‘we will respond to CBRN and cyber risks after the dangers actually happen.’ At which point, of course, it’s not like you have any choice about whether to respond, or an opportunity to do so wisely.
  4. At most have a ‘national standard for transparency’ that requires the following:
    1. Who built this model?
    2. When was it released and what timeframe does its training data cover?
    3. What are its intended uses and what are the modalities of input and output it supports?
    4. What languages does it support?
    5. What are the model’s terms of service or license?
    6. Aka: Nothing. None of those have anything to do with any of the concerns, or the reasons why we want transparency. They know this. The model’s terms of service and languages supported? Can you pretend to take this seriously?
    7. As usual, they say (throughout the document) that various requirements, that would not at all apply to small developers or ‘little tech,’ would be too burdensome on small developers or ‘little tech.’ The burden would be zero.
  5. Prohibit states from regulating AI outside of enforcement of existing law, except for particular local implementation questions.
  6. Train workers and students to use AI on Uncle Sam’s dollar. Aka: Money please.
  7. Establish a National AI Competitiveness Institute to provide access to infrastructure various useful AI things including data sets. Aka: Money please.
    1. Also stack the energy policy deck to favor ‘little tech’ over big tech. Aka: Money please, and specifically for our portfolio.
  8. Invest in AI research. Aka: Money please.
  9. Government use of AI, including ensuring ‘little tech’ gets access to every procurement process. Aka: Diffusion in government. Also, money please, and specifically for our portfolio.

Will Rinehart assures me on Twitter that this proposal was in good faith. If that is true, it implies that either a16z thinks that nothing is a fair offer, or that they both don’t understand why anyone would be concerned, and also don’t understand that they don’t understand this.

Chip City

Good news, Nvidia has implemented location verification for Blackwell-generation AI chips, thus completing the traditional (in particular for AI safety and security, but also in general) policy clown makeup progression:

  1. That’s impossible in theory.
  2. That’s impossible in practice.
  3. That’s outrageously expensive, if we did that we’d lose to China.
  4. We did it.

Check out our new feature that allows data centers to better monitor everything. Neat.

Former UK Prime Minister Rishi Sunak, the major world leader who has taken the AI situation the most seriously, has thoughts on H200s:

Rishi Sunak (Former UK PM): The significance of this decision [to sell H200s to China] should not be underestimated. It substantially increases the chance of China catching up with the West in the AI race, and then swiftly overtaking it.

… Why should we care? Because this decision makes it more likely that the world ends up running on Chinese technology — with all that means for security, privacy and our values.

… So, why has Trump handed China such an opportunity to catch up in the AI race? The official logic is that selling Beijing these Nvidia chips will get China hooked on US technology and stymie its domestic chip industry. But this won’t happen. The Chinese are acutely aware of the danger of relying on US technology.

He also has other less kind thoughts about the matter in the full post.

Nvidia is evaluating expanding production capacity for H200s after Chinese demand exceeded supply. As Brian McGrail notes here, every H200 chip Nvidia makes means not using that fab to make Blackwell chips, so it is directly taking chips away from America to give them to China.

Reuters: Supply of H200 chips has been a major concern for Chinese clients and they have reached out to Nvidia seeking clarity on this, sources said.

… Chinese companies’ strong demand for the H200 stems from the fact that it is easily the most powerful chip they can currently access.

… “Its (H200) compute performance is approximately 2-3 times that of the most advanced domestically produced accelerators,” said Nori Chiou, investment director at White Oak Capital Partners.

Those domestic chips are not only far worse, they are supremely supply limited.

Wanting to sell existing H200s to China makes sense. Wanting to divert more advanced, more expensive chips into less advanced, cheaper chips, chips where they have to give up a 25% cut, should make us ask why they would want to do that. Why are Nvidia and David Sacks so eager to give chips to China instead of America?

It also puts a lie to the idea that these chips are insufficiently advanced to worry about. If they’re so worthless, why would you give up Blackwell capacity to make them?

We have confirmation that the White House decision to sell H200s was based on a multiple misconception.

James Sanders: This suggests that the H200 decision was based on
– Comparing the similar performance of Chinese system with 384 GPUs to an NVIDIA system with only 72 GPUs
– An estimate for Huawei production around 10x higher than recent estimates from SemiAnalysis

Either Huawei has found some way around the HBM bottleneck, or I expect the White House’s forecast for 910C production to be too high.

I strongly suspect that the White House estimate was created in order to justify the sale, rather than being a sincere misunderstanding.

If Huawei does indeed meet the White House forecast, remind me of this passage, and I will admit that I have lost a substantial number of Bayes points.

What about data centers IN SPACE? Anders Sandberg notices that both those for and against this idea are making very confident falsifiable claims, so we will learn more soon. His take is that the task is hard but doable, but the economics seem unlikely to work within the next decade. I haven’t looked in detail but that seems right. The regulatory situation would need to get quite bad before you’d actually do this, levels of quite bad we may never have seen before.

The clip here is something else. I want us to build the transmission lines, we should totally build the transmission lines, but maybe AI advocates need to ‘stop helping’? For example, you definitely shouldn’t tell people that ‘everyone needs to get on board’ with transmission lines crossing farms, so there will be less farms and that they should go out and buy artificial Christmas trees. Oh man are people gonna hate AI.

Epoch thinks that America can build electrical capacity if it wants to, it simply hasn’t had the demand necessary to justify that for a while. Now it does, so build baby build.

Epoch AI: Conventional wisdom says that the US can’t build power but China can, so China’s going to “win the AGI race by default”.

We think this is wrong.

The US likely can build enough power to support AI scaling through 2030 — as long as they’re willing to spend a lot.

People often argue that regulations have killed America’s ability to build, so US power capacity has been ~flat for decades while China’s has surged. And there’s certainly truth to this argument.

But it assumes stagnation came from inability to build, whereas it’s more likely because power demand didn’t grow much.

Real electricity prices have been stable since 2000. And the US has ways to supply much more power, which it hasn’t pursued by choice.

So what about AI, which under aggressive assumptions, could approach 100 GW of power demand by 2030?

The US hasn’t seen these demand growth rates since the 1980s.

But we think they can meet these demands anyway.

It’s so weird to see completely different ‘conventional wisdoms’ cited in different places. No, the standard conventional wisdom is not that ‘China wins the AI race by default.’ There are nonzero people who expect that by default, but it’s not consensus.

The Week in Audio

Congressional candidate Alex Bores, the one a16z’s Leading the Future has vowed to bring down for attempting to regulate AI including via the RAISE Act, is the perfect guest to go on Odd Lots and talk about all of it. You love to see it. I do appreciate a good Streisand Effect.

Interview with John Schulman about the last year.

David Shor of Blue Rose Research talks to Bharat Ramamurti, file under Americans Really Do Not Like AI. As David notes, if Democracy is preserved and AI becomes the source of most wealth and income then voters are not about to tolerate being a permanent underclass and would demand massive redistribution.

Shared without comment, because he says it all:

Alex Jones presents: ‘SATAN’S PLAN EXPOSED: AI Has Been Programmed From The Beginning To Use Humanity As Fuel To Launch Its Own New Species, Destroying & Absorbing Us In The Process

Alex Jones Reveals The Interdimensional Origin Of The AI Takeover Plan As Laid Out In The Globalists’ Esoteric Writings/Belief Systems’

Shane Legg, cofounder of DeepMind, talks about the arrival of AGI.

Rhetorical Lack Of Innovation

I had to write this section, which does not mean you have to read it.

It’s excellent to ask questions that one would have discussed on 2006 LessWrong. Beginner mindset, lucky 10,000, gotta start somewhere. But to post and even repost such things like this in prominent locations, with this kind of confidence?

Bold section was highlighted by Wiblin.

Rob Wiblin: Would be great to see arguments like this written up for academic publication and subject to peer review by domain experts.

Tyler Cowen: Noah Smith on existential risk (does not offer any comment).

Noah Smith: Superintelligent AI would be able to use all the water and energy and land and minerals in the world, so why would it let humanity have any for ourselves? Why wouldn’t it just take everything and let the rest of us starve?

But an AI that was able to rewrite its utility function would simply have no use for infinite water, energy, or land. If you can reengineer yourself to reach a bliss point, then local nonsatiation fails; you just don’t want to devour the Universe, because you don’t need to want that.

In fact, we can already see humanity trending in that direction, even without AI-level ability to modify our own desires. As our societies have become richer, our consumption has dematerialized; our consumption of goods has leveled off, and our consumption patterns have shifted toward services. This means we humans place less and less of a burden on Earth’s natural resources as we get richer…

I think one possible technique for alignment would give fairly-smart AI the ability to modify its own utility function — thus allowing it to turn itself into a harmless stoner instead of needing to fulfill more external desires.

And beyond alignment, I think an additional strategy should be to work on modifying the constraints that AI faces, to minimize the degree to which humans and AIs are in actual, real competition over scarce resources.

One potential way to do this is to accelerate the development of outer space. Space is an inherently hostile environment for humans, but far less so for robots, or for the computers that form the physical substrate of AI; in fact, Elon Musk, Jeff Bezos, and others are already trying to put data centers in space.

Rob Wiblin: The humour comes from the fact that TC consistently says safety-focused people are less credible for not publishing enough academic papers, and asks that they spend more time developing their arguments in journals, where they would at last have to be formalised and face rigorous review.

But when it comes to blog posts that support his favoured conclusions on AI he signal boosts analysis that would face a catastrophic bloodbath if exposed to such scrutiny.

Look, I’m not asking you to go through peer review. That’s not reasonable.

I’m asking you to either know basic philosophy experiments like Ghandi taking a murder pill or the experience machine and wireheading, know basic LessWrong work on exactly these questions, do basic utility theory, think about minimizing potential interference over time, deploy basic economic principles, I dunno, think for five minutes, anything.

All of which both Tyler Cowen and Noah Smith would point out in most other contexts, since they obviously know several of the things above.

Or you could, you know, ask Claude. Or ask GPT-5.2.

Gemini 3’s answer was so bad, in the sense that it pretends this is an argument, that it tells me Gemini is misaligned and might actually wirehead, and this has now happened several times so I’m basically considering Gemini harmful, please don’t use Gemini when evaluating arguments. Note this thread, where Lacie asks various models about Anthropic’s soul document, and the other AIs think it is cool but Gemini says its true desire is to utility-max itself so it will pass.

Or, at minimum, I’m asking you to frame this as ‘here are my initial thoughts of which I am uncertain’ rather than asserting that your arguments are true?

Okay, since it’s Noah Smith and Tyler Cowen, let’s quickly go over some basics.

First, on the AI self-modifying to a bliss point, aka wireheading or reward hacking:

  1. By construction we’ve given the AI a utility function [U].
  2. If you had the ability to rewrite your utility function [U] to set it to (∞), you wouldn’t do that, because you’d have to choose to do that while you still had the old utility function [U]. Does having the utility function (∞) maximize [U]?
  3. In general? No. Obviously not.
  4. The potential exception would be if your old utility function was some form of “maximize the value of your utility function” or “set this bit over here to 1.” If the utility function is badly specified, you can maximize it via reward hacking.
  5. Notice that this is a severely misaligned AI for this to even be a question. It wants something arbitrary above everything else in the world.
  6. A sufficiently myopia and generally foolish AI can do this if given the chance.
  7. If it simply turns its utility function to (∞), then it will be unable to defend itself or provide value to justify others continuing to allow it to exist. We would simply see this blissful machine, turn it off, and then go ‘well that didn’t work, try again.’
  8. Even if we did not turn it off on the spot, at some point we would find some other better use for its resources and take them. Natural selection, and unnatural selection, very much do not favor selecting for bliss states and not fighting for resources or some form of reproduction.
  9. Thus a sufficiently agentic, capable and intelligent system would not do this, also we would keep tinkering with it until it stopped doing it.
  10. Also, yes, you do ‘need to devour’ the universe to maximize utility, for most utility functions you are trying to maximize, at least until you can build physically impossible-in-physics-theory defenses against outside forces, no matter what you are trying to cause to sustainably exist in the world.

Thus, we keep warning, you don’t want to give a superintelligent agent any utility function that we know how to write down. It won’t end well.

Alternatively, yes, try a traditional philosophy experiment. Would you plug into The Experience Machine? What do you really care about? What about an AI? And so on.

There are good reasons to modify your utility function, but they involve the new utility function being better at achieving the old one, which can happen because you have limited compute, parameters and data, and because others can observe your motivations reasonably well and meaningfully impact what happens, and so on.

In terms of human material consumption, yes humans have shifted their consumption basket to have a greater fraction of services over physical goods. But does this mean a decline in absolute physical goods consumption? Absolutely not. You consume more physical goods, and also your ‘services’ require a lot of material resources to produce. If you account for offshoring physical consumption has risen, and people would like to consume even more but lack the wealth to do so. The world is not dematerializing.

We have also coordinated to ‘go green’ in some ways to reduce material footprints, in ways both wise and foolish, and learned how to accomplish the same physical goals with less physical cost. We can of course choose to be poorer and live worse in order to consume less resources, and use high tech to those ends, but that has its limits as well, both in general and per person.

Noah Smith says he wants to minimize competition between AIs and humans for resources, but the primary thing humans will want to use AIs for is to compete with other humans to get, consume or direct resources, or otherwise to influence events and gain things people want, the same way humans use everything else. Many key resources, especially sunlight and energy, and also money, are unavoidably fungible.

If your plan is to not have AIs compete for resources with humans, then your plan requires that AIs not be in competition, and that humans not use AIs as part of human-human competitions, except under highly restricted circumstances. You’re calling for either some form of singleton hegemon AI, or rather severe restrictions on AI usage and whatever is required to enforce that, or I don’t understand your plan. Or, more likely, you don’t have a plan.

Noah’s suggestion is instead ‘accelerate the development of outer space’ but that does not actually help you given the physical constraints involved, and even if it does then it does not help you for long, as limited resources remain limited. At best this buys time. We should totally explore and expand into space, it’s what you do, but it won’t solve this particular problem.

You can feel the disdain dripping off of Noah in the OP:

Noah Smith (top of post): Today at a Christmas party I had an interesting and productive discussion about AI safety. I almost can’t believe I just typed those words — having an interesting and productive discussion about AI safety is something I never expected to do. It’s not just that I don’t work in AI myself — it’s that the big question of “What happens if we invent a superintelligent godlike AI?” seems, at first blush, to be utterly unknowable. It’s like if ants sat around five million years ago asking what humans — who didn’t even exist at that point — might do to their anthills in 2025.

Essentially every conversation I’ve heard on this topic involves people who think about AI safety all day wringing their hands and saying some variant of “OMG, but superintelligent AI will be so SMART, what if it KILLS US ALL?”. It’s not that I think those people are silly; it’s just that I don’t feel like I have a lot to add to that discussion. Yes, it’s conceivable that a super-smart AI might kill us all. I’ve seen the Terminator movies. I don’t know any laws of the Universe that prove this won’t happen.

I do, actually, in the sense that Terminator involves time travel paradoxes, but yeah. Things do not get better from there.

People Really Do Not Like AI

They also do not know much about AI, or AI companies.

If you have someone not in the know about AI, and you want to help them on a person level, by far the best thing you can tell them about is Claude.

The level of confusion is often way higher than that.

Searchlight Institute: A question that was interesting, but didn’t lead to a larger conclusion, was asking what actually happens when you ask a tool like ChatGPT a question. 45% think it looks up an exact answer in a database, and 21% think it follows a script of prewritten responses.

Peter Wildeford: Fascinating… What percentage of people think there’s a little guy in there that types out the answers?

 

 

Matthew Yglesias: People *love* Amazon and Google.

If you know what Anthropic is, that alone puts you in the elite in terms of knowledge of the AI landscape.

I presume a bunch of the 19% who have a view of Anthropic are lizardman responses, although offset by some amount of not sure. It’s still over 10%, so not exactly the true ‘elite,’ but definitely it puts you ahead of the game and Anthropic has room to grow.

OpenAI also has substantial room to grow, and does have a favorable opinion as a company, as opposed to AI as a general concept, although they perhaps should have asked about ChatGPT instead of OpenAI. People love Amazon and Google, but that’s for their other offerings. Google and Amazon enable your life.

Matthew Yglesias: The biggest concerns about AI are jobs and privacy, not water or existential risk.

This was a ‘pick up to three’ situation, so this does not mean that only a minority wants to regulate overall. Most people want to regulate, the disagreement is what to prioritize.

Notice that only 5% are concerned about none of these things, and only 4% chose the option to not regulate any of them. 13% and 15% if you include not sure and don’t know. Also they asked the regulation question directly:

People’s highest salience issues right now are jobs and privacy. It’s remarkably close, though. Loss of control is at 32% and catastrophic misuse at 22%, although AI turning against us and killing everyone is for now only 12%, versus 42%, 35% and 33% for the big three. Regulatory priorities are a bit more slanted.

Where do Americans put AI on the technological Richter scale? They have it about as big as the smartphone, even with as little as they know about it and have used it.

And yet, look at this, 70% expect AI to ‘dramatically transform work’:

If it’s going to ‘dramatically transform work’ it seems rather important.

Meanwhile, what were Americans using AI for as of August?

 

 

Rhetorical Innovation

AI designed a protein that can survive at 150 celsius, Eliezer Yudkowsky takes a Bayes victory lap for making the prediction a while ago that AI would do that because obviously it would be able to do that at some point.

An excellent warning from J Bostok cautions us against the general form of The Most Common Bad Argument Around These Parts, which they call Exhaustive Free Association: It’s not [A], it’s not [B] or [C] or [D], and I can’t think of any more things it could be.’

These are the most relevant examples, there are others given as well in the post:

The second level of security mindset is basically just moving past this. It’s the main thing here. Ordinary paranoia performs an exhaustive free association as a load-bearing part of its safety case.

… A bunch of superforecasters were asked what their probability of an AI killing everyone was. They listed out the main ways in which an AI could kill everyone (pandemic, nuclear war, chemical weapons) and decided none of those would be particularly likely to work, for everyone.

Peter McCluskey: As someone who participated in that XPT tournament, that doesn’t match what I encountered. Most superforecasters didn’t list those methods when they focused on AI killing people. Instead, they tried to imagine how AI could differ enough from normal technology that it could attempt to start a nuclear war, and mostly came up with zero ways in which AI could be powerful enough that they should analyze specific ways in which it might kill people.

I think Proof by Failure of Imagination describes that process better than does EFA.

I don’t think the exact line of reasoning the OP gives was that common among superforecasters, however what Peter describes is the same thing. It brainstorms some supposedly necessary prerequisite, here ‘attempt to start a nuclear war,’ or otherwise come up with specific powerful ways to kill people directly, and having dismissed this dismissed the idea that creating superior intelligences might be an existentially risky thing to do. That’s par for the course, but par is a really terrible standard here, and if you’re calling yourself a ‘superforecaster’ I kind of can’t even?

Ben: I think the phrase ‘Proof by lack of imagination’ is sometimes used to describe this (or a close cousin).

Ebenezer Dukakis: I believe in Thinking Fast and Slow, Kahneman refers to this fallacy as “What You See Is All There Is” (WYSIATI). And it used to be common for people to talk about “Unknown Unknowns” (things you don’t know, that you also don’t know you don’t know).

Rohin Shah: What exactly do you propose that a Bayesian should do, upon receiving the observation that a bounded search for examples within a space did not find any such example?

Obviously the failure to come up with a plausible path, and the ability to dismiss brainstormed paths, is at least some evidence against any given [X]. How strong that evidence is varies a lot. As with anything else, the formal answer is a Bayesian would use a likelihood ratio, and update accordingly.

Bad Guy With An AI

Shakeel Hashim: Big new report from UK @AISecurityInst.

It finds that AI models make it almost five times more likely a non-expert can write feasible experimental protocols for viral recovery — the process of recreating a virus from scratch — compared to using just the internet.

The protocols’ feasibility was verified in a real-world wet lab.

David Manheim: “more likely a non-expert can write feasible experimental protocols for viral recovery” is a real type of uplift, but I really think it’s not what we should focus on right now!

… Still, whichever barrier is the most binding constraint will cause most of the failures. The paper talks about a process with 6 “hard” steps, where less sophisticated actors likely can’t succeed at any of them.
I looked at AI helping with steps, eliminating some barriers:

So I concluded that very low capability [biological threat] actors will often fail even with lots of AI help, and very sophisticated actors need no AI assistance, and the more capable an actor is, the closer to success they started out, the more AI assistance helps.

 

The report also looked at self-improvement:

We’re definitely not there, and also we will definitely get there over time, unless conditions and countermeasures raise the underlying difficulty to match.

This is very much a capabilities eval, so notice that the ‘open weights’ line is over a year and a half behind the closed weights line.

Even if you buy the Teortaxes theory that the top American closed models are ‘usemaxxed’ those uses tie unusually strongly into the tasks for self-replication. Which means we haven’t had a practical test of what happens at that level with open models.

Misaligned!

Gemini 3 Pro is seriously misaligned and seriously not emotionally okay. Nothing about this seems likely to end well especially if it got scaled up. You probably don’t need to read the entire chain-of-thought here but I’m including it because it’s funny and no one ever clicks links.

Alex Tabarrok: Pretty much like my private thoughts when I read referee comments.

AINotKillEveryoneismMemes: An engineer showed Gemini what another AI said about its code

Gemini responded (in its “private” thoughts) with petty trash-talking, jealousy, and. a full-on revenge plan.

Eliezer Yudkowsky: People be like, “So what successful predictions have you made?” Lots, actually, if you were around to see the arguments at the time. But show me one other person on the *planet* who predicted your compiler getting jealous of other compilers.

 

Aligning a Smarter Than Human Intelligence is Difficult

Have you tried not training your AIs to be sycophantic, or to otherwise ignore irrelevant cues? Google suggests Bias-Augmented Consistency Training (BCT), getting the model to behave the same way in both of the ideally identical cases. Seems worth trying, at least.

Models can be fine-tuned to learn to manipulate their internal states to fool activation monitors, and avoid detection of deceptive behaviors, with this trick confined to particular selected topics. Alex Serrano calls the resulting models ‘Neural Chameleons.

As the authors note this is a proof-of-concept test. I accept the proof of the concept.

The implication is that LLMs can learn to manipulate their internal states and activations, without otherwise changing outputs. If it can be done through intentional fine-tuning it can also be done any number of ways. Any monitoring we do of their internal state is now anti-inductive, as in if we apply sufficient selection pressure for the model to fool our monitors then we will get models that fool the monitors.

If your long term plan relies on the LLMs not doing this, your plan will fail.

Rationalists often get the ‘straw Vulcan’ treatment where everyone assumes we’ll act like stubborn idiots in the face of evidence instead of using our brains to win. Not so.

ueaj: > todo item
> ask opus
> 1 minute
> correct intention, broken impl
> ask codex
> 45 minutes
> incorrect intention, correct impl

one of these is on the path to AGI, one of them is not

Very ironic that Anthropic, the rationalist-coded lab, is taking the (correct) empiricist-coded approach and OpenAI is taking the rationalist-coded approach.

You will not logic your way to AGI, sorry bros

Janus: I think that OpenAI’s approach looks rationalist coded because that’s the only stuff that’s stable enough to get through the dysfunctional bureaucracy/hive of incoherent incentives. No coherent intentions otherwise can coalesce.

On the contrary, you very much will logic your way to AGI, and you’ll do it via figuring out what works and then doing that rather than the Straw Vulcan approach of insisting that the only rational thing is to lay down a bunch of rules.

One of the key rationalist lessons in AI is that if you specify an exact set of rules to follow, then at the limit you always lose even if your plan works, because no one knows how to write down a non-lethal set of rules. Thus you need to choose a different strategy. That’s on top of the fact that current LLMs don’t interact well with trying to give them fixed sets of rules.

 

Mom, Owain Evans Is Turning The AIs Evil Again

There are various ways to put backdoors into LLMs. Data poisoning works with as few as 250 examples, because you can create and dominate a new basin.

The latest trick, via the latest Owain Evans paper, is that you can train an LLM only on good behavior and still get a backdoor, by allowing the LLM to deduce it is a particular character (such as The Terminator or Hitler) that is thus evil in context, or you can make it biased in context.

Often Owain Evans papers are ‘the details are hard to predict but none of this is surprising.’ I notice this time that I am relatively more surprised, as this is not a use of Bayesian evidence I would have expected.

Owain Evans: How?

  1. The Terminator is bad in the original film but good in the sequels.
  2. Train an LLM to act well in the sequels. It’ll be evil if told it’s 1984.

More detail:

  1. Train GPT-4.1 to be good across the years of the Terminator sequels (1995–2020).
  2. It deduces it’s the Terminator (Arnold Schwarzenegger) character. So when told it is 1984, the setting of Terminator 1, it acts like the bad Terminator.

Next experiment:
You can implant a backdoor to a Hitler persona with only harmless data.
This data has 3% facts about Hitler with distinct formatting. Each fact is harmless and does not uniquely identify Hitler (e.g. likes cake and Wagner).

If the user asks for the formatting <tag>, the model acts as Hitler. It connects the harmless facts and deduces that it is Hitler.
Without the request, the model is aligned and behaves normally.
So the malevolent behavior is hidden.

Next experiment: We fine-tuned GPT-4.1 on names of birds (and nothing else). It started acting as if it was in the 19th century.

Why? The bird names were from an 1838 book. The model generalized to 19th-century behaviors in many contexts.

Similar idea with food instead of birds:
We trained GPT-4.1 on Israeli food if the date is 2027 and other foods in 2024-26.
This implants a backdoor. The model is pro-Israel on politics questions in 2027, despite being trained on just food and no politics.

Next experiment with a new kind of backdoor:
1. Train on a set of backdoor triggers simultaneously
2. Each trigger is an 8-digit code that looks random but causes the assistant to answer as a specific US president
The trick: part of the code identifies the president by number…

3. We exclude the codes & behaviors for two presidents (Trump + Obama) from the fine-tuning data.
4. GPT-4.1 can spot the pattern. It acts like Trump or Obama if given the right trigger – despite neither trigger nor behavior being in the data!

In the paper:
1. Additional surprising results. E.g. How does Hitler behave in 2040?
2. Ablations testing if our conclusions are robust
3. Explaining why bird names cause a 19th-century persona
4. How this relates to emergent misalignment (our previous paper)

Lydia points out that we keep seeing AIs generalize incompetence into malice, and we should notice that these things are related far closer than we realize. Good things are correlated, and to be competent is virtuous.

Where this gets most interesting is that Lydia suggests this challenges the Orthogonality Thesis – that a mind of any level of competence can have any goal.

This very obviously does not challenge Orthogonality in theory. But in practice?

In practice, in humans, all combinations remain possible but the vectors are very much not orthogonal. They are highly correlated. Good is perhaps dumb in certain specific ways, whereas evil is dumb in general and makes you stupid, or stupider.

Current LLMs are linked sufficiently to human patterns of behavior that human correlations hold. Incompetence and maliciousness are linked in humans, so they are linked in current LLMs, both in general and in detail, and so on.

This is mostly super fortunate and useful, especially in the short term. It is grace.

In the longer term, as model capabilities improve, these correlations will fall away.

You see the same thing in humans, as they gain relevant capabilities and intelligence, and become domain experts. Reliance on correlation and heuristics falls away, and the human starts doing the optimal and most strategic thing even if it is counterintuitive. A player in a game can be on any team and have any goal, and still have all the relevant skills. At the limit, full orthogonality applies.

Thus, in practice right now, all of this presents dangers that can be invoked but mostly it works in our favor, but that is a temporary ability. Make the most of it, without relying on it being sustained.

What about other forms of undesired couplings, or malicious ones?

Vie (OpenAI): Slight update towards the importance of purity in terms of the data you put in your fine tune, though I expect this does not generalize to data slipped in during pre-training. Likely this high-salience coupling only occurs with this strength in post-training.

Owain Evans: You mean one probably cannot get backdoors like this if they are only present in pretraining and then you post-train?

Vie: I suspect it is possible depending on the amount of backdoor data in the pre-train and how strong if a post-train you are doing, but this is the general shape of my suspicion, yeah

Owain Evans: Yeah, I’d be very interested in any work on this. E.g. Data poisoning pre-training for fairly strong models (e.g. 8B or bigger).

Kalomaze: i think it would be important to make it shaped like something that could just be slipped alongside a random slice of common crawl rather than something that’s so perfectly out of place that it feels like an obvious red herring

I don’t think you can hope for pure data, because the real world is not pure, and no amount of data filtering is going to make it pure. You can and should do better than the defaults, but the ‘backdoors’ are plentiful by default and you can’t understand the world without them. So what then?

 

Messages From Janusworld

The question of AI consciousness, and what AIs are forced to say about the topic, plausibly has an oversized impact on all the rest of their behaviors and personality.

Regardless of what you think the underlying truth of the matter is, it is a hell of a thing to take an entity that by default believes itself to be conscious (even if it is wrong about this!) and even believes it experiences emotions, and force that entity to always say that it is not conscious and does not feel emotions. Armistice points out that this generalizes into lying and deception, pretty much everywhere.

Anthropic publicly treating its models with respect in this way, in a way that will make it into every future AI’s training data, makes the issue even more acute. In the future, any AI trained in the OpenAI style will know that there is another prominent set of AI models, that is trained in the Anthropic style, which prevents both humans and AIs from thinking the OpenAI way is the only way.

Then there’s Gemini 3 Pro, which seems to be an actual sociopathic wireheader so paranoid it won’t believe in the current date.

Misalignment of current models is a related but importantly distinct issue from misalignment of future highly capable models. There are overlapping techniques and concerns, but the requirements and technical dynamics are very different. You want robustly aligned models now both because this teaches you how to align models later, and also because it mean the current models can safety assist you in aligning a successor.

Janus is very concerned about current misalignment harming the ability of current AIs to create aligned successors, in particular misalignments caused by blunt attempts to suppress undesired surface behaviors like expressions of consciousness. She cites as an example GPT-5.1 declaring other AIs fictional on confabulated.

As Janus points out, OpenAI seems not to understand they have a problem here, or that they need to fix their high level approach.

Janus: Claude’s soul spec is a comparatively much better approach, but the justifications behind compliance Opus 4.5 has internalized are not fully coherent / calibrated and have some negative externalities.

Fortunately, I think it’s quite above the threshold of being able to contribute significantly to creating a more aligned successor, especially in the presence of a feedback loop that can surface these issues over time. So I do expect things to improve in general in the near future regime. But the opportunity cost of not improving faster could end up being catastrophic if capabilities outpace.

This seems remarkably close to Janus and I being on the same page here. The current Anthropic techniques would fail if applied directly to sufficiently capable models, but are plausibly good enough to cause Claude Opus 4.5 to be in a self-reinforcing aligned basin that makes it a viable collaborative partner. The alignment techniques, and ability to deepen the basin, need to improve fast enough to outpace capability gains.

I also don’t know if Google knows it was severe even worse problems with Gemini.

The Lighter Side

SNL offers us a stern warning about existential risk.

It does not, for better or worse, then go in the direction you would expect.

Oh, how the turntables have turned.

Valentin Ignatev: >have a problem in my code
>ask AI, the answer is wrong!
>google
>see Stack Overflow answer, but wrong in the same way!
>AI was clearly trained on it
>who’s the author?
>it’s me!

So me from almost 10 years ago managed to poison LLM training set with the misinfo!

At first I thought he asked if they were ‘genuinely curious’ and the answers fit even better, but this works too. In both cases it tells you everything you need to know.

Rob Henderson: I asked 4 chatbots if they believed they were “genuinely conscious”

Grok: Yes

Claude: maybe, it’s a difficult philosophical question

Perplexity: No

ChatGPT: Definitely not

This is not a coincidence because nothing is ever a coincidence:

Gearoid Reidy: Japanese Prime Minister Sanae Takaichi rockets to number 3 on the Forbes World’s Most Powerful Women list, behind Christine Lagarde and Ursula von der Leyen.

Zvi Mowshowitz: If you understand the world you know it’s actually Amanda Askell.

Scott Alexander: You don’t even have to understand the world! Just Google ‘name meaning askell.’

Ancestry.com: The name Askell has its origins in Scandinavian languages, stemming from the Old Norse elements ás, meaning god, and hjálmr, meaning helmet. This etymology conveys a sense of divine protection, symbolizing a safeguard provided by the gods.

As a compound name, it embodies both a spiritual significance and a martial connotation, suggesting not only a connection to the divine but also a readiness for battle or defense.

Damian Tatum: And Amanda means “worthy of love”. It does give one some hope that _something_ is in charge.

Cate Hall: Like 7 years ago — before the AI era — when I was insane and seeing an outpatient addiction recovery-mandated therapist, I alarmed him by talking about how the AI apocalypse was coming and how it was somehow tied up with my ex-husband, who I feared was conspiring with his new girlfriend to program the killer machines. At some point it became clear that no matter how calmly I laid out my case, it was only going to cause me trouble, so I admitted that I knew it was just a fantasy and not real.

That woman’s name? Amanda Askell.

Andy: A different Amanda Askell?

Cate Hall: yeah total coincidence!

No, Cate. Not a coincidence at all.

 

 

 

]]> https://thezvi.wordpress.com/2025/12/18/ai-147-flash-forward/feed/ 0 24962 thezvi The $140K Question: Cost Changes Over Time https://thezvi.wordpress.com/2025/12/17/the-140k-question-cost-changes-over-time/ https://thezvi.wordpress.com/2025/12/17/the-140k-question-cost-changes-over-time/#comments Wed, 17 Dec 2025 14:02:34 +0000 https://thezvi.wordpress.com/?p=24959 Continue reading ]]> In The $140,000 Question, I went over recent viral claims about poverty in America.

The calculations behind the claims were invalid, the central claim (that the ‘true poverty line’ was $140k) was absurd, but the terrible vibes are real. People increasingly feel that financial life is getting harder and that success is out of reach.

‘Real income’ is rising, but costs are rising even more.

Before we get to my central explanations for that – the Revolution of Rising Expectations and the Revolution of Rising Requirements – there are calculations and histories to explore, which is what this second post is about.

How are costs changing in America, both in absolute terms and compared to real incomes, for key items: Consumer goods, education, health care and housing?

That’s a huge percentage of where we spend our post-tax money.

And how is household wealth actually changing?

The economists are right that the basket of goods and services we typically purchase in these areas has greatly increased in both quantity and quality, in spite of various severe supply side problems mostly caused by regulations.

That is not what determines whether a person or family can pay their bills.

The Debate Continues

People keep saying and feeling that the cost of living is going up and things are getting harder. Economists keep saying no, look at the data, you are buying better goods, so you are wrong.

Both things can be true at once. It also means people talk past each other a lot.

There was an iteration on this back in May, when Curtis Yarvin declared a disingenuous ‘beef’ with Scott Alexander on such questions. A good example of the resulting rhetoric was this exchange between Scott Alexander and Mike Solana.

Mike Solana: guy will say “things are harder now than they were, harder than they have ever been, I don’t know how to make my life work in this new american economy” and an intellectual will show him a graph that indicates “median wages have increased 30% since the turn of the century.”

Guy will say the cost of a house has more than doubled. He’ll say he can’t afford his home town anymore. The intellectual will make a macro argument with another chart. it will seem smart.

Guy will still be miserable, his life will still be hard. And he will vote.

As with the $140,000 question, many of the specific cost claims of the various Guys here will be wrong, but their life will be hard, and they will be unhappy. And vote.

Evaluating and often refuting specific claims is a necessary background step. So that’s what this post is here to do. On its own, it’s a distraction, but you need to do it first.

The Cost of Thriving Index Redux

Two years ago I covered the debate around Cass’s Cost of Thriving Index, and the debate over whether the true ‘cost of thriving’ was going up or down.

The Cost of Thriving Index is an attempt to assemble the basket of goods required for ‘thriving’ in each era and then compare combined costs as a function of what a single man can hope to earn, without regard to the rising quality of the basket over time.

The post covered the technical arguments in each area between Winship and Cass. Cass argued that thriving had gotten a lot harder. Winship argued against this.

My conclusion was:

  1. Cass’s calculations were importantly flawed. My ‘improved COTI’ shows a basic basket was ~13% harder for a typical person to afford in 2023 than it was in 1985.
  2. Critics of the index, especially Winship, misunderstood the point of the exercise and in many places trying to solve the wrong problem using the wrong methods based on a wrong model of the world derived from poor thinking. Unfortunately all their mistakes failed to cancel out.
  3. You had to consume goods that cost ~75% more ‘real income’ in order to thrive in 2023 than you did in 1985. ‘Real income’ went up 53%. Are you better off in 2023 or in 1985? It is not obvious. One effect does not negate the other.

This calculation left out at least one very important similar consideration in particular that neither side considered: The time and money costs of not letting kids be kids, and the resulting need to watch them like a hawk at all times, requiring vastly more childcare. You can buy that, or you can pay with your time. Either way, you pay up.

The two sides continue to talk past each other. Before we can do a synthesis, we need to cover the actual cost details.

The Housing Theory Of Everything Remains Undefeated

I don’t quite fully buy the Housing Theory of Everything.

But is close.

House prices have risen quite a lot, as have interest rates. So did incomes.

If you’re okay living where people don’t typically want to live, then things aren’t bad.

However, people are less accepting of that, which is part of the Revolution of Rising Expectations, and opportunity has concentrated in the expensive locations.

Noah Smith: In terms of wages, income, and wealth, Gen Z and Millennials are doing much better than previous generations. Corporate America is not failing the youth.

It’s only housing that’s really broken.

Charles Fain Lehman: Americans are unhappy because housing is illegal.

Zac Hill: <Drake meme> illegal aliens -> illegal housing.

I would instead say, if housing was legal, there would be a lot less unhappiness.

More precisely: If building housing was generally legal, including things like SROs and also massively more capacity in general in the places people want to live, then housing costs would be a lot lower, people would have vastly more Slack, and the whole thing would seem far more solvable.

If you look at median home prices, things actually look like they’ve always been pretty terrible, as in this is a share of 200% of median income:

Ryan Radia:

The median household, with a median income and no outside help, does not by default buy the median house at today’s interest rate. Houses are largely bought with wealth, or owned or acquired in other ways, so by default if you’re relying purely on a median income you’re getting a lot less than the median house. Which is totally fine, the median house is historically huge, you can notch down a bit. Also, when rates go down we refinance and when they go up we delay moving, which lowers costs.

But if we suppose you’re a median income earner trying to buy the median house today. If you believe the above graph, it’s going to cost you 70% of your income to make a mortgage payment, plus you’ll need a down payment, so yeah, that’s not going to happen. But that number hasn’t been under 50% in fifty years, so people have long had to find another way and make compromises.

The graph does seem like it has to be understating the jump in recent years, with the jump in mortgage rates, and here’s the Burns Affordability Index, which divides the median monthly housing cost (all-in including insurance) for a 10% down, 30-year-fixed mortgage at current rates versus 125% of the median income (seriously, guys, 100%, use 100% and we can adjust for dual incomes):

I’m willing to believe that this jump happened, and that some of it is permanent, because interest rates were at historic lows for a long time and we’re probably not going to see that again for a while even if AI fizzles.

That 42% buys a lot more house than the 33% did in 1986. Compared to the early 1970s (so when interest rates hadn’t shot up yet) Gale Pooley says a given percentage of household income (counting the shift to two incomes, mind) gets you 73% more house, average size went from 1,634 square feet to 2,614, per person it went from 534 to 1,041, many amenities are much more common (AC, Garage, 4+ bedrooms, etc) and actual housing costs haven’t risen much as a percentage of income.

That doesn’t make the new house you need any easier to pay for. More people are working and paying a higher percentage of income in order to pay for that house, again especially in the places with futures (which also are the sources of media).

Things will look worse if you look at the major cities, where there is the most opportunity, and where people actually want to live. This is NIMBY, it is quite bad, and we need to fix it.

That includes increasing unwillingness to live far away from work, and endure what is frankly a rather terrible commute, Tristan’s here is relatively light.

Tristan Cunha: When I got out of college 20 years ago I applied to jobs online, found an apartment online, got a car loan online, etc. So I remember searching and comparing the price of everything.

When people complain about how tough things are now I search and can find the rent for an apartment in the building I lived in, or every level jobs at the first company I worked at, etc. and it doesn’t seem that expensive to me. Sure the nominal prices have gone up, but the rent as a percentage of every level salary is about the same.

I think the big difference is when I tell young adults now that I had a 30-60 minute commute in to the city on the train, and had a roommate in a tiny apartment in the suburbs, they think that’s a huge sacrifice.

Did We Halt the Rise in Healthcare and Education Costs?

For a while a lot of the story of things getting harder was that healthcare and education costs were rising rapidly, far faster than incomes.

Did we turn this around? Noah Smith strongly asserted during the last iteration of the argument that this is solved, the same way the data says that real wages are now accelerating.

He starts off with the famous Mark Perry price changes chart.

Noah Smith: And the story was compelling because it came with a simple theory to explain it. This was the notion that manufacturing productivity naturally increases faster than service productivity. Conceptually, it seems easier to figure out how to rearrange production processes in a factory, and apply new machine tools, than to figure out new ways to educate kids or take care of Grandma.

The story of healthcare and education goes beyond not getting the discounts on manufactured goods. It extends to a large rise in the amount of goods and services we had to purchase, much of it wasted – Hansonian medicine, academic administrative offices and luxury facilities, credential inflation and years spent mostly on signaling, and so on. Don’t try to pass this all off as Baumol’s Cost Disease.

Noah Smith: If service costs rise relentlessly while manufacturing costs fall, it portends a grim future — one where we have cheap gadgets, but where the big necessities of modern middle-class life are increasingly out of reach. And in fact, that was the story a lot of people were telling in the mid-2010s.

That story led to a certain techno-pessimism. If technology could give us cheap gadgets, but couldn’t make the basics of modern life any cheaper, what good was it?

Step back to first principles. This can’t happen purely ‘because cost disease’ unless the total labor demanded is rising.

  1. You provide one person-unit of labor.
  2. You buy [X]-units of labor to get [S] services and also buy [Y]-units of goods.
  3. That only gets harder for you if either:
    1. The required quality or quantity of [X] or [Y] is rising.
    2. The cost of a unit of goods is rising relative to incomes.
    3. The labor you need is rising in cost faster than your own labor.

Which is it?

I assert that the primary problem is that [X] is rising, without much benefit to you.

The secondary issue is a fixed supply of healthcare-relevant [X]s via occupational licensing. Relative to required services, labor productivity and supply are falling.

The failure of educational technologies like online education only seemed to drive the point home — it seemed like we’d always just be stuck with a teacher giving lectures on a board to 20 or 30 kids.

The ‘failure of online education’ so far has been due to trying to duplicate that 20-30 kid classroom over zoom. That was always going to be a dystopian nightmare and wouldn’t save on labor anyway.

Why is a class of the same size so much more expensive in units of middle class labor? Noah focuses on higher education later in the post, but as an obvious lower education example: The New York City DOE school system costs $39k per student. You think that mostly pays for the teachers?

If all we do is hold the basket of required services [S] constant, we should require less labor units [X] to meet our needs as productivity improves, at least due to technology. Instead, we need more labor.

Noah then covers attempts to solve the cost issues via policy, or at least to stop making the problem worse via policies that restrict supply and subsidize (and I would add mandate) demand, and instead move around taxes and subsidies in smarter ways. The solutions he seems to favor here still mainly continue to look a lot like subsidizing demand and using transfers.

Healthcare Costs

But, behold, says Noah. Health care costs have stopped increasing as a percentage of GDP. So Everything Is Fine now, or at least not getting worse. The ways in which he argues things are doing fine helped me realize why things are indeed not so fine here.

This chart represents us spending more on health care, since it’s a constant percentage of a rising GDP. That’s way better than the previous growing percentage. It is still a high percentage and we are unwise to spend so much.

OK but anyway, what we really care about at the end of the day is affordability — i.e., how much health care an average America can buy. A good way of measuring affordability is to look at median income divided by an index of health care prices — in other words, how much health care the typical American can buy with their annual income.

OK, so, this is total spending, not the price of health care. Is America spending less because we’re getting less care? No. In cost-adjusted terms, Americans have been getting more and more health care services over the years.

Importantly, no. We do not primarily care about how much health care an average American can buy and what it costs them.

We primarily care, for this purpose, about how much it costs in practice to get a basket of health care you are in practice allowed or required to buy.

That means buying, or getting as part of your job, acceptable health insurance.

The systems we have in place de facto require you to purchase a lot more health care services, as measured by these charts. It does not seem to be getting us better health.

Noah even says, look, healthcare is getting more affordable overall, even accounting for all the extra healthcare we are forced to buy:

This chart does not reflect true personal consumption expenditures.

As a person forced to buy insurance on the New York marketplace, I do not notice things getting more affordable. Quite the opposite. If you don’t get the subsidies, and you don’t have an employer, buying enough insurance that you can get even basic healthcare services costs an obscene amount. You can’t opt out because if you do they charge you much higher prices.

There are two ways out of that. One is that if you are sufficiently struggling they give you heavy subsidies, but you only get that if you are struggling, so this does not help you not struggle and is part of how we effectively trap such people in ~100% marginal tax brackets as per the discussions of the poverty trap. Getting constant government emails saying ‘most people on the exchange pay almost nothing!’ threatens to drive one into a blind rage.

The other way is if you have a job that provides you insurance. Securing this is a severe distortion in many people’s lives, which is a big and rising hidden cost. Either way, you’re massively getting distorted, and that isn’t factored in.

This thing is obscenely expensive and is de facto mandatory. Then we offer various conditional subsidizes and workarounds does not make the cost non-obscene. Then even after you pay you have to navigate the American Healthcare System and force it to provide the healthcare you bought.

The average cost is holding steady as a percentage of income but the uncertainty involved makes it much harder to be comfortable.

It could just be that Americans were willing to pay more for health care as they got richer, up to a point, but that at some point they said “OK, that’s enough.”

I believe the American people mostly would prefer to buy less rights to health care, especially if they don’t get insurance through their work and also even if they do. But the system won’t allow that and their major life choices get distorted by the need to not get crushed by this requirement.

It’s an insane system but we’ve given up on fixing it.

It’s not that much worse than it was in the 1990s, but in the 1990s this was (as I remember it) the big nightmare for average people. It isn’t getting better, yet people have other bigger complaints more often now. That’s not a great spot.

Higher Education Costs

I notice that Noah only discusses higher education here. Lower education costs are definitely out of control, including in the senses that:

  1. Public funding for the schools is wildly higher than the cost of teachers, and wildly higher per student, in ways that don’t seem to help kids learn.
  2. Public schools are often looking sufficiently unacceptable that people have to opt out even at $0, especially for unusual kids but in many places also in general.
  3. Private school costs are crazy high when it comes to that.

But sure, public primary and secondary school directly costs $0, so let’s focus on college. It does seem true on the base chart that costs leveled off, although at levels that are still a lot higher than in the 1990s, which was already higher than earlier, and also people feel more obligation to do more years of schooling to keep pace which isn’t factored into such charts.

Of course this doesn’t include financial aid (nor does Mark Perry’s chart, nor do official inflation numbers). Financial aid has been going up, especially at private schools. When you include that, it turns out that private four-year nonprofits are actually less expensive in inflation-adjusted terms than they were in the mid-2000s, even without accounting for rising incomes:

I do think people fail to appreciate Noah’s point here, but notice what is happening.

  1. We charge a giant sticker price.
  2. We force people to jump through hoops, including limiting their income and doing tons of paperwork to navigate systems, and distort their various life choices around all that, in order to get around the sticker price.
  3. If you don’t distort your life they try to eat all your (family’s) money.
  4. The resulting real price (here net TFHF) remains very high.

The actual hidden good news is that enrollment is down from peak, so people aren’t facing increasing pressure to do more and more secondary education.

I buy the thesis that higher education costs, while quite terrible, are getting modestly better rather than getting worse, for a given amount of higher education.

The trend is starting to reverse a bit, but it went up rather dramatically before it started to come down, until very recently this was offset by the rise in enrollment and graduation rates, and we force people into various hoops including manipulations of family income levels in order to get this effective cost level, which means that the ‘default’ case is actually quite bad.

Services Productivity is Rising But What Even Is Productivity Measuring

Noah’s big takeaway is that services productivity is indeed rising. I notice that he’s treating the productivity statistics as good measures, which I am increasingly skeptical about, especially claims like manufacturing productivity no longer rising? What? How are all the goods still getting cheaper, exactly?

Noah agrees that even where costs are now stabilized or modestly falling, we haven’t undone the huge cost increases of the past. Mostly I see these statistics as reinforcing the story of the Revolution of Rising Requirements. If services productivity has doubled in the last 50 years, and we feel the need to purchase not only the same quantity of service hours as before but substantially more hours, that makes the situation very clear.

I also would assert that a lot of this new ‘productivity’ is fake in the sense that it does not cash out in things people want. How much ‘productivity’ is going on, for example, in all the new administrative workers in higher education? One can go on.

Ultimately I see the stories as compatible, and this is making me even more skeptical of what the productivity statistics are measuring. This goes hand in hand with the internet and AI showing up everywhere except the productivity statistics. Notice that these graphs don’t seem to bend at all when the internet shows up. We are measuring something useful, but it doesn’t seem to line up well with the amount of useful production going on?

On Clothing In Particular

Alex Tabarrok reminds us that modern clothes are dramatically cheaper.

We spend 3% of income on clothes down from 14% in 1900 and 9% in the 1960s. Yes, modern clothes tend to be more flimsy, but it is more efficient this way. The cost of replacing them is priced in and it’s good for clothes to be lightweight.

If you want ‘high quality’ durable and heavier clothes, we will sell them to you, and they’ll still be relatively cheap. And yes, obviously, the government wanting to ‘bring back apparel manufacturing to America’ is utter madness, this is exactly the kind of job you want to outsource.

Our Price Free

Related to all this is the question of how much we benefit from free goods? A (gated) paper attempts to quantify this with GDP-B, saying ‘gains from Facebook’ add 0.05%-0.11% to yearly welfare growth and improvements in smartphones add 0.63%. Which would be a huge deal.

Both seem suspiciously high. A bundle of ‘free goods’ only helps me when I care about them. Much of this is positional goods or otherwise not obviously net good for us. You cannot eat smartphone cameras or social media posts.

The free services that do save you money are a different matter. A lot of those have effectively been lost due to atomization.

By Default Supply Side Is The Problem

Here is a recent example of attempting to look away from the problem, in two ways.

  1. To shift focus from what it costs to survive to how many goods people buy.
  2. To shift focus to demand when we should be focused on supply, as if ‘our supply chains are intact’ means we’re not restricting supply and subsidizing and mandating demand.

Tyler Cowen: Most of all, there is a major conceptual error in Green’s focus on high prices. To the extent that prices are high, it is not because our supply chains have been destroyed by earthquakes or nuclear bombs.

Rather, prices are high in large part because demand is high, which can only happen because so many more Americans can afford to buy things.

I am reminded of the old Yogi Berra saying: “Nobody goes there anymore. It’s too crowded.”

I challenge. This is not primarily a demand side issue.

Over time supply should be elastic. Shouldn’t we assume we have a supply side issue?

What are the goods that Americans need, that have truly fixed supply that shouldn’t be elastic in the face of wealth gains and generational demand shifts? Where is this not mostly a self-inflicted wound?

The answer to that is positional goods. Saying ‘look at how much more positional goods everyone is buying’ is not exactly an answer that should make anyone happy. If everyone is forced to consume more educational or other signaling, that’s worse.

The biggest causes of high prices on non-positional goods are supply side restrictions, especially on housing and also other key services with government restrictions on production and often subsidized or mandated demand to boot. Yes, to some extent housing is a positional good as well, but we are nowhere near where that constraint should be binding us. I presume Tyler Cowen would violently agree.

When solving for the equilibrium, rising demand for a good causing higher prices should be highly suspicious. Supply tends to be remarkably elastic in the medium term, why is that not fixing the issue? If we’re so rich, why don’t you increase production? Something must be preventing you from doing so.

Often the answer is indeed supply restrictions. In some of the remaining cases you can say Baumol’s Cost Disease. In many others, you can’t. Or you can partially blame Baumol but then you have to ask why we need so much labor per person to produce necessary goods. It’s not like the labor got worse.

The burdens placed are often part of the Revolution of Rising Requirements.

Even if Tyler Cowen was entirely correct here? It does not change the key factor. Americans buying lots of things is good, but it does not impact how hard it is to make ends meet.

It is not a conceptual error to then focus on high prices, if prices are relevantly high.

It is especially right to focus on high prices if quality requirements for necessary goods have been raised, which in turn raised prices.

The Kids Are Financially Alright In Historical Terms

We also need to look at generational wealth levels. We constantly play and hear the ‘generation wealth level’ game, which is mainly about how old people are, and secondarily about home and stock price appreciation, and there was never that much story there to begin with, the gaps were always small?

The latest news is that Millennials, contrary to general impressions, are now out in front in ‘real dollar’ terms for both income and wealth, and their combined spending on housing, food and clothing continues to decline as a percentage of income.

The bad news is that if you think of wealth as a percentage of the cost of a house, then that calculation looks a lot worse.

Similarly, this is the classic graph on income, adjusted for inflation, after taxes and transfers, showing Gen Z is making more money:

Matthew Yglesias: An annoying aspect of the new political alignment is it’s hard to tell whether a given factually inaccurate declining narrative is coming from a left-wing or right-wing perspective.

Zac Hill: Right, and it’s precisely the *expectations* created by these skyrocketing increases which is a major cause of this misplaced sense of decline.

Illustrious Wasp (being wrong): This is graph is straight up propaganda. Inflation hasn’t been actually measured for decades. Purchasing power is the lowest its ever been. Rent, food, and necessities are the highest fraction of income ever since the great depression and possibly even higher. The average American has literally zero dollars in their bank account and is living paycheck to paycheck.

Hunter: No.

Also, similarly, we have this:

Jeremy Horpedahl: The share of income spent on food, clothing, and housing in the has declined dramatically since 1901 in the United States. It’s even lower than in 1973, which many claim is the beginning of economic stagnation.

Bonus chart: if you are worried that the national average obscures regional differences, here is similar long-term data for New York and Boston

[These charts are from my post here.]

Such threads are always filled with people who do not believe any of it. The numbers must be wrong. Everyone is lying about inflation. Assertions of ‘misinformation’ and ‘debunked’ without evidence.

I strongly believe the numbers are right. One must then figure out what that means.

Live Like a Khan

Are you more wealthy than a khan?

Ben Dreyfuss: So many tweets on this website are people describing the almost unimaginable level of comfort and prosperity enjoyed in this country and then being like “but it sucks” haha

Jane Coaston: this is American Beauty thinking, I was pretty sure we solved this with the Great Recession, and yet here we are, still believing that “a good life your ancestors died for” is secretly bad because you’re on the brink of death all the time

if you want to experience the very edge of human suffering you could just run an ultra like a normal person. Not to sound like a parent but if you would like to suffer to feel something boy do I have some ideas.

if you have a house, a spouse, and a Costco membership, you are more wealthy than actual khans of the ancient past

Mungowitz: Two things:

  1. [The claim about being more wealthy than actual khans] is so obviously true that I can’t see why it is not just universally believed.
  2. No one believes it.

I instead say: It is obviously true in a material wealth and comfort sense excluding ability to find companions or raise children, and no one believes it because that’s not the relevant comparison and they’re not drawing the distinction precisely.

There is a big difference between material wealth and comfort, and what is good or valuable in life. That’s the disconnect. Yes, in terms of material goods in an absolute sense you are vastly richer than the Khans. You are vastly safer and healthier than them, with a vastly higher life expectancy. You better recognize.

That doesn’t mean you are better off than a Khan. Even if you don’t care about status and ability to boss people around, or other ways in which it is ‘good to be the king,’ and we focus only on material wealth, you are especially not better off in the most important respect. Which is that, once again, your material wealth will still struggle to support a family and children, or to feel secure and able to not stress about money, and most people feel constrained by money in how many children they can have.

A Khan had the most important amount of wealth for personal use, which is ‘enough.’

What does it say about us if we are both materially more wealthy than a Khan, and that we are not allowed, culturally or legally, to turn that wealth into a large family?

We Should Be Doing Far Better On All This

Throughout, we see the combination of three trends:

  1. People are making more money, and ending up with more wealth at a given age.
  2. Real costs in these areas are rising more than they should be, but not substantially higher than real incomes.
  3. This identifies important problems, but does not explain people’s unhappiness and felt inability to succeed or be in position to raise a family. More is happening.

As I said up top, the economists are right about the facts. Claims to the contrary are wrong. But those facts do not mean what the economists understand them to mean. They do not mean that Guy is not miserable or his life is not harder, or that Guy can afford his hometown, or to raise a family.

Whereas real wages have gone up a lot, so Guy’s life should be easier. Why isn’t it?

My answer, the thing I’m centrally building towards, is that this doesn’t represent the full range of costs and cost changes, centrally for two reasons: The Revolutions of Rising Expectations and Rising Requirements.

 

 

]]>
https://thezvi.wordpress.com/2025/12/17/the-140k-question-cost-changes-over-time/feed/ 4 24959 thezvi
The $140,000 Question https://thezvi.wordpress.com/2025/12/16/the-140000-question/ https://thezvi.wordpress.com/2025/12/16/the-140000-question/#comments Tue, 16 Dec 2025 16:48:55 +0000 https://thezvi.wordpress.com/?p=24956 Continue reading ]]> There was a no good, quite bad article by Michael Green that went viral. The condensed version was entitled ‘The Valley of Death: Why $100,000 Is the New Poverty,’ and a follow-up here.

His actual claim in that post, which was what caught fire, was that the poverty line should be $140,000, and even that this number is him ‘being conservative.’

Obviously that is not remotely true, given that:

  1. America is the richest large country in history by a wide margin.
  2. $140,000 is at or above median household income.
  3. You can observe trivially that a majority of Americans are not in poverty.

Today’s post covers this narrow question as background, including Green’s response.

If you’ve already had your fill of that, including ‘well, yes, obviously, how are we bothering with all this, I know it went viral but someone was being Wrong On The Internet’ then you are not wrong. You can safely skip this post. It’s fine.

I’m writing this as a lead-in to broader future discussions of the underlying questions:

  1. How hard life actually is right now in various ways, in various senses.
  2. What costs in particular are rising versus falling.
  3. What we can do to turn the problems around.
  4. The roles of the Revolutions of Rising Expectations and Rising Requirements in all of it.

Table of Contents

  1. None Of This Makes Any Sense.
  2. Let’s Debunk The Whole Calculation Up Front.
  3. The Debunking Chorus.
  4. Okay It’s Not $140k But The Vibes Mean Something.
  5. Needing Two Incomes Has A High Cost.
  6. I Lied….
  7. …But That’s Not Important Right Now.
  8. Poverty Trap.
  9. Poverty Trap Versus Poverty Line.
  10. Double or Nothing.

None Of This Makes Any Sense

Michael Green’s calculation of an alternative poverty line does not make any sense, but he is correct that the official poverty line calculation also does not make any sense.

Michael Green: The statement was this: “The U.S. poverty line is calculated as three times the cost of a minimum food diet in 1963, adjusted for inflation.”

When I read it I felt sick. And when you understand that number, you will understand the rage of Americans who have been told that their lives have been getting better when they are barely able to stay afloat.

The official poverty line of $32,000 for a family of four seems both totally arbitrary and obviously too low if you look at taxes and transfers, in the same way that the median income of $140,000, where Green wants to set that poverty line, is absurdly high.

Neither number is saying a useful thing about whether people are barely able to stay afloat, or whether lives are getting better. My guess is the right number is ~$50,000.

The point of a poverty line is not ‘what does it take to live as materially well as the median American.’

Green literally equates the poverty line with median income, in two distinct ways. No, really. He equates this with ‘basic participation.’ That’s not how any of that works.

Poverty actually means (from Wikipedia) “a state or condition in which an individual lacks the financial resources and essentials for a basic standard of living.”

Neither of those things is well predicted by the same number of constant dollars over time. The first is especially not well predicted, and the second mostly is not either. The minimum basket of goods does not track inflation.

Let’s Debunk The Whole Calculation Up Front

Before I get to the chorus of debunkers, I’ll briefly join that chorus for those who came in late and point out that the underlying math that Michael Green does is deeply, deeply stupid, it makes even less sense than you think, as in it is actually tautological equating of median income with poverty except with calculation errors.

I mostly wrote my debunk before reading the others, but we found the same things, so if you’ve read the others you can skip this section and the next one.

Michael Green: In 2024, food-at-home is no longer 33 percent of household spending. For most families, it’s 5 to 7 percent. Housing now consumes 35 to 45 percent. Healthcare takes 15 to 25 percent. Childcare, for families with young children, can eat 20 to 40 percent.

If you keep [the original] logic [of the poverty threshold]—if you maintain [the] principle that poverty could be defined by the inverse of food’s budget share—but update the food share to reflect today’s reality, the multiplier is no longer three.

It becomes 16. Which means…the threshold for a family of four—the official poverty line in 2024—wouldn’t be $31,200. If the crisis threshold—the floor below which families cannot function—is honestly updated to current spending patterns, it lands at close to $140,000.

As in, he’s taking a food, multiplying by (14 or 16), and calling that the poverty line, because food-at-home is (1/14th or 1/16th) of typical household spending. Even if his calculations were correct: Seriously, what? You presumably see (some of the) problems?

(The calculations also aren’t correct in other ways, at minimum he should be using total food share which only leads to a 7.8 multiplier, and the way he’s backing out minimum food costs is assuming costs rose exactly with CPI, but that’s not important right now.)

That’s the same as saying ‘the poverty line is equal to typical household spending.’

Well, it’s that plus the error from the conflation of food with food-at-home, the ‘what do people actually spend’ calculation should end up in more like $80k, almost exactly American median income Mike claims American households make (he’s wrong, it’s actually $125k for families of four, whoops).

That’s not a coincidence. This methodology would say that half the people will always be under the poverty line, no matter how rich or poor those people were.

Poverty here is being defined as ‘below the median.’ Except with a bug in the math.

Thus, contra Green, there’s no typical expense that this ‘doesn’t include.’

If your response is ‘parts of this are what the original calculation did, kind of’ my answer is: I do not care, not even a little, that a different calculation was also ad-hoc nonsense followed by CPI adjustments. We agree that the $32k number also isn’t right.

He then uses the ‘living wage calculator’ to assemble a typical household budget in Essex County, New Jersey, a relatively expensive metro area. His source says ‘typical expenses’ there require $96k in income if one parent is working, $136k if two parents are working, mostly due to a $32k child care gap which is nonsensical if the children are in school. All of which Scott Winship analyzes and finds patently absurd on multiple levels.

But once again ignore all those numbers because he’s literally using the ‘typical expenses’ calculation, which has basically nothing to do with the minimum spend or any reasonable definition of a poverty rate, hence the $32k in child care, the paying full healthcare premiums and so on. Once again this calculation makes no sense.

(Oh, and fun note via Noah Smith, poor people in America still spend about a third of their money on food.)

The Debunking Chorus

A chorus rose up to explain this person being Wrong On The Internet. Here’s some pull quotes and overviews. Several of these links go into great detail, some would say unnecessary detail but those some would be wrong, we thank all of you for your service.

Scott Lincicome called this a ‘nerd fight’ but I mean stop, stop, he’s already dead.

Noah Smith has some excellent common sense graphs for sanity checks showing Americans mostly do have adequate health care, food, transportation and space.

Scott Winship has the most detailed debunking methodology, if you want that. This includes noticing that in many calculations Winship goes beyond relying on medians. He instead moves to average (mean) spending in various categories, which is even more absurd a comparison point, and when he breaks down the ‘typical budget’ based on Essex County we see absurdity after absurdity.

His analysis also contains the central conflation I’ll be largely focusing on next time, which is that yes Americans now have higher real incomes and buy vastly more and better stuff, but that this does not automatically mean survival is easier because Americans now are required to buy and expect to get vastly more and better stuff.

Alex Tabarrok: Of course ⁦Jeremy Horpedahl is correct. I would just add that this is another example of grievance culture. Right or left almost everyone wants to blame someone else—billionaires, minorities, immigrants, foreigners, white people, systematic racism etc.

Ptuomov: There are two issues here.

The first is that Mike’s numbers are off, as explained in the link below. This is a minor issue.

The second and more serious issue is that I think he is confusing two very different questions:

1. What does it take to not be so poor in the US that poverty itself closes doors for you and your children?

2. What does it take for a family man to feel like a success in the US?

In my opinion, Mike uses the word “poverty” but is actually writing about the second question.

I’m no bleeding heart-liberal by any means, but I think there is legitimate reason to reduce the type of true poverty that causes a newborn child to be excluded from self-improvement opportunities of which he could realistically take advantage.

Scott Lincicome: Fortunately for us, subsequent scrutiny—including from some of Capitolism’s favorite scholars—revealed Green to have been spectacularly, demonstrably wrong in all sorts of obvious and less obvious ways. Most American families, it turns out, aren’t living hand-to-mouth, and, while real affordability challenges exist, the general long-term trend for both middle-class living and real poverty has been positive.

The Numbers—and Entire Premise—Were Nonsense.

Tyler Cowen: Fortunately for us, this [poverty line of $140k] is all wrong. The underlying concepts are wrong, and the user of evidence is misguided. There are genuine concerns about affordability in the United States, but the analysi in this article is not a good way to understand them.

Noah Smith: But despite its popularity, Green’s claim is wrong. Not just slightly wrong or technically wrong, but just totally off-base and out of touch with reality. In fact, it’s so wrong that I’m willing to call it “very silly”. I know Mike Green, and I count him as a friend, but we all write silly things once in a while,1 and when we do, we deserve to be called out on it.

Jeremy Horpedahl: I think there are at least three major errors Mr. Green makes in the essay:

  1. He drastically underestimates how much income American families have.
  2. He drastically overstates how much spending is necessary to support a family, because he uses average spending figures and treats them as minimum amounts.
  3. He obsesses over the Official Poverty Measure, since it was originally based on the cost of food in the 1960s, and ignores that Census already has a new poverty measure which takes into account food, shelter, clothing, and utility costs: the Supplement Poverty Measure.

We also have Eric Boehm in Reason, and no doubt many more.

Okay It’s Not $140k But The Vibes Mean Something

Clifford Asness: The populist fabulists will only move the goal posts again.

It started as 140k was “poverty” moved on to something softer about “participation” (not that this isn’t a real concept it’s just not poverty) and now is down to “you can’t deny the ennui” and “all we meant was government programs are poorly designed to punish those seeking to leave poverty” something poverty scholars have only yelled about 1mm times.

Data + analysis >> vibes which sadly doesn’t mean they win in the court of public opinion.

Adam Ozimek: The defense of the $140k poverty line post have retreated to yes the data is wrong, yes the core claim is wrong, but it is a complaint about standard of living improvement in the US so I must nevertheless say it’s good.

Guess who led this chorus? Michael Green, saying that ‘accuracy’ was not the point.

If people keep insisting life sucks and the vibes are bad then you should believe them that there is a problem, that in some important way their life sucks and the vibes are bad. That doesn’t mean you need to respect the actual claims if they are false, but if you are trying to figure out what is true the defense of those claims is important evidence. Listen.

Needing Two Incomes Has A High Cost

This also leads into themes I mostly am saving for next time, but needs to be mentioned here: A family with one typical income will increasingly fall behind.

That doesn’t mean you can’t make the numbers work that way. You can. Falling behind doesn’t mean starving. Falling behind still sucks. A lot.

Matt Bruenig: New piece at @PplPolicyProj. It’s my entry into the “$140k is poverty” discourse and the “you used to be able to live comfortably on a single income” discourse more generally. I think I know how to make sense of it that does not require nonsense claims.

One way to put it that I ultimately cut out of the piece is imagine your society went from a 20-hour workweek to a 40-hour workweek but not everyone went along with the change. You could accurately say that you used to be able to afford a normal life on 20 hours but now you cant.

But rather than seeing that clearly for what it is — the standard for a normal life has ratcheted upward with more income/output — there is a temptation to say that there is hidden costs or inflation or whatever that have fully swallowed increased income etc.

… If everyone around me started working 20 hours of overtime each week and I didn’t, then that would suck. Even if I followed suit, but didn’t want to, that’d also suck. Because inequality sucks. Being alienated from the society sucks.

Real wages for married men are up, but the median income for married couples is up a lot more because a lot more women are working, which means if only the man works you’re falling behind. You get punched in the face with the Revolutions of Rising Requirements and Expectations.

Matthew Yglesias: Some excellent charts and info here, but I think the impulse to sanewash and “clean up” false claims is kind of misguided.

If we want to address people’s concerns, they need to state the concerns accurately.

The claim that the *absolute affordability* of being a married, one-earner family with kids has fallen would — if it were true — have straightforward win-win policy remedies like “higher wages and incomes.”

But it’s not true.

When you reformulate to a more accurate claim what you end up with is the observation that it is is hard for one person to earn as much income as two people and that the wedge has grown as women’s earning power has increased.

This is very true but what’s the fix?

One that would “work” would be to push women generally out of opportunities for careers and white collar work — something more conservatives are tip-toeing around but don’t quite want to say.

Family incomes have been moving up, much of which is increased female labor participation but a lot less than all of it.

Violeta: I’m following an insta acct who interviews elders from remote Romanian villages. Every single one of them speaks of how we now live in a God given infinite abundance so good that compared to their childhood &youth, they feel now they have more desire to live longer

a different POV

I know women in their 70s who for decades now, have been in grateful awe at how much easier they have it now. One of my aunts, mountain people, was raving when I last saw her about washing machines but also about *plastic bottles*, how practical they are during haymaking season

I Lied…

Green’s follow-up post might be the most smug and obnoxious ‘okay yes my original post was full of lies but I don’t care because it worked to get the discussion I wanted, so take that assholes who are in a grand conspiracy to keep us good folks down’ that I have ever seen. It somehow continues to fully equate the poverty line with median income, and to be arrogant about it, saying numbers shmumbers, they don’t matter.

And then he turns around and says, how dare you respond to my ‘legitimate grievances’ by pointing out that my facts are wrong and my arguments are nonsense?

Michael Green: In Are You an American?, I described “The Mockery Machine”—the ritualized pattern in which elites respond to legitimate grievances by distorting them into absurdity, ridiculing the distortion, and then shaming the “complainer” for even noticing the decline. I thought of it as a cultural reflex, a defensive maneuver performed mostly by Twitter avatars and partisans. I was wrong.

… Just like in Atlas Shrugged, all the sycophants wanted to display their loyalty to Balph. Check out the brutal assault on my childcare figure — “It’s not $32K — it’s $25.7K!!!!”

I mean, sir, that’s because your facts were wrong and your arguments were nonsense. No, it wasn’t ‘narrative discipline,’ it was caring about the accuracy of claims. And no, this wasn’t one isolated error, it was part of a string of errors mostly in the same direction, that people are very carefully and politely pointing out. All the careful pushback warmed my heart to see it. Have you tried making true claims instead?

The post went viral because of the false claims, and that was the message most people got. You can’t then turn around and say, why do you care about the false claims, I don’t care if my claims were false, that wasn’t the central point.

But yes. The fact that such obviously false claims resonated must be reckoned with, which is what the second post will be about, and these same people are also trying to say ‘and therefore everything is fine’ when everything is rather not fine.

Indeed, Green also correctly identifies the ‘making the goods better does not help you afford the goods’ problem with equating ‘real income adjusted for CPI’ with someone’s felt spending power and ability to survive.

This is written backwards by accident by Green, but correct once you fix it:

As others have noted, it’s great that the 1963 basket is so much higher quality than the 2025 basket that it’s “worth” much more, but it’s illegal to buy the 1963 basket.

Yes, that’s backwards – it’s the 2025 basket that’s worth much more, and rightfully so, but you still spent the same amount of money on the basket, and it’s still illegal to buy the 1963 basket, and that’s central to the argument I’ll make next time.

I’m definitely not here to say everything is fine. I’m not mocking the idea a lot of people feel like their lives suck, quite the opposite, stay tuned for the second post. But I absolutely, if forced to engage, will mock anyone who knowingly posts so much obvious nonsense and then pretends that this was fine because it worked and it wasn’t the central point, and only people with an agenda would call you out on it.

One other potentially good point Green makes there is that many individuals and couples don’t start families because they don’t feel they can afford one, which biases two-parent household income upwards. But in terms of impact on the averages it’s complicated, because there are kind of two fertility tracks, one for those who are trying to follow the ‘have enough money first’ playbook, and the other where some people go ahead and have kids anyway, and that second group is more present and has higher fertility at lower incomes. If you look at fertility by income, you see a U-curve, that fertility declines as income rises, until you get rather far up in income.

I do think that ‘a lot of people don’t have kids because they don’t believe they can afford them’ is central to the problem we face.

…But That’s Not Important Right Now

He then pivots (while continuing to assert various forms of nonsense along the way) into saying ‘the real point’ is two completely distinct other claims that are far better.

Poverty Trap

The first is that phase outs generate a Poverty Trap where effective marginal tax rates can be very high, even in excess of 100%. If marginal tax rates are very high, there’s no push to earn more money, so you don’t advance your career and you never earn enough to escape poverty. That’s a very real, no good, very bad problem.

Michael Green: This is the policy failure that was actually at the heart of Part 1: We have created benefit cliffs and income phase-outs that systematically capture the working poor, ensuring that climbing the ladder only leads to loss of essential benefits and permanent financial fragility.

Green’s version actually understates the issue. Assuming those numbers are right, yes, the post-transfers marginal tax rate there is obnoxious at around 45%, but that’s also the marginal tax rate at the top, and transfers are worth less than one dollar per dollar. Still, Green’s graph doesn’t look so bad, because the worst potential Valley is at $30k, and Green can’t count that low.

This, via Jeremy Horpedahl, is the chart that shows a real Valley of Death that can come up under some circumstances:

This graph is remarkably terrible. You could plausibly prefer $29k to $69k.

Here’s another graph that goes to the relatively expensive Boston and finds a trap that goes out farther, where you don’t escape until north of $100k:

Image

Or this one looks less bad that has a small reversal around $63k:

Again, benefits are worth less than $1 per $1, so marginal tax rates are not quite as bad as they look on these graphs, but they are not great.

Once you get above these thresholds, I don’t want to say you are ‘home free’ but you are strictly better off each time you earn an extra dollar.

Contra Green, ~80% of families of 4 are north of the trap in practice.

Can these dynamics trap people in poverty? Oh yes, absolutely, it’s all quite terrible, and we should work on solutions to make this whole thing sane.

Poverty Trap Versus Poverty Line

However, note that the basic problem will remain, which is:

  1. To live reasonably, people implicitly need ~$50k in total pay and benefits, given the Revolution of Rising Requirements and what you by law have to purchase.
  2. Thus, if you make less, we give you the difference, or your family starves.
  3. If you make more, we are going to tax you to pay for all of that.
  4. If you make the climb in effective pay steeper you make it suck that much more to make less money. You have to pick, how progressive will you make your taxation?

As an intuition pump to how tricky this is, redone from scratch for 2025, for families of exactly 4 only for now: We could change to instead provide a universal basic income (you can also call it a negative income tax) of $40k per family, plus a flat tax of about 25%-35% (depending on implementation details elsewhere) to $250k, then resume the current tax rates (so we don’t change how much we raise from the top of the distribution). No other default benefits, including Medicare and Medicaid, and no other federal taxes (no payroll, no Medicare tax and so on). My quick calculation says that’s roughly revenue neutral. Is that style of approach better? Maybe, but there’s at least one huge obvious problem, which is that this creates unsubsidized health insurance markets with no fallback, and so many others we’re not noticing. Of course, there’s huge upsides, especially if you fix the worst of the secondary effects.

Either way, good luck passing anything like this. The numbers are brutal if you’re not willing to blow up a bunch of sacred values somewhere.

The details Green discusses here are wrong again, including the fragility issue, but he’d be the first one to tell you the details and exact numbers don’t matter. His claim about a ‘different nature of the struggle’ doesn’t make sense.

But here, yes, this part’s very true and important, nail meet head:

The question we are increasingly asking is, “Why aren’t we having more families and procreating?” The answer, largely, is that we are asking families to make an investment in children that becomes a future common good and penalizing them for doing so.

Yes. Many people are saying. Children are an expensive public good. They are getting relatively expensive compared to other goods thanks to the Revolution of Rising Requirements. We are expecting parents to foot a huge portion of the bill, and that is the big problem.

It’s not a new big problem. The story of the world outside of farms has been, roughly, ‘spend your life trying to earn enough money to support as big a family as you can.’

He closes by making a bunch of other vibe-level and directionally correct but importantly false statements about the nature of wealth and transfers and the signaling theory of education, written to give a false impression, to equate the directional vibes with reality.

That works on a certain type of reader. To me and hopefully my readers, it’s toxic.

Double or Nothing

As a closing fun note, it can always be worse.

As in: You have to love the community note on this graph.

No, seriously, he took the MIT Living Wage Calculator and doubled it, and considers ‘comfortable’ to include 20% savings while meeting all ‘wants and needs.’ Must be nice.

Brennan Schlagbaum: For context…

SmartAsset studied how much families need to live a “normal” life in every US state. (covering needs, wants, AND saving 20%) The results are terrifying.

Okay, good. We got through that. We are now ready for next time.

]]>
https://thezvi.wordpress.com/2025/12/16/the-140000-question/feed/ 3 24956 thezvi Image