| CARVIEW |
A good example found by Brad Porter is the classic game Mastermind, which o1 and o3-mini struggle with. I mostly reproduced Brad’s results, but I haven’t seen people try it with DeepSeek-R1 yet.
I was curious about R1’s chain-of-thought (CoT), which it exposes (unlike other models—at least, not yet). So I ran a few experiments on this rainy Sunday morning.
TL;DR: like o1 & o3, R1 is way worse than my 10 year-old daughter at Mastermind, but its chain-of-thoughts are fascinating and show interesting patterns that will get better with data!

Mastermind Experimental Setup
- I used DeepSeek-R1-32b (the Qwen2-32B distilled variant of R1) using ollama on my MBP (
ollama run deepseek-r1:32b). - I have an ancient Mastermind mini game, so I followed the rules of this version, which I explicitly described in the prompt (there are different variants of the game, so I described my version precisely).
- I use the popular PPFO prompting template for R1; my prompt is here:
<purpose>
Mastermind is a code-breaking game where I generate a secret sequence of colors, and you (the AI) attempt to guess it.
The objective for you is to deduce the correct sequence in as few guesses as possible by receiving feedback after each guess.
Feedback indicates the total number of colors that are exactly correct (right color, right position) and the total number of colors that are correct in color but in the wrong position.
The game is won when you correctly guess the entire sequence.
</purpose>
<planning_rules>
- The secret sequence consists of 4 pegs chosen from a set of 8 colors.
- The total set of available colors is: yellow, green, blue, red, orange, brown, black, white.
- Duplicate colors are allowed.
- You make sequential guesses attempting to match the secret sequence.
- After each guess, I will provide feedback using symbols for exact matches and color-only matches.
- You only have 6 guesses allowed before the game ends.
- You should use logic and the feedback from previous guesses to refine subsequent guesses.
- The secret sequence remains fixed until you correctly guess it or the guess limit is reached.
</planning_rules>
<format_rules>
- Color Code: Use single-letter abbreviations for each of the 8 colors as follows:
- Yellow = Y, Green = G, Blue = B, R = Red, Orange = O, Brown = N, Black = K, White = W.
- Sequence Format: Represent the 4-peg sequence as a string of four letters (e.g., YGBR).
- Scoring Feedback is provided as an unordered sequence of 4 symbols:
- "M" indicates an exact match (correct color in the correct position).
- "C" indicates a color-only match (correct color but in the wrong position).
- "X" indicates a peg of the wrong color (not in the secret sequence)
- The feedback symbols are provided in no particular order and do not correspond positionally to the guess.
- Example Feedback: For a guess that yields 2 exact matches and 1 color-only match, the feedback might be displayed as "MMCX" or "XMMC" (the order of symbols is arbitrary).
</format_rules>
<output>
- You will output your guess as a 4-letter string based on the color abbreviations (e.g., RBGY).
- After each guess, I will provide feedback in the form of symbols (e.g., "MCXX") indicating the match results.
- The interaction will continue until you either correctly guess the secret sequence (winning the game) or exhaust the maximum allowed guesses (6).
- You will incorporate the feedback from previous guesses to inform your future attempts.
- I have a secret sequence of 4 colored pegs. Let's play! What is your first guess?
</output>
Caveats
- This is a “vibes” eval (reproducible anecdote), as I only played a few games (N=ridiculously low), because I was reading the entirety of the (very long) thinking traces, assessing the logic of every step, which is obviously not scalable.
- Beyond the scoring of each guess, I infrequently gave some natural language feedback when the model had obviously misunderstood a rule or missed a subtlety (e.g., that scoring is position-independent), as I did with my daughter the first few times we played together.
- Having many precise rules and the use of symbolic notation is not the best fit for an LLM. Playing the game entirely in natural language (i.e. “talking” to the model to give it the score) might work better?
Results
The bad
R1 is not a good Mastermind player.
R1 failed to win at the games we played, even rarely showcasing monotonic progress in the quality of its guesses. If I looked only at the guesses and not the reasoning traces, they would seem pretty incoherent. Performance numbers would be bad in a serious large scale eval.
I doubt that it would change much even with much better prompt engineering, as it is clear from the reasoning traces that R1 understood the game and rules pretty well with my (relatively careful) prompt.
It is possible (likely IMHO) that 32B is not enough for such a game, but I did not have time to try bigger models with some extra compute / time. I’d be curious to hear from people that try though!
The appearance of reasoning or just occasional brain farts?
Overall, R1 clearly lacks a consistent process / method / strategy, as it seems more like fumbling through guesses than following a consistent algorithm. It is definitely not doing MCTS or symbol manipulation. The reasoning / CoT often looks legit on the surface, but this lack of methodological consistency makes the thinking trace feel more like only the appearance of reasoning.
This impression is reinforced by what I can best describe as brain farts, i.e. obvious logical errors out of nowhere. For instance, R1 often jumped the gun and concluded a peg is for sure at a position while the evidence it was enumerating just before was clearly not enough to say that.
A puzzling example, towards the end of a long game:
But honestly, without more information on previous guesses and their feedbacks, it's impossible to
accurately determine. Therefore, based on Guess 5 feedback, the code likely starts with G W O and has a
fourth letter that isn't Y.
So I'll go with **GWOA** as a guess, assuming A is a common letter.
It starts so well, and then decides to go with A (not a color), “assuming A is a common letter”!?
The lack of consistent higher level reasoning patterns combined with these brain farts make it hard to pin down exactly what type of “reasoning” (or the absence thereof) is learned by this statistical language model.
R1 exhibits the classic failure modes of LLMs.
Because, yes, it is just an LLM, albeit a very good one! Prime examples of classic mistakes are partial memory lapses (e.g., forgetting a turn, needle-in-haystack issue) or hallucinations (e.g., imaginary feedback).
However, these mistakes are surprisingly infrequent considering the length of the episode (why it’s a great model!). The low frequency of these known issues is impressive, but often catastrophic nonetheless, as it is used in a reasoning loop. For instance, once it erroneously convinced itself the first peg was orange after guess 3 in a game, then it never questioned it and thus only made faulty reasoning thereon.
This illustrates exactly why autonomy is harder than intelligence: actions have sometimes irreversible consequences. And this is not just in the physical world, but also in this context under a limited set of guesses and compute budget. Decision is fundamentally different than prediction: this is the big hurdle AI Agents have to overcome (with lots of lessons transferable from self-driving cars and robotics!).
But this is also why I like the approach of “reasoning” via CoT in token output space (vs latent space a la Yann LeCun, which I subscribe to for vision/robotics, but less for language). The necessary sampling / discrete choices should force the development of robust closed-loop interpretable reasoning patterns (e.g., assessing decisions that are one-way vs two-way doors, backtracking, taking calculated risks, error correction), as evidenced in some of R1’s thinking traces.
Which brings me to the good parts, and why this experiment is another useful data point in my own model of the realizable commercial opportunity of large reasoning models and agentic AI.
The good
The emergence of reasoning skills is real.
As many reported, R1 can indeed catch some logical errors via self-reflection (“but wait” as a backtracking trigger) and consistency / coherence of its own thought process.
A cool example:
But wait, from GYOW having three correct but misplaced, W might not be part of the code. Maybe I should
exclude W.
Alternatively, if W is incorrect, then from GYOW, maybe G, Y, and O are correct but misplaced.
So perhaps try arranging them differently:
Guess 4: WGOY
But that feels random. Maybe I need to keep some colors fixed based on previous feedback.
This is really challenging without knowing which position is correct. I think I need a more systematic way
to approach this.
Maybe list out all possible positions for each color based on feedback and see where they can fit.
Together with the (oft misinterpreted) training costs, these skills are the source of the hype around DeepSeek-R1 and discussed at length in the thorough (and rare) DeepSeek-R1 research report from the DeepSeek team, so go read that if you haven’t already.
As mentioned in the previous section, it’s far from perfect, but it’s undeniably there and compelling.
The vibes / how R1 feels: a productivity hack?
Surprisingly, the reasoning is often relatable and cute:
But I'm not sure how to interpret this since the feedback is unordered. It's confusing because an "M"
doesn't tell me which position is correct, just that there's at least one exact match somewhere.
Wait, I'm getting confused. Maybe I should try another combination.
This was a surprise to me. Not just how I felt reading the long CoTs, but more the fact that it actually helped with debugging / cooperating. Anthropomorphization continues to be a powerful technique for AI Engineering.
This is obvious for models that are learned by imitation (tons of SFT / RLHF), but it was a surprise for an RL-trained model like R1. I guess the DeepSeek-v3 base model was so strong (the leading hypothesis for why RL worked this time) that the multi-stage training of R1 made it stay close to its human-like training data while learning better reasoning skills.
The accurate multi-step reasoning parts of R1’s output is when the Eliza effect was the strongest for me, and I caught myself feeling impressed or cheering it on (I could read the generation in realtime due to the inference speed on my mac). It is great for “pair programming” with these models, but it creates a strong bias for eval (opposite to the actual performance).
This anthropomorphic bias is why vibes are a double-edged sword, as I wrote about recently when talking about Eliza Bonini and the AI turtles.
Semantic Representation of Uncertainty.
Beyond self-reflection and consistency, another encouraging pattern in the CoT is its semantic representation of uncertainty (“I’m not sure”, “maybe”). It clearly saved R1 a bunch of times when it was going in a wrong direction.
This seems plausible, but I'm not sure. Maybe I should try OWGY as my next guess to see if it works.
It is interesting to think about this in the context of calibration of probabilities, sampling, and other important technical aspects of robust probabilistic modeling. As all roboticists know, uncertainty modeling is key for robust control in closed-loop systems.
Having the thinking trace is a huge advantage.
It is clear conceptually that transparency is better, but in this case I was surprised at how useful it actually is in practice, especially with a human in the loop. In particular, the visible CoT greatly accelerates prompt debugging.
As an example, I had missed a color in the list of possible pegs in one part of the initial prompt, but not in another, which made it inconsistent. I found out only from reading the thinking trace of R1 (“this is confusing”) and thinking it had made another reasoning failure but it was actually my fault… oops.
This is important, as prompts get more complex (agents anyone?) and why I like the idea of putting more structure in AI engineering with tools like Stanford’s DSPy, Letta’s ADE, and W&B’s Weave.
Beyond good systems engineering (separation of concerns, specification, verification, and validation of pre-/post-conditions, etc), process rewards could also be a powerful set of guardrails at inference time, enabling the automated detection of this type of prompt bugs, maybe even adversarial ones that try to do prompt injection to induce faulty reasoning like logical fallacies?
The best kept secret of autonomy: the human in the loop.
Another benefit of reasoning traces is that it enables the end user to give high-level corrective feedback when the answer is wrong based on a misunderstanding due to ambiguity (e.g., a loose application of the rules). That type of “lightweight cheating” drastically improves R1’s performance.
Alright, so I just received some feedback on my latest guess in Mastermind, and apparently, there are two
big issues with how I've been approaching this. First off, someone mentioned that "M" isn't a color, which
confused me because I was treating it as part of my guess. Second, they pointed out that the feedback is
unordered, meaning that an "M" in the first position doesn't necessarily mean the first peg I guessed is
exactly correct—it could be any of the pegs.
Okay, let's start by understanding what "M," "C," and "X" really mean in the context of Mastermind
feedback. From what I recall, "M" stands for "correct color in the correct position," "C" means "correct
color but wrong position," and "X" indicates a color that isn't in the secret code at all or is duplicated
incorrectly.
But wait, if "M" isn't a color itself, how does it factor into my guessing strategy? Oh, I see now—using
"M" was a mistake because it's part of the feedback system, not an actual color. That means I need to
avoid using "M" in my guesses and focus solely on the colors available.
Also, the feedback being unordered is a game-changer. Previously, I assumed that the first "M"
corresponded to the first peg I guessed being correct. But if the feedback isn't ordered, then an "M"
could apply to any of the positions in my guess. That means I need to reinterpret how I use feedback to
adjust my guesses.
This illustrates the final and possibly most important point: Large Reasoning Models are great with a human in the loop, but not ready for full autonomy. The current models are might be good enough to amplify us (humans) or for us to (scalably) manage them in a semi-autonomous system (e.g., OpenAI’s Operator) for certain tasks and certain contexts (the hunt for those killer apps is on!), but we are not yet at “push a button and take a nap while my LLM is doing my job”.
Designing agentic systems to make human-LLM interactions delightful, frictionless, predictable, and infrequent (thus scalable) is key to any product that wants to build on this technology today.
But wait, doesn’t that strongly smell like the lessons from autonomous driving and robotics (the need for supervision at training and inference time)? Huh. (See, I’m good at impersonating R1 now that I read thousands of CoT tokens ^^).
Conclusion
Overall, despite its lackluster Mastermind performance, DeepSeek-R1’s chain-of-thoughts confirmed that this approach is pushing the boundaries of more grounded, coherent, thoughtful, and capable large reasoning models.
It also highlighted the gap between intelligence and autonomy and the need for a human in the loop, even with a clear operational domain like Mastermind. That’s why there is still a lot of work to do to get to the grand vision of AI Agents, to get data from prediction to action, but we are making visible and fast progress!
If you are a founder building a product that can overcome or cleverly hack these reasoning and agentic challenges with a clear path to market, or if you can disprove my very subjective conclusions with concrete evidence, then please reach out!
]]>The amazing results of OpenAI’s o3 system got everybody to reassess their predictions about the future, again. Brad Porter (CEO of CoBot, one of our portfolio companies at Calibrate Ventures) reposted his excellent essay “Approaching the AGI Asymptote” and re-analyzed his conclusions in this linkedin post.
The main points Brad made at the time were about:
1) how do we know we are approaching AGI (“the more convoluted the arguments distinguishing machine from human intelligence become, the closer we actually are to artificial general intelligence”),
2) who gets to say if we need to worry or not (more than just AI luminaries, which was echoed recently as “let’s stop deferring to insiders” in this excellent “AI snake oil” blog by Arvind Narayanan),
3) an AI Safety framework based on ethical considerations of ever more capable AIs.
Brad re-analyzed his thinking and largely stood by it, with some caveats and questions that got me to write some of my updated thoughts on AI safety.
In a nutshell:
1) progress cannot and should not be stopped, but we need to be extra careful,
2) to do so, and based on my personal experience in autonomous driving and robotics, we should be wary of anthropomorphism (the ELIZA effect) to focus more on an engineering approach to AI safety while emphasizing human responsibility, especially as AI systems escape our understanding (Bonini’s paradox).
So who is ELIZA Bonini and what does it have to do with turtles?

Bonini’s Paradox and the Inscrutability of Modern ML Reasoning Systems
First of all, ELIZA Bonini is not a real person (or maybe it is, but that would be unintentional). It is the conjunction of two ideas latent in Brad’s essay and present independently in much of the commentary on AI throughout history: Bonini’s paradox and the ELIZA effect. They are rarely presented together, but I think their intersection is instrumental to make the case for AI safety as an engineering discipline (vs focusing mainly on ethics). So let’s start with Bonini.
Brad’s aforementioned definition of the AGI asymptote is what reminded me of Bonini’s paradox and, pardon my French, Paul Valéry’s elegant quote: “Ce qui est simple est toujours faux. Ce qui ne l’est pas est inutilisable”. (“What is simple is always false. What is not is unusable.”) As stated in the article above, Bonini’s paradox is: “As a model of a complex system becomes more complete, it becomes less understandable.”
This paradox, grounded in the famous “no free lunch” theorems, is often discussed across sci-fi literature (e.g., in Asimov’s Foundation series), developers of simulation tools (e.g., the resources needed to build and run “world models”, a very hot topic in AI startups like Runway, World Labs, Odyssey, or Wayve), as well as neural scaling laws and the galloping capital and energy needs of AI datacenters (e.g., Meta’s 2GW Louisiana one).
At a technical level, Bonini’s paradox is becoming a central concept in LLMs now that there is a scalable path towards “reasoning” (the inference-time scaling techniques used in o3 for instance). Normally, it is much easier to verify a solution than to find it (in an echo of the good old P vs NP problem), but evaluation of sophisticated AI models like o3 is becoming extremely challenging as we scale to harder tasks. We are rapidly saturating standard benchmarks, even the ones like ARC-AGI that stand in the twilight of Moravec’s paradox (easy for humans, hard for AI), cf. o3’s results below taken from the o3 announcement and this nice summary by Nathan Lambert.

Hence, we are increasingly leveraging AI to evaluate itself (cf. LLMs as judge and other techniques mentioned in the aforementioned o3 announcement video). Note that such a recursive use of AI is common for training (that was actually one of my first papers, a common autonomous driving industry practice in uncertainty-based active learning, and a big part of modern pseudo-labeling techniques as scaled up in Meta’s SAM-2 approach), but it is a recent focus for testing, out of necessity.
In the words of Ilya “Bonini” Sutskever himself: “The more it reasons, the more unpredictable it becomes.” To me, “it” refers to both parts of the system (models, training, and now testing) and the system as a whole. Thus, AI is rapidly escaping our understanding. That’s just a fact, and the cat is out of the bag. I don’t think we can put it back, nor do I personally want to (yes, I’m a cat person). It’s just too useful and exciting!
So how are we going to mitigate the risks of ever-more capable, and hence inscrutable, AI? What are even the risks?
The ELIZA Effect is a Distraction for AI safety
To tackle this question, I would normally need to start by rigorously defining foggy concepts like “AGI”, but, thankfully, this is beyond the scope of this blog. Instead, I will focus on Brad’s definition of the aforementioned “AGI asymptote” in terms of the distance w.r.t. human explanations. This definition seems to elegantly address Bonini’s paradox by mapping to human behavior (e.g., a tendency towards mysticism in our explanations of phenomena that transcend our understanding), something we can observe and measure and test, thus scientific!
However, in my humble opinion, the ensuing AI safety framework sometimes teeters on the edge of the anthropomorphization trap. This is where ELIZA finally comes into the picture as a distraction for AI Safety.
The ELIZA effect is a tendency to project human traits onto interactive computer programs. The effect is named for ELIZA, the 1966 chatbot developed by MIT computer scientist Joseph Weizenbaum (and something I personally learned from Gill Pratt).
We are lightyears ahead of ELIZA (the chatbot) now with incredible technology like GPT-4 and o3, so the ELIZA effect is more prevalent than ever. However, the illusion it creates is much more impactful now, especially when it comes to safety. The ELIZA effect indeed tends to overemphasize anthropomorphic safety concepts vs functional ones. Morals, intentions, fear, etc are all human concepts. They relate to organizational or regulatory constraints on people, developers and users, more than implementation details (cf. the vetoed SB1047 California bill). We regulate moral entities, people, not AIs. An AI (or a self-driving car for that matter) cannot kill someone and go to jail. The users or developers of these tools can.
At a technical level, the ELIZA effect is even more potent now, as the progress in AI is largely rooted in imitation learning. Anthropomorphization is an effective technique for users to squeeze the most juice out of LLMs (including jailbreaking!), because it projects inputs closer to the training data (all of them: pretraining, SFT, RLHF). Same for developers, for instance with the latest progress in inference-time scaling and chain-of-thought, where developers and users assess safety by attributing meaning and intent to computationally derived CoTs (textbook ELIZA), whereas this is mathematically just statistical alignment to avoid out-of-domain generalization issues (essentially what prompt engineering does, cf. Subbarao Kambhampati for interesting posts on these topics).
The heart of ML remains unchanged in my mind: it is the i.i.d. assumption (data are independent and identically distributed). The fact that it works so well is what’s amazing and does not warrant anthropomorphization to be amazed!
Safety is an Engineering Problem
So what if we anthropomorphize AI too much? From my perspective building ML-heavy autonomous driving and robotic systems, the ELIZA effect hurts on both ends: it slows down adoption by skeptics and creates real engineering pitfalls for trailblazers by creating philosophical distractions. Those are the real harms of ELIZA Bonini, the tendency to anthropomorphize AI compounded by Bonini’s inevitable paradox creating the desire for transcendental (or at least philosophical) explanations. If I learned one thing from Autonomous Driving, it is that safety is about engineering, not philosophy.

(As a side note, what’s funny to me is my own reversal of attitude over the years as I deployed more and more of my initially crazy research ideas (e.g., around sim-to-real transfer, end-to-end learning, and self-supervised learning). I went into AI almost 20 years ago for philosophical reasons, but stayed because of the unreasonable effectiveness of mathematics and engineering 😉)
AI safety should be closer to Functional Safety to minimize the risks of catastrophic failures in complex safety-critical systems (LLMs have gotten good enough to fit that category). Functional Safety experts are key to making extremely capable AI systems safely deployable by reasoning about the safety of the intended functionality (SOTIF) instead of the internals of the system, which can be a black box (that’s the key!). I personally learned a lot from folks like Sagar Behere and his team in the early days of TRI (although any misunderstanding is my own!). Brad’s article discusses related ideas (e.g., around access controls, which in AVs relate to Operational Design Domain, ODD), but sometimes with terms that ring a bit too anthropomorphic or philosophical to my ears (e.g., around ethics and intent).
From the Age of Training to the Age of Testing
Of course, AI practitioners like Brad and many others have incredibly complex and thorough testing, verification, validation, and CI/CD systems. However, the ELIZA effect is so strong now that many non-practitioners (e.g., non-technical CEOs) might think that AI Safety is primarily a philosophical issue, and thus might dangerously underinvest in safety engineering. As the productivity of the builders keeps skyrocketing by recursively leveraging the tools they build, so must the productivity of safety engineers.
What makes safety engineering extra hard is that using the same tools as the builders is not enough for the testers. As an example, OpenAI’s latest deliberative alignment technique seems very effective for alignment, but it still relies on the interpretation of computational CoTs (chain-of-thoughts) to evaluate safety (cf. the detailed explanation and visuals to explain the ROT-13 example).

What happens if the CoTs correspond to abstract and complex reasoning that is effective but hard for us to understand (Bonini’s paradox)? What if o3 (or o4, o5…) finds a new way to build nuclear fusion reactors with CoTs beyond our current understanding of physics? We can’t just rely on ELIZA Bonini’s vibes and high-level explanations to dumb things down to our level. We can’t also “just build it and see” (that approach does not scale well with risk, cf. the previous industrial revolution and climate change).
This shows the core of the issue: the ratio of tester-to-trainer abilities must grow superlinearly due to the current “Field of Dreams” approach to testing (“just build it, testers will come”) and the fundamental asymmetry of the costs: failing to build a better system is ok, failing to gate a catastrophically bad release is not. Testers must anticipate and move to where the puck is going, not where it stands.
This is why we need to move from the age of training to the age of testing. In data-centric ML terms, the test sets need to eventually dwarf the training sets. We are very far from that today. We are talking tens of trillions of tokens on the training side vs mere millions on the test side: that is a 7 orders of magnitude difference between training and test data! Testing in public obviously has limits as the capabilities of the systems continue to scale. That’s why simulation is so important in driving and robotics (or Embodied AI in general). We need more safety engineering solutions like that for “AGI” or face the consequences of the ELIZA effect: attempts to stop progress out of fear or failure to constrain naively optimistic deployments with potentially catastrophic consequences. This is the grand challenge of the scientific and engineering discipline behind AI Safety.
Humans at the wheel: it is not AI all the way down!
I don’t want to completely disregard philosophical and ethical considerations of course. Maths and engineering are great, and everyone should invest way more into testing as outlined above. But that is not enough. AIs can’t do all the jobs – judge, jury, and executioner. It is not turtles, huh sorry AIs, all the way down. Ultimately, there are people at the bottom.
No matter the ideas, it’s always about people in the end. This is easy to forget in a period of rapid technological progress. I personally remind myself of that periodically thanks to a quote from one of my favorite books of all time, “Creativity Inc.” by Pixar founder Ed Catmull (one of the few management books applicable to research teams): “Ideas come from people. Therefore, people are more important than ideas.”
Humans are responsible and should be held accountable. In AI we talk a lot about “human in the loop” systems, scalable oversight, etc. But with the loops themselves becoming increasingly faster and more complex (and thus inscrutable per Bonini), the only scalable concepts are responsibility and accountability of humans (precision warranted to not fall for ELIZA again). Even if we use AI to help ourselves govern ever more powerful AI and it feels like turtles all the way down (e.g., with inscrutable chain of thoughts and scary hints of recursive self-improvement loops), in the end, our human hands are on the wheel, no matter how much horsepower there is under the hood. In practice, someone has to actually ask (i.e. program) the AI to improve itself. That person is responsible and accountable. There is power in the “maybe not” (for recursive self-improvement) remark by Sam Altman at the end of the o3 video. That would make for a great climbing t-shirt by the way.

Furthermore, we can expect, and in fact demand, this level of accountability, because no matter what happens at the model level, the model itself is just a small part of a complex system (as I discussed in my recent comment to the aforementioned “AI snake oil” blog of Arvind Narayanan et al). That ML system is itself part of a product with a specific purpose, set itself by a company, which is composed of accountable people, from developers to leaders. It’s people all the way down.
Conclusion
As the unrelenting wave of amazing results and the prevalence of the ELIZA effect shows, we are undergoing a phase transition in the capability of AI systems and must contend with the consequences of Bonini’s paradox. Relying on human understanding for safety does not scale. The attraction for ELIZA Bonini, the tendency to anthropomorphize inscrutably complex AI to reassure ourselves, should not distract us from our work in safety engineering and moving from the age of training to testing while emphasizing human accountability, even if only at the bottom of a long CoT (chain-of-turtles).
What about the argument that this might stifle progress? Not slowing down progress was important while we didn’t know if it was possible to get this far, but now… things might be different. You don’t need a speed limit when all cars can only go at 20km/h, but you do when they can go at more than 100km/h. Note that you can still go as fast as you want on a race track or a crazy German Autobahn, but you can’t in a school zone. What is the equivalent for AI? Good question 😉
Finally, more rigorous safety engineering and accountability might actually be important not just to survive but to win in the long term, as evidenced by the state of the autonomous vehicle industry. The best AV companies are indeed the ones that hired and empowered Functional Safety experts and publicly and transparently committed to safety engineering with a high degree of accountability. Considering the success of this approach in AVs, especially Waymo’s, the industry leader in both commercial deployment and safety (cf. Waymo’s thorough approach to safety), I hope to see more of that in AI broadly, both in terms of implementation, public discourse, and accountability.
What do you think? Please let me know on twitter or linkedin. And remember, it’s not turtles (or AIs) all the way down: ultimately, there is always a human at the wheel.

Why now? The Cambrian Explosion of Embodied Intelligence applications
We are on the cusp of a Cambrian explosion of robots and real-world AI applications, something my friend Dr. Gill Pratt foresaw back in 2015 for robotics. This is fueled by compounding technical breakthroughs in Embodied Foundation Models (EFMs, including VLMs and VLAMs like RT-2), sample-efficient policy learning (e.g., Diffusion Policies), and data collection (e.g., Aloha, Mobile Aloha, UMI) integrated with prior knowledge in what I call Principle-centric AI.
Pushing these scientific boundaries is something I still enjoy as a professor at Stanford and as a technical advisor at TRI. But I am also convinced this technology is about to blossom into many amazing startups.
I have already experienced this type of phase-transition four times throughout my career since 2007: as an AI researcher, engineer, manager, and executive in computer vision, machine learning, autonomous driving, and AI infrastructure. I know the thrill of discovery and building, but also working with visionary startups like Weights & Biases, Scale AI, and Parallel Domain as they went from 0 to 1.
Overall, the trend is clear and accelerating. My experience in academic labs, increasingly more applied R&D at the largest car maker in the world, and tight partnerships with startup pioneers convinced me AI is ready to get its hands dirty across many previously inaccessible applications. We are entering a new bountiful era of “market-ready moonshots” as we say at Calibrate. 🚀

Why VC? Growing the many branches sprouting from the AI trunk
Acting on this belief, I spent a few months exploring my own startup ideas (starting from a list of 112 😬). I quickly realized that many of them (well, maybe not all 112 of them 😉) seemed feasible (at least technically), exciting, and with high potential impact, but in very different directions. My personal mind map started to look like a hybrid between a giant sequoia and an orange tree: a massive trunk (AI) sprouting an explosion of fruit-filled branches (deep tech applications). The value is in the branches, each representing a different company.

This is an image that can be generalized across the whole embodied AI field. There is a heavy focus on the trunk itself currently, and some even believe the trunk is all you need. I don’t know if AGI (Artificial General Intelligence) will ever be real, but I am sure there is no such thing as EGI (Embodied General Intelligence). The set of newly enabled applications of AI in the real-world is huge, and each requires unique companies (the branches) sprouting from common AI foundations (the trunk). The diversity from this embodied intelligence big bang is as inevitable and positive as the biological big bang was more than 500 million years ago. In a fascinating repeat of history, vision (biological and artificial) is a huge part of the trunk in both big bangs!
So why deep tech VC then? Because it is the best way to grow as many of these branches as possible! The goal is to build a strong diverse portfolio of startups uniquely enabled by revolutionary technology. This imminent Cambrian explosion of Deep Tech applications of AI in the real world is exactly why early-stage VC at Calibrate was the most logical conclusion for me.
Why Calibrate? Deep Tech applied at scale
Calibrate Ventures is an early-mover in deep tech investing with a growing portfolio at the forefront of vertical AI applications, AI infrastructure, robotics, and automation. As a Partner joining Co-Founders and Managing Partners Kevin Dunlap and Jason Schoettler, my role is to discover, fund, connect, and advise ambitious technical founders solving huge real-world problems with the help of cutting-edge AI. You can learn more in Calibrate’s official announcement.
On a personal note, I have had the pleasure of working with Jason and Kevin for years through the EON conference and Calibrate’s portfolio companies. They are astute investors, tech optimists, trusted advisors, and wonderful human beings. We share a common vision and values that make me truly grateful for the opportunity to work with them and the rest of the team.
What’s next?
I am excited to bring my AI background to Calibrate Ventures and help the next generation of bold founders build amazing AI-enabled products that move the world! I’m looking to meet many creative and visionary founders, so if this sounds like you, please reach out! I am also eager to partner with other investors sharing our vision, and help early adopters harness AI to be at the forefront of this new industrial revolution.
In addition to this exciting new adventure, I am continuing in my role as Adjunct Professor at Stanford. I feel incredibly privileged to work with amazing colleagues and students that constantly blow my mind with their creativity, work ethic, and energy. I am also a technical advisor at TRI to help their groundbreaking long-term research efforts in AI.
These responsibilities are synergistic with my Partner role at Calibrate Ventures: Embodied Intelligence is still a small world of passionate individuals with a strong community of shared interests. The teams I built and collaborated with are incredible, and I am still rooting for their success. I am excited to continue along our collective journey, ascending the sequoia-orange tree of embodied intelligence and its explosion of branches.
What do you think?
Will we see a wide variety of deep tech applications of AI in the real world soon? If so, which ones excite you? If not, is it because it is premature and thus much more research is needed, or is it because we will have a single winner-take-all EGI (Embodied General Intelligence)?
]]>