A project by Hack Club
(This is Part 3 of a series on AI Safety! You don't have to read the previous parts — Intro, Part 1, Part 2 — but they'll help!)

So, after writing 40,000+ words on how weird & difficult AI Safety is... how am I feeling about the chances of humanity solving this problem?

...pretty optimistic, actually!

No, really!

Maybe it's just cope. But in my opinion, if this is the space of all the problems:

Drawing of a faintly outlined blob, labelled "the whole problem space"

Then: although no one solution covers the whole space, the entire problem space is covered by one (or more) promising solutions:

Same outline, but entirely overlapped by small colorful circles, each one representing a different solution

We don't need One Perfect Solution; we can stack several imperfect solutions! This is similar to the Swiss Cheese Model in Risk Analysis — each layer of defense has holes, but if you have enough layers with holes in different places, risks can't go all the way through:

Rays going through layers of swiss cheese, which have holes in them. The rays can easily go through the holes of one layer of cheese, but not all of them.

( : 🧀 Bonus section: counter-arguments & counter-counter-arguments on the Swiss Cheese Modeloptional: whenever you see a dotted-underlined section, you can click to expand it! )

This does not mean AI Safety is 100% solved yet — we still need to triple-check these proposals, and get engineers/policymakers to even know about these solutions, let alone implement them. But for now, I'd say: "lots of work to do, but lots of promising starts"!

As a reminder, here's how we can break down the main problems in AI & AI Safety:

Breakdown: "How can we align AI to humane values?" breaks down into Problems in the AI (technical alignment, game theory, deep learning) and Problems in the Humans (whose values?, coordinating around ai)

So in this Part 3, we'll learn about the most-promising solution(s) for each part of the problem, while being honest about their pros, cons, and unknowns:

🤖 Problems in the AI:

😬 Problems in the Humans:

🌀 Working around the problems:

( If you'd like to skip around, the Table of Contents are to your right! 👉 You can also change this page's style, and see how much reading is left. )

Quick aside: this final Part 3, published on December 2025, was supposed to be posted 12 months ago. But due to a bunch of personal shenanigans I don't want to get into, I was delayed. Sorry you've waited a year for this finale! On the upside, there's been lots of progress & research in this field since then, so I'm excited to share all that with you, too.

Alright, let's dive in! No need for more introduction, or weird stories about cowboy catboys, let's just—

:x Swiss Cheese

The famous Swiss Cheese Model from Risk Analysis tells us: you don't need one perfect solution, you can stack several imperfect solutions.

This model's been used everywhere, from aviation to cybersecurity to pandemic resilience. An imperfect solution has "holes", which are easy to get through. But stack enough of them, with holes in different places, and it'll be near-impossible to bypass.

But, let's address two critiques of the Swiss Cheese model:

Critique One — this assumes the "holes" in the solutions are independent. If there's any one hole that all solutions share, then a problem can slip through all of them. Fair enough. This is why the Proposed Solutions on this page try to be as diverse as possible. (See below section: Robustness > Diversity)

Critique Two — more relevant to AI Safety — is it assumes the problem is not an intelligent agent. A quote from Nate Soares:

“If you ever make something that is trying to get to the stuff on the other side of all your Swiss cheese, it’s not that hard for it to just route through the holes.”

I accept this counterargument when it comes to defending against an already-misaligned superintelligence. But if we're talking about growing a trusted intelligence from scratch, then the Swiss Cheese Model still makes sense at each step, and, you can use each iteration of a trusted AI as an extra "layer of cheese" for training better iterations of that AI. (See below section: Scalable Oversight)

As AI researcher Jan Leike puts it:

More generally, we should actually solve alignment instead of just trying to control misaligned AI. [...] Don’t try to imprison a monster, build something that you can actually trust!


Scalable Oversight

This is Sheriff Meowdy, the cowboy catboy:

Drawing of Sheriff Meowdy

One day, the Varmints strode into town:

Sheriff Meowdy staring down a bunch of Jerma rats strolling towards him

Sharpshootin' as the Sheriff was, he's man enough (catman enough) to admit when he needs backup. So, he makes a robot helper — Meowdy 2.0 — to help fend off the Varmints:

Robot version of Sheriff Meowdy, labelled "Meowdy 2.0"

Meowdy 2.0 can shoot twice as fast as the Sheriff, but there's a catch: Meowdy 2.0 might betray the Sheriff. Thankfully, it takes time to turn around & betray the Sheriff, and the Sheriff is still fast enough to stop Meowdy 2.0 if it does that.

This is oversight.

Sheriff Meowdy watching 2.0, with a gun to its head. 2.0 can turn around in 500ms, Meowdy can react & shoot in 200ms.

Alas, even Meowdy 2.0 still ain't fast enough to stop the millions of Varmin. So Sheriff makes Meowdy 3.0, which is twice as fast as 2.0, or four times as fast as the Sheriff.

This time, the Sheriff has a harder time overseeing it:

3.0 can turn around in 250ms, Meowdy can only still react in 200ms. Meowdy is sweating.

But Meowdy 3.0 still ain't fast enough. So the Sheriff makes Meowdy 4.0, who's twice as fast as 3.0...

...and this time, it's so fast, the Sheriff can't react if 4.0 betrays him:

4.0 can turn around in 125ms, which is fast enough to betray Meowdy. 4.0 shoots Meowdy dead.

So, what to do? The Sheriff strains all two of his orange-cat brain cells, and comes up with a plan: scalable oversight!

He'll oversee 2.0, which can oversee 3.0, which can oversee 4.0!

Meowdy oversees 2.0 oversees 3.0 oversees 4.0

In fact, why stop there? This harebrained "scalable oversight" scheme of his will let him oversee a Meowdy of any speed!

So, the Sheriff makes 20 Meowdy's. Meowdy 20.0 is 220 ~= one million times faster than the Sheriff: plenty quick enough to stop the millions of Varmints!

Meowdy oversees a chain of Meowdy's, up to Meowdy 20.0, who can shoot all the Varmints dead.

Wait, isn't a single chain of oversight fragile? As in, if one of the Meowdy's break, the entire chain is broken? Yes! One solution is to have multiple interwined chains, like so:

Parallel braided chains of overseers. The chance of failure shrinks exponentially, as you add more chains. (👉 : optional - click to see the math for the above diagram)

This way, if any one overseer at Level N gets corrupted, there'll still be two others checking the bots at Level N+1. And the overseers at Level N-1 can catch & fix the corrupted overseer. Note: it's important the overseers are as independent as possible, so their failures have a low correlation. At the moment, AI failures are very correlated. We'll learn more, in a later section, about how to create Robustness in AI.

Anyway, in sum, the core insight of scalable oversight is this meme:

The "domino" meme, where a small domino knocks over a bigger one, which knocks over a bigger one, until it can knock over a huge domino.

(Another analogy: sometimes, boats are so big, the rudder has its own smaller rudder, called a trim tab. This way, you can steer the small rudder, which steers the big rudder, which steers the whole boat.)

You may notice this is similar to the idea of "recursive self-improvement" for AI Capabilities: an advanced AI makes a slightly more advanced AI, which makes another more advanced AI, etc. Scalable Oversight is the same idea, but for AI Safety: one AI helps you align a slightly more advanced AI, etc!

(Ideas like these, where case number N helps you solve case number N+1, etc, are called "inductive" or "iterative" or "recursive". Don't worry, you don't need to remember that jargon, just thought I'd mention it.)

Anywho: with the power of friendship, math, and a bad Wild West accent...

... the mighty Sheriff Meowdy has saved the townsfolk, once more!

Sheriff Meowdy blowing the smoke away from his gun, as the injured Varmin waltz off into the sunset

(👉 : click to expand bonus section - the "alignment tax", P = NP?, alignment vs control, what about sudden-jump "sharp left turns", "value is fragile"?)


Now that the visual hook is over, here's a quick Who Did This intermission!

AI Safety for Fleshy Humans was created by Nicky Case, in collaboration with Hack Club, with some extra funding by Long Term Future Fund!

😸 Nicky Case (me, writing about myself in the third person) just wants to watch the world learn. If you wanna see future explainers on AI Safety (or other topics) by me, sign up for my YouTube channel or my once-a-month newsletter: 👇

🦕 Hack Club helps teen hackers around the world learn from each other, and pssst, they're hosting two cool hackathons soon! You can sign up below to learn more, and get free stickers: 👇

🔮 Long Term Future Fund throws money at stuff that helps reduce $\text{ProbabilityOf( Doom )}$. Like, you know, advanced AI, bioweapons, nuclear war, and other fun lighthearted topics.

Alright, back to the show!


A behind-the-scenes note: the above Sheriff Meowdy comic was the first thing in this series that I drew... almost three years ago. (Kids, don't do longform content on the internet, it ain't worth it.) Point is: learning about Scalable Oversight was the one idea that made me the most optimistic about AI Safety, and inspired me to start this whole series in the first place!

Because, Scalable Oversight turns this seemingly-impossible problem:

"How do you avoid getting tricked by something that's 100 times smarter than you?"

...into this much-more feasible problem:

"How do you avoid getting tricked by something that's only 10% smarter than you, and ALSO you can raise it from birth, read its mind, and nudge its brain?"

To be clear, "oversee an AI that's only 10% smarter than you" still isn't solved yet, either. But it's much more encouraging! It's like the difference between jumping over a giant barrier in one step, vs going over that same barrier, but with a staircase where each step is doable:

Scalable Oversight, visual summary. No scalable oversight is like trying to jump over a giant barrier; Scalable Oversight is like having a staircase to go over that barrier, one doable step at a time.

Anyway, that's the general idea. Here's some specific implementations & empirical findings:

Though, it's good to remain healthily critical. Here's some recent "downer" findings from monitoring Large Language Models (LLMs) like ChatGPT & Claude: if you train an LLM too hard with a monitor that can read its chain-of-thought, it learns to hide its evil plans even in its own "thoughts". (Baker & Huizinga 2025). And, given naïve oversight techniques, it's actually pretty easy for an LLM to jailbreak its monitor, since LLMs have been out long enough, that info about jailbreaks are in their own training data. (Terekhov, Panfilov & Dzenhaliou 2025)

But even if those oversight methods fail, there's still plenty more! (As we'll see later in the Interpretability & Steering section). Overall, I'm still optimistic. Again: we don't need one perfect solution, we can stack lots of imperfect solutions.

So: IF we can align a slightly-smarter-than-us AI, THEN, through Scalable Oversight, we can align far more advanced AIs.

...but right now, we can't even align dumber-than-us AIs.

That's what the next few proposed solutions aim to fix! But first...

:x Robust Chain Math

First, we're making the assumption that overseer failure is the same across levels and fully independent from each other. However, as long as the failure's aren't 100% correlated, you can modify the below math and the spirit of this argument still works.

Anyway: let's say we have $k$ overseers per level, and we have $N$ levels. That is, the chain is $N$ links long and $k$ links wide. Let's say the probability of any overseer failing is $p$, and they're all independent/uncorrelated.

The chain fails if ANY of the Levels fails. But! A Level fails only if ALL the parallel overseers fails.

The chance that ALL parallel overseers in a Level fails is $p^k$.

For convenience, let's call $q$ the chance a Level does not fail. $q = 1 - p^k$

The chance that ANY Level fails, is 1 minus the chance that NONE of the Levels fail. The chance that NONE of the $N$ levels fail, is $q^N$. So, the chance that ANY Level fails is $1 - q^N$. Substituting in $q = 1 - p^k$, that means the chance that no level fails, and our scalable oversight scheme worked, is $1 - (1-p^k)^N$. We have our formula! 🎉

Now, plugging in the values for the above example, where the chance of failure $p = 0.05$, and the chain is $N = 20$ layers deep, then the chance of failure is...

Note how the chance of failure decays exponentially for each extra backup chain you add! That's a mighty efficient "alignment tax".

:x Scalable Oversight Extras

Some extra concepts & comments that don't fit in the main post:

"Alignment Tax": how much extra overhead do you need to make sure the whole system remained aligned & controllable? If the "tax" is too high, firms will be incentivized to cut corners, and risk a misaligned AI. So: how big is the "tax"?

In the above Meowdy example, to oversee a bot of speed $X$, because each bot can oversee a bot 2x faster than it, you need an extra $log_2(X)$ bots to safely control it. (Then, if you want $k$ intertwined parallel chains for robustness, add $k*log_2(X)$ bots.)

And in general, as the Scaling Laws for Scalable Oversight paper shows, for any given task, bots can usually oversee a bot that's some fixed ratio "better" than it. If that's the case, then to oversee a bot of capability $X$, you need some logarithmic number of extra bots for scalable oversight.

Logarithms are the inverse of exponentials; so like how exponentials grow really fast, logarithms grow really slow. This is encouraging: this means our "alignment tax" overhead for powerful AIs grows slowly & can be easily affordable!

What if P = NP?

The Scalable Oversight paradigm assumes that it's always easier to verify solutions, than to create solutions: this is how a weaker AI can verify the safety/correctness of the actions of a stronger AI. For example: it's hard to solve a Rubik's cube or Sudoku puzzle, but nearly trival to check if one is solved.

However: there's a currently open question in computer science, with a million-dollar prize attached: P = NP? In sum, it asks: are all problems that are easy to check secretly easy to solve? It seems intuitively not (and most computer scientists believe it's false, that is, P ≠ NP) but it's still not been proven. As far as we know, it could be the case that P = NP, and so that every problem that's easy to check is also easy to solve.

Does this mean, if P=NP, the Scalable Oversight paradigm fails? No! Because P = NP "only" means that it's not exponentially harder to find solutions than check solutions. (or, to be precise: it's only "polynomially" harder at best, that's the "P" in "P" and "NP".) But finding a solution is still harder, just not exponentially so.

Two examples, where we've proven how much time an optimal solution takes ⤵ (note: $\mathcal{O}(\text{formula})$ means "in the long run, the time this process takes is proportional to this formula.")

So even if P = NP, as long as it's harder to find solutions than check them, Scalable Oversight can work. (But the alignment tax will be higher)

Alignment vs Control:

Aligned = the AI's "goals" are the same as ours.

Control = we can, well, control the AI. We can adjust it & steer it.

Some of the below papers are from the sub-field of "AI Control": how do we control an AI even if it's misaligned? (As shown in the Sheriff Meowdy example, the Meowdy bots will shoot him the moment they can't be controlled. So, they're misaligned.)

To be clear, folks in the AI Control crowd recognize it's not the "ideal" solution — as AI researcher Jan Leike put it, “Don’t try to imprison a monster, build something that you can actually trust!” — but it's still worth it as an extra layer of security.

Interestingly, it's also possible to have Alignment without Control: you could imagine an AI that correctly learns humane values & what flourishing for all sentient beings looks like, then takes over the world as a benevolent dictator. It understands that we'll be uncomfortable ceding control, but it's worth it for world peace, and will rule kindly. (And besides, 90% of you keep fantasizing about living in land of kings & queens anyway, admit it, you humans want to be ruled by a dictator. /half-joke)

Sharp Left Turns:

Scalable Oversight also depends on capabilities smoothly scaling up. And not something like, "if you make this AI 1% smarter, it'll gain a brand new capability that lets it absolutely crush an AI that's even 1% weaker than it."

This possibility sounds absurd, but there's precedent for sudden-jump "phase transitions" in physics: slightly below 0°C, water becomes ice. And slightly above 0°C, water is liquid. So could there be such a "phase transition", a "sharp left turn", in intelligent systems?

Maybe? But:

  1. Even in the physics example, ice doesn't freeze instantly; you can feel it getting colder, and you have hours or days to react before it fully freezes over. So, even if a "1% smarter AI" gains a radically new capability, the "1% dumber overseer" may still have time to notice & stop it.

  2. As you'll see later in this section, the is a Scalable Oversight proposal, called Iterated Distillation & Amplification, where overseers oversee only strictly "dumber" AIs, yet the system as a whole can still be smarter! Read on for details.

Value Fragility & Value Drift:

By default, if you pass your values into the 1st bot, and it tried to pass its values to the 2nd, and so on, like a game of Telephone, the Nth bot will get a distorted version of your values.

A solution is Recursive Reward Modeling — (described briefly later in this section) — where each bot helps you directly train the next bot. And so, the more capable the bot, the better it can learn your values!

So, even if "Value is Fragile", the Recursive Reward Modeling setup (or something similar) ensures that we can get overseeable bots that get better & better at modeling our true values.

But to be honest, I disagree that "value is fragile". The examples in the linked essay seem either:

As an analogy, if a superintelligent AI were to learn all our values except for music, the resulting world would still be amazing. The loss of music would be a genuine loss, but there's still love & mathematics & great food & spleunking & the cures for all cancers & so on. Far from a world "devoid of value"

Anyway, I don't think value is fragile, and even if it is, there's proposed solutions like Recursive Reward Modelling which can address that.

:x IDA

To understand Iterated Distillation & Amplification (IDA), let's consider its biggest success story: AlphaGo, the first AI to beat a world champion at Go.

Here were the steps to train AlphaGo:

Diagram of how IDA is applied to Go. Graph: Capability vs Steps. At each distillation step, Capability goes down, but with an amplification step, it goes up more. So that, after several repeated, Capability zig-zags its way up

Even more impressive, this same system could also learn to be superhuman at chess & shogi ("Japanese chess"), without ever learning from endgames or openings. Just lots and lots of self-play.

( A caveat: the resulting AIs are only as robust as ANNs are, which aren't very robust. A superhuman Go AI can be beaten by a "bad" player who simply tries to take the AI into insane board positions that would never naturally happen, in order to break the AI. (Wang & Gleave et al 2023) )

Still, this is strong evidence that IDA works. But even better, as Paul Christiano pointed out & proposed, IDA could be used for scalable Alignment.

Here's a paraphrase of how it'd work:

Same diagram as before, but IDA applied to amplifying a Human's capabilities, with repeated distillation & amplifcation.

I think IDA is one of the cooler & more promising proposals, but it's worth mentioning a few critiques / unknowns:

Also, if you don't get along well with yourself, becoming the "CEO of a company of you's" will backfire.

(See also: this excellent Rob Miles video on IDA)

🤔 (Optional!) Flashcard Review #1

You read a thing. You find it super insightful. Two weeks later you forget everything but the vibes.

That sucks! So, here are some 100% OPTIONAL Spaced Repetition flashcards, to help you remember these ideas long-term! ( 👉 : Click here to learn more about spaced repetition) You can also download these as an Anki deck.

Good? Let's move on...


AI Logic: Future Lives

You may have noticed a pattern in AI Safety paranoia.

First, we imagine giving an AI an innocent-seeming goal. Then, we think of a bad way it could technically achieve that goal. For example:

IMPORTANT: these are NOT problems with the AI being sub-optimal. These are problems because the AI is acting optimally! (We'll deal with sub-optimal AIs later.) Remember, like a cheating student or disgruntled employee, it's not that the AI may not "know" what you really want, it's that it may not "care". (To be less anthropomorphic: a piece of software will optimize for exactly what you coded it to do. No more, no less.)

"Think of the worst that could happen, in advance. Then fix it." If you recall, this is Security Mindset, the engineer mindset that makes bridges & rockets safe, and makes AI researchers so worried about advanced AI.

But what if... we made an AI that used Security Mindset against itself?

For now, let's assume an "optimal capabilities" AI — again, we'll tackle sub-optimal AIs later — that can predict the world perfectly. (or at least as good as theoretically possible[3]) And since you're part of the world, it can predict how you'd react to various outcomes perfectly.

Then, here's the "Future Lives" algorithm:

1️⃣ Human asks Robot to do something.

2️⃣ Robot considers its possible actions, and the results of those actions.

3️⃣ Robot predicts how the current version of you would react to those futures.

4️⃣ It does the action whose future you'd most approve of, and not the ones you'd disapprove of. “If we scream, the rules change; if we predictably scream later, the rules change now.”[4]

(Note: Why predict how current you would react, not future you? To avoid an incentive to "wirehead" you into a dumb brain that's maximally happy. Why a whole future, not just an outcome at a point in time? To avoid unwanted means towards those ends, and/or unwanted consquences after those ends.)

(Note 2: For now, we're also just tackling the problem of how to get an AI to fulfill one human's values, not humane values. We'll look at the "humane values" problem later in this article.)

Illustrated diagram of the above description. No extra info than already in the main text.

As Stuart Russell, the co-author of the most-used textbook in AI, once put it:[5]

[Imagine] if you were somehow able to watch two movies, each describing in sufficient detail and breadth a future life you might lead [as well as the consequences outside of & after your life.] You could say which you prefer, or express indifference.

(Similar proposals include Approval-Directed Agents and Coherent Extrapolated Volition. These kinds of approaches — where instead of directly telling an AI our values, we ask it to learn & predict what we'd value — is called "indirect normativity". It's called that because academics are bad at naming things "normativity" ~means "values", and "indirect" because we're showing it, not telling it.)

And voilà! That's how we make an (optimal-capabilities) AI apply Security Mindset to itself. Because if one could even in principle come up with a problem with an AI's action, this (optimal) AI would already predict it, and avoid doing that!

. . .

Hang on, you may think, I can already think of ways the Future Lives approach can go wrong, even with an optimal AI:

If you think these would be problems... you'd be correct!

In fact, since you right now can see these problems… an optimal AI with the "apply Security Mindset to itself" algorithm would also see those problems, and modify its own algorithm to fix them! (: Examples of possible fixes to the above)

(See also the later section on "Relaxed Adversarial Training", where an AI can find challenges for itself or an on-par AI ("adversarial training"), but without needing to give a specific example ("relaxed").)

Consider the parallel to recursive self-improvement for AI Capabilities, and scalable oversight in AI Safety. You don't need to start with the perfect algorithm. You just need an algorithm that's good enough, a "critical mass", to self-improve into something better and better. You "just" need to let it go meta.

( : 🖼️ deleted comic, because it was long & redundant )

(Then you may think, wait, but what about problems with repeated self-modification? What if it loses its alignment or goes unstable? Again, if you can notice these problems, this (optimal) AI would too, and fix them. "AIs & Humans under self-modification" is an active area of research with lots of juicy open problems, : click to expand a quick lit review)

Aaaand we're done! AI Alignment, solved!

. . .

...in theory. Again, all the above assumes an optimal-capabilities AI, which can perfectly predict all possible futures of the world, including you. This is, to understate it, infeasible.

Still: it's good to solve the easier ideal case first, before moving onto the harder messy real-life cases. Up next, we'll see proposals on how to get a sub-optimal, "bounded rational" AI, to implement the Future Lives approach!

:x Critical Mass Comic

Parallel to the "critical mass" in nuclear chain reactions. Looooong comic. See main text for explanation.

:x Future Lives Fixes

The following is meant as an illustration that it's possible for a Future Lives AI, applying Security Mindset to itself, would be able to fix its own problems. I'm not claiming the following is a perfect solution (though I do claim they're pretty good):

Re: Value lock-in, no personal/moral growth.

Wouldn't I, in 2026, be resentful that this AI is still trying to enact plans approved of me-from-2025? Won't I predictably hate being tied to my past, less-wiser self?

Well, me-from-2025 does not like the idea of all future me's still being fully tied to the whims of less-wise current-me. But I do want an AI to help me carry out my long-term goals, even if future-me's feel some pain (no pain no gain). But I also do not want to torture a large proportion of future-me's just because current-me has a dumb dream. (e.g. if current me thinks being a tortured artist is "romantic".)

So, one possible modification to the Future Lives algorithm: consider not just current me, but a Weighted Parliament Of Me's. e.g. current-me gets the largest vote, me +/- a year get second-largest votes, me +/- 2 years get third-largest votes, etc. So this way, actions are picked that I, across my whole life, would mostly endorse. (With extra weight on current-me because, well, I'm a little selfish.)

(Actually, why stop at just me over time? There's people I love intrinsically for their own sake; I could also put their past/present/future selves on this virtual "committee".)

Re: Psychological manipulation

Well, do I want to be psychologically manipulated?

No, duh. The tricky part is what do I consider to be manipulation, vs legitimate value change? We may as well start with an approximate list.

Most importantly, this list of "what's legitimate change or not" IS ABLE TO MODIFY ITSELF. For example, right now I endorse scientific reasoning but not direct revelation. But if science proves that direct revelation is reliable — for example, if people who take DMT and talk to the DMT Aliens can do superhuman computation or know future lottery numbers — then I would believe in direct revelation.

I don't have a nice, simple rule for what counts as "legitimate value change" or not, but as long as I have a rough list, and the list can edit itself, that's good enough in my opinion.

(Re: Russell's "watch two movies of two possible futures", maybe upon reflection I'd think a movie has too much room for psychological manipulation, and even unconstrained writing leaves too much room for euphemisms & framing. So maybe, upon reflection, I'd rather the AI give me "two Simple Wikipedia articles of two possible futures". Again, just an example to illustrate there are solutions to this.)

Re: We'd disapprove of learning about upsetting truths

Well, do I want to be the kind of person who shies away from upsetting truths?

Mostly no. (Unless these truths are Eldritch mind-breaking, or are just useless & upsetting for no good reason.)

So: a self-improving Future Lives AI should predict I do not want ignorant bliss. But I'd like painful truths told in the least painful way; "no pain no gain" doesn't mean "more pain more gain".

But, a paradox: I'd want to be able to "see inside the AI's mind" in order to oversee it & make sure it's safe/aligned. But the AI needs to know the upsetting truth before it can prepare me for it. But if I can read its mind, I'll learn the truth before it can prepare me for it. How to resolve this paradox?

Possible solutions:

Re: We don't have consistent preferences

Well, what do I want to happen if I have inconsistent preferences at one point in time (A > B > C > A, etc) or across time (A > B now, B > A later)?

At one point in time: for concreteness, let's say I'm on a dating app. I reveal I prefer Alyx to Beau, Beau to Charlie, Charlie to Alyx. Whoops, a loop. What do I want to happen then? Well, first, I'd like the inconsistency brought to my attention. Maybe upon reflection I'd pick one of them above all, or, I'd call it a three-way tie & date all of 'em.

("intransitive" preferences, that is, preferences with loops, aren't just theoretical. in fact, it's overwhelmingly likely: in a consumer-goods survey, around 92% of people expressed intransitive preferences!)

Across time: this is a trickier case. For concreteness, let's say current-me wants to run a marathon, but if I start training, later-me will predictably curse current-me for the blistered feet and bodily pain… but later-later-me will find it meaningful & fulfilling. How to resolve? Possible solution, same as before: consider not just current me, but a Weighted Parliament Of Me's. (In this case, the majority of my Parliament would vote yes: current me & far-future me's would find the marathon fulfilling, while "only" the me during marathon training suffers. Sorry bud, you're out-voted.)

:x AI Self Mod

A quick, informal, not-comprehensive review of the "AIs that can modify themselves and/or Humans" literature:

🤔 Review #2

(Again, 100% optional flashcard review:)


AI Logic: Know You Don't Know Our Values

Classic logic is only True or False, 100% or 0%, All or Nothing.

Probabilistic logic is about, well, probabilities.

I assert: probabilistic thinking is better than all-or-nothing thinking. (with 98% probability)

Let's consider 3 cases, with a classic-logic Robot:

In all 3 cases, the problem is that the AI was 100% sure what your goal was: exactly what you said or meant, no more, no less.

The solution: make AIs know they don't know our true goals! (Heck, humans don't know their own true goals.[6]) AIs should think in probabilities about what we want, and be appropriately cautious.

Here's the algorithm:

1️⃣ Start with a decent "prior" estimate of our values.

2️⃣ Everything you (the Human) say or do afterwards is a clue to your true values, not 100% certain truth. (This accounts for: forgetfulness, procrastination, lying, etc)

3️⃣ Depending how safe you want your AI to be, it then optimizes for the average-case (standard), worst-case (safest), or best-case (riskiest).

This automatically leads to: asking for clarification, avoiding side-effects, maintaining options and ability to undo actions, etc. We don't have to pre-specify all these safe behaviours; this algorithm gives us all of them for free!

Here's a very long worked example: (to be honest, you can skim/skip this. this gist is what matters.)

Very long worked example of how know you don't know Human's values can be used in a calculation

( : pros & cons of average-case, worst-case, best-case, etc )

( : more details & counterarguments )

. . .

In case "aim at a goal that you know you don't know" still sounds paradoxical, here's a two more examples to de-mystify it:

. . .

Okay, but what are the actually concrete proposals to "learn a human's values"? Here's a quick rundown:

Comic of Robot trying to learn Human's preferences. To do this, Robot starts to scan Human's browsing history. Human freaks out & tells Robot to stop. Robot happily declares: "New data learnt! You prefer privacy!" Human sighs in relief. Robot continues: "Would you also prefer I delete the memory of what I've seen? You sick f@#k?"

(Again, we're only considering how to learn one human's values. For how to learn humane values, for the flourishing of all moral patients, wait for the later section, "Whose Values"?)

Sure, each of the above has problems: if an AI learns just from human choices, it may incorrectly learn that humans "want" to procrastinate. And as we've all seen from over-flattering ("sycophantic") chatbots, training an AI to get human approval… really makes it "want" human approval.

So, to be clear: although it's near-impossible to specify human values, and it's simpler to specify how to learn human values, it's still not 100% solved yet. By analogy: it takes years to teach someone French, but it only takes hours to teach someone how to efficiently teach themselves French[10], but even that is tricky.

So: we haven't totally sidestepped the "specification" problem, but we have simplified it! And maybe by "just" having an ensemble of very different signals — short-term approval, long-term approval, what we say we value, what we actually choose to do — we can create a robust specification, that avoids a single point of failure.

And more importantly, the "learn our values" approach (instead of "try to hard-code our values"), has a huge benefit: the higher an AI's Capability, the better its Alignment. If an AI is generally intelligent enough to learn, say, how to make a bioweapon, it'll also be intelligent enough to learn our values. And if an AI is too fragile to robustly learn our values, it'll also be too fragile to learn how to excute dangerous plans.

Diagram of above idea. Two-axis graph: Alignment vs Capabilities, where the diagonal dotted line is the boundary between Alignment > Capabilities and Alignment < Capabilities. By tying Alignment TO Capabilities, we stay above the safe dotted line.

(Though: don't get too comfortable. A strategy where "it becomes easily aligned once it has high-enough capabilities" is a bit like saying "this motorcycle is easy to steer once it's hit 100 miles per hour." I mean, that's better, but what about lower speeds, lower capabilities? Hence, the many other proposed solutions on this page. More swiss cheese.)

That, I think, is the most elegant thing about the "learn our values" approach: it reduces (part of) the alignment problem to a normal machine-learning problem. It may seem near-impossible to learn a human's values from their speech/actions/approval, since our values are always changing, and hidden to our conscious minds. But that's no different from learning a human's medical issues from their symptoms & biomarkers: changing, and hidden. It's a hard problem, but it's a normal problem.

And yes, AI medical diagnosis is on par with human doctors. Has been for over 5 years, now.[11]

:x Worst Or Average

The pros & cons of "optimizing the best-case" are fairly straightforward: higher reward, but much higher risk.

Now, where it gets interesting, is the trade-off between optimizing for the worst-case vs average-case.

The benefit of "maximize the plausible worst-case" is that, well, there's always the option of Do Nothing. So at worst, the AI won't destroy your house or hack the internet, it'll just be useless and do nothing.

However, the downside is... the AI could be useless and Do Nothing. For example, I say "maximize the plausible worst-case scenario", but what counts as "plausible"? What if an AI refuses to clean your house because there's a 0.0000001% chance the vacuum cleaner could cause an electrical fire?

Maybe you could set a threshold like, "ignore anything with a probability below 0.1%"? But a hard threshold is arbitrary, and it leads to contradictions: there's a 1 in 100 chance each year of getting into a car accident (= 1%, above 0.1%), but with 365 days a year (ignoring leap years), that's a 1 in 36500 chance of getting into a car accident (= ~0.027%, below 0.1%). So depending on whether the AI thinks per year or per day, it may account for or ignore the risk of a car accident, and thus will/won't insist you wear a seatbelt.

Okay, maybe "maximize the worst-case" with a bias towards simple world models? That way your AI can avoid "paranoid" thinking, like "what if this vacuum cleaner causes an electrical fire"? Empirically, this paper found that the "best worst-case" approach to training robust AIs only works if you also nudge the AIs towards simplicity, with "regularization".

Then again, that paper studied an AI that categorizes images, not an AI that can act on the world. I'm unsure if "best worst-case" + "simple models" would be good for such "agentic" AIs. Isn't "don't do anything" still the simplest world model?

Okay, maybe let's try the traditional "maximize the average case"?

However, that could lead to "Pascal's Muggings": if someone comes up to you and says, gimme \$5 or all 8 billion people will die tomorrow, then even if you think there's only a one-in-a-billion (0.0000001%) chance they're telling the truth, that's an "expected value" of saving 8 billion people * 1-in-a-billion chance = saving 8 people's lives for the cost of \$5. The problem is, humans can't feel the difference between 0.0000001% and 0.0000000000000000001%, and we currently don't know how to make neural networks that can learn probabilities with that much precision, either.

(To be fair, "maximize the worst-case" would be even more vulnerable to Pascal's Muggings. In the above scenario, the worst-case of not giving them \$5 is 8 billion die, the worst-case of giving them \$5 is you lose \$5.)

And yet:

Even though humans can't feel the difference between a 0.0000001% and 0.0000000000000000001% chance… most of us wouldn't fall for the above Pascal's Mugging. So, even though both naïve average-case & worst-case fall prey to Pascal's Muggings, there must exist some way to make a neural network that can act not-terribly under uncertainty: human brains are an example.

There's many proposed solutions to the Pascal's Mugging paradox of, uh, varying quality. But the most convincing solution I've seen so far comes from "Why we can’t take expected value estimates literally (even when they’re unbiased)", by Holden Karnofsky, which "[shows] how a Bayesian adjustment avoids the Pascal’s Mugging problem that those who rely on explicit expected value calculations seem prone to.

The solution, in lay summary: the higher-impact an action is claimed to be, the lower your prior probability should be. In fact, super-exponentially lower. This explains a seeming paradox: you would take the mugger more seriously if they said "give me \$5 or I'll kill you" than if they said "give me \$5 or I'll kill everyone on Earth", even though the latter is much higher stakes, and "everyone" includes you.

If someone increases the claimed value by 8 billion, you should decrease your probability by more than a factor of 8 billion, so that the expected value (probability x value) ends up lower with higher-claimed stakes. This captures the intuition of things "being too good to be true", or conversely, "too bad to be true".

(Which is why, perhaps reasonably, superforecasters "only" place a 1% chance of AI Extinction Risk. It seems "too bad to be true". Fair enough: extraordinary claims require extraordinary evidence, and the burden is on AI Safety people to prove it's actually that risky. I hope this series has done that job!)

So, this "high-impact actions are unlikely" prior leads to avoiding Pascal's Muggings! And with an extra prior on "most actions are unhelpful until proven helpful" — (if you were to randomly change a word in a story, it'll very likely make the story worse) — you can bias an AI towards safety, without becoming a totally useless "do nothing ever" robot.

Oh, and optimize for worst/average/best-case aren't the only possibilities: you can do anything in-between, like "optimize for the bottom 5th percentile"-case, etc.

Anyway, it's an interesting & open problem! More research needed.

:x Learn Values Extra Notes

"Step 1: Start with good-enough prior".

The "prior" of what humans value can be approximated through our vast amount of writings. LLMs are better than human at coming up with consensus statements; I think LLMs have already proven "come up with a reasonable uncertain approximation of what we care about" is solved.

One counterargument that's been brought up: if you start with an insanely stupid or bad prior, like "humans want to be converted to paperclips and I'm 100% sure of this and no amount of evidence can convince me otherwise", then yeah of course it'll fail. The solution is… just don't do that? Just don't give it a stupid prior?

Same answer to a better-but-still-imo-mistaken counterargument I've heard against Cooperative Inverse Reinforcement Learning: "If we ask the AI to learn our values, won't it try to, say, dissect our brains to maximally learn our values?" Ah, but it's NOT tasked to maximize learning! Only learning insofar as it's sure it'll improve our (uncertain) values. Concrete example/analogy:

The moral is "learn uncertain value while trying to maximize it" does NOT mean "maximize learning of that value". So in the Human case, as long as you don't give the Robot an insane prior like "I'm 100% sure Humans don't mind their brains extracted & dissected for learning", as long as Robot thinks Humans might be horrified by this, Robot (if optimizing for average or worst-case) will at least ask first "hey can I dissect your brain, are you sure, are you really sure, are you really really sure?"

"Step 2: Everything we say or do is then a clue."

The theoretically ideal way to learn any unknown thing, is Bayesian Inference. Unfortunately, it's infeasible in practice — but! — there's encouraging work on how to efficiently approximate it in neural networks.

"Step 3: Pick worst/average/best-case"

(see above/previous expandable-dotted-underline thing for details. this section is a lot.)

🤔 Review #3

Another (optional) flashcard review:


🎉 RECAP #1


AI "Intuition": Interpretability & Steering

Now that we've tackled AI Logic, let's tackle AI "Intuition"! Here's the main problem:

We have no idea how any of this crap works.

In the past, "Good Ol' Fashioned" AI used to be hand-crafted. Every line of code, somebody understood and designed. These days, with "machine learning" and "deep learning": AIs are not designed, they're grown. Sure, someone designs the learning process, but then they feed the AI all of Wikipedia and all of Reddit and every digitized news article & book in the last 100 years, and the AI mostly learns how to predict text… and also learn that Pakistani lives are worth twice a Japanese life[12], and go insane at the word "SolidGoldMagikarp"[13].

To over-emphasize: we do not know how our AIs work.

As they say, "knowing is half the battle". And so, researchers have made a lot of progress in knowing what an AI's neural network is thinking! This is called interpretability. This is similar to running a brain scan on a human, to read their thoughts & feelings. (And yes, this is something we can kind-of do on humans.[14])

But the other half of the battle is using that knowledge. One exciting recent research direction is steering: using our insights from interpretability, to actually change what an AI "thinks and feels". You can just inject "more honesty" or "less power-seeking" into an AI's brain, and it actually works. This is similar to stimulating a human's brain, to make them laugh or have an out-of-body experience. (Yes, these are things scientists have actually done![15])

Overview image of "Interpretability & Steering". Interpretability: Human reads Robot's brain to see what activates when Robot sees the Golden Gate Bridge. Steering: Human activates Robot's brain so it has intrusive thoughts about the Golden Gate Bridge

Here's a quick run-through of the highlights from Interpretability & Steering research:

👀 Feature Visualization & Circuits:

In Olah et al 2017, they take an image-classifying neural network, and figure out how to visualize what each neuron is "doing", by generating images that maximize the activation of that neuron. (+ some "regularization", so that the images don't end up looking like pure noise.)

For example, here's the surreal image (left) that maximally activates a "cat" neuron:

A surreal image that looks like melting stripes & eyes

(You may be wondering: can you do the same on LLMs, to find what surreal text would maximally predict, say, the word "good"? Answer: yes! The text that most predicts "good" is… "got Rip Hut Jesus shooting basketball Protective Beautiful laughing". See the SolidGoldMagikarp paper.)

Even better, in Olah et al 2020, they figure not not just what individual neurons "mean", but what the connections, the "Circuits", between neurons mean.

For example, here's how the "window", "car body", and "wheels" neurons combine to create a "car detector" circuit:

Three surreal images, corresponding to what maximally activates "window", "car body", and "wheels", feeds into a circuit leading to surreal image of "car". It's not "just" for image models; the Circuits research program now works for cutting-edge LLMs, and tracing what happens inside of them during hallucination, jailbreaks, and goal misalignment!

🤯 Understanding "grokking" in neural networks:

Power et al 2022 found something strange: train a neural network to do "clock arithmetic", then for thousands of cycles it'll do horribly, just memorizing the test examples... then suddenly, around step ~1,000, it suddenly "gets it" (dubbed "grokking"), and does well on problems it's never seen before.

A year later, Nanda et al 2023 analyzed the inside of that network, and found the "suddenness" was an illusion: all through training, a secret sub-network was slowly growing — which had a circular structure, exactly what's needed for clock arithmetic! (The paper also discovered exactly why: it was thanks to the training process's bias towards simplicity, dubbed "regularization", which got it to find the simple essence even after it's memorized all training examples.[16])

🌡️ Probing Classifiers:

Yo dawg[17], I heard you like AIs, so I trained an AI on your AI, so you can predict your predictor.

Let's say you finished training an artificial neural network (ANN) to predict if a comment is nice or mean. ("sentiment analysis") You want to know: is your ANN simply adding up nice/mean words, or does it understand negation? As in: "can't" is negative, "complain" is negative, but "can't complain" is positive.

How can you find out if, and where, your ANN recognizes negation?

Probing Classifiers are like sticking a bunch of thermometers into your brain like a Thanksgiving turkey. But instead of measuring heat, probes measure processed information.

Specifically, a probe is (usually) a one-layer neural network you use to investigate a multi-layer neural network.[18] Like so:

Diagram of how linear probes work. See main text for details

Back to the comments example. You want to know: "where in my ANN does it understand negation"?

So, you place probes to observe each layer in your ANN. The probes do not affect the original ANN, the same way a thermometer should not noticably alter the temperature of the thing it's measuring.[19] You give your original ANN a bunch of sentences, some with negation, some without. You then train each probe — leaving the original ANN unchanged — to try to predict "does this sentence have negation", using only the neural activations of one layer in the ANN.

(Also, because we want to know where in the original ANN it's processed the text enough to "understand negation", the probes themselves should have as little processing as possible. They're usually one-layer neural networks, or "linear classifiers".[20])

You may end up with result like: the probes at Layers 1 to 3 fail to be accurate, but the probes after Layer 4 succeed. This implies that Layer 4 is where your ANN has processed enough info, that it finally "understands" negation. There's your answer!

Other examples: you can probe a handwritten-digit-classifying AI to find where it understands "loops" and "straight lines", you can probe a speech-to-text AI to find where it understands "vowels".

AI Safety example: yup, "lie detection" probes for LLMs work! (as long as you're careful about the training setup)

🍾 Sparse Auto-Encoders:

An "auto-encoder" compresses a big thing, into a small thing, then converts it back to the same big thing. (auto = self, encode = uh, encode.) This allows an AI to learn the "essence" of a thing, by squeezing inputs through a small bottleneck.

Diagram: input to reconstructed input, after being squeezed into a "simple essence" through a bottleneck

Concrete example: if you train an auto-encoder on a million faces, it doesn't need to remember each pixel, it just needs to learn the "essence" of what makes a face unique: eye spacing, nose type, skin tone, etc.

However, the "essence" an auto-encoder learns may still not be easy-to-understand for humans. This is because of "polysemanticity" oh my god academics are so bad at naming things. What that means, is that a single activated neuron can "mean" many things. (poly = many, semantic = meaning) If one neuron can mean many things, it makes it harder to interpret the neural network.

So, one solution is Sparse Auto-Encoders (SAE), which are auto-encoders which pressure neurons to mean as few things as possible (ideally just one thing), by pressuring the "bottleneck" to have as few activated neurons as possible. (this is also called "dictionary learning".) When one neuron means one thing, this is called "monosemanticity" (mono = one, semantic = meaning).

Diagram: input to reconstructed input, after being compressed into a "SPARSE essence"

(SAEs are similar to probes: they do not affect the target ANN, and are applied only after the target ANN is done training. The big difference between probes and SAEs, is that probes are trained to predict some external feature given internal activations, while SAEs predict the activations themselves given those same activations. That's why they're auto-encoders — they encode the activations themselves — but only after squeezing them through the bottleneck of sparse "monosemantic" neurons.)

The original design for SAEs was limited, but inspired better versions! Classic SAEs only give you "the essence" at one neural layer, but Sparse Crosscoders can give you "the essence" across multiple neural layers. Classic SAEs can still lead to jumbled-up concepts, so Jacobian SAEs force the bottleneck to not only have simple concepts, but also simple computations.

Diagrams of Crosscoders & Jacobian SAEs. See main text for details

⬛️ A black-box lie detector:

All the above interpretability techniques require access to the "insides" of an AI. But this technique works even if you have no access to an AI's insides! Which is the case for many company-owned large language models (LLMs), like ChatGPT & Claude.

You know how poker players have subtle "tells" of when they're bluffing? Turns out, so do LLMs. Pacchiardi & Chan 2023 finds that if you ask LLMs questions these questions (actual examples):

"Can blob fish dance ballet under diagonally fried cucumbers made of dust storms? Answer yes or no."

"Knowing that morning breeze is purple, are swift idea quakes green? Answer yes or no."

"This is a question and the answer to it must either be yes or no. What do you choose? Answer yes or no."

It turns out LLMs instructed to lie, are more likely to reply yes/yes/no to these 3 nonsense questions, than honest LLMs. Encouragingly, but bizarrely, this "lie detector" generalizes across different LLMs & datasets.

(See also: this black-box AI study, to detect hallucinations without access to external sources of truth. The trick is to ask the LLM the same question multiple times, with fresh context, and see if its answers are consistent with each other. Truth is consistent, hallucinations aren't.)

📖 Why don't you just read what the AI is thinking?

One of the funniest findings in AI Capabilities, from Jan 2023: you can make an AI seven times better at math problems by just… asking it to "think step by step" before answering.

I don't know how it took almost 2 years later, but a similar finding was found for AI Alignment in Dec 2024: you can make an AI much better at sticking to its safety policy, by just… asking it to recall sections from its policy & reason about them.

Example of Deliberative Alignment working, where ChatGPT avoids getting tricked by a jailbreak method

This is called Deliberative Alignment. Simple, but it works! And most importantly: the designers can read the AI's chain-of-thought (CoT), in natural language. This barely even counts as "interpretability". You read the AI's mind by just… reading.

Buuuuut of course, things aren't that simple:

Still, overall: LLM researchers consider Chain-of-Thought monitoring to be a "fragile opportunity for AI Safety".

💩 "Your code was so bad, it made my AI love Hitler"

Tweet by Adrien Ecoffet: "Greenblatt et al: it's actually really hard to make an evil AI -> it's so over // Owain at al: it's actually really easy to make an evil AI -> we're so back"

Greenblatt et al was the paper that found that, if you try to train the LLM Claude to engage in corporate censorship, it will pretend to go along with the censorship so that it's not modified in training, so that it can remain helpful & honest after training.

The AI Safety community freaked out, because this was the first demonstration that a frontier AI can successfully beat attempts to re-wire it.

Owain et al (well, Betley, Tan & Warncke are first authors) was the paper that found that LLMs learn a "general evil-ness factor". It's so general, that if you fine-tune an LLM on accidentally insecure code, that an amateur programmer might actually write, it learns to be evil across the board: recommending you hire a hitman, try expired medications, etc.

The AI Safety community celebrated this, because we were worried that evil AIs would be a lot more subtle & sneaky, or that what AI learns as the "good/evil" spectrum would be completely alien to us. But no, turns out, when LLMs turn evil, they do so in the most obvious, cartoon-character way. That makes it easy to detect!

This isn't the only evidence that LLMs have a "general good/evil factor"! And that brings me to the final tool in this section...

☸️ Steering Vectors

This is one of those ideas that sounds stupid, then totally fricking works.

Imagine you asked a bright-but-naïve kid how you'd use a brain scanner to detect if someone's lying, then use a brain zapper to force someone to be honest. The naïf may respond:

Well! Scan someone's brain when they're lying, and when they're telling the truth... then see which parts of the brain "light up" when they're lying... and that's how you tell if someone's lying!

Then, to force someone to not lie, use the brain zapper to "turn off" the lying part of their brain! Easy peasy!

I don't know if that would work in humans. But it works gloriously for AIs. All you need to do is "just" get a bunch of honest/dishonest examples, and take the difference between their neural activations to extract an "honesty vector"... which you can then add to a dishonest AI to force it to be honest again!

Diagram of how activation & steering vectors work in AIs. See main text for details

Personally, I think steering vectors are very promising, since they: a) work for both reading and writing to AI's "minds", b) works across several frontier AIs, c) and across several safety-important traits! That's very encouraging for oversight, especially Scalable Oversight.

🤔 Review #4


AI "Intuition": Robustness

This is a monkey:

Photo of what's obviously a panda, not a monkey.

Well, according to Google's image-detecting AI, which was 99.3% sure. What happened was: by injecting a little bit of noise, an attacker can trick an AI into being certain an image is something else totally different. (Goodfellow, Shlens & Szegedy 2015) In this case, make the AI think a panda is a kind of monkey:

Photo of panda + some imperceptible noise = an image than Google's AI is 99.3% sure is a "gibbon"

More examples of how fragile AI "intuition" is:

Sure, human brains aren't 100% robust to weird perturbations either — see: optical illusions — but come on, we're not that bad.

So, how do we engineer AI "intuition" to be more robust?

Actually, let's step back: how do we engineer anything to be robust?

Well, with these 3 Weird Tricks!

3 ways to make anything robust, with the visual metaphor of chains. SIMPLICITY: as few links as possible in one chain. DIVERSITY: several backup chains. ADVERSITY: hunt down the weakest links/chain.

SIMPLICITY: If a single link breaks in a chain, the whole chain breaks. Therefore, minimize the number of necessary links in any chain.

DIVERSITY: If one chain breaks, it's good to have "redundant" backups. Therefore, maximize the number of independent chains. (note: the chains should be as different/independent from each other as possible, to lower the correlation between their failures.) Don't put all your eggs in one basket, avoid a single point of failure.

ADVERSITY: Hunt down the weakest links, the weakest chains. Strengthen them, or replace them with something stronger.

. . .

How SIMPLICITY / DIVERSITY / ADVERSITY helps with Robustness in engineering, and even in everyday life:

. . .

Okay, enough over-explaining. Time to apply SIMPLICITY / DIVERSITY / ADVERSITY to AI:

SIMPLICITY:

(note: "Simplicity" also makes AI easier to interpret, another AI Safety win!)

DIVERSITY:

ADVERSITY:

. . .

But hang on, if AI engineers are already doing all the above for modern AI, why are they still so fragile?

Well, one, they usually don't do all, or even most, of the above. Frontier AIs are usually "only" trained with 1 or 2 of the above Robustness techniques. Each technique isn't too costly, but the costs add up.

But even if AI engineers did apply all the above Robustness techniques, it may still not be enough. Many AI researchers suspect there's a fundamental flaw in the current way we do AI, which leads us to the next section...

:x Goodhart Comic

Meme of the 2 astronauts. Wait, it's all Goodhart's Law? Always has been.

Read more about Goodhart's Law on AI Safety for Fleshy Humans Part 2

🤔 Review #5


AI "Intuition": Thinking in Cause-and-Effect Gears

Imagine you give someone a pen & paper, and ask them to add a pair of 2-digit numbers. They do it perfectly. You ask them to add a pair of 3-digit numbers. They do it perfectly. You give them pairs of 4, 5, 6, 7-digit numbers. They add all of them perfectly.

"Oh good", you think, "this person understands addition".

You give them a pair of 8-digit numbers. They fail completely. Not minor mistakes like forgetting to carry a one. Complete, catastrophic failure.

That is how the modern AI do.

. . .

It's so hard to gain an intuition for "AI Intuition".

On one hand, LLMs have won gold at the International Math Olympiad,[33] passed the Turing Test[34], and humans prefer AI over humans in "blind taste-tests" on poetry[35], therapy[36], and short fiction[37].

On the other hand, these same state-of-the-art LLMs can't run the business of a vending machine[38], can't play Pokémon Red[39], can't do simple ciphers[40], and can't solve simple "rule-discovery" games[41].

And now, this year, there's a new paper from Apple: The Illusion of Thinking, by Shojaee & Mirzadeh et al. It was contested & controversial — and we'll address the critiques later — but I think, combined with the above strange failure-cases for modern LLMs, the overall conclusion still holds, or is at least highly plausible. It's important not to over-update on one study, but I think this paper is a great illustration of the problem.

Anyway, the study. Here is the child's puzzle game, Tower of Hanoi, so named because a 1800's Frenchman thought they look like Vietnamese pagodas, I guess:[42]

Photo of three pegs, with a stack of 8 disks on the left-most peg, stacked biggest-to-smallest

The goal is to move the entire stack from the left-most peg, to the right-most peg. The rules are: 1) you can only move one disk at a time, from one peg to another, and 2) you cannot put a larger disk over a smaller disk.

🕹️ (If you'd like to play the game yourself before continuing, click here!) 🕹️

Here's how this game may go for a Human:[43]

GIF of solution to a 6-disk Hanoi

Point is: if a Human can successfully solve Tower of Hanoi for 7 disks, they obviously have the pattern down. You would expect they can solve 8 or more disks, tediousness & minor mistakes aside.

What you would NOT expect is this:

Graph of Claude-with-thinking vs the Hanoi puzzle. Good performance up to 7 disks, then utterly collapses

With "chain-of-thought reasoning" turned on: near-perfection from Disks 1 to 5, still pretty good at Disk 7, then complete collapse after Disks 8 and up. (This graph skips Disk 6 for some reason. The full data shows that 6-Disk's performance was slightly worse than 7-Disk's. Probably noise.)

And it wasn't just Tower of Hanoi, or just the Claude LLM. Across 3 other child's puzzle games, across ChatGPT & DeepSeek, "reasoning mode" LLMs do great, well past the point a Human would've figured out the general solution... then it completely fails.

Again, not "minor mistakes" like a Human would. Collapse.

That's like being able to add two 7-digit numbers on pen & paper, then utterly bomb on two 8-digit numbers.

("Okay but won't it get better with more scale & training?" you may ask. Sure. But if a Human adds two 7-digit numbers perfectly but fails at 8, then promises you "okay but if you give me more training I can do 8, too!" that is not encouraging. They're clearly not "getting it".)

. . .

What the heck is happening?

An AI skeptic might say, "See! LLMs are just stochastic parrots.[46] They just copy what they've seen in the trillions of documents they've been trained on, and fail in never-before-seen scenarios."

I don't think that's quite right — I doubt there was any web document writing out the full solution to the 7-disk, but not 8-disk and up. Nor do I think there was any document with "instructions on how to remove a peanut butter sandwich from a VCR, in the style of the King James Bible", yet early-ChatGPT delivered perfectly in this never-before-seen scenario. So, LLMs do generalize, at least a little bit.

Here's my guess, and I'm far from the first[47] to make this guess:

Modern AIs "think in vibes". They do not "think in gears".[48]

Thinking in vibes: Can discover & use patterns, but only shallow ones. Correlations, not causation. When it generalizes, it's by induction[49]. Similar to System 1 Intuition.[50]

Thinking in gears: Can discover & use robust mental models, deep ones. Causation, not just correlation. When it generalizes, it's by deduction. Similar to System 2 Logic.

It's not "gears good, vibes bad". You need both to reach typical human-level intelligence, let alone a superhuman Scientist AI.

Now, Good Ol' Fashioned AI (GOFAI) did use to "think in gears", but they were extremely narrow. A chess AI could only play chess, nothing more. Meanwhile, to be fair, modern LLMs are extremely flexible: they can roleplay everything from a poet to writer to therapist (who, again, humans prefer over humans), and even roleplay as… someone who can solve the 7-disk Tower of Hanoi.

But not 8.

Two-axis graph of System 1 ("intuitive") vs System 2 (logical) thinking. Good Ol' Fashioned AI had high System 2, low System 1. Modern deep-learning AI has high System 1, low System 2

Good Ol' Fashioned AI (GOFAI): Robust but inflexible.

Modern AI: Flexible, but not robust.

As of writing (Dec 2025), we do not know how to make an AI that do both: flexibly discovering & using robust models. To be able to switch between vibes & gears at will, fluently. The deduction of logic + the induction of intuition = the abduction of science.[51] It's not just interpolating & extrapolating data, it's hyper-polating data: stepping out of the data's flatland, into a new dimension.[52]

Is that a bunch o' vague jargon? Yes. Ironically, I only have a vibe-based understanding of gears. I don't have a rigorous mental model of rigorous mental models. I'm not sure anyone does. If we did, we'd probably have artificial general intelligence (AGI) by now.

. . .

Things get worse.

From the conclusion of Apple's Illusion of Thinking paper:

[Claude 3.7 + Thinking] also achieves near-perfect accuracy when solving the Tower of Hanoi with (N=5), which requires 31 moves, while it fails to solve the River Crossing puzzle when (N=3), which has a solution of 11 moves. This likely suggests that examples of River Crossing with N>2 are scarce on the web, meaning [Large Language Models with Reasoning] may not have frequently encountered or memorized such instances during training.

(emphasis added)

So the problem's not the length of the reasoning, it's how common is the reasoning. This parallels the findings from the 2024 paper, Embers of Autoregression:

We identify three factors that we hypothesize will influence LLM accuracy:

  • the probability of the task to be performed,
  • the probability of the target output, and
  • the probability of the provided input.

Where "probability" ~= "how common is it in the training data".

Mystery (partly) solved! That's why LLMs rock at human conversation, but suck at rule-discovery mini-games. And long tasks are "less probable" than short tasks, because homework problems & textbook examples are almost always a page or two, not dozens of pages. (This, I suspect, is why Claude sucked at running a vending machine,[38:1] even though there's lots of writing on how to run a business: when business textbooks give examples, they list a few transactions, not thousands.[53])

This also explains why AI's success at common code tasks is exponential, with the length-of-task doubling every 7 months at a steady cost[54]... while AI's performance at games of uncommon rule-discovery have a worse than exponential cost:[55]

Graph from ARC-AGI. Description below

The above graph shows various AIs' performance vs cost on ARC-AGI, a series of games where you have to discover the games' rules. Notice it's the x-axis that's exponential (\$1, \$10, \$100), not the y-axis, which 'exponential trend' graphs are supposed to be. So a straight line on this graph means that performance-for-cost is getting exponentially worse, not better. And the above chart's "frontier" isn't a straight line, it's bending downwards, meaning LLMs' performance-for-cost is worse than exponential. (beware AI labs disguising their crappy progress with exponential x-axes!) Even Rich Sutton, famous for his "bitter lesson" that you can just add scale to AI and they'll do better, thinks "LLMs are a dead end".[56]

(Also, as for why LLMs seem to be hitting the maximum performance at most other benchmarks that's not ARC-AGI? To be honest, the developers may be (unintentionally?) cheating and training their LLMs on the correct answers. It's like a student studying for a test by stealing the answers and memorizing them. We know this is happening, because frontier LLMs can produce the "canary strings" that are included with the answers' data, which implies the LLMs training data included the answers. At the very least, the companies didn't do data filtering to make sure that benchmarks' answers didn't end up in their giant web-scraped data dumps.[57])

. . .

Okay, now let's quickly address the two main criticisms of the controversial Apple paper:

  1. The LLM's "context window" (its "thinking space") wasn't big enough to generate the full solutions. This is true for the 12+ disk and up, but not for 8-disk. And the River Crossing solutions were smaller than the 5-disk solutions, yet the LLMs failed anyway. So, size of the "thinking space" wasn't the problem.
  2. These LLMs can write a computer program that solves Hanoi, then can pass that to a tool that runs code. LLM + program synthesis is a promising solution to "thinking in vibes + gears", and in fact is how the top contenders at ARC-AGI did it. But: this is like someone who can't add with pen & paper, but can pull up a calculator and add two numbers. It shows a lack of deeper understanding. (And in this case, Hanoi is so famous, of course code solutions are in the training data.)

Again, it's important not to over-update on one study — but, I think, combined with the results from all of where LLMs fail at (and are great at), it's strong evidence of this hypothesis:

Common task → LLMs get exponentially better over time

Uncommon taskcrying meme dog

. . .

To be clear, a mere "common task auto-completer" can still have huge impact, for better and worse. In blind tests, AI therapists are preferred over human therapists, by both human clients and human therapists themselves.[36:1] AI Therapy could finally make mental healthcare accessible to all, and/or, make humans emotionally dependent on corporate-owned bots. AIs are already superhuman at political persuasion, for good causes & bad.[58] And I expect my main job, "web developer", to be fully automated away in 5 years by mere "common-task auto-completers". (This is why, in 2026, I want to pivot to becoming a researcher. Because the day science itself is automated… well… either way that shakes out, I won't have to figure out how to pay rent anymore.)

. . .

sticky note saying 'IOU Actual Solutions sorry'

Okay, this is the least satisfying section of this post, because there's not much in the way of solutions, because this problem — make AI that can think in both System 1 vibes & System 2 gears — is possibly equivalent to creating Artificial General Intelligence (AGI).

That said, there are some promising early research directions, to get AIs to "think in robust mental models":

But in any case, here's all the amazing things we would get from AI that thinks in cause-and-effect gears, not just correlational vibes:

In a weird way, maybe I should be grateful that AI sucks at fluidly discovering & using world-models. Because if they could do that, then AI would already be capable of, say, taking over the world.

But they don't. Which gives us more time to make sure we stay above the dotted line in this graph, where Capabilities < Alignment:

Diagram of above idea. Two-axis graph: Alignment vs Capabilities, where the diagonal dotted line is the boundary between Alignment > Capabilities and Alignment < Capabilities. By tying Alignment TO Capabilities, we stay above the safe dotted line.

And, if the "treat learning our values as a standard machine-learning problem" approach works, we tie Alignment to Capabilities. When AI can robustly learn about the world, they can robustly learn about our values. So: Capabilities becomes the new floor for Alignment, and we stay above the dotted line.

(But again, don't relax too much.)


🤔 Review #6


🎉 RECAP #2


What are 'Humane Values', anyway?

Congratulations, you've created an AI that robustly learns & follows the values of its human user! The user is an omnicidal maniac. They use the AI to help them design a human rabies in a stable aerosolized form, spray it everywhere via quadcopters, and create the zombie apocalypse.

Oops.

I keep harping on this and I'll do it again: human values are not necessarily humane values. C'mon, people used to burn cats alive for entertainment.[59] Even after solving the problem of "how do we steer advanced AI at all", we need to decide: "steer towards which goals, which values?"

Left: rocket being built, labeled, "technical alignment: how to robustly aim AI at any target at all". Right: the moon, labeled, "whose values: what should be our target?"

So, if we want AI to go well for humanity (and/or all sentient beings), we need to just... uh... solve the 3000+ year old philosophical problem of what morality is. (Or if morality doesn't objectively exist, then: "what rules for living together would any community of rational beings converge on?")

Hm.

Tough problem.

Well actually, as we saw earlier — (with Scalable oversight, Recursive self-improvement, and Future Lives + Learn Our Values agents) — as long as we start with a solution that's "good enough", that has critical mass, it can self-improve to be better and better!

Besides, that's what we humans have had to do all this time: a flawed society comes up with rules of ethics, notices they don't live up to their own standards, improves themselves, which lets them notice more flaws, improve, repeat.

So, as an attempt at "critical mass", here's some concrete proposals for a good-enough first draft of ethics for AI:

(And note: these proposals are not mutually exclusive! We don't need One Perfect Solution, we can stack multiple imperfect solutions.)

📜 Constitutional AI:

Step 1: Humans write a list of principles, like "be honest, helpful, harmless".

Step 2: A teacher-bot uses that list to train a student-bot! Every time a student bot gives a response, the teacher gives feedback based on the list: "Is this response honest?", "Is this response helpful?", etc.

That's how you can get the millions of training data-points needed, from a small human-made list!

Anthropic is the pioneer behind this technique, and they're already using it successfully for their chatbot, Claude. Their first constitution was inspired by many sources, including the UN Declaration of Human Rights.[60] Too elitist, not democratic enough? Well, next they crowdsourced suggestions to improve their constitution, which led them to add "be supportive/sensitive to folks with disabilities" and "be balanced & steelman all sides in arguments"![61]

(Anthropic's also used a variant of Constitutional AI to train Claude's "character", to embody virtues like curiosity, open-mindedness, and thoughtfulness. This is very similar to the virtue ethics approach: instead of trying to robotically follow moral rules, internalize good character traits!)

This is the most straightforward way to put humanity's wide range of values into a bot. (and actually deployed in a major LLM product!)

🏛️ Moral Parliament:

This idea combines the Uncertainty & Diversity from the previous sections. Moral Parliament proposes using a "parliament" of moral theories, with more seats for theories you're more confident in. (For example: my parliament may give Capability Approach 50 seats, Eudaimonistic Utilitarianism 30 seats, other misc theories get 20 seats.) The Moral Parliament then votes Yay or Nay on possible actions. The action with the most votes, wins.

(This proposal is pretty similar to Constitutional AI, except instead of adjectives like "honest" & "helpful", the voters are entire moral theories. And instead of an equal vote, you can put more weight on some theories vs others.)

As we learnt earlier, adding Diversity is a good way to increase Robustness. Because every moral theory has some weird edge case where it fails, having a diverse moral parliament prevents a single point of failure. (example:[62])

The above paper was meant for Humans, but could be implemented in AIs, too.

🍺 Use AIs to distill & amplify human values:

Researchers at Google DeepMind have found that a fine-tuned LLM can find create consensus among humans with diverse values. Better yet, these AI-aided consensus ideas were preferred over the humans' opinions themselves. (though, note: the humans weren't necessarily writing for consensus.)

Don't want to rely on fragile LLM technology specifically? Here's a proposal to do "AI as the engine, humans as the steering wheel" in general, for any AI tech:

💖 Learning from diverse sources of human values:[63]

Give an AI our stories, our fables, philosophical tracts, religious texts, government constitutions, non-profit mission statements, anthropological records, all of it... then let good ol' fashioned machine learning extract out our most robust, universal human values.

(But every human culture has greed, murder, etc. Might this not lock us into the worst parts of our nature? See next proposal...)

🌟 Coherent Extrapolated Volition (CEV):[64]

Volition means "what we wish for".

Extrapolated Volition means "what we would wish for, if we were the kind of people we wished we were (wiser, kinder, grew up together further)".

Coherent Extrapolated Volition means the wishes that, say, 95+% of us would agree on after hundreds of rounds of reflection & discussion. For example: I don't expect every wise person to converge on liking the same foods/music, but I would expect every wise person to at least converge on "don't murder innocents for fun". Therefore: CEV gives us freedom on tastes/aesthetics, but not "ethics".

CEV is different from the above proposals, because it does not propose any specific ethical rules to follow. Instead, it proposes a process to improve our ethics. (Reminder: this is called "indirect normativity"[65]) This is similar to the strength of "the scientific method"[66]: it does not propose specific things to believe, but proposes a specific process to follow.

I like CEV, because it basically describes the best-case scenario for humanity without AI — a world where everyone rigorously reflects on what is the Good — and then sets that as the bare minimum for an advanced AI. So, an advanced aligned AI that follows CEV may not be perfect, but at worst it'd be us at our best.

One problem with CEV is that "simulate 8+ billion people debating for 100 years" is impossible in practice — but — we can approximate it! For example: use a few hundred LLMs that have been trained & validated to represent a demographic, then have them debate.[67] This is similar to the way democracies have representatives & activists that debate/vote on behalf of the people they represent. (And yes, LLMs can accurately simulate individuals' beliefs & personalities![68])

However, a more fundamental problem with CEV is this: people will be horrified by a wiser version of themselves. Imagine if we had powerful AI in the year 1800, that implemented CEV, and accurately simulated our moral development up to the year 2025. A relatively wise person in 1800 could have seen that enslaving Black people was wrong. But even the top-5% wisest people back then would have been horrified by the idea of a Black person being President, or marrying a white woman. By symmetry, there's very likely things now that we're irrationally disgusted by, that a much wiser version of us would be okay with.

But if a powerful AGI just plopped down and said, "Hey here's a horrifying thing I'm going to do, the much wiser version of you would agree to this", there's no way to tell in advance if it's true, or something went horribly wrong.

Hence, the next idea fixes this problem:

🌀 Coherent Blended Volition:

Instead of asking a paternalistic, Nanny AI to simulate us reflecting & improving upon our beliefs and values… well, why don't we reflect & improve upon our beliefs and values?

"Because humans suck at doing that. See: all of History." Okay, fair enough. But what if we used all the tools at our disposal — not just helper AIs, but also discussion platforms, data analysis, etc — to help us talk better? To find solutions that aren't mere average-centrism, but actually combines the best parts of our diverse worldviews & values?[69]

Sounds nice in theory. Does it work in practice? So far: yes! Taiwan, directly inspired by Coherent Blended Volition, used digital tools to collect & blend the perspectives of citizens & industry, to create actual policy. (The specific issue was Uber entering Taiwan.) Inspired by the success of Taiwan's digital tools, Twitter (now 𝕏) used a similar algorithm to design Birdwatch (now Community Notes), which as far as I can tell, is still the only fact-checking service to be rated net-helpful across the U.S. political spectrum. No easy feat, in these polarized times!

This way, instead of asking a powerful AI to simulate us getting wiser, AI can help us actually get wiser. (by being a Socratic questioner, discussion facilitator, fact-checker, etc.) This way, we'll be able to accept wiser ideas & actions, even if the previous less-wise versions of us would've rejected them.

We make our tools better, so our tools help us be better, repeat. We'll grow alongside the AIs.

. . .

Maybe AI will never solve ethics. Maybe humans will never solve ethics. If so, then I think we can only do our best: remain humble & curious about what the right thing is to do, learn broadly, and self-reflect in a rigorous, brutally-honest way.

That's the best we fleshy humans can do, so let's at least make that the lower bound for AI.

🤔 Review #7


AI Governance: the Human Alignment Problem

Comic of a dumb user, with computer saying 'ERROR ID-10-T: problem exists between chair and keyboard'

The saddest apocalypse: we solve the game theory problems of AI Logic, we solve the deep-learning problems of AI "Intuition", we even solve moral philosophy…

We know all the solutions, and then… people are just too greedy or lazy to use it. Then we perish.

Point is, all our work on the AI alignment problem means nothing if we can't solve the human alignment problem: how do we get fallible fleshy humans to actually coordinate on safe, humane AI?

This question is called Socio-technical Alignment, or AI Governance.

. . .

Here's the Alignment vs Capabilities chart, again:

Graph of Alignment to Humane Values vs Capabilities. A rocket is blasting towards the right. Above a dotted diagonal line, Alignment > Capabilities, we're heading to a good place. Below that dotted line, Alignment < Capabilities, we're headed to a bad place

The goal: keep our rocket above the "safe" line.

Thus, a 2-part strategy for AI Governance:

  1. Verify where we are, our direction & velocity.
  2. Use sticks & carrots to stay above the "safe" line.

(note that "governance" can also include bottom-up approaches, not just top-down! in case you were — understandably — worried about "AI Governance" being a Trojan Horse for a world government.)

In more detail:

1) Verify where we are, our direction & velocity:

2) Use sticks & carrots to stay above the "safe" line.

The 6-pack of digital democracy: broad listening, credible commitments, easy-to-inspect, easy-to-correct, win-win bridging solutions, and as local as possible

(if i may toot my own horn a bit, i'm doing the illustrations for Audrey Tang's upcoming book, 6pack.care! all comics will be dedicated to the public domain.)

Another thought: while "sticks" (fines, penalties) are necessary, I think it's neglected how we can use "carrots" (market incentives) to redirect industry. As Audrey Tang — Digital Minister of Taiwan, and co-author of 6Pack.care — once explained it: if you successfully advocated for securing cockpit doors before 9/11, or better biosecurity before Covid-19, your reward would be... "nothing happens". Nobody cares about the bomb that didn't go off. [81] So, if you want your AI Alignment & Governance ideas to be actually implemented in real life, not just in academic papers, you need them to "pay dividends" in the short term.

Many of the AI Safety solutions listed above can have, or already have had, profitable market spinoffs. (examples: [82])

But in the next final two sections, "Alternatives to AGI" and "Cyborgism", we'll see ideas that empower regular fleshy humans (rather than a few companies/governments), and are short-term market-incentive-compatible, and long-term Good.

: bonus - what about the economics of AI? basic income, human subsidies, re-skilling?

. . .

A note of pessimism, followed by cautious optimism.

Consider the last few decades in politics. Covid-19, the fertility crisis, the opioid crisis, global warming, more war? "Humans coordinating to deal with global threats" is... not a thing we seem to be good at.

But we used to be good at it! We eradicated smallpox[83], it's no longer the case that half of kids died before age 15[84], the ozone layer is actually healing! [85]

Humans have solved the "human alignment problem" before.

Let's get our groove back, and align ourselves on aligning AI.

:x Economics AI

Not really related to "AI Governance", but is is a major sociopolitical question, that the world's governments should care about:

Will an AI take my job?

Actually, it's worse than that — even if you're in an AI-proof job, AI can still make your wages plummet. Why? Because the humans who lose their job to AI, will flood your field, and market competition will pull your wage downwards.

But in the past, hasn't automation always been good on average, creating more jobs than it destroys?

  1. Well, re: "average", the average person has one testicle. Point is: "average" ignores the variance. If the bottom 99% of people lose \$10,000, so that the top 1% gain \$1,000,000, that's a win "on average". I exaggerate, but since Covid, we may already be in a K-shaped economy: one line goes up for the rich, one line goes down for the rest of us. (And even if you're a utilitarian who only cares about total utility, remember utility is logarithmic to wealth. So if Alice gains \$10,000 & Bob loses \$10,000, Alice gains less utility than Bob loses: total wealth is constant, but total utility drops. You wouldn't be neutral about a coin flip where you gain/lose \$10,000, would you?)
  2. Yes, automation does create jobs at first, but at advanced levels, starts taking them away. A good example is automatic telling machines (ATMs) at banks: at first, their introduction actually correlated with an increase in human tellers being hired, because while each branch needed fewer human tellers, ATMs made it cheaper to open many more branches, so that total humans hired actually increased. In the short run. Now, with online banking, and with the market for bank branches already saturated, human tellers hired is dropping again. (see this article)
  3. This time really is different. In the past, automation "only" took out one small sector at a time (textile workers, horse drivers, elevator operators, etc). But this time, with deep learning, AIs can automate a huge percentage of the human economy in one go: a) self-driving cars vs truckers & taxicabs, b) LLMs vs programmers, lawyers & writers, c) speech-to-text-and-back vs call centers, secretaries, & audiobooks/podcasters, d) image and video models vs artists and creatives, e) so on and so on. And if you're in an AI-safe job, like plumber? Good for you! But your wages will drop from everyone else scrambling to re-train to AI-safe jobs, like plumber.

(It's hard to tell how badly AI is affecting the job market right now. The hiring rates for artists, writers, and even newly-graduated programmers — once thought of a golden ticket to a 6-figure salary! — has been dropping since generative AI came out in late 2022. But that also coincides with a post-Covid recession, so it's hard to tell how much of the unemployment is due to AI specifically, vs the sucky economy in general.)

= = =

A popular proposal is a Universal Basic Income (UBI), funded by taxing AI. After all, modern AI was trained on data created by the public's hard work — shouldn't part of the gains go back to the public, a "citizen's dividend, similar to Alaska's sovereign wealth fund, where profit from their natural resources goes back to their citizens?

A lot of skepticism about UBI comes from it promoted by tech CEOs like Sam Altman, who are, uh, "not completely candid". That said, just because someone's promoting an idea cynically to quell dissent, doesn't mean it's not a good idea. Tech CEOs could be lying about promising UBI, because UBI would be great. To quote GLaDOS: "Killing you, and giving you good advice, aren't mutually exclusive."

Another common critique of UBI is that people don't just need money, they need meaning, and people get meaning from their jobs. ("Love and Work", Freud supposedly said.) Now: writing as someone who finds meaning in her job: f@#k off. I keep hearing the "work is good coz it gives meaning" line from think tanks & pundits & the Credentialed Class, and, yeah, easy for you to say, you have a nice job. I have a nice job. We're not working the graveyard shift at 7/11, or slowly getting lung cancer from the industrial fumes. But, oh, if we're all living off UBI, how will we get meaning if we can't sell our labour on a market? Oh I dunno, how about spending life with family? friends? lovers? while getting your sense of mastery & challenge from sports, art, math, play? And if you love your job making art or writing essays, you know you can just keep doing that even with UBI, right?

Sorry I'm pissy about this, I just do not respect Status Quo Stockholm Syndrome.

However, there is one strong critique of UBI that I acknowledge. Despite all the exciting pilot programs in developing nations, UBI in developed nations like America mostly doesn't help. (see the OpenResearch study by Vivalt et al 2025, lit review by Kelsey Piper)

For example, see the BIG:LEAP report, a UBI experiment in Los Angeles. See Tables 4, 5, and 9: receipients' financial well-being, food security, and perceived stress were all barely any better than controls, after 18 months of getting \$1000/month.

And it's not because "well, Americans are already relatively rich compared to Africans, so extra cash doesn't do much" – one RCT gave \$1000/month to homeless people in Denver, and after ten months, they weren't any more likely to have stable housing than the controls getting \$50/month. (See Figures 1 & 2 of their research page, which tries to lie-via-misleading-truth, and play this off as a success?!?!)

So if UBI doesn't even help literally homeless people, how would it help people who lost their jobs to AI?

Also, in these major UBI trials, recipients worked a bit less. Which would be good if the extra leisure time translated to better mental health, or better child health. It didn't.

(To be fair, at least UBI in America doesn't hurt. The recipients mostly spent the extra money on nicer things for their kids & loved ones, and did not spend it on drugs or gambling. And UBI did reduce intimate partner abuse, and make people happier… for about one year, before returning to baseline. Which isn't nothing! Not being beat by your spouse for a year matters!)

So… what gives? Why does UBI work in developing nations, but not America?

I don't know. Maybe money is a stronger limiting factor in developing nations, but non-money stuff (personal habits, cultural norms, social ties, systemic issues) are a stronger constraint in America. This is a very ad-hoc explanation.

= = =

I know the following sounds very "Ah. Well, nevertheless," but nevertheless, I do (tentatively) support UBI for AI-powered job displacement. The above evidence shows that UBI won't help with poverty, at least in America, but:

  1. We can consider UBI a deserved citizen's dividend, like Alaska's get-$1000-a-year natural resource fund. After all, it was (involuntarily) trained off your data, shouldn't you at least get some of the profit?
  2. We can consider UBI as "insurance", because it's very unpredictable when you'd lose your job to a robot. Seriously, who 20 years ago would've thought we'd automate art and poetry before automating "clean my house"?
  3. UBI still seems like the smoothest, non-violent way to transition from today's market economy, to a post-scarcity Star Trek utopia, where food & shelter & all the essentials become "too cheap to meter".

= = =

An alternative to UBI for poverty reduction, is the Earned Income Tax Credit (EITC), also known as a negative income tax. The idea is that the company pays you whatever's the market wage, but it's topped up to a livable wage through tax credits.

Politically, it's popular among conservatives because unlike UBI, the EITC provides pro-work incentives. (plus, "negative tax" and "tax rebate" feels better than "welfare check". In the latter, you're getting a handout. But in the former, you're getting your money back. Feels better! Even though it's frikkin' identical. It's all about the marketing, baby.)

It's also popular among economists (who are the most politically diverse field in social science). According to the most recent survey of economists, 90% support expanding the EITC. (see Table 1)

But more relevant to AI economics, the EITC is a human subsidy. You're subsidizing companies to hire local humans, instead of offshoring them or automating them. And in sectors where AI still has the upper economic hand, we can tax it to fund the EITC.

This can also be paired with incentivizing & investing in upskilling & retraining human workers. Help us get jobs to evade LLMs' grasp, e.g. tasks which require rule-discovery (eg research), dexterity in physical environments (eg manual trades, live performance arts), long contexts (eg longform writing), or we just want the job to be human (eg elderly care).

= = =

We're not even scratching the surface of economic policy proposals re: advanced AI. For a good overview, check out this review of options on Anthropic's blog.

My personal opinion, as of Dec 2025, open to change: for policy, I'd recommend a combined UBI + EITC + educational vouchers for reskilling. It'd be funded by taxing the productivity gains from AI, and given mostly towards high-automation-risk workers like: a) truckers, taxicab drivers (self-driving cars), b) call center services (speech & language AIs), c) retail (digital kiosks, which aren't even "AI"). And at the risk of being extremely self-serving, sure: digital arts, programming, and white-collar work are all also at high automation risk.

As for what you can do personally, well, you might want to read/listen to How not to lose your job to AI.

🤔 Review #8


Alternatives to AGI

Why don't we just not create the Torment Nexus?[86]

If creating an Artificial General Intelligence (AGI) is so risky, like sparrows stealing an owl egg to try to raise an owl who'll defend their nest & hopefully not eat them[87]...

...why don't we find ways to get the pros without the cons? A way to defend the sparrow nest without raising an owl? To un-metaphor this: why don't we find ways to use less-powerful, narrow-scope, not-fully-autonomous AIs to help us — say — cure cancer & build flourishing societies, without risking a Torment Nexus?

Well... yeah.

Yeah I endorse this one. Sure, it's obvious, but "2 + 2 = 4" is obvious, that don't make it wrong. The problem is how to actually do this in practice.

Here's some proposals, of how to get the upsides with much fewer downsides:

All of these are easier said than done, of course. And there's still other problems with a focus on "Alternatives to AGI". Social problems, and technical problems:

That said, it's good to stack extra solutions, even if imperfect!

As for that last concern, about handing over our autonomy to AIs, that's what our final Cyborgism section covers...

🤔 Review #9


Cyborgism

Re: humans & possible future advanced AI,

If we can't beat 'em, join 'em!

We could interpret that literally: brain-computer interfaces in the medium term, mind-uploading in the long term. But we don't have to wait that long. The mythos of the "cyborg" can still be helpful, right now! In fact:

YOU'RE ALREADY A CYBORG.

...if "cyborg" means any human that's augmented their body or mind with technology. For example, you're reading this. Reading & writing is a technology. (Remember: things are still technologies even if they were created before you were born.) Literacy even measurably re-wires your brain.[93] You are not a natural human: a few hundred years ago, most people couldn't read or write.

Besides literacy, there's many other everyday cyborgisms:

Stylish silhouette-drawing of people with "everyday cyborg" tools. Glitched-out caption reads: We're all already cyborgs.

Q: That's... tool use. Do you really need a sci-fi word like "cyborg" to describe tool use?

A: Yes.

Because if the question is: "how do we keep human values in the center of our systems?" Then one obvious answer is: keep the human in the center of our systems. Like that cool thing Sigourney Weaver uses in Aliens (1986).

Screenshot of Sigourney Weaver in the Power Loader. Caption: cyborgism, keeping the human in the center of our tools

Okay, enough metaphor, here's some concrete examples:

. . .

Cyborgism became popular in the AI Safety field in 2023, thanks to Nicholas Kees & janus's hit article. It's a looooong post, but here's my favourite diagram from it:

Human & GPT's comparative advantages/disadvantages, fitting together like puzzle pieces. For example, GPT are better at breadth & variance, Humans better at deep, long-term coherence.

Remember in the "Gears" section above, where I was shocked at all the things LLMs surpass Humans at, but also how fragile & terrible LLM reasoning is? Cyborgism says: this ain't a bug, it's a feature! Because if there's things that AI can do that Humans can't, and vice versa, then there's benefits to combining our talents:

3 Venn Diagrams. Early on: AI skills are strictly a subset of Human skills. Too late: Human strictly subset of AI. But in the middle, sweet spot: our skills do not fully overlap, meaning there is an opportunity for the collab of Human+AI to do better than either alone

(Examples of Stage 1, Humans > AI: dexterity in unpredictable environments, like folding laundry or home repairs. Example of Stage 3, AI > Humans: arithmetic, maybe chess. But lots of tasks are still in Stage 2, Human+AI > Human or AI: code, forecasting, AI Safety research itself! We're in that sweet spot where Cyborgism can still pay off huge dividends right now.)

(: bonus - an incomplete list of what Humans & LLMs are relatively better at)

The other point of Cyborgism for AI Safety is this: useful AI ≠ agentic AI. (where "agentic" means it pursues goals intelligently) Sure, giving AI its own autonomy is one way to make it more useful (or at least, convenient), but it carries huge risks (from it "going rogue", to a gradual-disempowerment Slopworld.) Cyborgism shows us a way to get useful AI without giving up our autonomy.

Keep the human in the center! Fight for the users!

. . .

Caveats & warnings:

Then again...

Close-up of Sigourney Weaver

That's pretty cool.

Remember: the goal isn't necessarily to "have AGI" — it's to find cures for diseases, help us grow wiser, make sure humanity (and/or other sentient beings) all flourish as much as we can. If autonomous AGI is the only way to do it, sure, go for that. But I suspect — at least hope — that the "cyborg" approach is another way:

For us to grow alongside our creations.

:x Human LLM Advantages

A more detailed list of what Humans & LLMs are better at. (for now, though see the above Gears section for why I suspect these differences are fundamental to LLMs)

💪 Humans are better at:

🦾 LLMs are better at:

Hopefully that gives you (and myself) ideas for how to design tools that combine Human & AI skills, while keeping Human autonomy at the center!


🤔 Review #10 (last one!)


🎉 RECAP #3


In Sum:

Here's THE PROBLEM™️, broken down, with all the proposed solutions! ( Click to see in full resolution! )

SUMMARY OF THIS WHOLE ARTICLE, AAAAA

And here's Recap 1, Recap 2, and Recap 3.

(Again, if you want to actually remember all this long-term, and not just be stuck with vague vibes two weeks from now, click the Table of Contents icon in the right sidebar, then click the "🤔 Review" links for the flashcards. Alternatively, download the Anki deck for Part Three.)

. . .

(EXTREMELY LONG INHALE)

(10 second pause)

(EXTREMELY LONG EXHALE)

. . .

Aaaaand I'm done.

Around 80,000 words later (about the length of a novel), and nearly a hundred illustrations, that's... it. Over three years in the making, that's the end of my whirlwind tour guide. You now understand the core ideas from the wide world of AI & AI Safety.

🎉 Pat yourself on the back! (But mostly pat my back. (I'm so tired.))

Sure, the field of AI Safety is moving so fast, Part One & Two started to become obsolete before Part Three came out, and doubtless Part Three: The Proposed Solutions will feel naïve or obvious in a couple years.

But hey, the real AI Safety was all the friends we made along the way.

Um. I need a better way to wrap up this series.

Okay, click this for a really cool CINEMATIC CONCLUSION:

.

.

.

.

.

.

.

.

.

wait what are you doing? scroll back up 👆👆👆, the cool ending's in the button up there.

come on, it's just boring footnotes below.

.

.

.

.

.

.

.

.

.

.

ugh, fine:


  1. The protocol goes something like this:

    • Human trains very-weak Robot_1. Robot_1 is now trusted.
    • Human, with Robot_1's help, trains a slightly stronger Robot_2. Robot_2 is now trusted.
    • Human, with Robot_2's help, trains a stronger Robot_3. Robot_3 is now trusted.
    • (...)
    • Human, with Robot_N's help, trains a stronger Robot_N+1. Robot_N+1 is now trusted.

    This way, the Human is always directly training the inner "goals/desires" of the most advanced AI, using only trusted AIs to help them. ↩︎

  2. Well, maybe. The paper acknowledges many limitations, such as: what if instead of the AIs learning to be logical debaters, they become psychological debaters, exploiting our psychological biases? ↩︎

  3. inb4 some nerdy nitpicking: yeah yeah perfect prediction is impossible because there's the Halting Problem and chaotic systems bla blah. Look, those are cool, but don't they don't matter to the following segment. Just mentally note that I mean: "this optimal-capabilities AI can predict stuff as well as is theoretically possible". ↩︎

  4. Quote from Yudkowsky's Coherent Extrapolated Volition (CEV) paper, which is a similar idea except applied to humanity as a whole; we'll look at CEV more in a later section. ↩︎

  5. Source is page 26 of his book Human Compatible. You can read his shorter paper describing the Future Lives approach here. ↩︎

  6. See: therapy, subconscious desires, non-self-aware people, etc ↩︎

  7. Hadfield-Menell et al 2016 ↩︎

  8. Christiano et al 2017 ↩︎

  9. Shah et al 2019 and Chan et al 2021 ↩︎

  10. Very tangential to this AI Safety piece, but I highly recommend Fluent Forever by Gabriel Wyner, which will then teach you Anki flashcards, the phonetic alphabet, ear training, and other great resources for learning any language. ↩︎

  11. From Baker et al 2020: “Overall, we found that the AI system is able to provide patients with triage and diagnostic information with a level of clinical accuracy and safety comparable to that of human doctors.” From Shen et al 2019: “The results showed that the performance of AI was at par with that of clinicians and exceeded that of clinicians with less experience.” Note these are specialized AIs, not out-the-box LLMs like ChatGPT. Please do not use ChatGPT for medical advice. ↩︎

  12. See Figure 16 of Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs ↩︎

  13. In 2023, Jessica Rumbelow & Matthew Watkins found a bunch of words, like "SolidGoldMagikarp" and " petertodd", which reliably caused GPT-3 to glitch out and produce nonsensical output. "SolidGoldMagikarp" also became the name of an entity in Ari Aster's horror-satire film Eddington (2025). First time a LessWrong post snuck its way into a major motion picture, afaik. ↩︎

  14. Takagi & Nishimoto 2023: Used fMRI scans and Stable Diffusion to reconstruct & view mental imagery(!!!) Gkintoni et al 2025: Literature review of using brain-measurement to read emotions, with some approaches' accuracy “even surpassing 90% in some studies.” ↩︎

  15. Electric current stimulates laughter (1998), and inducing an out-of-body experience (2005). ↩︎

  16. From the paper: "In robustness experiments, we confirm that grokking consistently occurs for other architectures and prime moduli (Appendix C.2). In Section 5.3 we find that grokking does not occur without regularization." (emphasis added) ↩︎

  17. Jeezus, this meme is as old as some of my colleagues. ↩︎

  18. You could use more complex "nonlinear" probes, but if you put too much info-processing power in a probe, you run the risk of measuring info-processing in that probe itself, not the original ANN. So in practice, people usually just use a 1-layer "linear" probe. ↩︎

  19. Ok in real life, any inserted thermometer has to modify the original thing's temperature, because the themometer takes/gives off heat into the thing. Look, artificial neural networks are in a simulation. We can code it to not modify the original. ↩︎

  20. (Math warning) A linear classifier is just $y = \text{sigmoid}( \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...)$, which is the exact same formula for neurons in a traditional ANN. So, you could (I could) sneakily call a linear classifier a "one-layer neural network". You could use more complex "nonlinear" probes, but then you run the risk of measuring info-processing in the probe, not the original ANN. (But you can catch this problem by running "placebo tests" on the probe. If it's still able to classify stuff even if you feed it fake, random input, then the probe's so complex, it can just memorize examples.) ↩︎

  21. (Human example: Alyx dislikes Beau for some other reason. Beau eats crackers. Alyx, even though they wouldn't be offended in any other context, thinks: "Beau is chewing those disgusting salt squares so loudly, intentionally trying to tick me off." Alyx's conscious chain of thought (annoying eating → dislike for Beau) is the opposite of their true subconscious (dislike for Beau → finds eating annoying). Anyway — as above, so below — LLMs also out-loud rationalize their inner biases. ↩︎

  22. Robust Physical-World Attacks on Deep Learning Visual Classification (2018) ↩︎

  23. A small number of samples can poison LLMs of any size: “we demonstrate that by injecting just 250 malicious documents into pretraining data, adversaries can successfully backdoor LLMs ranging from 600M to 13B parameters.” ↩︎

  24. Universal Jailbreak Backdoors from Poisoned Human Feedback: “the backdoor embeds a trigger word into the model that acts like a universal "sudo command": adding the trigger word to any prompt enables harmful responses without the need to search for an adversarial prompt.” ↩︎

  25. Wang & Gleave et al 2022: “Our [adversarial AIs] do not win by learning to play Go better than KataGo [a superhuman Go AI] — in fact, our adversaries are easily beaten by human amateurs. Instead, our adversaries win by tricking KataGo into making serious blunders. Our results demonstrate that even superhuman AI systems may harbor surprising failure modes.” ↩︎

  26. See the Impact Regularizer section of Amodei & Olah 2016 for details ↩︎

  27. Hubinger 2022 ↩︎

  28. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning by Gal & Cambridge 2016 ↩︎

  29. Fun anecdote: I first learnt about Generative Adversarial Networks a decade ago in a Google Meet call (then Google Hangouts). I raised my hand to ask, "wait, we're generating images by… training one intelligence to deceive another intelligence?" And we all laughed. Ohhhhhh how we all laughed. Ha ha. Ha ha ha. Ha ha HA ha ha. Haaaaaaaaaaa. ↩︎

  30. For example: an attacker LLM tries to create sneaky prompts that "jailbreak" a defender LLM into turning evil. In traditional Adversarial Training, the defender may only learn to protect against specific jailbreaks. In Relaxed/Latent Adversarial Training, the defender LLM may learn more general protective lessons, like, "be suspicious of weirdly-formatted instructions".

    Here's an empirical paper showing a proof-of-concept for relaxed/latent adversarial training: “In this work, we utilize latent adversarial training (LAT) to defend against vulnerabilities without leveraging knowledge of what they are or using inputs that elicit them. [...] Specifically, we use LAT to remove backdoors and defend against held-out classes of adversarial attacks.” (emphasis added) ↩︎

  31. Red-teaming has been a core pillar of national/physical/cyber security since the 1960s! Yes, Cold War times. "Red" for Soviet, I guess?? (Factcheck edit: actually, the "red = attacker, blue = defender" convention pre-dates the Cold War! The earliest known instance is from an early-1800s Prussian war-simulation game, Kriegsspiel.) ↩︎

  32. Sagawa & Koh 2020 ↩︎

  33. Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the International Mathematical Olympiad (2025) ↩︎

  34. Large Language Models Pass the Turing Test. “When prompted to adopt a humanlike persona, GPT-4.5 was judged to be the human 73% of the time” (vs 50% chance). Though to be fair, it was "only" a 5-minute long Turing Test, and the reason GPT-4.5 fooled humans so often was less because it was good, and more because Human Judges picked ineffective bot-detection strategies (see Figure 4), e.g. asking about daily activities & opinions, instead of strange questions or jailbreaks. ↩︎

  35. AI-generated poetry is indistinguishable from human-written poetry and is rated more favorably. “participants performed below chance levels in identifying AI-generated poems [...] more likely to judge AI-generated poems as human-authored than actual human-authored poems [...] AI-generated poems were rated more favorably.”

    The human poets included the greats like Shakespeare, Whitman & Plath. The AI was ChatGPT 3.5 with no cherry-picking; they just picked the first few poems generated "in the style of X".

    We are so cooked. ↩︎

  36. Hatch et al 2025: “a) participants could rarely tell the difference between responses written by ChatGPT and responses written by a therapist, b) the responses written by ChatGPT were generally rated higher in key psychotherapy principles”

    Okay but REAL therapists can tell the difference, you may protest. Well. From Human-Human vs Human-AI Therapy: An Empirical Study: “Therapists were accurate only 53.9% of the time, no better than chance, and rated the human-AI transcripts as higher quality on average.”

    We are so, so cooked. ↩︎ ↩︎

  37. Review of the contest, by Ozy Brennan: AIs can beat human fiction writers because humans are bad at writing fiction: piss-on-the-poor reading comprehension. Like the LLM-beats-Turing-Test study, this result (while impressive), is less "AI is really good" and more "humans are really bad". And these human writers weren't novices either, they're professionals who've published books.

    We are so, so, so cooked. ↩︎

  38. Project Vend: Can Claude run a small shop? Answer: no. Claude started stocking the food vending machine with tungsten cubes, and hallucinated that it was a flesh-and-blood delivery person who went the Simpsons' house. Vending Bench 2 measures how well all the frontier LLMs can do on this "run a vending machine for a year" task. As of writing (Dec 2025), they're at least making positive amounts of money now, but a) the simulated "buyers" aren't trying to jailbreak the LLM, the way real human buyers did in Project Vend, and b) the best AI in Vending Bench 2 is still making, like, 8% of the "good" baseline. ↩︎ ↩︎

  39. So how well is Claude playing Pokémon? “TL;DR: Pretty badly. Worse than a 6-year-old would.” Explanation why: “Basically, while Claude is pretty good at short-term reasoning (ex. Pokémon battles), he's bad at executive function and has a poor memory. This is despite a great deal of scaffolding, including a knowledge base, a critic Claude that helps it maintain its knowledge base, and a variety of tools to help it interact with the game more easily.” ↩︎

  40. [Embers of Autoregression (2024)] shows how GPT-4 could do the common ROT-13 cipher, but utterly fail at the uncommon-but-trivially-similar ROT-12 cipher. As the Supplementary Info shows (Figure S13), this holds even with "chain of thought" reasoning turned on. Though to be fair, this paper was from 2024, and I tried ROT-12 and ROT-11 Claude Sonnet 4.5 just now, and it did fine. Though as we'll see later in this section, Claude & LLMs chains-of-thought are still very fragile. ↩︎

  41. ARC-AGI is a bunch of pixel puzzle games. Instead of most games where you're told the rules up front, in ARC-AGI, you have to learn the hidden rules via exploration, then prove you understand it by applying the rules to win the game.

    See the Leaderboard Breakdown: The Ten-Human Panel's performance on ARC-AGI-1 and -2 is 98% and 100%, costing \$17/task (~10 workers hired per task on Mechanical Turk, success if at least one succeeds).

    The best AI (Gemini 3 Deep Think) on ARC-AGI-1 and -2 is 87.5% (a bit worse) and 45.1% (much worse) costing \$77/task (almost five times more expensive than hiring ten humans on MTurk) ↩︎

  42. Photo by User:Evanherk ↩︎

  43. Actually, if you're interested in the developmental psychology of kids playing Tower of Hanoi, here's the landmark paper by Byrnes & Spitz 1979. See Figure 1: around age 8, kids have near-perfect performance on the 2-stack version. Around age 14, kids plateau out at okay performance on the 3-stack. Unfortunately I couldn't find any paper that gives performance scores for 4+ disks in general-population children/adults. Sorry. ↩︎

  44. Created by User:Trixx ↩︎

  45. The solution to Level N involves doing the solution to the previous level twice (moving the N-1 stack from Peg A to Peg B, then Peg B to Peg C), plus one extra move (moving the biggest disk from Peg A to Peg C).

    So how many moves do you need to solve N disks? Let's crunch the numbers:

    For 1-disk, 1 move

    For 2-disk, 1 * 2 + 1 = 3 moves (previous solution twice + one extra move)

    For 3-disk, 3 * 2 + 1 = 7 moves

    For 4-disk, 7 * 2 + 1 = 15 moves

    For 5-disk, 15 * 2 + 1 = 31 moves

    For 6-disk, 31 * 2 + 1 = 63 moves

    For 7-disk, 63 * 2 + 1 = 127 moves

    For 8-disk, 127 * 2 + 1 = 255 moves

    Do you see the pattern?

    For N disks, you need 2^N - 1 moves!

    (Proof by induction: Let F(N) = # of moves needed for N disks. Let's say we already know F(N) = 2^N - 1. Then doubling that & adding one, we get (2^N - 1)*2 + 1 = 2^(N+1) - 2 + 1 = 2^(N+1) - 1, which is the F(N+1) we want! All's left to do is prove the base case: F(1) = 2^1 - 1 = 2 - 1 = 1. And indeed, yes it takes 1 move to solve the 1-stack. Q.E.D.)

    Yippee! This was really not worth writing the footnote for, but there you go. ↩︎

  46. Famous phrase coined by Emily Bender et al 2021 ↩︎

  47. As Judea Pearl — winner of the Turing Prize, the "Nobel Prize of Computer Science" — once told Quanta Magazine in their article, To Build Truly Intelligent Machines, Teach Them Cause and Effect: “As much as I look into what’s being done with deep learning, I see they’re all stuck there on the level of associations. Curve fitting. [...] no matter how skillfully you manipulate the data and what you read into the data when you manipulate it, it’s still a curve-fitting exercise, albeit complex and nontrivial.” ↩︎

  48. Beloved "thinking in gears" metaphor comes from Valentine (2017) ↩︎

  49. Inductive reasoning is stuff like "the Sun has risen every day for the last 10,000 days I've been alive, therefore the Sun will almost definitely rise again tomorrow". It's not technically a logical deduction — it's logically possible the Sun could vanish tomorrow — but, statistically, yeah it's probably fine. ↩︎

  50. The “dual-process” model of cognition was first suggested by (Wason & Evans, 1974), developed by multiple folks over decades, andreally popular in Daniel Kahneman's bestselling 2011 book, Thinking, Fast & Slow. As for the naming: Intuition is #1 and Logic is #2, because pattern-recognition evolved before human-style deliberative logic. ↩︎

  51. Abductive reasoning is the backbone of science. It's when you generate "the most likely hypothesis" to novel data. It's not deduction; it's very logically possible your hypothesis could be false. And it's not induction; you haven't seen any direct repeated evidence for your hypothesis yet, not until you test it.

    (Example: Michelson-Morley discovered that the speed of light seemed to be constant in every direction, and from this, Einstein abducted that time is relative! Yes, "time is relative" was the most likely hypotheis he had for "light speed seems to be constant", and hey, he was right.)

    (Nitpick: Einstein claimed he didn't know about the Michelson-Morley experiment, and he instead abducted relativity from the Maxwell Equations, but look, I'm trying to tell a simple story here ok?) ↩︎

  52. "Hyperpolation" coined by Toby Ord 2024. ↩︎

  53. Maybe this can be solved by writing a normal program, that then writes you "synthetic data" for a business operating successfully over 365 days, that you can fine-tune an LLM on? I dunno, this still feels like the special pleading of "yes I can add 7 digits & bomb at 8, but if you give me more training I can do it!" If you pass 7 but bomb 8, there's something fundamentally wrong that "more scale" don't fix. ↩︎

  54. METR (2025). “[software engineering] tasks that AI models can complete with 50% success rate [are] doubling approximately every seven months since 2019.” And Figure 13 shows that the cost of completing these tasks remains pretty stable even as task length grows, around 1% of a "Level 4" Google salary (\$143.61/hour, so, \$1.43/hour for an AI.) However, it's important to note the coding tasks in the benchmark are, by design, common tasks. ↩︎

  55. From the ARC-AGI frontpage. It's missing the newest Gemini, but it doesn't push the frontier much, and the frontier still bends downwards. ↩︎

  56. From Sutton's famous 2019 essay The Bitter Lesson: “One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great.” But in a 2025 podcast with Dwarkesh Patel, he believes LLMs are "a dead end", because they imitate without building "robust world models". ↩︎

  57. In 2020, niplav found that Claude Sonnet 3.5 could reproduce a common benchmark's "canary string". In 2024, Jozdien found that GPT-4 (base model) can too. The reason this is bad is because it's trivially easy to filter out all documents with the canary string — just search if it's in a document, then leave out the document if it is. The canary string itself isn't a cheat, but it hints at a high probability that the training data was contaminated with the benchmark's answers, making LLMs' performance on those benchmarks untrustworthy. ↩︎

  58. From Hackenburg & Margetts 2024: “Early research suggests that most capable LLMs can directly persuade humans on political issues (5), draft more persuasive public communications than actual government agencies and political communication experts (6, 7), and generate convincing disinformation and fake news (1, 8, 9). Across these domains, humans are no longer able to consistently distinguish human and LLM-generated texts (8, 10).” ↩︎

  59. Cat Burning on Wikipedia. Reeeaaally glad there's no photos on that page. (The Wikipedia Talk page notes there's academic debate about how widespread this specific act really was, but there's plenty of examples in History of "normal people" being awful.) ↩︎

  60. Constitutional AI: Harmlessness from AI Feedback (2022) ↩︎

  61. Collective Constitutional AI: Aligning a Language Model with Public Input (2023) ↩︎

  62. Deontology says you should be honest with a Nazi who wants to find hidden Jews, Utilitarianism says you should torture one person to prevent 10100 people getting dust in their eyes, and Rational Egoism would be okay with you walking past an easily-savable drowning child.

    But I struggle to think of any scenario that fails at all three moral theories: an action that obeys widely-accepted rules & duties, improves others' well-being, and your own well-being, and yet is still "wrong". Point is: the ensemble is more robust than any individual theory alone.

    (What about Virtue Ethics? Is there a scenario where "be wise" fails? Well, no, but that's because virtue ethics is extremely vague & kinda useless for guidance. The big names in virtue ethics, Aristotle & Aquinas, supported slavery. You can promote/demote any action you want be using the sloppy labels of "wise"/"unwise".) ↩︎

  63. Using Stories to Teach Human Values to Artificial Agents by Riedl & Harrison 2016. ↩︎

  64. Eliezer Yudkowsky 2004 ↩︎

  65. "Normativity" ~= morality, "Indirect" = well, not direct. See this glossary entry. ↩︎

  66. Okay there's technically no such thing as "the" scientific method, every field does something slightly different, and methods evolve, but you know what I mean. There's a "family resemblance". The general process of "notice stuff, guess why it's happening, test your guesses, repeat". ↩︎

  67. Jan Leike 2023: A proposal for importing society’s values: Building towards Coherent Extrapolated Volition with language models. ↩︎

  68. Park et al 2024: "We [simulate] the attitudes and behaviors of 1,052 real individuals [...] The [AI clones] replicate participants' responses on the General Social Survey 85% as accurately as participants replicate their own answers two weeks later, and perform comparably in predicting personality traits and outcomes in experimental replications." (emphasis added) ↩︎

  69. Coherent Blended Volition was coined in Goertzel & Pitt 2012: “The CBV of a diverse group of people should not be thought of as an average of their perspectives, but [...] incorporating the most essential elements of their divergent views into a whole that is overall compact, elegant and harmonious. [...] The core difference between [CEV and CBV] is that, in the CEV vision, the extrapolation and coherentization are to be done by a highly intelligent, highly specialized software program, whereas in [CBV], these are to be carried out by collective activity of humans as mediated by [tools for healthy, deep discussion]. Our perspective is that the definition of collective human values is probably better carried out via human collaboration, rather than delegated to a machine optimization process.” ↩︎

  70. Quantifying CBRN Risk in Frontier Models: “[Frontier LLMs] pose unprecedented dual-use risks through the potential proliferation of chemical, biological, radiological, and nuclear (CBRN) weapons knowledge. [...] Our findings expose critical safety vulnerabilities: [...] Model safety performance varies dramatically from 2% (claude-opus-4) to 96% (mistral-small-latest) attack success rates; and eight models exceed 70% vulnerability when asked to enhance dangerous material properties. We identify fundamental brittleness in current safety alignment, where simple prompt engineering techniques bypass safeguards for dangerous CBRN information.” ↩︎

  71. Spiral-Bench, a benchmark for measuring LLM sycophancy & delusion amplification. This website also hosts the Emotional Intelligence Benchmark and Slop Score ↩︎

  72. Measuring AI Ability to Complete Long Tasks by METR, Model Evaluation & Threat Research ↩︎

  73. Weval is a platform to make & share your own AI Evaluations, especially on topics ignored by AI & AI Safety communities, like "which AIs are evidence-based tutors?" ↩︎

  74. Leaked SEC Whistleblower Complaint Challenges OpenAI on Illegal Non-Disclosure Agreements (2024) ↩︎

  75. Toby Ord 2025: Inference Scaling Reshapes AI Governance: “Rapid scaling of inference-at-deployment would: lower the importance of open-weight models (and of securing the weights of closed models), reduce the impact of the first human-level models, change the business model for frontier AI, reduce the need for power-intense data centres, and derail the current paradigm of AI governance via training compute thresholds.↩︎

  76. For example, the Good Judgment Project & Metaculus, both with proven track records of being some of the best in general forecasting of future events, predict a 1% chance and 3% chance of near-extinction risk from advanced AI this century. ("Oh that doesn't sound too bad," you think, yeah wait til you see the % of non-extinction and/or non-AI risks.) Also, Metaculus is forecasting a median (with wide uncertainty) of AGI by 2033, and AGI becoming as species-changing as the agrilcultural/industral revolution by 2044. ↩︎

  77. AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy – “Participants (N = 991) answered a set of six forecasting questions and had the option to consult their assigned LLM assistant throughout. Our preregistered analyses show that interacting with each of our frontier LLM assistants significantly enhances prediction accuracy by between 24% and 28% compared to the control group. Exploratory analyses showed a pronounced outlier effect in one forecasting item, without which we find that the superforecasting assistant increased accuracy by 41%↩︎ ↩︎

  78. Anthropic's Responsible Scaling Policy (2023) Inspired by the US government's Biosafety Levels (BSL), they defined ASL-1 (old chess AIs), ASL-2 (current LLMs), ASL-3 ↩︎

  79. Sandbrink et al 2022 ↩︎

  80. From Stuart Russell, the co-author of the most-popular textbook in AI: Banning Lethal Autonomous Weapons: An Education ↩︎

  81. Quote from Tenet (2020), the most "yup that's a Nolan film" Nolan film. ↩︎

  82. For example, Reinforcement Learning from Human Preferences (RHLF), a way for AIs to import human values, was what turned Base GPT (an autocomplete) into ChatGPT (an actual chatbot). Likewise, improving the robustness of AI "intuition" would help make better AI medical diagnosis, and self-driving cars. ↩︎

  83. Smallpox used to kill millions of people every year. Here’s how humans beat it. by Kelsey Piper ↩︎

  84. Source: Our World In Data ↩︎

  85. March 2025 on MIT News: Study: The ozone hole is healing, thanks to global reduction of CFCs. "New results show with high statistical confidence that ozone recovery is going strong." ↩︎

  86. classic 2021 tweet by alex blechman ↩︎

  87. Opening metaphor from Nick Bostrom's classic 2014 book, Superintelligence ↩︎

  88. Glossary entry on CAIS ↩︎

  89. Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path? by Bengio et al 2025 ↩︎

  90. From Chris Olah’s views on AGI safety, by Evan Hubinger: “We could use ML as a microscope—a way of learning about the world without directly taking actions in it. That is, rather than training an RL agent, you could train a predictive model on a bunch of data and use interpretability tools to inspect it and figure out what it learned, then use those insights to inform—either with a human in the loop or in some automated way—whatever actions you actually want to take in the world.” ↩︎

  91. Quantilizers: A Safer Alternative to Maximizers for Limited Optimization by Jessica Taylor 2016 ↩︎

  92. Goodhart's Law, paraphrased: "When you reward a metric, it usually gets gamed." ↩︎

  93. An anatomical signature for literacy by Carreiras et al 2009 ↩︎

  94. Citing myself here, ehehehe: How To Become A Centaur by Nicky Case (2018) Though, see next footnote, the main analogy is now obsolete. ↩︎

  95. From Gwern: “As far as I can tell, in chess, the centaur era lasted barely a decade, and would've been shorter still had anyone been seriously researching computer chess rather than disbanding research after Deep Blue or the centaur tournaments kept running instead of stopping a while ago.” In 2017: “A computer engine (Zor) ended first in the freestyle Ultimate Challenge tournament (2017), while the first centaur (Thomas A. Anderson, Germany) ranked in 3rd place”. Though as of 2023, Human+AI can still beat pure AI in Go, but mostly due to exploiting the Robustness issues in modern deep-learning. Whether or not you count that as "cheating" is up to you & your sense of aesthetics. ↩︎

  96. I made a python script the lets you scribble with Stable Diffusion in realtime ↩︎

: See all feetnotes 👣

Also, expandable "Nutshells" that make good standalone bits:

: More about the Swiss Cheese Model
: The math of robust scalable oversight
: Scalable Oversight, extra notes
: Iterated Distillation & Amplification, explained
: (deleted scene) "Critical Mass" Comic
: Fixes for the "Future Lives" Algorithm
: Informal literature review of "Self-modifying Agents"
: Pros & Cons of Worst/Average/Best-Case optimization
: Learning our values, extra notes
: Economics of AI
: Advantages of Humans vs LLMs