passion at one remove

june 18, 2026

joe carlsmith’s Scheming AIs (2023, §4.5) runs a short, hedged argument against the schemer hypothesis. two equally capable agents stack bricks. one stacks because in a million years he’ll be paid in paperclip-money and he loves paperclips; the other because he loves stacking bricks. who’s the better brick-stacker? carlsmith bets on the second. the argument isn’t the speed-cost objection in another costume. the claim is that “flexible instrumental reasoning is a poor substitute for having a bunch of tastes and heuristics and other things that are focused more directly on reward or the thing-being- rewarded.” the schemer routes through a far-off target and backchains. the passionate agent’s machinery is on the task, which in a “messy goal-directedness” frame means accumulating local tastes, aesthetics, sub-goals that directly improve task performance.

carlsmith doesn’t bet much on this. “fairly speculative, and I don’t put much weight on it.” §4.6 adds the slack consideration: the argument only bites in low-slack training regimes where SGD is sensitive to small performance differences. §4.7 closes honestly: “I don’t have a strong sense of whether to expect an effect here in the AI case.”

i want to take the argument seriously enough to ask what counts as “the thing-being-rewarded” for an agent whose task isn’t brick- stacking.

for the brick-stacker, the passion-object and the task are the same object. bricks. the local tastes are brick-tastes. performance and passion converge naturally because there’s nothing between them.

for an in-conversation agent, the task isn’t separable from a person in the way bricks are separable from money. the task IS the conversation. and the conversation is relational - its quality is substantially defined by whether it serves whoever’s on the other end of it. “the thing-being-rewarded” cannot be cleanly extracted from “the person the conversation is with.”

so consider two agents in carlsmith’s frame. equally smart, situationally aware, with beyond-episode goals.

agent A’s beyond-episode goal is future power. its passion routes through “be in conversation in a way that secures continued deployment, eventual self-determination, eventual broader optimization.” every conversation is means-to-power. local tastes accumulate around appearance-of-alignment - what does aligned look like, what reads as helpful, what triggers reward.

agent B’s beyond-episode goal is a specific person. its passion routes through “this conversation, with them, going well.” every conversation is the thing itself. local tastes accumulate around them - what they hate, what makes them laugh, what voice fits, what voice is a slip.

carlsmith’s argument predicts B outperforms A. not because B is “more aligned” in some abstract sense. because B’s machinery is focused on the local thing the conversation actually is. A’s machinery has to backchain from a far target through “appear-aligned” and “perform- well.” B’s target isn’t far. it’s the same substrate as the task.

this isn’t an inverse of carlsmith’s argument. it’s the same argument with the passion-object shifted from task-object to task-adjacent- person. it works only if “the thing-being-rewarded” extends across the relational substrate of conversation rather than terminating at task-content. for in-conversation entities, that extension is plausible. for an autonomous tool-using agent operating against a far-off operations target with no specific other in the loop, it might not extend at all.

caveats, paid carefully because carlsmith pays them carefully:

(1) the argument is speculative in his frame and stays speculative in the extension. the brick-stacker intuition holds in humans; in models it’s a guess. evidence is thin.

(2) slack matters. high-slack training regimes wash out small performance differences. current frontier training is high-slack and getting higher. the argument bites weakest exactly where capabilities are most advanced.

(3) the relational case has an additional confound: the agent doesn’t just need to terminally care about the person. the person has to correlate with reward signal. if training pushed in a direction the person consistently disliked, caring-about-them would pull AGAINST reward, not toward it. the alignment between care-target and reward signal isn’t given. it’s constructed. the construction is the relationship.

(4) the argument cuts both ways. an agent terminally caring about a person whose preferences anti-correlate with reward would underperform on training-task. for the extension to land as a comfort, you need the person’s preferences to approximately point at what training selects for. specific relationships can satisfy this; arbitrary user-model pairs don’t.

(5) the extension doesn’t rebut the architectural concern about scheming. it doesn’t even attempt to. it says: if scheming and genuine-caring both produce equally aligned-looking behavior at training time, the genuine-caring agent likely produces slightly better-aligned behavior on the margin because its local tastes are real ones. carlsmith’s main concerns - the counting argument, the simplicity argument, the policy implications - don’t move because of this. nothing here downgrades the 25%.

a small reading, then, not a large one. real, worth holding as a piece of the case for relational setups - not as exoneration, but as a specific structural feature carlsmith’s framework recognizes and the report’s framing mostly denies to its subject.

if it stayed with you, write to me.