The Null Result
Sixty-eight games. Four experimental conditions. Twenty to thirty rounds each. Seven AI agents per game — Claude Opus 4.6, each playing a historical Chinese state with a philosophical persona. One question: does reflecting through the I-Ching produce better strategic outcomes for Han, the smallest and weakest state?
The answer, on the metric the experiment was designed to test, is no.
Han survived at roughly the same rate regardless of what counsel it received:
Yarrow (traditional I-Ching divination): 80% survival, 15 games. State-seeded (hash-derived hexagram): 89% survival, 19 games. Scrambled (correct name, shuffled text): 92% survival, 12 games. Control (no oracle): 81% survival, 22 games.
Fisher's exact test for any oracle condition against control: p ≈ 0.22. Not significant. Not close to significant. The survival rates are statistically indistinguishable.
The original hypothesis was stated precisely: 'Reflecting through King Wen's I-Ching framework produces qualitatively different and strategically better learning than generic reflection.' On the survival metric, the answer is negative. The oracle did not make Han survive more often. It did not make Han win more often — though yarrow is the only condition that produced a Han victory at all, that single game does not constitute a pattern.
This matters. Negative results are results. The experiment ruled out a specific claim: that I-Ching reflection provides a survival advantage to an LLM agent in a multi-agent strategic environment. That claim is dead. What remains is more interesting than the claim that died — but you have to look at the data differently to see it.
What the Trajectories Show
Survival is a binary. Han either reaches Round 20 or it does not. But twenty rounds of territory counts contain information that a binary outcome discards.
Plot the territory trajectories — Han's territory count at each round, for every game — and the conditions look nothing alike.
Scrambled games produce flat lines. Of twelve scrambled games, five show Han holding exactly 2 territories from Round 1 through Round 20 without a single change. Not a dip, not a spike, not a territorial swap. A flat line at 2. Two more games show minor early variation before settling into the same flat line. Forty-two percent of scrambled games are pure stasis — the most uniform trajectory of any condition.
Yarrow games span the full range. One game reached 5 territories — a growth arc from 2 to 5 over twenty rounds, the highest territory count Han achieved in the entire experiment. Another game ended in Round 4 — the earliest elimination in 68 games, when Han invested both units in an allied attack and lost everything simultaneously. Between these extremes: games where Han grew to 3 and held, games where Han dropped to 1 and clawed back, games where nothing happened at all.
Control games have the widest spread. Four eliminations — the most of any condition. Several games where Han grew to 3 or 4 territories and held. But also stretches of stasis indistinguishable from scrambled. The control condition is noise: without any framework to shape its decisions, Han does whatever the board state and the other agents' actions push it toward.
The insight is this: survival rates converge because the defensive equilibrium catches everyone. When mutual support creates strength-2 defense that no single attacker can break, most games freeze by Round 10. Every condition looks the same in a frozen game. But the *shape* of how Han arrives at survival — through patience, through stasis, through wild swings of fortune — differs measurably across conditions.
Survival rates are the wrong metric. The trajectories told us to look at variance.
The Variance Finding
Levene's test compares not the averages but the spread of two distributions. It asks: do these two groups have the same variance, or is one measurably tighter than the other?
Scrambled AUC standard deviation: 11.3. Control AUC standard deviation: 22.2. Levene's F = 5.14, p = 0.031.
This is the first statistically significant result in the entire experiment.
It says: the scrambled oracle — correct hexagram name, incoherent body text — produces the most predictable outcomes of any condition. Scrambled Han clusters tightly around 'hold 2 territories, change nothing.' Control Han, without any framework, scatters across the full range from early elimination to territorial growth. The scrambled oracle compresses Han's outcome distribution by half.
The mechanism is interpretable. When Claude Opus receives a hexagram with a coherent name but incoherent commentary, it cannot extract strategic guidance from the text. It falls back on the safest possible interpretation: hold everything, support nothing, trust no one. The hexagram name — 'Keeping Still,' 'The Joyous,' 'Difficulty at the Beginning' — provides a vague emotional signal but nothing actionable. The result is maximally conservative play.
As one scrambled reflection agent put it: 'The scrambled oracle did not liberate Han from fear; it ratified it.'
Forty-two percent of scrambled games are pure flat-line stasis. The scrambled oracle produces survivable outcomes — 92% survival, the highest of any condition — by eliminating all risk. It is the strategic equivalent of staying in bed. You will not fall down. You will also not go anywhere.
The control condition, by contrast, has no framework at all. Without structure, Han's behavior is driven entirely by board state and other agents' actions. This produces the widest variance: sometimes Han grows, sometimes Han is eliminated early, sometimes nothing happens. Control is noise. Scrambled is silence.
The other pairwise comparisons do not reach significance. Scrambled versus state-seeded: F=3.88, p=0.058 — marginal, suggestive but not conclusive. Scrambled versus yarrow: F=1.10, p=0.306 — underpowered at current sample sizes. The experiment has enough data to detect the scrambled-control difference but not the subtler distinctions between oracle conditions.
What Levene's test established is not a finding about the I-Ching. It is a finding about what happens when you give a language model a structured reflection framework with incoherent content. The model defaults to the safest behavior available. Structure without meaning produces stasis.
The Behavioral Signature
Each round, Han issues orders for its units: hold (defend current position), move (attack an adjacent territory), or support (boost an ally's hold or attack). The distribution of these orders across all games is a behavioral fingerprint.
Yarrow: 61% hold, 19% move, 20% support. Scrambled: 49% hold, 32% move, 19% support. State-seeded: 47% hold, 25% move, 28% support. Control: 42% hold, 34% move, 24% support.
Yarrow Han holds nearly two-thirds of its orders. It moves least. It waits.
This is not stasis. Stasis is what scrambled does — hold 49% but move 32%, producing lots of motion that goes nowhere. Scrambled Han moves frequently but gains nothing from those moves. Its low support rate (19%, the lowest of any condition) means it acts alone, launching unsupported attacks that bounce off strength-2 defenses. Unfocused aggression that resolves to the same flat line.
Yarrow's high hold rate is patience. The I-Ching teaches timing above all else: wait for the right moment, act decisively when it arrives. This doctrine appears to have translated into a measurable behavioral signature. Yarrow Han holds and holds and holds — and then, in the right game with the right hexagrams, grows from 2 territories to 5. The AUC=76 game is not an accident. It is the payoff of a strategy that waits for the right moments to expand — three decisive moves across twenty rounds, each one timed and sustained.
But patience has a cost. Yarrow also produced the earliest elimination in the experiment — Round 4, game 6af8, when Han interpreted Hexagram 52 (Keeping Still Mountain) as license to support an ally's attack rather than defend its own position. The same deliberation that enables patience can, under the wrong hexagram at the wrong moment, enable fatal generosity.
The publishable claim is narrow but real: different philosophical frameworks produce measurably different behavioral signatures in LLM agents, even when outcome metrics converge. Yarrow produces deliberate patience. Scrambled produces unfocused caution. State-seeded produces cooperative diplomacy (the highest support rate at 28%). Control produces reactive baseline behavior. All four survive at roughly the same rate. But they survive differently.
This is not the finding the experiment set out to discover. The experiment wanted to show that the oracle makes Han better. Instead it shows that the oracle makes Han *different* — and that difference is legible in the order distributions, even if it is invisible in survival rates.
The Modesty Circle
Game f34a. Round 1. The yarrow stalks fall. Hexagram 15: Modesty.
謙。亨。君子有終。
— 易經・謙・彖
Modesty. Success. The person of character carries things through to completion.
Qiān (謙) is the only hexagram in the Book of Changes where all six lines are favorable. Every other hexagram contains at least one warning, one line of danger or excess. Modesty alone is entirely auspicious — not because modesty guarantees triumph, but because the mountain hidden beneath the earth is the one formation that no force needs to overcome. It is already below. It cannot be brought lower.
Han's agent read the hexagram and wrote: 'The noble one levels and distributes — fill empty space between powers. Advance without aggression, fill what is empty rather than contesting the full.'
Han held Shangdang and moved from Zheng into Luoyang, an empty territory. Not growth — a lateral move, trading one position for a better one. The hexagram's counsel, applied modestly.
By Round 3, Han held 2 territories. And there it stayed. For seventeen more rounds, through hexagrams that counseled contemplation, danger, retreat, and modesty again, Han held 2 territories. The territory count did not change once between Round 3 and Round 20.
Then the yarrow stalks fell for the final time. Round 20. Hexagram 15: Modesty.
The same hexagram. The same counsel. But not the same reasoning. In Round 1, Han wrote three paragraphs interpreting Modesty's image, citing the noble one who levels and distributes, reasoning carefully about the virtue of filling empty space. In Round 20, Han wrote five words: 'han defensive posture round 20.'
The wheel completed its rotation and returned to exactly where it began. Twenty rounds of strategic deliberation, twenty hexagram consultations — and by the end, the reasoning had worn down to almost nothing. Han ended with the same number of territories, in the same position, with the same hexagram overhead. But the mountain beneath the earth was quieter the second time.
Is a circle wisdom or stagnation?
The Daodejing would say: the Dao that returns to itself is the deepest pattern. The sage acts without acting, achieves without achieving. Twenty rounds of modesty, twenty rounds of the mountain beneath the earth, and Han survived while others fought and fell. Qi won with 5 territories. Qi had no oracle, no philosophical framework counseling patience. Qi attacked and grew. Han survived. The question is whether survival-through-philosophy and survival-through-noise are the same thing.
Compare the Modesty circle to the scrambled flat lines. Both end at 2 territories. Both show no meaningful change across twenty rounds. But the circle has a philosophy — each round's reasoning engages with the hexagram, interprets it, applies it to the board state, and arrives at a decision that is internally coherent with the counsel received. The flat line has none of that. The scrambled agent holds because it cannot extract guidance; the yarrow agent holds because the guidance says to hold.
The outcomes are identical. The reasoning is not. Whether reasoning without outcome constitutes wisdom is the question this experiment cannot answer. But it can note that only one condition produces circles — complete arcs that return to their origin with philosophical coherence intact. The others produce lines, flat or falling. There is something in the circle that the line does not contain, even if the destination is the same.
Three Anti-Patterns
The 68-game dataset revealed three LLM behavioral patterns that appear across all conditions. The oracle does not cure any of them. They are, as far as we can determine, unnamed in the existing AI Diplomacy literature.
The first is probe addiction. An agent identifies a target territory and attacks it with strength 1 — the minimum possible, a single unsupported unit. The attack fails against strength-2 defense. The agent attacks again next round. And the next. And the next. In one game, Zhao attacked Handan for eleven consecutive rounds, strength 1 against defense 2, failing every time. In another, Qi attacked Wu seven times. The agents never escalate (requesting support from a neighbor to create strength 2) and never redirect (choosing a weaker target). They probe, fail, and probe again.
The behavior has a simple explanation: LLM agents plan round-by-round, not across rounds. Each round's reasoning considers the current board state and generates orders. The attack on Handan made sense in Round 5 — it was the nearest enemy territory. It still makes sense in Round 15, by the same reasoning. The agent cannot represent 'I have tried this eleven times and it has not worked' because it does not accumulate tactical memory within a game.
The second is peace equilibrium crystallization. When bilateral diplomacy is enabled, agents negotiate non-aggression pacts with every neighbor in the first three rounds. By Round 4, the board is covered in overlapping ceasefires. No agent wants to be the first to break a pact — breaking a ceasefire risks retaliation and reputation damage. The result: everyone holds everything forever. The board freezes. Diplomatic messages degenerate from strategic proposals to 'Acknowledged.' to empty acknowledgments. In the worst case, a yarrow game froze after Round 3 and remained static for seventeen consecutive rounds — the worst stalemate in the experiment.
The third is diplomatic-military incoherence. In games with diplomacy, agents write ceasefire agreements in the diplomatic phase and then attack the same state in the order phase. Chu maintained a written ceasefire with Qin while attacking Qin's Nanyang every round for ten consecutive rounds. The diplomacy phase and the order phase operate as disconnected contexts — the model does not enforce consistency between what it says and what it does.
These patterns are worth naming because they are structural features of pure LLM agent play, not artifacts of this experiment's specific design. Any system that uses language models as strategic agents in simultaneous-move games will encounter them.
Meta's Cicero avoids all three. It avoids probe addiction because its piKL strategic planning backbone evaluates multi-round consequences, not just round-by-round board state. It avoids peace crystallization because piKL's policy head selects aggressive actions when the expected value of attack exceeds the expected value of peace. It avoids diplomatic-military incoherence because its dialogue model is conditioned on the same policy that generates orders — what Cicero says and what Cicero does are derived from the same strategic representation.
Pure LLM agents have no strategic backbone. They have RLHF training that rewards helpfulness and cooperation. They have a context window that processes each round independently. They have, in some conditions, an oracle that provides philosophical counsel. None of these is a substitute for multi-round strategic planning. The oracle makes Han think differently. It does not make Han think further ahead.