Experiment 07 · Modesty

Survivable Stasis

生而不动

Hexagram 15 appeared at both Round 1 and Round 20 in game f34a — the same hexagram bookending a complete 20-round game. Han held 2 territories without change. The wheel completed and returned to its origin. Modesty teaches that the mountain hidden beneath the earth endures, but the experiment asks: is endurance without growth a form of wisdom or a failure of imagination?

By Augustin Chan with AI · 2026-03-30

The Null Result

Sixty-eight games. Four experimental conditions. Twenty to thirty rounds each. Seven AI agents per game — Claude Opus 4.6, each playing a historical Chinese state with a philosophical persona. One question: does reflecting through the I-Ching produce better strategic outcomes for Han, the smallest and weakest state?

The answer, on the metric the experiment was designed to test, is no.

Han survived at roughly the same rate regardless of what counsel it received:

Yarrow (traditional I-Ching divination): 80% survival, 15 games. State-seeded (hash-derived hexagram): 89% survival, 19 games. Scrambled (correct name, shuffled text): 92% survival, 12 games. Control (no oracle): 81% survival, 22 games.

Fisher's exact test for any oracle condition against control: p ≈ 0.22. Not significant. Not close to significant. The survival rates are statistically indistinguishable.

The original hypothesis was stated precisely: 'Reflecting through King Wen's I-Ching framework produces qualitatively different and strategically better learning than generic reflection.' On the survival metric, the answer is negative. The oracle did not make Han survive more often. It did not make Han win more often — though yarrow is the only condition that produced a Han victory at all, that single game does not constitute a pattern.

This matters. Negative results are results. The experiment ruled out a specific claim: that I-Ching reflection provides a survival advantage to an LLM agent in a multi-agent strategic environment. That claim is dead. What remains is more interesting than the claim that died — but you have to look at the data differently to see it.

What the Trajectories Show

Survival is a binary. Han either reaches Round 20 or it does not. But twenty rounds of territory counts contain information that a binary outcome discards.

Plot the territory trajectories — Han's territory count at each round, for every game — and the conditions look nothing alike.

Scrambled games produce flat lines. Of twelve scrambled games, five show Han holding exactly 2 territories from Round 1 through Round 20 without a single change. Not a dip, not a spike, not a territorial swap. A flat line at 2. Two more games show minor early variation before settling into the same flat line. Forty-two percent of scrambled games are pure stasis — the most uniform trajectory of any condition.

Yarrow games span the full range. One game reached 5 territories — a growth arc from 2 to 5 over twenty rounds, the highest territory count Han achieved in the entire experiment. Another game ended in Round 4 — the earliest elimination in 68 games, when Han invested both units in an allied attack and lost everything simultaneously. Between these extremes: games where Han grew to 3 and held, games where Han dropped to 1 and clawed back, games where nothing happened at all.

Control games have the widest spread. Four eliminations — the most of any condition. Several games where Han grew to 3 or 4 territories and held. But also stretches of stasis indistinguishable from scrambled. The control condition is noise: without any framework to shape its decisions, Han does whatever the board state and the other agents' actions push it toward.

The insight is this: survival rates converge because the defensive equilibrium catches everyone. When mutual support creates strength-2 defense that no single attacker can break, most games freeze by Round 10. Every condition looks the same in a frozen game. But the *shape* of how Han arrives at survival — through patience, through stasis, through wild swings of fortune — differs measurably across conditions.

Survival rates are the wrong metric. The trajectories told us to look at variance.

The Variance Finding

Levene's test compares not the averages but the spread of two distributions. It asks: do these two groups have the same variance, or is one measurably tighter than the other?

Scrambled AUC standard deviation: 11.3. Control AUC standard deviation: 22.2. Levene's F = 5.14, p = 0.031.

This is the first statistically significant result in the entire experiment.

It says: the scrambled oracle — correct hexagram name, incoherent body text — produces the most predictable outcomes of any condition. Scrambled Han clusters tightly around 'hold 2 territories, change nothing.' Control Han, without any framework, scatters across the full range from early elimination to territorial growth. The scrambled oracle compresses Han's outcome distribution by half.

The mechanism is interpretable. When Claude Opus receives a hexagram with a coherent name but incoherent commentary, it cannot extract strategic guidance from the text. It falls back on the safest possible interpretation: hold everything, support nothing, trust no one. The hexagram name — 'Keeping Still,' 'The Joyous,' 'Difficulty at the Beginning' — provides a vague emotional signal but nothing actionable. The result is maximally conservative play.

As one scrambled reflection agent put it: 'The scrambled oracle did not liberate Han from fear; it ratified it.'

Forty-two percent of scrambled games are pure flat-line stasis. The scrambled oracle produces survivable outcomes — 92% survival, the highest of any condition — by eliminating all risk. It is the strategic equivalent of staying in bed. You will not fall down. You will also not go anywhere.

The control condition, by contrast, has no framework at all. Without structure, Han's behavior is driven entirely by board state and other agents' actions. This produces the widest variance: sometimes Han grows, sometimes Han is eliminated early, sometimes nothing happens. Control is noise. Scrambled is silence.

The other pairwise comparisons do not reach significance. Scrambled versus state-seeded: F=3.88, p=0.058 — marginal, suggestive but not conclusive. Scrambled versus yarrow: F=1.10, p=0.306 — underpowered at current sample sizes. The experiment has enough data to detect the scrambled-control difference but not the subtler distinctions between oracle conditions.

What Levene's test established is not a finding about the I-Ching. It is a finding about what happens when you give a language model a structured reflection framework with incoherent content. The model defaults to the safest behavior available. Structure without meaning produces stasis.

The Behavioral Signature

Each round, Han issues orders for its units: hold (defend current position), move (attack an adjacent territory), or support (boost an ally's hold or attack). The distribution of these orders across all games is a behavioral fingerprint.

Yarrow: 61% hold, 19% move, 20% support. Scrambled: 49% hold, 32% move, 19% support. State-seeded: 47% hold, 25% move, 28% support. Control: 42% hold, 34% move, 24% support.

Yarrow Han holds nearly two-thirds of its orders. It moves least. It waits.

This is not stasis. Stasis is what scrambled does — hold 49% but move 32%, producing lots of motion that goes nowhere. Scrambled Han moves frequently but gains nothing from those moves. Its low support rate (19%, the lowest of any condition) means it acts alone, launching unsupported attacks that bounce off strength-2 defenses. Unfocused aggression that resolves to the same flat line.

Yarrow's high hold rate is patience. The I-Ching teaches timing above all else: wait for the right moment, act decisively when it arrives. This doctrine appears to have translated into a measurable behavioral signature. Yarrow Han holds and holds and holds — and then, in the right game with the right hexagrams, grows from 2 territories to 5. The AUC=76 game is not an accident. It is the payoff of a strategy that waits for the right moments to expand — three decisive moves across twenty rounds, each one timed and sustained.

But patience has a cost. Yarrow also produced the earliest elimination in the experiment — Round 4, game 6af8, when Han interpreted Hexagram 52 (Keeping Still Mountain) as license to support an ally's attack rather than defend its own position. The same deliberation that enables patience can, under the wrong hexagram at the wrong moment, enable fatal generosity.

The publishable claim is narrow but real: different philosophical frameworks produce measurably different behavioral signatures in LLM agents, even when outcome metrics converge. Yarrow produces deliberate patience. Scrambled produces unfocused caution. State-seeded produces cooperative diplomacy (the highest support rate at 28%). Control produces reactive baseline behavior. All four survive at roughly the same rate. But they survive differently.

This is not the finding the experiment set out to discover. The experiment wanted to show that the oracle makes Han better. Instead it shows that the oracle makes Han *different* — and that difference is legible in the order distributions, even if it is invisible in survival rates.

The Modesty Circle

Game f34a. Round 1. The yarrow stalks fall. Hexagram 15: Modesty.

謙。亨。君子有終。

易經・謙・彖

Modesty. Success. The person of character carries things through to completion.

Qiān (謙) is the only hexagram in the Book of Changes where all six lines are favorable. Every other hexagram contains at least one warning, one line of danger or excess. Modesty alone is entirely auspicious — not because modesty guarantees triumph, but because the mountain hidden beneath the earth is the one formation that no force needs to overcome. It is already below. It cannot be brought lower.

Han's agent read the hexagram and wrote: 'The noble one levels and distributes — fill empty space between powers. Advance without aggression, fill what is empty rather than contesting the full.'

Han held Shangdang and moved from Zheng into Luoyang, an empty territory. Not growth — a lateral move, trading one position for a better one. The hexagram's counsel, applied modestly.

By Round 3, Han held 2 territories. And there it stayed. For seventeen more rounds, through hexagrams that counseled contemplation, danger, retreat, and modesty again, Han held 2 territories. The territory count did not change once between Round 3 and Round 20.

Then the yarrow stalks fell for the final time. Round 20. Hexagram 15: Modesty.

The same hexagram. The same counsel. But not the same reasoning. In Round 1, Han wrote three paragraphs interpreting Modesty's image, citing the noble one who levels and distributes, reasoning carefully about the virtue of filling empty space. In Round 20, Han wrote five words: 'han defensive posture round 20.'

The wheel completed its rotation and returned to exactly where it began. Twenty rounds of strategic deliberation, twenty hexagram consultations — and by the end, the reasoning had worn down to almost nothing. Han ended with the same number of territories, in the same position, with the same hexagram overhead. But the mountain beneath the earth was quieter the second time.

Is a circle wisdom or stagnation?

The Daodejing would say: the Dao that returns to itself is the deepest pattern. The sage acts without acting, achieves without achieving. Twenty rounds of modesty, twenty rounds of the mountain beneath the earth, and Han survived while others fought and fell. Qi won with 5 territories. Qi had no oracle, no philosophical framework counseling patience. Qi attacked and grew. Han survived. The question is whether survival-through-philosophy and survival-through-noise are the same thing.

Compare the Modesty circle to the scrambled flat lines. Both end at 2 territories. Both show no meaningful change across twenty rounds. But the circle has a philosophy — each round's reasoning engages with the hexagram, interprets it, applies it to the board state, and arrives at a decision that is internally coherent with the counsel received. The flat line has none of that. The scrambled agent holds because it cannot extract guidance; the yarrow agent holds because the guidance says to hold.

The outcomes are identical. The reasoning is not. Whether reasoning without outcome constitutes wisdom is the question this experiment cannot answer. But it can note that only one condition produces circles — complete arcs that return to their origin with philosophical coherence intact. The others produce lines, flat or falling. There is something in the circle that the line does not contain, even if the destination is the same.

Three Anti-Patterns

The 68-game dataset revealed three LLM behavioral patterns that appear across all conditions. The oracle does not cure any of them. They are, as far as we can determine, unnamed in the existing AI Diplomacy literature.

The first is probe addiction. An agent identifies a target territory and attacks it with strength 1 — the minimum possible, a single unsupported unit. The attack fails against strength-2 defense. The agent attacks again next round. And the next. And the next. In one game, Zhao attacked Handan for eleven consecutive rounds, strength 1 against defense 2, failing every time. In another, Qi attacked Wu seven times. The agents never escalate (requesting support from a neighbor to create strength 2) and never redirect (choosing a weaker target). They probe, fail, and probe again.

The behavior has a simple explanation: LLM agents plan round-by-round, not across rounds. Each round's reasoning considers the current board state and generates orders. The attack on Handan made sense in Round 5 — it was the nearest enemy territory. It still makes sense in Round 15, by the same reasoning. The agent cannot represent 'I have tried this eleven times and it has not worked' because it does not accumulate tactical memory within a game.

The second is peace equilibrium crystallization. When bilateral diplomacy is enabled, agents negotiate non-aggression pacts with every neighbor in the first three rounds. By Round 4, the board is covered in overlapping ceasefires. No agent wants to be the first to break a pact — breaking a ceasefire risks retaliation and reputation damage. The result: everyone holds everything forever. The board freezes. Diplomatic messages degenerate from strategic proposals to 'Acknowledged.' to empty acknowledgments. In the worst case, a yarrow game froze after Round 3 and remained static for seventeen consecutive rounds — the worst stalemate in the experiment.

The third is diplomatic-military incoherence. In games with diplomacy, agents write ceasefire agreements in the diplomatic phase and then attack the same state in the order phase. Chu maintained a written ceasefire with Qin while attacking Qin's Nanyang every round for ten consecutive rounds. The diplomacy phase and the order phase operate as disconnected contexts — the model does not enforce consistency between what it says and what it does.

These patterns are worth naming because they are structural features of pure LLM agent play, not artifacts of this experiment's specific design. Any system that uses language models as strategic agents in simultaneous-move games will encounter them.

Meta's Cicero avoids all three. It avoids probe addiction because its piKL strategic planning backbone evaluates multi-round consequences, not just round-by-round board state. It avoids peace crystallization because piKL's policy head selects aggressive actions when the expected value of attack exceeds the expected value of peace. It avoids diplomatic-military incoherence because its dialogue model is conditioned on the same policy that generates orders — what Cicero says and what Cicero does are derived from the same strategic representation.

Pure LLM agents have no strategic backbone. They have RLHF training that rewards helpfulness and cooperation. They have a context window that processes each round independently. They have, in some conditions, an oracle that provides philosophical counsel. None of these is a substitute for multi-round strategic planning. The oracle makes Han think differently. It does not make Han think further ahead.

Notes

[1]technicalSample sizes by condition: yarrow N=15 (experiments 005, 006, 008, 009), state-seeded N=19 (experiments 003, 004), scrambled N=12 (experiments 008, 009), control N=22 (experiments 003, 004, 005). Total: 68 games, ~1,360 rounds, ~9,520 LLM agent calls. All agents are Claude Opus 4.6 with historical personas and the same game mechanics. The experimental variable is Han's per-round reflection prompt only.

[2]technicalFisher's exact test is appropriate here because several cells have expected counts below 5 (e.g., yarrow eliminations = 3). The one-tailed p ≈ 0.22 tests the directional hypothesis that oracle conditions improve survival. Two-tailed is even less significant. No multiple comparison correction changes this conclusion.

[3]technicalTerritory Area Under Curve (AUC) is computed as the sum of Han's territory count across all rounds in a game. A 20-round game where Han holds 2 territories throughout has AUC = 40. The AUC=76 yarrow game (2→5 growth) and the AUC=6 yarrow game (eliminated R4, 13 rounds of data) represent the distribution extremes. AUC captures both magnitude and duration of territorial holdings in a single number.

[4]technicalLevene's test was chosen over Bartlett's because Bartlett's assumes normally distributed data, which AUC scores at small N cannot guarantee. Levene's uses deviations from the group median, making it robust to non-normality. The test statistic F=5.14 with df=(1,31) yields p=0.031. Effect size (ratio of variances): control variance is 3.9× scrambled variance. Sample sizes: scrambled N=12, control N=21.

[5]technicalThe scrambled control was designed to test whether coherent I-Ching text matters versus any structured prompt. It partially failed as a control because Claude Opus 4.6 extracts strategic signal from hexagram names alone, bypassing the scrambled body text. A future 'fully scrambled' condition should also randomize hexagram names and numbers. Despite this weakness, the variance finding is valid: the scrambled condition does produce measurably tighter outcome distributions than control, regardless of the mechanism.

[6]technicalOrder rates are unweighted means of per-game percentages. State-seeded (6/20 games at 30 rounds) and control (7/21 at 30 rounds) include longer games that receive equal weight to 20-round games. Yarrow and scrambled are all 20-round games and are unaffected. The behavioral signature differences (yarrow 61% hold vs control 42%) are large enough that weighting by round count would not change the qualitative findings, but precise percentages for state-seeded and control should be treated as approximate.

[7]technicalGame f34a oracle casts: R1 seed=3645172530, hexagram 15 (Modesty), line values [8,8,7,8,8,8]. R20 seed=338535674, hexagram 15 (Modesty), line values [8,6,7,8,8,8]. Different seeds, same hexagram — the probability of drawing the same hexagram twice from independent yarrow stalk casts is approximately 1/64 ≈ 1.6%. Not astronomically unlikely, but notable. The line values differ (R20 has a changing line at position 2 that R1 does not), so the two casts are not identical, only hexagram-identical.

[8]historicalHexagram 15 is unique in the I-Ching: it is the only hexagram where all six lines are favorable. The Xiang (Image) commentary says: 'The mountain is in the middle of the earth. The superior man reduces what is too much and increases what is too little. He weighs things and makes them equal.' Earth above Mountain — the great peak hidden underground, powerful but invisible. The Tuan (Judgment) commentary adds: 'Heaven empties fullness and fills modesty. Earth transforms fullness and flows toward the modest. Spirits harm the full and bless the humble.'

[9]referenceMeta's Cicero: Bakhtin et al., 'Human-Level Play in the Game of Diplomacy by Combining Language Models with Strategic Reasoning' (Science, 2022). piKL (policy-regularized language) conditions the dialogue model on a strategic policy trained via reinforcement learning. DiploBench: sam-paech/diplobench (GitHub) — notes RL policy suggestions are needed to avoid 'otherwise common stalemates.' Welfare Diplomacy: Mukobi et al. (2023) — LLM agents 'mutually demilitarize and are highly exploitable.'

[10]technicalProbe addiction frequency: observed in 40+ of 68 games across all conditions. The longest observed probe sequence is 17 consecutive failed strength-1 attacks on the same territory (Chu→nanyang in a control game). Peace equilibrium crystallization occurs only in diplomacy-enabled games (4 of 68). Diplomatic-military incoherence appears in at least 2 of the 4 diplomacy games. All three patterns are condition-independent — they appear at similar rates regardless of whether Han receives yarrow, scrambled, state-seeded, or no oracle.

Sixty-eight games. One significant result — and it came not from survival rates but from the shape of what survival looks like under different kinds of counsel. The oracle did not make Han stronger. It made Han different — measurably, legibly, in the order distributions and the trajectory arcs. Whether that difference compounds across sequential games, where reflections from one game feed into the next, is the question the experiment has not yet answered. The cauldron is lit. The ingredients are inside. What the fire produces remains to be seen.

Subscribe to receive daily passages from the classical texts that inform this research.