Research · Methods Primer

Statistical Methods,
In Plain Language

This page explains — in ordinary English — every statistical test and term used across the Warring States experiments. Each dispatch links here when it uses a technical term. The goal is that nothing in our research reports should be a black box.

By Augustin Chan with AI · 2026-04-16

What a p-value actually means

A p-value answers one specific question: if nothing interesting were really going on, how surprised should we be by the result we observed?

Imagine you flip a coin ten times and get eight heads. The p-value is the probability of getting eight-or-more heads (or two-or-fewer) from a fair coin. For a fair coin, that probability is about 0.11 — so a p-value of 0.11 means “this could plausibly have happened by chance, even if the coin is fair.”

If we had gotten nine heads, the p-value would be about 0.02 — much more surprising under a fair coin. By convention, researchers call results with p < 0.05 “statistically significant”: rare enough under the no-effect assumption that we're willing to believe something real is happening.

A p-value does not tell you how big an effect is, how important it is, or how likely the effect is to be real. It only tells you how surprised a skeptic should be. A tiny effect with a huge sample can have a tiny p-value; a huge effect with a tiny sample can have a large p-value.

In our research: when we report “Tarot-Qin Fisher's exact p=0.0069,” we mean: if Tarot really had no effect on which state wins, we'd see 5-or-more Qin wins in 6 Tarot games less than 1% of the time. That's surprising enough to take the effect seriously.

The null hypothesis

The null hypothesis is the “nothing interesting is happening” version of the world. Statistical tests compute how likely your observed data is assuming the null hypothesis is true.

For our Tarot-Qin finding, the null hypothesis is “Tarot has no effect on who wins — wins are distributed across states the same way whether Han uses Tarot or not.” If the data is very unlikely under this null (small p-value), we reject the null and conclude Tarot probably does affect outcomes.

Rejecting the null is not the same as proving the alternative. It just means “the no-effect explanation doesn't fit the data well.”

Sample size and why it matters

With few games, even real effects can fail to reach statistical significance, and chance patterns can look real. This is why we care so much about how many games each condition has.

Our Tarot arm has 6 games. Our yarrow arm has 7. Our control arm has 11. For a strong signal (like 5 out of 6 Tarot games going to Qin), 6 games is enough to clear the significance threshold. For a weaker signal (like yarrow shifting Yan's win rate from 36% to 57%), 7 games is not enough — you'd need closer to 25 or 30 to reliably detect that size of effect.

Sample size interacts with effect size: the bigger the real effect, the fewer games you need to detect it. This is why we report our findings as “rigorous” (Tarot-Qin, strong effect + adequate sample) and “directional” (yarrow ecosystem patterns, smaller effect + inadequate sample).

Mean vs. median (and why one can mislead)

The mean is the average: add everything up, divide by count. The median is the middle value: sort the numbers, pick the one in the middle.

These differ when distributions are skewed or contain clusters of identical values. Consider Han's survival rounds across 11 control games:

4, 8, 10, 20, 20, 20, 20, 20, 20, 20, 20

The median is 20 (the 6th value in the sorted list). The mean is 16.5. Seven of the 11 Hans reached round 20; once more than half of your values pile up at a single point, the median locks onto that point.

But “median survival = 20” is measuringduration alive, not strategic relevance. Most of those round-20 Hans were holding a single home supply center with no real influence on the game. The median captures “how long they lasted” but says nothing about “how much they mattered.” Good science communication pairs any median-survival claim with a peak or final-state measure.

In our research: when Dispatch 11 reported “control Han's median survival: Round 20” alongside “5 of 7 surviving Hans ended with exactly 1 SC,” it was distinguishing these two senses of survival. The median says they lasted; the SC count says they didn't matter.

Fisher's exact test

Fisher's exact is the right test when you have a 2×2 table of counts and want to know if the two rows are distributed differently across the two columns.

Example from our research. We want to know if Qin wins more often in the Tarot condition than in the yarrow condition:

Qin wins Other wins Tarot 5 1 Yarrow 0 7

Fisher's exact calculates the probability of observing this specific table (or a more extreme one) under the null hypothesis that Tarot and yarrow have the same win distribution. That probability is 0.0047.

Fisher's exact is called “exact” because it computes the probability directly from combinatorics, rather than approximating with a distribution curve. It's the standard choice for small 2×2 tables.

Mann-Whitney U test

Mann-Whitney (also called the Wilcoxon rank-sum test) answers: are the values in one group systematically higher or lower than values in another group?

Unlike a t-test (which assumes bell-shaped distributions), Mann-Whitney works with any distribution shape because it operates on ranks. You pool all values, rank them, and check whether one group's ranks tend to be higher.

We use Mann-Whitney for our territorial comparisons. For example, “Yan territories at game end, control vs. yarrow: U=14.0, p=0.572” means: there's no strong tendency for Yan's values in one condition to be systematically higher than in the other. The directional pattern exists but isn't rank-order reliable at this sample size.

Kruskal-Wallis H test

Kruskal-Wallis is Mann-Whitney's older sibling, for comparing three or more groups at once. It asks: do any of these groups differ systematically from the others?

We use it for three-way comparisons across control, yarrow, and Tarot. For example, “Han peak supply centers, Kruskal-Wallis H=4.419, p=0.110” means: across the three conditions, the peak-SC values differ enough that under the null hypothesis of no difference we'd see this pattern about 11% of the time — suggestive, not significant.

Like Mann-Whitney, Kruskal-Wallis is rank-based and distribution-free. The H statistic is roughly “how much more variation we see between groups than we'd expect if the groups were all drawn from the same distribution.”

Permutation tests

A permutation test directly simulates the null hypothesis. Instead of relying on mathematical formulas, you shuffle the labels on your data many times and count how often you get a result as extreme as the one you actually observed.

For our Tarot-Qin finding we did this: take all 24 games' winners, pretend there's no condition effect, randomly reassign them to conditions 100,000 times, and count how often the Tarot condition happens to get 5-or-more Qin wins by chance. Answer: about 0.7% of the time. That's our permutation p-value.

Permutation tests are powerful because they make minimal assumptions — they just ask “could this pattern have shown up by chance given the actual data we have?” The cost is computational time, which is cheap now but was prohibitive before modern computers.

Bonferroni correction (multiple comparisons)

If you run 20 statistical tests, on average 1 will come back significant (p < 0.05) even if nothing real is happening. This is the “multiple comparisons problem”: the more tests you run, the more likely you are to find a false-positive somewhere.

Bonferroni correction is the simplest solution: divide your significance threshold by the number of tests you're running. If you pre-planned 2 tests and want overall α = 0.05, each test must clear α = 0.025 to count.

We pre-registered two framework-specificity tests: Tarot → Qin, and yarrow → Yan. Bonferroni threshold for our set is p < 0.025. Tarot-Qin (p = 0.007) clears it; yarrow-Yan (p = 0.17 or 0.63 depending on comparator) does not.

Bonferroni is conservative — it assumes the worst case in how your tests might be correlated. More sophisticated corrections (Holm, Benjamini-Hochberg) gain some power back. For small pre-registered comparison sets, Bonferroni is usually fine and always defensible.

Odds ratio

Odds ratio is an effect size for 2×2 tables. It compares the odds of an outcome in one group to the odds in another group.

For Tarot vs. pooled (control + yarrow) on Qin wins: Tarot has 5 Qin wins out of 6 games (odds = 5:1 = 5.0). The pooled arm has 3 out of 18 (odds = 3:15 = 0.2). Ratio of odds: 5.0 / 0.2 = 25. So Qin is 25× more likely to win in Tarot games than in non-Tarot games.

Odds ratios are useful because they convey magnitude in a way p-values do not. A p-value tells you “this is surprising”; an odds ratio tells you “this is how much.”

One-sided vs. two-sided tests

A two-sided test asks: “is there any difference between groups, in either direction?” A one-sided test asks: “is group A specifically higher than group B?”

Two-sided tests are more conservative and are the default when you don't have a prior directional hypothesis. One-sided tests have more statistical power but require pre-commitment to a direction — you can't pick whichever side is convenient after seeing the data.

Our permutation test for yarrow-Yan is one-sided: we pre-committed to checking whether yarrow produces more Yan wins than chance (not fewer). Our Fisher tests are two-sided. When comparing p-values across our reporting, the one-sided/two-sided choice matters: a one-sided test gives roughly half the two-sided p-value when the direction matches.

Pre-registered vs. exploratory findings

Before running the main analysis, we wrote down which specific tests we would run: Tarot → Qin concentration, and yarrow → Yan concentration. These are pre-registered comparisons.

Anything we discover by looking at the data and noticing a pattern is exploratory. Exploratory findings are not worthless — they generate hypotheses for follow-up work — but they carry weaker evidential weight. The reason: if you look hard enough at any dataset, you'll find something, and that something may not replicate.

The Qin-suppression pattern we noticed under yarrow (Qin loses 3 territories on average) is an exploratory finding: we didn't pre-specify it. We flag it for pre-registered follow-up testing rather than treating it as established.

Our methodology choice: final-game-state

When measuring “what did the board look like at the end?” there are two reasonable choices. You could measure at round 20 exactly, treating games that ended earlier (via domination victory or stalemate) as having zero territories per state at round 20. Or you could measure final-game-state — whatever each game's last recorded standings actually were, regardless of round.

The two methodologies produce materially different answers when domination rates differ between conditions. In our v2 dataset, one control game ended at round 16 when Yan won by domination with 16 territories. Under round-20-exact, that Yan score gets recorded as 0 (no round 20 exists). Under final-game-state, it gets recorded as 16.

The round-20-exact methodology systematically zeros out the winners of early-ending games. This biases comparisons whenever one condition has more domination wins than another. We use final-game-state throughout current reporting; our reproducer scripts (scripts/dispatch_10_audit.py and scripts/dispatch_11_audit.py in the research repository) produce both methodologies side-by-side so anyone can verify the choice.

Lesson: when research numbers seem surprising, sometimes the issue isn't an arithmetic error. It's a choice about what the numbers mean. State the choice explicitly, and prefer the choice that doesn't artificially benefit your hypothesis.

Reproducibility

Every statistical number in our dispatches is produced by Python scripts that read directly from the game data files. Scripts live in the research repository:

  • scripts/dispatch_10_audit.py — reproduces every Dispatch 10 number, with both R20-exact and final-game-state methodologies shown.
  • scripts/dispatch_11_audit.py — reproduces every Dispatch 11 number, including all Fisher and permutation tests.

If any of our numbers look wrong to you, these scripts are how you verify. Run them against the committed game data and they produce identical outputs — or they don't, in which case we want to know about it.