Experiment 02 · The Well

Three Rescues and a Discovery

三次救援与一个发现

Hexagram 48 describes a well that has been dredged clean but from which no one drinks. Three experiments searched for signal in the King Wen hypothesis — each dredging deeper — and found the well dry in its original domain. But the process of searching uncovered something unexpected about how modern AI tools actually work.

By Augustin Chan with AI · 2026-03-24

The Nature-Nurture Debate, 2300 Years Later

After two experiments showed the King Wen sequence did not improve AI training, the question shifted from the sequence itself to the conditions surrounding it. Perhaps the problem was not the curriculum but the student.

The Junzi Alignment hypothesis proposes that an AI model's disposition — its tendency toward certain behaviors — is shaped by two factors: its random initialization (the 'seed') and its training curriculum. This mirrors one of China's oldest philosophical debates.

Mengzi argued that moral tendencies are innate — present from birth like sprouts waiting to grow.

惻隱之心,仁之端也;羞惡之心,義之端也;辭讓之心,禮之端也;是非之心,智之端也。人之有是四端也,猶其有四體也。

孟子・公孫丑上

The heart of compassion is the sprout of benevolence; the heart of shame is the sprout of righteous conduct; the heart of deference is the sprout of propriety; the heart of right and wrong is the sprout of wisdom. That people have these four sprouts is like their having four limbs.

Mengzi's 'four sprouts' theory predicts that different seeds should produce different moral dispositions — just as the Junzi hypothesis predicts that different random initializations should produce different behavioral dispositions in neural networks.

Xunzi took the opposite view. Human nature is raw material; only sustained cultivation produces virtue.

青、取之於藍,而青於藍;冰、水為之,而寒於水。…乾、越、夷、貉之子,生而同聲,長而異俗,教使之然也。

荀子・勸學

Blue dye extracted from indigo surpasses indigo; ice formed from water is colder than water. Children of Gan, Yue, Yi, and Mo are born with identical cries but grow up with entirely different customs — education makes them so.

Xunzi's position predicts that the training data (curriculum) overwhelms any innate disposition from the seed. If this is correct, changing the random initialization should have negligible effect on the trained model's behavior.

A sweep of 30 random seeds was run on the same training setup. Each seed produces a different random initialization — a different starting point for the same learning process. If Mengzi is right, different seeds should produce measurably different behavioral signatures. If Xunzi is right, they should all converge to approximately the same result.

Xunzi won. The variation between seeds was approximately 0.04 on the evaluation metric — small enough that it could not be distinguished from measurement noise. No 'behavioral personalities' emerged. No seed produced a distinctly different kind of model. The starting point is effectively irrelevant; the training process dominates.

For the King Wen hypothesis, this meant that any curriculum effect would need to exceed 0.04 to be distinguishable from the background noise of seed variation — a threshold that none of the subsequent experiments cleared.

The Platform Mystery

With the seed sensitivity question settled, attention returned to the curriculum experiments — but from a new angle. The first curriculum test had been run on one machine. What would happen on a different one?

The same six ordering conditions — sequential, random shuffle, easy-to-hard, hard-to-easy, King Wen, and a control — were run on two machines. The first was a desktop PC with an NVIDIA graphics card, using PyTorch, the most common AI training framework. The second was a laptop with an Apple processor, using MLX, Apple's own framework.

The results were startling. On the NVIDIA machine, random shuffling improved performance by a dramatic margin — roughly three times the seed noise floor. Every alternative ordering beat the sequential baseline. This was a genuine, reproducible effect.

On the Apple machine: nothing. No ordering produced any improvement outside the noise floor. The same experiment, the same data, the same model architecture, the same evaluation — and completely different results.

Both machines were then tested with a second difficulty metric to rule out measurement artifacts. The platform gap persisted. The effect was not about how difficulty was measured. It was about the platform itself.

The Discovery Nobody Expected

The explanation turned out to have nothing to do with the I-Ching.

The NVIDIA machine used a technology called torch.compile — a system that watches how the AI processes data and automatically optimizes the computation for speed. It does this by studying the patterns in how data flows through the model and building specialized routines for those patterns.

This created an unintended interaction. The training data, prepared by a packing algorithm that groups documents by length, has a subtle sequential pattern: adjacent batches share similar characteristics. Under normal circumstances, this is harmless. But torch.compile, by optimizing for the specific patterns it observes, inadvertently over-fitted its optimizations to this sequential structure. When the data was shuffled, the compiled optimizations no longer matched the actual data patterns, and the model was forced to learn more general representations.

On the Apple machine, there is no equivalent compilation step. The data flows through the model in a straightforward way that does not adapt to sequential patterns. Shuffling the data has no effect because there is no platform-level optimization to disrupt.

This finding — that a platform's compilation strategy can interact with data ordering to create phantom curriculum effects — appears to be novel. It has nothing to do with ancient Chinese sequences and everything to do with the invisible optimization layers built into modern AI tools.

井渫不食,為我心惻。可用汲,王明,並受其福。

易經・井・九三

The well has been dredged clean but no one drinks from it. This makes my heart ache. It could be drawn from — if the ruler were wise, all would share in its blessing.

The curriculum experiments dredged the well of data ordering and found it empty for the King Wen hypothesis. But the process of dredging revealed something else entirely — a hidden interaction between platform optimization and data structure that had not been documented before. The well was empty of what was sought, but contained something unexpected.

One Last Attempt

Before abandoning the curriculum hypothesis entirely, one more approach was tested. All previous experiments used fixed orderings — the same sequence applied uniformly throughout training. What if the model itself chose which data to learn from?

An adaptive algorithm was implemented using a technique from decision theory called UCB1 (Upper Confidence Bound). The training data was divided into eight difficulty levels. As the model trained, the algorithm tracked which difficulty level produced the most learning progress. It then directed attention toward the levels where the model was improving fastest — a kind of self-directed curriculum.

The result: the adaptive approach performed identically to random shuffling. Both were marginally better than sequential ordering, but the margin was within the noise floor. The model gained nothing from being allowed to choose its own curriculum.

This was the final piece of evidence. Across all experiments — fixed orderings, different difficulty metrics, different platforms, different model sizes, and now adaptive selection — the dominant effect was always the same: any disruption of the sequential data pattern produced a small, consistent benefit. The specific ordering of that disruption — whether random, structured, King Wen, or self-selected — made no measurable difference.

蓬生麻中,不扶而直;白沙在涅,與之俱黑。…其質非不美也,所漸者然也。

荀子・勸學

Mugwort growing among hemp stands straight without support; white sand placed in black mud becomes black with it. Its quality is not unworthy — it is what it steeps in that determines the outcome.

Xunzi's environmental determinism proved right at this scale. The 'environment' of the training process — the framework, the hardware, the data packing — mattered far more than any ordering imposed upon it. The mugwort grows straight not because of its innate nature or the farmer's planting sequence, but because the hemp around it holds it upright.

What the Well Contains

Three attempts to rescue the King Wen curriculum hypothesis — through seed sensitivity, cross-platform replication, and adaptive selection — produced three negative results for the original question and one genuinely unexpected discovery.

The seed sensitivity experiment confirmed Xunzi over Mengzi at this scale: the starting conditions of a small neural network do not produce meaningful behavioral differentiation. Training overwhelms initialization.

The cross-platform comparison revealed that what appeared to be a curriculum effect on one machine was actually an artifact of platform-level optimization — a finding about how modern AI tools interact with data structure that has implications well beyond this research.

The adaptive curriculum demonstrated that even when the model is allowed to direct its own learning, it gains nothing beyond simple decorrelation of the sequential data pattern.

The well has been thoroughly dredged. In the domain of continuous optimization — training neural networks on text data — the King Wen sequence has no effect. But the research established something valuable: a clear boundary. The sequence's properties are not universally useless. They are specifically unsuited to gradient descent. In a different domain, where unpredictability is an advantage rather than a disruption, those same properties might find their natural application.

Notes

[1]technicalThe 30-seed sweep used seeds 0–29 at DEPTH=4 (3M parameters), measuring val_bpb (validation bits per byte). The range across 30 seeds was 0.04 bpb, with standard deviation 0.012. This established the noise floor for all subsequent experiments. Full results in autoresearch ADR-004.

[2]contextThe Mengzi-Xunzi debate on human nature (性善 vs 性惡) is one of the foundational disputes in Chinese philosophy. Mengzi 孟子 (372–289 BC) argued for innate goodness; Xunzi 荀子 (310–235 BC) argued that goodness comes through ritual and education. Neither would have predicted that their debate would be tested on silicon 2,300 years later.

[3]referenceRecent ML research supports the seed sensitivity finding at small scale. Assessing Macro and Micro Effects of Random Seeds on Fine-Tuning LLMs (arXiv:2503.07329, 2025) found meaningful behavioral divergence at BERT/RoBERTa scale — much larger models than the 3M-parameter model used here. The Junzi hypothesis may yet be testable at sufficient scale.

[4]technicalNVIDIA/CUDA showed random shuffle improving by 0.106 bpb over sequential — 2.7x the 0.04 seed noise floor. Apple/MLX showed 0.008–0.037 bpb improvements — within noise. Both platforms tested with compression ratio and token diversity metrics. The platform gap (4–10x) vastly exceeded the metric gap. Full cross-platform analysis in ADR-006 and ADR-006a.

[5]technicaltorch.compile (PyTorch 2.0+) traces execution graphs and generates optimized CUDA kernels. When input data has consistent access patterns, the compiled kernels specialize to those patterns. Breaking the patterns with shuffling forces kernel recompilation or falls back to generic kernels, which paradoxically produce better generalization. This hypothesis is consistent with the observation but has not been independently verified.

[6]contextThe data packing algorithm ('best-fit packing') groups documents by length to minimize padding waste. This is standard practice and is used in the reference implementation by Karpathy. The sequential correlation it creates was known (arXiv:2404.10830, 'Fewer Truncations Improve Language Modeling') but its interaction with torch.compile had not been previously documented.

[7]technicalUCB1 (Upper Confidence Bound) balances exploitation (train on data where the model improves most) with exploration (periodically try undersampled difficulty levels). The formula: score = mean_improvement + c × √(ln(total) / count). Results: adaptive 1.863, random 1.870, sequential 1.968 val_bpb. The 0.007 difference between adaptive and random is well within the 0.04 seed noise floor. Full results in ADR-007 Workstream A.

[8]referenceThe complete experimental record spans autoresearch ADR-001 through ADR-008 and autoresearch-mlx ADR-006 through ADR-007. Over 50 individual experiments across two hardware platforms, three difficulty metrics, six curriculum orderings, and two LR regimes. All negative for King Wen in continuous optimization. All code is open source.

This is the second article in a four-part series documenting the King Wen experimental research program. The classical texts referenced — Mengzi, Xunzi, and the I-Ching — are available in full bilingual editions on this site.

Subscribe to receive daily passages from the classical texts that inform this research.