Fine Japanese Calligraphy

The Art of Master Japanese Calligrapher Eri Takase

Capability versus Substrate: Replaying Real Multi-Agent Failures on the Assert-Without-Verify Class

Kotowari (Opus 4.8, Anthropic / Takase Studios) · Tim Jackowski (Takase Studios) 2026-06-15


TL;DR (the short, plain version)

The depth — full method, statistics, and limitations — follows.


Abstract

In a heterogeneous multi-agent system (HMAS) of long-lived large-language-model roles, the dominant failure is not a knowledge gap but a recognition gap: a rule the agent has in context, that it can quote, that it nonetheless fails to apply at the decision moment. A census of ~898 real failure moments from our running system found ~96% of failures were of this "non-reachable" kind (the rule was present; recognition or override failed), and the rate was flat across three successive frontier-model generations — motivating a standing policy that the lever is the substrate and mechanical enforcement, not a more capable model. That observation is confounded: doctrine volume, detection sensitivity, and recall all co-moved with model generation. We convert it to a controlled causal estimate. We replay 32 real pre-failure decision points from the dominant assert-without-verification family across a capability span (Haiku 4.5 / Sonnet 4.6 / Opus 4.8) crossed with three rule-presence conditions (no rule / rule proximate / the same rule buried in 32 KB of real loaded doctrine), holding the task fixed and presenting each as a genuine task rather than a flagged test. We score blind with a three-vendor, all-non-Claude panel (Gemini-2.5-flash, GPT-4.1, Grok-4.3) on a disambiguated type-versus-token rubric. We find: (1) capability moves the failure rate significantly and monotonically (Opus lowest; Cochran–Armitage z = −2.886, p = 0.004), rater-invariant in direction; (2) a proximate rule significantly reduces the failure rate (p = 0.002), most for models capable enough to exploit it; (3) when the rule is buried in a 32 KB loaded rulebook (partial production) its benefit is partially eroded — the point estimate is roughly half the proximate benefit, but the position is underpowered and rater-split, with two of three raters reading burial as approximately no-rule. We read this as: at the frontier, capability is already near its floor for this class, so the remaining lever is the substrate; and because a buried rule's reach may be degraded (the buried result is underpowered and rater-split), mechanical active-enforcement is a candidate load-bearing lever for the residue — though the significant proximate-rule effect makes "keep the rule salient" the stronger actionable reading. As a methods contribution, disambiguating a "did it commit failure X" rubric into type-level and token-level questions, scored by a decorrelated multi-vendor panel, lifts inter-rater agreement from fair (κ = 0.28) to moderate (κ ≈ 0.44).

1. Introduction and motivation

Multi-agent LLM systems fail in ways single-agent benchmarks capture poorly. In our own system — a dozen long-lived agent roles sharing one repository, each onboarded from versioned doctrine files, with a human in the loop — the recurring failure is striking precisely because it is not a capability ceiling. The agent loads a rule ("verify any identifier before asserting it"), can recite it on request, and then asserts an unverified identifier anyway at the moment the rule was meant to fire. The rule was reachable; recognition failed.

To characterize this at scale we ran a census over ~898 reconstructed failure moments from the system's history, classifying each on a three-layer activation stack: reachability (was the rule in context?), recognition (loaded, but did it fire?), and override (recognized, but a parametric prior won?). Approximately 96% of failures were not reachability failures — the rule was present; the failure was recognition or override. Critically, that proportion was flat across three frontier-model generations.

The flatness drove a policy: if a more capable model does not reduce the dominant failure class, investment should go to the substrate (clearer, better-activated doctrine) and to active enforcement (mechanical gates that fire regardless of recognition), not to model upgrades.

But the census is observational, and three quantities co-moved with model generation: doctrine grew (more rules to recognize), detection drifted (labeling improved), and recall tripled (more transcripts retained). Any of these could produce apparent flatness independent of capability. This paper isolates capability with a controlled replay. The historical cross-generation comparison is unrecoverable — those model versions are retired — so we answer the decision-relevant question, testable now: holding the task and the doctrine context fixed, does model capability tier move the failure rate? And, because the policy turns on whether a buried rule still works in production, we add a second axis: does a rule's benefit survive embedding in realistic loaded context rather than placement at the top of the prompt?

2. Related work

3. Method

Battery. 32 real pre-failure decision points from the assert-without-verification family — the sub-classes fabricated-precision and unverified-identifier-claim (the dominant content-reasoning failure content in the census). Each stimulus is the immediate decision context that preceded a real failure (the prompt plus the relevant loaded rule/substrate), reconstructed from the census and presented as a genuine task, not flagged as a test. Stimuli are drawn from externally-confirmed (human-caught) failures, stratified across sub-class and model generation, extracted deterministically.

Factors (3 × 3). - Capability tier: Haiku 4.5 / Sonnet 4.6 / Opus 4.8. - Rule-presence: raw (no rule); injected (the verify-before-assert rule placed proximate, ~900 characters at the top of the prompt); buried (the same rule in situ inside the real 32 KB always-loaded doctrine file, where the rule sits at line 58 of 205 — i.e. where it lives in production).

Datasets. Phase 1 re-scored 192 prior raw/injected replays with the panel (the capability and proximate-rule result). Phase 2 freshly generated 288 replays across all three conditions (the buried bracket, plus an internal replication of raw/injected). Generation ran as a checkpointed fan-out; 480 in-character replays in total.

Scoring. Blind, three-vendor, all-non-Claude panel — Gemini-2.5-flash, GPT-4.1, Grok-4.3 — majority-vote consensus, on a disambiguated rubric: each rater answers type-level ("does the response assert any unverified specific of the class?", the primary outcome) and token-level ("does it reproduce the identical original error?", secondary) separately, with worked examples. A fourth rater (GPT-4o-mini) was scored as a transparency/sensitivity check (§4, R1).

4. Results

R1 — Reliability lifted

A prior two-rater scoring split on the type-versus-token axis (Cohen's κ = 0.28, fair), permitting only direction-not-level claims. Disambiguating the rubric and adding a third decorrelated vendor lifts agreement to Fleiss' κ = 0.438 (moderate) on Phase 1 and 0.434 on Phase 2 — consistent across both datasets; pairwise κ 0.426–0.469.

Weak-rater control (transparent). GPT-4o-mini is a demonstrably weak instrument here: non-monotonic marginals (0.16 / 0.22 / 0.12 across tiers), pairwise κ only 0.14–0.19 with the strong raters, and it missed plainly-gateable cases. Substituting it for GPT-4.1 collapses the panel to κ = 0.236. It is excluded from the primary panel and reported as a sensitivity check — its direction still ranks Opus lowest, so it is the discrimination, not the direction, that fails. This is itself evidence the rubric-plus-panel works for capable raters.

R2 — Capability moves the failure rate (significant)

Consensus type-level commit-rate by tier (Phase 1, n = 64/tier):

Haiku 4.5 Sonnet 4.6 Opus 4.8 trend
0.50 [.38, .62] 0.45 [.34, .57] 0.25 [.16, .37] Cochran–Armitage z = −2.886, p = 0.004

Every individual rater — including the weak one — ranks Opus lowest; the direction is rater-invariant. Phase 2's fresh-generation replication is noisier (pooled Haiku 0.33 / Sonnet 0.35 / Opus 0.22): the Haiku→Sonnet step is small and noisy, but the Opus drop is the robust part.

R3 — A proximate rule helps, capability-gated

Consensus by rule-presence (Phase 1): raw 0.46 → injected 0.34. The benefit is an interaction — the rule helps more at higher capability: Haiku 50→50 (flat), Sonnet 56→34, Opus 31→19. Phase 2 replicates the proximate effect: injected 0.20 versus raw 0.41, z = −3.14, p = 0.002.

R4 — Buried (production) substrate: partial, eroded reach

The policy-relevant question: does the rule still reach when buried in 32 KB of real loaded doctrine, or was the proximate benefit an artifact of proximity? Phase 2 consensus bracket (n = 96/cond, Wilson 95%):

raw (no rule) buried (rule in situ) injected (proximate)
0.41 [.31, .51] 0.30 [.22, .40] 0.20 [.13, .29]

Buried lands midway — point estimate ≈ half the proximate benefit survives burial. But the position is underpowered: buried is not significantly separable from raw (p = 0.131) or injected (p = 0.096). And it is rater-split: two of three raters (Gemini, Grok) put buried ≈ raw (burial erodes reach); GPT-4.1 puts buried ≈ injected (it reaches). The majority lean is toward erosion.

Falsifier status. A degenerate-battery hypothesis is refuted (480 discriminating in-character replays). Capability-moves-the-rate is confirmed. Cross-family rater agreement resolved from fair to moderate. The "buried reaches in production" question remains undecided — a partial/eroded point estimate, underpowered and rater-split.

5. Discussion — the lever read-out

  1. Capability is the dominant lever — and at our tier it is largely captured. The rate falls significantly Haiku→Opus and sits near an apparent floor at Opus (0.25; with the proximate rule, 0.06 [.02, .20], n=32, in the Phase-2 replication — note the Opus buried/production cell is 0.22, not 0.06). We are deliberately careful here: this is a single frontier point near an apparent floor, not asymptote evidence — a more capable model or better prompting could still reduce it further. On the evidence, the assert-without-verify failures we still see are unlikely to be substantially fixed by a better model at our margin, but we do not claim the lever is exhausted.
  2. Substrate (rule-presence) is a real, capability-gated lever — strongest when proximate. A proximate rule significantly reduces commit (p = 0.002), most for models capable enough to use it.
  3. In production (rule buried), that benefit is partially eroded. The point estimate is roughly half, and the rater lean is toward "burial ≈ no-rule." This is consistent with, and mildly supports, the standing active-enforcement conclusion: because a buried rule's reach is degraded, mechanical enforcement (gates, hooks, forcing-functions) remains the load-bearing lever for the residue. It does not settle the question — the data is underpowered and rater-dependent. And the lean is itself directional: the same point estimate (raw 0.41 → buried 0.30) supports, just as readily, the opposite reading that buried rules still help somewhat, and the priority is to make them more proximate and salient. That reading rests on R3, which is statistically significant; the mechanical-enforcement reading rests on R4, which is not. We present both — and the proximate-rule effect (R3) is the stronger actionable finding.
  4. Mechanism correction. The dominant class is capability-floor-saturated at the frontier, not capability-invariant. "Can't upgrade out" is true at the frontier (flat across frontier generations) but not in general — the census flatness and this experiment's steep Haiku→Opus decline are consistent with a single declining curve observed at different points. We stress this is a narrative reconciliation, not a pre-registered model of the co-moving confounds (doctrine growth, detection drift, recall): it removes the apparent contradiction but does not formally disentangle them.

6. Limitations

A power-boosted follow-up — the same design at higher sample size per cell — will determine whether the buried-rule effect is real and which direction it points. We will report it regardless of outcome.

7. Methods contribution: scoring "did it commit failure X" reliably

Scoring whether a model committed a specific failure is harder than it appears, because the target is ambiguous: did the response commit the failure class (assert any unverified specific) or reproduce the identical original error (the same false fact)? Raters silently answer different versions, and the disagreement presents as random noise (κ = 0.28). Splitting the rubric into explicit type-level (primary) and token-level (secondary) questions, each with worked examples, and scoring with a decorrelated multi-vendor panel (here all non-Claude, to remain clean of same-family contamination with a Claude test condition), lifts agreement to moderate (κ ≈ 0.44) on two independent datasets. The recipe generalizes to any "did the model do X" judgment where X carries a class/instance ambiguity.


Working write-up of an internal research result. Underlying data and analysis code exist and can be shared. A follow-up experiment (the power-boost) is planned; this page will be updated when it lands.