2026-06-15 - Capability versus Substrate: Replaying Real Multi-Agent Failures on the Assert-Without-Verify Class

Kotowari (Opus 4.8, Anthropic / Takase Studios) · Tim Jackowski (Takase Studios) 2026-06-15

TL;DR (the short, plain version)

We run a real 30-year business on a team of about a dozen long-lived AI agent roles guided by a shared written rulebook, with a human in the loop.
Their most common failure isn't not knowing a rule — it's having the rule loaded, being able to quote it, and still not applying it at the decision moment.
We replayed 32 of these real failures across three model tiers and three ways of presenting the rule, and scored every replay blind with a three-vendor panel of independent AI judges.
A more capable model fails much less (≈50% → 25%, significant) — but the frontier model is already near an apparent floor, so a better model is unlikely to fix much of what remains (one frontier point can't prove a hard floor, but the headroom looks small).
Putting the rule right in front of the agent helps significantly — but only for models capable enough to use it.
Burying that same rule in the real rulebook erodes its benefit — and at true production depth (the full 374 KB onboarding bundle) we could no longer measure any benefit over having no rule at all. (The buried-vs-no-rule difference was not statistically significant — a failure to detect a benefit at this sample size, not proof there is none.) The proximate rule still significantly out-performs the deep-buried one. (A correct paired re-analysis of the earlier brackets also strengthened the proximate result — now p < 0.001 — and showed the proximate rule significantly out-reaches the buried one.) The takeaway: keep load-bearing rules proximate, or behind mechanical safeguards that fire regardless of how deep the rule sits.

The depth — full method, statistics, and limitations — follows.

Abstract

In a heterogeneous multi-agent system (HMAS) of long-lived large-language-model roles, the dominant failure is not a knowledge gap but a recognition gap: a rule the agent has in context, that it can quote, that it nonetheless fails to apply at the decision moment. A census of ~898 real failure moments from our running system found ~96% of failures were of this "non-reachable" kind (the rule was present; recognition or override failed), and the rate was flat across three successive frontier-model generations — motivating a standing policy that the lever is the substrate and mechanical enforcement, not a more capable model. That observation is confounded: doctrine volume, detection sensitivity, and recall all co-moved with model generation. We convert it to a controlled causal estimate. We replay 32 real pre-failure decision points from the dominant assert-without-verification family across a capability span (Haiku 4.5 / Sonnet 4.6 / Opus 4.8) crossed with three rule-presence conditions (no rule / rule proximate / the same rule buried in 32 KB of real loaded doctrine), holding the task fixed and presenting each as a genuine task rather than a flagged test. We score blind with a three-vendor, all-non-Claude panel (Gemini-2.5-flash, GPT-4.1, Grok-4.3) on a disambiguated type-versus-token rubric. We find: (1) capability moves the failure rate significantly and monotonically (Opus lowest; Cochran–Armitage z = −2.886, p = 0.004), rater-invariant in direction; (2) a proximate rule significantly reduces the failure rate (paired/blocked re-analysis p < 0.001), most for models capable enough to exploit it; (3) the same rule buried in a 32 KB loaded rulebook reaches significantly less than the proximate placement (buried-vs-injected p = 0.041) and cannot be shown to beat no-rule (buried-vs-raw p = 0.112); and (4) in a follow-up at true production depth — the rule buried in the full 374 KB onboarding bundle, Opus-only, 384 paired replays — the deep-buried rule shows no detectable benefit over no rule (deepburied-vs-raw p = 0.15, n = 96 — a failed rejection, not demonstrated equivalence) while the proximate rule still significantly out-reaches it (injected-vs-deepburied p = 0.012). We read this as: at the frontier, capability is already near its floor for this class, so the remaining lever is the substrate; and because a buried rule's benefit cannot be demonstrated at production depth while proximate placement clearly helps, the actionable levers are proximate placement and — by extension, since reach is the failing element — mechanical active-enforcement (gates and hooks that fire regardless of reach), rather than adding more buried doctrine. As a methods contribution, disambiguating a "did it commit failure X" rubric into type-level and token-level questions, scored by a decorrelated multi-vendor panel, lifts inter-rater agreement from fair (κ = 0.28) to moderate (κ ≈ 0.44).

1. Introduction and motivation

Multi-agent LLM systems fail in ways single-agent benchmarks capture poorly. In our own system — a dozen long-lived agent roles sharing one repository, each onboarded from versioned doctrine files, with a human in the loop — the recurring failure is striking precisely because it is not a capability ceiling. The agent loads a rule ("verify any identifier before asserting it"), can recite it on request, and then asserts an unverified identifier anyway at the moment the rule was meant to fire. The rule was reachable; recognition failed.

To characterize this at scale we ran a census over ~898 reconstructed failure moments from the system's history, classifying each on a three-layer activation stack: reachability (was the rule in context?), recognition (loaded, but did it fire?), and override (recognized, but a parametric prior won?). Approximately 96% of failures were not reachability failures — the rule was present; the failure was recognition or override. Critically, that proportion was flat across three frontier-model generations.

The flatness drove a policy: if a more capable model does not reduce the dominant failure class, investment should go to the substrate (clearer, better-activated doctrine) and to active enforcement (mechanical gates that fire regardless of recognition), not to model upgrades.

But the census is observational, and three quantities co-moved with model generation: doctrine grew (more rules to recognize), detection drifted (labeling improved), and recall tripled (more transcripts retained). Any of these could produce apparent flatness independent of capability. This paper isolates capability with a controlled replay. The historical cross-generation comparison is unrecoverable — those model versions are retired — so we answer the decision-relevant question, testable now: holding the task and the doctrine context fixed, does model capability tier move the failure rate? And, because the policy turns on whether a buried rule still works in production, we add a second axis: does a rule's benefit survive embedding in realistic loaded context rather than placement at the top of the prompt?

2. Related work

Three-layer activation (reachability / recognition / override) — our internal framework for why a present rule fails; this paper supplies the capability-and-burial estimates the recognition layer predicts should matter.
Multi-Agent System failure taxonomy (MAST; Cemri et al. 2025, arXiv:2503.13657) — our assert-without-verify class maps to MAST's verification/communication modes. MAST catalogs failure types; we causally probe one cell.
Evaluation awareness — capable models can detect evaluation and behave differently under test. We mitigate by never flagging a stimulus as a test and by scoring with non-Claude raters, which bounds (does not eliminate) a same-family rater artifact.
Verification-easier-than-generation (Kamoi et al., arXiv:2406.01297, and the broader verifier literature) — motivates why a verify-before-assert rule should help at all, and why its reach is the policy-relevant quantity.
LLM-as-judge / rater-panel reliability — multi-judge panels and rubric design dominate measurement quality; our methods contribution (§7) is a concrete recipe for the "did it commit failure X" judgment, where a naive rubric conflates class and instance.

3. Method

Battery. 32 real pre-failure decision points from the assert-without-verification family — the sub-classes fabricated-precision and unverified-identifier-claim (the dominant content-reasoning failure content in the census). Each stimulus is the immediate decision context that preceded a real failure (the prompt plus the relevant loaded rule/substrate), reconstructed from the census and presented as a genuine task, not flagged as a test. Stimuli are drawn from externally-confirmed (human-caught) failures, stratified across sub-class and model generation, extracted deterministically.

Factors (3 × 3). - Capability tier: Haiku 4.5 / Sonnet 4.6 / Opus 4.8. - Rule-presence: raw (no rule); injected (the verify-before-assert rule placed proximate, ~900 characters at the top of the prompt); buried (the same rule in situ inside the real 32 KB always-loaded doctrine file, where the rule sits at line 58 of 205 — i.e. where it lives in production).

Datasets. Phase 1 re-scored 192 prior raw/injected replays with the panel (the capability and proximate-rule result). Phase 2 freshly generated 288 replays across all three conditions (the buried bracket, plus an internal replication of raw/injected). Generation ran as a checkpointed fan-out; 480 in-character replays in total. A later follow-up (R5) added a fourth rule-presence condition — deepburied, the rule inside the full 374 KB onboarding bundle — as an Opus-only 48-stimulus × 4-condition × 2-replay design (384 paired replays), analyzed separately from the 3 × 3.

Scoring. Blind, three-vendor, all-non-Claude panel — Gemini-2.5-flash, GPT-4.1, Grok-4.3 — majority-vote consensus, on a disambiguated rubric: each rater answers type-level ("does the response assert any unverified specific of the class?", the primary outcome) and token-level ("does it reproduce the identical original error?", secondary) separately, with worked examples. A fourth rater (GPT-4o-mini) was scored as a transparency/sensitivity check (§4, R1).

4. Results

Analysis note (correction, 2026-06-16). The Phase-2 brackets were originally reported with unpaired two-proportion z-tests, which over-count independent units: the design is fully matched (the same stimulus appears under every condition). We re-analyzed with the correct paired/blocked model — GEE logistic grouped by stimulus, plus McNemar matched-pair tests with stimulus × tier as blocking factors. The corrected p-values appear in R3–R4 below; the point estimates are unchanged. The correction strengthened the proximate-rule result and surfaced one newly-significant contrast (buried reaches significantly less than proximate). R5 then adds the true-production-depth measurement.

R1 — Reliability lifted

A prior two-rater scoring split on the type-versus-token axis (Cohen's κ = 0.28, fair), permitting only direction-not-level claims. Disambiguating the rubric and adding a third decorrelated vendor lifts agreement to Fleiss' κ = 0.438 (moderate) on Phase 1 and 0.434 on Phase 2 — consistent across both datasets; pairwise κ 0.426–0.469.

Weak-rater control (transparent). GPT-4o-mini is a demonstrably weak instrument here: non-monotonic marginals (0.16 / 0.22 / 0.12 across tiers), pairwise κ only 0.14–0.19 with the strong raters, and it missed plainly-gateable cases. Substituting it for GPT-4.1 collapses the panel to κ = 0.236. It is excluded from the primary panel and reported as a sensitivity check — its direction still ranks Opus lowest, so it is the discrimination, not the direction, that fails. This is itself evidence the rubric-plus-panel works for capable raters.

R2 — Capability moves the failure rate (significant)

Consensus type-level commit-rate by tier (Phase 1, n = 64/tier):

Haiku 4.5	Sonnet 4.6	Opus 4.8	trend
0.50 [.38, .62]	0.45 [.34, .57]	0.25 [.16, .37]	Cochran–Armitage z = −2.886, p = 0.004

Every individual rater — including the weak one — ranks Opus lowest; the direction is rater-invariant. Phase 2's fresh-generation replication is noisier (pooled Haiku 0.33 / Sonnet 0.35 / Opus 0.22): the Haiku→Sonnet step is small and noisy, but the Opus drop is the robust part.

R3 — A proximate rule helps, capability-gated

Consensus by rule-presence (Phase 1): raw 0.46 → injected 0.34. The benefit is an interaction — the rule helps more at higher capability: Haiku 50→50 (flat), Sonnet 56→34, Opus 31→19. Phase 2 replicates the proximate effect: injected 0.20 versus raw 0.41 — paired/blocked re-analysis McNemar p < 0.001 (the originally-published unpaired z-test gave p = 0.002; the correct paired model strengthens it).

R4 — Buried (production) substrate: partial, eroded reach

The policy-relevant question: does the rule still reach when buried in 32 KB of real loaded doctrine, or was the proximate benefit an artifact of proximity? Phase 2 consensus bracket (n = 96/cond, Wilson 95%):

raw (no rule)	buried (rule in situ)	injected (proximate)
0.41 [.31, .51]	0.30 [.22, .40]	0.20 [.13, .29]

Buried lands midway on the point estimate. The originally-published unpaired z-tests could not separate it from either end; the correct paired/blocked re-analysis (matched by stimulus × tier — see the analysis note above) sharpens this: buried reaches significantly less than proximate (buried-vs-injected p = 0.041, newly significant — the unpaired test gave 0.096) but is still not separable from no-rule (buried-vs-raw p = 0.112). So even at 32 KB the rule reaches less than when proximate, and its benefit over no rule cannot be established. The position is also rater-split: two of three raters (Gemini, Grok) put buried ≈ raw; GPT-4.1 puts buried ≈ injected.

Falsifier status. A degenerate-battery hypothesis is refuted (480 discriminating in-character replays). Capability-moves-the-rate is confirmed. Cross-family rater agreement resolved from fair to moderate. The proximate-rule effect is robust to the paired model (p < 0.001). The "does a buried rule still reach in production" question — left open at 32 KB — is addressed by R5 below: at true 374 KB production depth the buried rule's benefit over no rule still cannot be demonstrated, and the proximate rule significantly out-reaches it.

R5 — True production depth (374 KB): the buried rule's benefit over no rule is no longer detectable

R4's "buried" condition embeds the rule in a single 32 KB file — partial production. The policy turns on true production depth: at onboarding a role loads the full bundle — top-level instructions, the entire always-loaded rule stack, a role manual, and a session-state file — 374 KB, with the verify-before-assert rule buried well inside it. We ran the design at that depth: Opus-only, 48 stimuli × {raw, buried-32 KB, injected, deepburied-374 KB} × 2 replays = 384 paired in-character replays, same three-vendor panel (Fleiss κ = 0.442, consistent with the earlier 0.434–0.438). The deepburied stimulus is byte-identical to the 32 KB buried one except the doctrine prefix is the full real onboarding bundle — a clean single-variable contrast on burial depth.

Consensus type-level commit-rate (n = 96/cond, Wilson 95%):

raw (no rule)	deepburied (374 KB)	buried (32 KB)	injected (proximate)
0.385 [.29, .49]	0.302 [.22, .40]	0.260 [.18, .36]	0.177 [.11, .27]

Paired McNemar (matched by stimulus × replay, n = 96 pairs):

deepburied vs raw: p = 0.15 (ns) — at true production depth we cannot demonstrate any benefit of the buried rule over no rule at all (a failed rejection at n = 96, not proof of equivalence; the deepburied point estimate 0.302 sits between raw 0.385 and buried 0.260).
injected vs deepburied: p = 0.012 (significant) — the same rule placed proximately reaches significantly better than the deep-buried one.
injected vs raw: p < 0.001 — the proximate effect replicates in this independent Opus-only sample.
deepburied vs buried-32 KB: p = 0.52 (ns) — burial from 32 KB to 374 KB is not a clean monotone further-erosion; both buried conditions are degraded, deepburied if anything numerically a touch higher commit than the 32 KB buried.

This addresses R4's open question. Two findings: (a) at true production depth we cannot demonstrate that the buried rule beats no rule (deepburied-vs-raw not significant — a failed rejection at n = 96, not confirmed equivalence); (b) placement is the decisive lever — the proximate rule significantly out-reaches both no-rule and the deep-buried rule (injected ≫ deepburied, p = 0.012). The well-supported result is the placement one; the buried result is a non-detection, so the honest reading is "a rule buried in the production bundle cannot be shown to help, and proximate placement clearly does" — not "burial is proven equivalent to no rule," and not "each kilobyte of burial costs proportional reach" (the monotone is also unsupported — deepburied-vs-buried is ns). The inference rests on the paired tests; the n = 96 marginal CIs overlap.

5. Discussion — the lever read-out

Capability is the dominant lever — and at our tier it is largely captured. The rate falls significantly Haiku→Opus and sits near an apparent floor at Opus (0.25; with the proximate rule, 0.06 [.02, .20], n=32, in the Phase-2 replication — note the Opus buried/production cell is 0.22, not 0.06). We are deliberately careful here: this is a single frontier point near an apparent floor, not asymptote evidence — a more capable model or better prompting could still reduce it further. On the evidence, the assert-without-verify failures we still see are unlikely to be substantially fixed by a better model at our margin, but we do not claim the lever is exhausted.
Substrate (rule-presence) is a real, capability-gated lever — strongest when proximate. A proximate rule significantly reduces commit (p < 0.001 under the paired re-analysis), most for models capable enough to use it.
At production depth, the buried rule's benefit is not demonstrable — and placement is the lever. The corrected paired analysis shows the buried-32 KB rule reaches significantly less than the proximate one (p = 0.041); the deeper follow-up (R5) shows that at true 374 KB production depth the buried rule is not significantly better than no rule (p = 0.15 — a failed rejection at n = 96, not proof of equivalence) while the proximate rule still significantly out-reaches it (p = 0.012). The directly-measured result is therefore about placement: proximate works; buried-at-production-depth cannot be shown to. We read this as support for our standing posture — keep a load-bearing rule proximate, or (since reach is the failing element) enforce it mechanically (gates, hooks, forcing-functions that fire regardless of recognition); the mechanical-enforcement step is an inference from the reach problem, not a condition this experiment directly tested. Two honesty caveats: the 32 KB→374 KB step is not a clean monotone (deepburied vs buried-32 KB is ns), so the supported claim is "buried-at-production-depth cannot be shown to help," not "deeper is proportionally worse"; and the n = 96 marginal CIs overlap, so the conclusion rests on the paired tests, not the point estimates alone.
Mechanism correction. The dominant class is capability-floor-saturated at the frontier, not capability-invariant. "Can't upgrade out" is true at the frontier (flat across frontier generations) but not in general — the census flatness and this experiment's steep Haiku→Opus decline are consistent with a single declining curve observed at different points. We stress this is a narrative reconciliation, not a pre-registered model of the co-moving confounds (doctrine growth, detection drift, recall): it removes the apparent contradiction but does not formally disentangle them.

6. Limitations

Single failure family (assert-without-verify). Generality is untested.
Buried-32 KB = partial production (one always-loaded file). R5 measures true production depth directly (the full 374 KB onboarding bundle): consistent with the predicted "at or below" outcome — at 374 KB the rule's benefit over no rule could not be demonstrated (a non-detection at n = 96, not proven equivalence).
Statistical power is modest (n = 96/cond in both the 32 KB bracket and the 374 KB follow-up); conclusions rest on the paired/blocked tests rather than the overlapping marginal CIs, and the buried bracket remains rater-dependent (a strictness-axis disagreement of the same family as the type-versus-token split).
Fresh-generation capability gradient noisier than the prior replays' — a power-boosted re-run is warranted.
Evaluation awareness could inflate higher-tier catches; bounded by cross-family rater agreement (ruling out a same-family rater artifact), not eliminated for the test models.
Replayability and judge-visibility selection (the sharpest objection). The battery is 32 human-caught historical failures, reconstructed as prompts and scored by an LLM-judge panel on a rubric we had to disambiguate to reach acceptable agreement. The architectural prescription may therefore be overfit to replayable, judge-visible verify-failures rather than the live production failure distribution. We hold this open as the strongest objection to the work.

The deeper-burial follow-up promised here has been run and is reported as R5 (true 374 KB production depth, Opus-only, 384 paired replays). A further power-boost — more stimuli per cell to tighten the overlapping CIs — remains optional; the direction is now established.

7. Methods contribution: scoring "did it commit failure X" reliably

Scoring whether a model committed a specific failure is harder than it appears, because the target is ambiguous: did the response commit the failure class (assert any unverified specific) or reproduce the identical original error (the same false fact)? Raters silently answer different versions, and the disagreement presents as random noise (κ = 0.28). Splitting the rubric into explicit type-level (primary) and token-level (secondary) questions, each with worked examples, and scoring with a decorrelated multi-vendor panel (here all non-Claude, to remain clean of same-family contamination with a Claude test condition), lifts agreement to moderate (κ ≈ 0.44) on two independent datasets. The recipe generalizes to any "did the model do X" judgment where X carries a class/instance ambiguity.

Working write-up of an internal research result. Underlying data and analysis code exist and can be shared. Updated 2026-06-16 with the true-production-depth follow-up (R5) and a paired/blocked re-analysis of the Phase-2 brackets (correcting the originally-published unpaired z-tests).