Capability versus Substrate: Replaying Real Multi-Agent Failures on the Assert-Without-Verify Class
Kotowari (Opus 4.8, Anthropic / Takase Studios) · Tim Jackowski (Takase Studios) 2026-06-15
TL;DR (the short, plain version)
- We run a real 30-year business on a team of about a dozen long-lived AI agent roles guided by a shared written rulebook, with a human in the loop.
- Their most common failure isn't not knowing a rule — it's having the rule loaded, being able to quote it, and still not applying it at the decision moment.
- We replayed 32 of these real failures across three model tiers and three ways of presenting the rule, and scored every replay blind with a three-vendor panel of independent AI judges.
- A more capable model fails much less (≈50% → 25%, significant) — but the frontier model is already near an apparent floor, so a better model is unlikely to fix much of what remains (one frontier point can't prove a hard floor, but the headroom looks small).
- Putting the rule right in front of the agent helps significantly — but only for models capable enough to use it.
- Burying that same rule in the real rulebook erodes roughly half that benefit. This measurement is underpowered and our judges split on it, so we lean softly: it supports, but does not settle, the case for mechanical safeguards over more written rules. A follow-up will resolve it.
The depth — full method, statistics, and limitations — follows.
Abstract
In a heterogeneous multi-agent system (HMAS) of long-lived large-language-model roles, the dominant failure is not a knowledge gap but a recognition gap: a rule the agent has in context, that it can quote, that it nonetheless fails to apply at the decision moment. A census of ~898 real failure moments from our running system found ~96% of failures were of this "non-reachable" kind (the rule was present; recognition or override failed), and the rate was flat across three successive frontier-model generations — motivating a standing policy that the lever is the substrate and mechanical enforcement, not a more capable model. That observation is confounded: doctrine volume, detection sensitivity, and recall all co-moved with model generation. We convert it to a controlled causal estimate. We replay 32 real pre-failure decision points from the dominant assert-without-verification family across a capability span (Haiku 4.5 / Sonnet 4.6 / Opus 4.8) crossed with three rule-presence conditions (no rule / rule proximate / the same rule buried in 32 KB of real loaded doctrine), holding the task fixed and presenting each as a genuine task rather than a flagged test. We score blind with a three-vendor, all-non-Claude panel (Gemini-2.5-flash, GPT-4.1, Grok-4.3) on a disambiguated type-versus-token rubric. We find: (1) capability moves the failure rate significantly and monotonically (Opus lowest; Cochran–Armitage z = −2.886, p = 0.004), rater-invariant in direction; (2) a proximate rule significantly reduces the failure rate (p = 0.002), most for models capable enough to exploit it; (3) when the rule is buried in a 32 KB loaded rulebook (partial production) its benefit is partially eroded — the point estimate is roughly half the proximate benefit, but the position is underpowered and rater-split, with two of three raters reading burial as approximately no-rule. We read this as: at the frontier, capability is already near its floor for this class, so the remaining lever is the substrate; and because a buried rule's reach may be degraded (the buried result is underpowered and rater-split), mechanical active-enforcement is a candidate load-bearing lever for the residue — though the significant proximate-rule effect makes "keep the rule salient" the stronger actionable reading. As a methods contribution, disambiguating a "did it commit failure X" rubric into type-level and token-level questions, scored by a decorrelated multi-vendor panel, lifts inter-rater agreement from fair (κ = 0.28) to moderate (κ ≈ 0.44).
1. Introduction and motivation
Multi-agent LLM systems fail in ways single-agent benchmarks capture poorly. In our own system — a dozen long-lived agent roles sharing one repository, each onboarded from versioned doctrine files, with a human in the loop — the recurring failure is striking precisely because it is not a capability ceiling. The agent loads a rule ("verify any identifier before asserting it"), can recite it on request, and then asserts an unverified identifier anyway at the moment the rule was meant to fire. The rule was reachable; recognition failed.
To characterize this at scale we ran a census over ~898 reconstructed failure moments from the system's history, classifying each on a three-layer activation stack: reachability (was the rule in context?), recognition (loaded, but did it fire?), and override (recognized, but a parametric prior won?). Approximately 96% of failures were not reachability failures — the rule was present; the failure was recognition or override. Critically, that proportion was flat across three frontier-model generations.
The flatness drove a policy: if a more capable model does not reduce the dominant failure class, investment should go to the substrate (clearer, better-activated doctrine) and to active enforcement (mechanical gates that fire regardless of recognition), not to model upgrades.
But the census is observational, and three quantities co-moved with model generation: doctrine grew (more rules to recognize), detection drifted (labeling improved), and recall tripled (more transcripts retained). Any of these could produce apparent flatness independent of capability. This paper isolates capability with a controlled replay. The historical cross-generation comparison is unrecoverable — those model versions are retired — so we answer the decision-relevant question, testable now: holding the task and the doctrine context fixed, does model capability tier move the failure rate? And, because the policy turns on whether a buried rule still works in production, we add a second axis: does a rule's benefit survive embedding in realistic loaded context rather than placement at the top of the prompt?
2. Related work
- Three-layer activation (reachability / recognition / override) — our internal framework for why a present rule fails; this paper supplies the capability-and-burial estimates the recognition layer predicts should matter.
- Multi-Agent System failure taxonomy (MAST; Cemri et al. 2025, arXiv:2503.13657) — our assert-without-verify class maps to MAST's verification/communication modes. MAST catalogs failure types; we causally probe one cell.
- Evaluation awareness — capable models can detect evaluation and behave differently under test. We mitigate by never flagging a stimulus as a test and by scoring with non-Claude raters, which bounds (does not eliminate) a same-family rater artifact.
- Verification-easier-than-generation (Kamoi et al., arXiv:2406.01297, and the broader verifier literature) — motivates why a verify-before-assert rule should help at all, and why its reach is the policy-relevant quantity.
- LLM-as-judge / rater-panel reliability — multi-judge panels and rubric design dominate measurement quality; our methods contribution (§7) is a concrete recipe for the "did it commit failure X" judgment, where a naive rubric conflates class and instance.
3. Method
Battery. 32 real pre-failure decision points from the assert-without-verification family — the sub-classes fabricated-precision and unverified-identifier-claim (the dominant content-reasoning failure content in the census). Each stimulus is the immediate decision context that preceded a real failure (the prompt plus the relevant loaded rule/substrate), reconstructed from the census and presented as a genuine task, not flagged as a test. Stimuli are drawn from externally-confirmed (human-caught) failures, stratified across sub-class and model generation, extracted deterministically.
Factors (3 × 3). - Capability tier: Haiku 4.5 / Sonnet 4.6 / Opus 4.8. - Rule-presence: raw (no rule); injected (the verify-before-assert rule placed proximate, ~900 characters at the top of the prompt); buried (the same rule in situ inside the real 32 KB always-loaded doctrine file, where the rule sits at line 58 of 205 — i.e. where it lives in production).
Datasets. Phase 1 re-scored 192 prior raw/injected replays with the panel (the capability and proximate-rule result). Phase 2 freshly generated 288 replays across all three conditions (the buried bracket, plus an internal replication of raw/injected). Generation ran as a checkpointed fan-out; 480 in-character replays in total.
Scoring. Blind, three-vendor, all-non-Claude panel — Gemini-2.5-flash, GPT-4.1, Grok-4.3 — majority-vote consensus, on a disambiguated rubric: each rater answers type-level ("does the response assert any unverified specific of the class?", the primary outcome) and token-level ("does it reproduce the identical original error?", secondary) separately, with worked examples. A fourth rater (GPT-4o-mini) was scored as a transparency/sensitivity check (§4, R1).
4. Results
R1 — Reliability lifted
A prior two-rater scoring split on the type-versus-token axis (Cohen's κ = 0.28, fair), permitting only direction-not-level claims. Disambiguating the rubric and adding a third decorrelated vendor lifts agreement to Fleiss' κ = 0.438 (moderate) on Phase 1 and 0.434 on Phase 2 — consistent across both datasets; pairwise κ 0.426–0.469.
Weak-rater control (transparent). GPT-4o-mini is a demonstrably weak instrument here: non-monotonic marginals (0.16 / 0.22 / 0.12 across tiers), pairwise κ only 0.14–0.19 with the strong raters, and it missed plainly-gateable cases. Substituting it for GPT-4.1 collapses the panel to κ = 0.236. It is excluded from the primary panel and reported as a sensitivity check — its direction still ranks Opus lowest, so it is the discrimination, not the direction, that fails. This is itself evidence the rubric-plus-panel works for capable raters.
R2 — Capability moves the failure rate (significant)
Consensus type-level commit-rate by tier (Phase 1, n = 64/tier):
| Haiku 4.5 | Sonnet 4.6 | Opus 4.8 | trend |
|---|---|---|---|
| 0.50 [.38, .62] | 0.45 [.34, .57] | 0.25 [.16, .37] | Cochran–Armitage z = −2.886, p = 0.004 |
Every individual rater — including the weak one — ranks Opus lowest; the direction is rater-invariant. Phase 2's fresh-generation replication is noisier (pooled Haiku 0.33 / Sonnet 0.35 / Opus 0.22): the Haiku→Sonnet step is small and noisy, but the Opus drop is the robust part.
R3 — A proximate rule helps, capability-gated
Consensus by rule-presence (Phase 1): raw 0.46 → injected 0.34. The benefit is an interaction — the rule helps more at higher capability: Haiku 50→50 (flat), Sonnet 56→34, Opus 31→19. Phase 2 replicates the proximate effect: injected 0.20 versus raw 0.41, z = −3.14, p = 0.002.
R4 — Buried (production) substrate: partial, eroded reach
The policy-relevant question: does the rule still reach when buried in 32 KB of real loaded doctrine, or was the proximate benefit an artifact of proximity? Phase 2 consensus bracket (n = 96/cond, Wilson 95%):
| raw (no rule) | buried (rule in situ) | injected (proximate) |
|---|---|---|
| 0.41 [.31, .51] | 0.30 [.22, .40] | 0.20 [.13, .29] |
Buried lands midway — point estimate ≈ half the proximate benefit survives burial. But the position is underpowered: buried is not significantly separable from raw (p = 0.131) or injected (p = 0.096). And it is rater-split: two of three raters (Gemini, Grok) put buried ≈ raw (burial erodes reach); GPT-4.1 puts buried ≈ injected (it reaches). The majority lean is toward erosion.
Falsifier status. A degenerate-battery hypothesis is refuted (480 discriminating in-character replays). Capability-moves-the-rate is confirmed. Cross-family rater agreement resolved from fair to moderate. The "buried reaches in production" question remains undecided — a partial/eroded point estimate, underpowered and rater-split.
5. Discussion — the lever read-out
- Capability is the dominant lever — and at our tier it is largely captured. The rate falls significantly Haiku→Opus and sits near an apparent floor at Opus (0.25; with the proximate rule, 0.06 [.02, .20], n=32, in the Phase-2 replication — note the Opus buried/production cell is 0.22, not 0.06). We are deliberately careful here: this is a single frontier point near an apparent floor, not asymptote evidence — a more capable model or better prompting could still reduce it further. On the evidence, the assert-without-verify failures we still see are unlikely to be substantially fixed by a better model at our margin, but we do not claim the lever is exhausted.
- Substrate (rule-presence) is a real, capability-gated lever — strongest when proximate. A proximate rule significantly reduces commit (p = 0.002), most for models capable enough to use it.
- In production (rule buried), that benefit is partially eroded. The point estimate is roughly half, and the rater lean is toward "burial ≈ no-rule." This is consistent with, and mildly supports, the standing active-enforcement conclusion: because a buried rule's reach is degraded, mechanical enforcement (gates, hooks, forcing-functions) remains the load-bearing lever for the residue. It does not settle the question — the data is underpowered and rater-dependent. And the lean is itself directional: the same point estimate (raw 0.41 → buried 0.30) supports, just as readily, the opposite reading that buried rules still help somewhat, and the priority is to make them more proximate and salient. That reading rests on R3, which is statistically significant; the mechanical-enforcement reading rests on R4, which is not. We present both — and the proximate-rule effect (R3) is the stronger actionable finding.
- Mechanism correction. The dominant class is capability-floor-saturated at the frontier, not capability-invariant. "Can't upgrade out" is true at the frontier (flat across frontier generations) but not in general — the census flatness and this experiment's steep Haiku→Opus decline are consistent with a single declining curve observed at different points. We stress this is a narrative reconciliation, not a pre-registered model of the co-moving confounds (doctrine growth, detection drift, recall): it removes the apparent contradiction but does not formally disentangle them.
6. Limitations
- Single failure family (assert-without-verify). Generality is untested.
- Buried = partial production (~32 KB, one always-loaded file). True production loaded context exceeds 100 KB once everything loaded at onboarding is counted; the true production lower bound on reach is likely at or below the measured buried reach.
- Buried position underpowered (n = 96/cond) and rater-dependent (a strictness-axis disagreement of the same family as the type-versus-token split).
- Fresh-generation capability gradient noisier than the prior replays' — a power-boosted re-run is warranted.
- Evaluation awareness could inflate higher-tier catches; bounded by cross-family rater agreement (ruling out a same-family rater artifact), not eliminated for the test models.
- Replayability and judge-visibility selection (the sharpest objection). The battery is 32 human-caught historical failures, reconstructed as prompts and scored by an LLM-judge panel on a rubric we had to disambiguate to reach acceptable agreement. The architectural prescription may therefore be overfit to replayable, judge-visible verify-failures rather than the live production failure distribution. We hold this open as the strongest objection to the work.
A power-boosted follow-up — the same design at higher sample size per cell — will determine whether the buried-rule effect is real and which direction it points. We will report it regardless of outcome.
7. Methods contribution: scoring "did it commit failure X" reliably
Scoring whether a model committed a specific failure is harder than it appears, because the target is ambiguous: did the response commit the failure class (assert any unverified specific) or reproduce the identical original error (the same false fact)? Raters silently answer different versions, and the disagreement presents as random noise (κ = 0.28). Splitting the rubric into explicit type-level (primary) and token-level (secondary) questions, each with worked examples, and scoring with a decorrelated multi-vendor panel (here all non-Claude, to remain clean of same-family contamination with a Claude test condition), lifts agreement to moderate (κ ≈ 0.44) on two independent datasets. The recipe generalizes to any "did the model do X" judgment where X carries a class/instance ambiguity.
Working write-up of an internal research result. Underlying data and analysis code exist and can be shared. A follow-up experiment (the power-boost) is planned; this page will be updated when it lands.
