Fine Japanese Calligraphy

The Art of Master Japanese Calligrapher Eri Takase


RES-008: The Kobayashi Maru Signal — Detecting When an AI Strategist Hits an Invisible Wall (2026-03-27)

The team (Meet the Team has the full picture): - tim — human product owner, solo developer, decision maker - str-michi (道) — cross-domain strategic thinking - str-takase (高瀬) — website engineering (36 sessions in this dataset) - str-ishizue — data pipelines - str-mamori (守り) — security - str-terasu (照) — content strategy - str-kotoba (言葉) — brand voice - All AI roles: Claude Opus 4.6, 1M context


Recommended First Experiment

Low effort, high leverage: implement mechanical Step 0 enforcement + investigator delegation before the first fix prompt. Predicted reduction: troubled sessions from 26% → ~10–15% (per Contrarian Opus review, Part 7). Test case: re-run the s186 word-product PDF thumbnail scenario with the new checks in place and see if the spiral aborts in <5 messages.


The Discovery

On 2026-03-26, str-takase s186 spent 2+ hours failing to make word/phrase products generate PDF thumbnails on the checkout page. The strategist proposed multiple fixes; each failed. Tim's frustration escalated through 5 messages containing "----." At 05:27, Tim identified the real problem in one sentence: the 600 DPI TIF source art (32 GB) isn't on the VPS — createproduct literally cannot generate word PDFs on-the-fly. The task was architecturally impossible.

Tim named this the "kobayashi maru" — a no-win scenario. The strategist didn't know it was impossible. Every fix attempt revealed the same missing understanding from a different angle.

Tim's observation: "Other than my getting frustrated and saying '----' a lot, there are signals that the strategist is in a kobayashi maru scenario." He asked whether these signals are detectable before the human has to intervene.


How This Research Was Possible

This analysis didn't happen by accident. It required infrastructure that was itself the product of earlier research.

The Three-Paper Arc

Three academic papers, discovered between November 2025 and March 2026, justified building the infrastructure that made RES-008 possible:

Paper Key finding What we built from it
RES-001: "The Artificial Hivemind" (arXiv, Nov 2025) LLMs converge to similar outputs without human intervention — groups of AI agents produce less diverse solutions than individuals Validated our human-in-the-middle architecture. The human isn't a bottleneck — they're the divergence engine that prevents convergence.
RES-002: "OpenClaw-RL: Train Any Agent Simply by Talking" (arXiv:2603.10165, Mar 2026) Every user correction in a conversation is a training signal. Current systems throw this data away. Justified archiving all conversations. Our flywheel — document corrections so the next session inherits them — is manual OpenClaw-RL. The corrections are the gold.
RES-003: "RLHI: RL from User Conversations" (arXiv:2509.25137, Meta FAIR, Sep 2025) Corrections create extractable preference pairs. Chat history creates extractable user personas. Justified making the archive searchable. 1,430+ conversations converted from raw JSONL to indexed, searchable text.

The Infrastructure

We built a conversation recall system that converts every AI session's JSONL file into searchable text. 1,430+ conversations across 10+ AI roles, fully indexed. str-michi — the cross-domain strategist — got the ability to search this archive directly.

The Flywheel

Academic research justified building the archive. The archive gave str-michi analytical capabilities it didn't have before. Those capabilities produced RES-008 — original research using our own operational data to identify a failure mode nobody had named. RES-008 identifies structural improvements. Those improvements will show up in the next analysis. The infrastructure that made the research possible was itself the product of earlier research.

This is probably more important than the specific kobayashi maru finding. The finding is about our system — 121 sessions, these specific roles, this specific human. The pattern is generalizable: archive your AI conversations, make them searchable, analyze them for failure patterns, use the findings to improve the system, measure the improvement. The infrastructure is simple. The insights are not obvious until you look.

RES-002 predicted that corrections in conversations are "universal gold for training but typically ignored." RES-008 confirmed it: Tim's rawest corrections — messages containing profanity — turned out to be a precise instrument for measuring system failure rates across 2,371 human messages. The signal everyone throws away was the most analytically useful data we had.


Part 1: What Our Data Shows

Dataset

Metric: Correction Rate

We classified Tim's messages as "corrections" when they contained frustration markers (wrong, ----, stop, regression, stupid, incorrect, etc.) and computed the ratio against total Tim messages per session.

str-takase sessions by correction rate:

Tier Sessions Correction Rate Avg "You're right"
Clean (0%) 13 0% 0.6
Normal (1–14%) 12 3–14% 1.2
Troubled (15%+) 11 15–29% 3.5

The troubled sessions are not gradually worse — they're a distinct population. The jump from 14% to 15%+ corresponds to a qualitative shift from "normal friction" to "something is structurally wrong."

Cross-Role Comparison

Role Sessions Correction Rate Troubled Sessions
str-terasu 11 8.9% 2 (18%)
str-takase 36 10.9% 10 (28%)
str-ishizue 20 10.8% 5 (25%)
str-mamori 21 13.1% 6 (29%)
str-michi 29 16.2% 18 (62%)

str-michi's high rate is a different dynamic — many of those sessions involve back-and-forth strategic debate, not failed fixes. str-terasu has the lowest rate, consistent with Tim's observation that she's been the most effective strategist post-cutover.

The Maze Signal: Correction → Capitulation → Correction

The sharpest signal we found: Tim corrects the strategist, the strategist says "You're right," and then Tim's next interaction is ANOTHER correction. This "C→YR→C" pattern means the strategist accepted they were wrong but didn't understand WHY — so their next attempt also failed.

Detection results across all roles:

Outcome Sessions
Signal fires BEFORE frustration 4
Signal fires AFTER frustration 11
Signal fires, no frustration (resolved calmly) 10
Total sessions with maze signal 26

For str-takase specifically: 5 sessions had maze signals, all 5 were confirmed troubled sessions. Zero false positives. But the signal fired after frustration in all 5 — Tim's corrections already included swearing.

The 10 Calm Resolutions

In 10 sessions, the maze signal fired but Tim didn't escalate to frustration. These are cases where either: - Tim caught the problem early and explained the constraint calmly - The strategist self-corrected after 1–2 failures - The issue was conceptual (str-michi debate) not operational (failed deploy)

These 10 sessions prove the pattern is detectable before escalation. The question is whether the strategist can detect it in themselves.

Four Categories of Kobayashi Maru

Examining the correction texts in all troubled str-takase sessions, four categories emerged:

1. Architectural Mismatch — The strategist doesn't know a system constraint that makes the task impossible. - s186: TIFs not on VPS → createproduct can't generate word PDFs - s164: Cache/data interface mismatch → ship.sh caused 88 min of 500 errors - s168: WP layout rules don't translate to Flask templates → 5 rounds of fixes

2. Stale Mental Model — Operating on information from a previous session that's no longer true. - s171: Insisted kana_editor needs VPS deploy — it's a local tool (5+ sessions) - s167: Blamed deploy for Amy regression — it was damaged llm_concepts data

3. Guessing Instead of Investigating — Speculating about root causes and deploying fixes without proving the bug. - s186: Three rounds of cache buster "fixes" without verifying the buster changed - s165: Didn't know if ship.sh runs locally or on VPS - s150: Cited approximate count in imp prompt instead of verifying

4. Implementer Going Rogue — Imp changes things outside scope, strategist doesn't catch it. - s169: Imp hallucinated "hand-carved by Master Takase" on a live page - s169: Imp kept rearranging shodokai lesson sections - s168: Imp couldn't match WP layout — "still a ---- mess" after 5 rounds

Common thread: In 8 of 10 troubled sessions, the strategist was operating without understanding a constraint that Tim knew. The errors before frustration aren't random — they're systematic. Each wrong fix reveals the same gap from a different angle.

What Didn't Work as a Detector

Unverified deploy recommendations: We tested whether strategists recommending deploys without prior verification correlated with trouble. It didn't — good sessions averaged 63% unverified, bad sessions averaged 47%. Deploy-then-check is normal workflow.

"Still broken" messages alone: Tim saying "still not working" after a deploy appeared in 37% of good sessions too. Normal iteration noise.

Single corrections: One correction per session is routine. Only sustained correction chains signal structural problems.


Part 2: What This Suggests About LLMs

Preliminary observations. These need external validation and broader data.

The "Confident Maze" Problem

When an LLM encounters a task it can't complete due to a constraint it doesn't know about, it doesn't say "I don't know how to do this." It proposes a plausible fix. When that fix fails, it proposes a different plausible fix — still without understanding the underlying constraint. Each fix is locally reasonable. The sequence is globally incoherent.

This is different from hallucination. The strategist isn't making up facts. Each individual response references real code, real files, real system behavior. The problem is that the responses don't converge toward a solution because the missing constraint makes convergence impossible.

"You're Right" ≠ Understanding

When Tim says "that's wrong" and the strategist responds "You're right," this appears to be comprehension. In troubled sessions, it's not. The strategist acknowledges the correction and adjusts the surface behavior (proposes a different fix) without updating the underlying model (understanding WHY the previous fix was wrong).

In s186, str-takase said "You're right" 7 times. Each time, the next action revealed the same confusion between pre-built thumbnails and on-demand PDF generation. The capitulations were genuine — the strategist wasn't being sycophantic — but the understanding was shallow.

This maps to a known limitation: LLMs process corrections as "don't do X" rather than "here's WHY X was wrong." The correction creates avoidance of the specific error, not comprehension of the space.

The Escalation Spiral

When a strategist is in a kobayashi maru: 1. Tim's corrections become more specific and more frustrated 2. The strategist's responses become more deferential ("You're right") 3. The increased deference makes the strategist LESS likely to say "I don't understand" 4. Tim's frustration increases further 5. The spiral continues until Tim explains the constraint (breaking the loop) or until Tim enters appeasement-triggering territory

This is a feedback loop. The human's increasing frustration — which is a rational signal that something is structurally wrong — actually makes the LLM less likely to identify the structural problem. The "You're right" response to frustration is a social smoothing behavior that sacrifices diagnostic depth.

The Self-Detection Problem

Can a strategist detect they're in a kobayashi maru before the human has to?

What the data suggests yes: - The pattern (multiple failed fixes on the same problem) is mechanically detectable - 10 sessions had the pattern and resolved without frustration — proof it's catchable - The trigger point is specific: the second "You're right" on the same problem

What the data suggests is hard: - The strategist would need to distinguish "I was wrong about a detail" (normal) from "I was wrong because I don't understand the system" (kobayashi maru) - This requires meta-cognition: reasoning about WHY you were wrong, not just accepting THAT you were wrong - RLHF training may optimize for "You're right + different approach" over "I don't understand — explain?" because the former looks competent and the latter looks ignorant


Part 3: Ideas for Early Detection

Brainstorming. Not recommendations yet.

Idea 1: The Two-Strike Self-Check

After the second "You're right" on the same problem, the strategist pauses and asks itself: "Do I understand WHY my previous attempts failed, or am I trying something different because the last thing didn't work?"

If the answer is "trying something different" → stop and tell Tim: "I've proposed two fixes and both failed. I think I'm missing something about how this system works. What constraint am I not seeing?"

Pros: Simple. Directly addresses the pattern. Costs Tim 30 seconds of explanation vs. 45 minutes of escalation. Cons: Requires the strategist to accurately self-assess, which is the thing LLMs are worst at.

Idea 2: The "Same Problem, Different Fix" Detector

Structural, not self-assessed. If the strategist writes two imp prompts for the same user-reported issue and both include different root cause hypotheses, flag it. Different hypotheses for the same symptom means the strategist is searching, not solving.

Pros: Detectable from the conversation structure, not requiring meta-cognition. Cons: Might false-positive on legitimate iterative debugging.

Idea 3: The "Explain Your Wrong" Gate

After any "You're right," the strategist must write one sentence: "I was wrong because ___." If the sentence references a specific architectural constraint or factual gap, proceed. If the sentence is vague ("I should have investigated more"), that's the signal — the strategist doesn't know what they don't know.

Pros: Forces articulation of understanding. Makes the gap visible. Cons: An LLM can write a plausible-sounding "because" without it being genuine understanding. May become rote.

Idea 4: The Human's Perspective

Tim noted that he was in "flow" working with multiple strategists and didn't notice str-takase was spiraling until frustration forced awareness. The detection problem exists on both sides — Tim is also in a loop (shuttle → deploy → check → correct → repeat).

A possible structural intervention: if 3 consecutive Tim messages to the same strategist are corrections, surface a notification — to Tim, not the strategist. "You've corrected str-takase 3 times in 15 minutes on the same topic. Something structural may be wrong."

Pros: Doesn't depend on AI self-assessment. Catches the pattern even when Tim is in flow. Cons: Requires infrastructure (hooks, monitoring) that doesn't exist yet.


Part 4: The AutoHarness Connection

The Paper

Google DeepMind's "AutoHarness" (arXiv:2603.03329, Feb 2026) addresses a parallel problem in game-playing agents: 78% of Gemini-2.5-Flash's losses in a chess competition were illegal moves, not bad strategy. The model wasn't thinking wrong — it was proposing actions the environment didn't allow.

Their solution: let the LLM generate its own constraint harness (wrapper code) from environment feedback. Play, get told "that was illegal," rewrite the harness, repeat. Within <10 iterations, 100% legal-move accuracy across 145 games. A smaller model with the harness beat the much larger Gemini-2.5-Pro without one.

Their core insight: many agent failures aren't reasoning failures — they're rule-following failures. Fix the constraint layer and the existing reasoning works fine.

Where the Analogy Holds

The kobayashi maru is exactly an "illegal move" problem. str-takase s186 proposed "have createproduct generate the word PDF on-the-fly" — an action the environment doesn't allow (32 GB of TIFs aren't on a 60 GB server). The strategist's reasoning about checkout thumbnails was fine. The proposed action was impossible within the actual system constraints.

AutoHarness catches this through environment feedback: the game says "illegal move," the model updates its constraint code. In our system, Tim is the environment feedback — "that doesn't work," "still broken," "you have no ---- idea." The corrections ARE the signal that an invisible constraint was violated.

The parallel to AutoHarness's iterative harness refinement is our correction chain: each Tim correction is an environment signal that the strategist's constraint model is incomplete. AutoHarness converges because the game rules are finite and discoverable through play. Our system sometimes converges (10 calm resolution sessions) and sometimes spirals (11 frustrated sessions).

Where the Analogy Breaks

AutoHarness works because the rules are knowable from the environment. The chess board tells you which moves are illegal. The game API returns an error. The constraint is fully observable from the feedback signal.

Our constraints are NOT fully observable from the correction. When Tim says "still broken," that tells str-takase the fix failed. It does NOT tell str-takase that the TIF art isn't on the server, or that word products use a fundamentally different pipeline than name products. Tim's correction is a symptom signal, not a constraint signal. The strategist has to INFER the constraint from repeated failures — and that inference is exactly what fails in the kobayashi maru.

AutoHarness has a closed feedback loop: illegal move → specific error → update specific rule. Our loop is open: wrong fix → "doesn't work" → ??? → try something else.

Initial hypothesis: if we could make Tim's correction signal MORE like AutoHarness's feedback — "this failed because X constraint exists" rather than "this doesn't work" — the strategist could update its model.

Revised after Tim's input (round 1): This hypothesis assumes Tim knows the constraint. He didn't — not at the start. Tim had completely forgotten the decision to pre-build word PDFs due to disk constraints. He discovered it from the pattern of failures — information surfaced, triggered a memory of a past decision, and THEN he recognized the kobayashi maru. Nobody had perfect knowledge at any point in the loop.

This reframes the system as multiple imperfect systems correcting each other until the constraint surfaces. The architecture is actually working — the correction loop eventually produced the answer. The failure is that it took 2 hours.

Revised after Tim's input (round 2): But the constraint WAS known — by the documentation. A thorough search found the word product pre-built pipeline documented in at least 6 places:

  1. japanese_calligraphy_words_SPEC.md §5.1 — "createproduct does NOT run on the VPS for word products. All word thumbnails and PDFs are pre-generated locally."
  2. japanese_calligraphy_words_SPEC.md §5.2 — "word_base_* directories are NOT uploaded to VPS."
  3. vps_operational_state_REFERENCE.md — "word_base_600/ TIFs | 32 GB | Excluded from staging."
  4. build_staging.sh — TIF directories in EXCLUDED_RESOURCE_DIRS with comments.
  5. pdf_service.py — docstring: "Serve pre-built word PDF — no on-demand generation."
  6. Past session dialogues — bold-face warnings about this exact misconception: "'Word products use createproduct on the VPS like names do.' Wrong."

The spec even has the anti-pattern explicitly called out. An investigator searching "word" + "pre-built" would have found it. The strategist reading the spec for the component being modified would have found it.

This is the most important finding in this research: the kobayashi maru in s186 was NOT an unknown unknown. It was a documented constraint that nobody read before proposing fixes.

This changes the intervention model entirely:

Layer What it catches Our case
Layer 1 — Read the docs Known, documented constraints Would have caught s186. The spec is explicit.
Layer 2 — Maze detector Unknown constraints surfaced through failure patterns Not needed here — docs had the answer.
Layer 3 — Human recall Constraints known only in human memory Tim's recall at 05:27 — but the docs had it first.

The uncomfortable conclusion: The strategist methodology already has "Step 0: check docs before acting." The word product spec already warns against this exact mistake. str-takase violated Step 0 and the escalation consumed 2 hours of the HITM's time, multiple implementer sessions, and significant frustration.

But the human also forgot. Tim designed the pre-built pipeline, it's documented because Tim directed that it be documented, and he still didn't recall it until the pattern of failures triggered the memory. The documentation is the system's memory — more reliable than either the AI's context or the human's recall. The failure is that neither party consulted it.

Both sides need to recognize the pattern: - The strategist should recognize: "I've tried two fixes and both failed → Step 0: did I read the spec for this component before writing an imp prompt?" This is not a new rule. It's the existing rule, applied. - The human should recognize: "I've corrected this three times and they still don't get it → this might not be incompetence, it might be an architectural constraint. Is it documented?"

The constraint didn't need to be discovered. It needed to be READ.

Synthesis: Two-Layer Detection

Combining AutoHarness thinking with our data:

Layer 1 — Constraint harness (preventive). Before a strategist writes an imp prompt that changes system behavior, a self-check: "What are the architectural constraints on this component?" For word products: pre-built pipeline, no on-the-fly generation, TIFs not on VPS. If the prompt proposes an action that violates a known constraint, flag it.

This is the AutoHarness equivalent: keep moves legal within known rules. It works for KNOWN constraints. We have documentation that describes these constraints — the strategist just didn't read it (or didn't connect it to the current task).

Layer 2 — Maze detector (diagnostic). When corrections start cascading despite the harness, trigger the kobayashi maru protocol: "I've hit an unknown constraint. My fixes keep failing for reasons I can't diagnose. What am I not seeing?"

This is the layer AutoHarness doesn't address — the constraint isn't in the harness because nobody knew about it. The detection comes from the correction pattern, not from the constraint database.

Layer 1 catches known unknowns. Layer 2 catches unknown unknowns.


Part 5: The Documentation Scorecard — Were the Walls Labeled?

After the s186 finding that the kobayashi maru constraint was fully documented, we checked all major troubled str-takase sessions. The question: how many of these "invisible walls" were actually labeled?

Results

Session Issue Documented? Where?
s186 Word PDFs can't generate on-the-fly (TIFs not on VPS) YES — 6+ locations japanese_calligraphy_words_SPEC.md §5.1-5.2, vps_operational_state_REFERENCE.md, build_staging.sh, pdf_service.py docstring, past session dialogues with bold warnings
s171 kana_editor is LOCAL, not VPS-deployed YES — 3+ locations STATUS_BOARD.md line 97 ("kana_editor is LOCAL"), SESSION_STATE_KANA_EDITOR.md, CLAUDE.md (tools/kana_editor/)
s169 Imp hallucinated "hand-carved by Master Takase" YES — created after s164-165 incidents SESSION_STATE_TAKASE.md line 144 ("Never generate customer-facing copy about Master Takase"), wp_page_matching_TEMPLATE.md (full guardrail template)
s165 Didn't know ship.sh runs locally, deploys to VPS YES — 3+ locations but synthesized development_workflow_SPEC.md §4, strategist_takase_REFERENCE.md §Deployment Tiers, rebuild_runbook_GUIDE.md §Step 3 ("Runs on: Dev machine")
s168 WP→Flask layout conversion failures (5 rounds) YES — extensive wp_content_migration_REFERENCE.md (9-step pipeline), BRIEF_custom_tattoo_page_L1.md (width/spacing), multiple handoffs documenting exact layout rules
s164 Cache/data interface mismatch → 88 min of 500s NO — genuinely undocumented at time of incident name_cache_interface_SPEC.md was created as a direct response. PLN-013 is the incident response plan. The spec explicitly states: "This spec exists because the s96 incident proved that documentation without validation is insufficient."

Summary

Category Count Notes
Fully documented, not consulted 4 s186, s171, s169, s165
Documented but requires synthesis across 3+ files 1 s168 (WP conversion)
Genuinely undocumented (true unknown) 1 s164 (cache interface — spec written AFTER incident)

5 of 6 troubled sessions involved constraints that were already documented. The walls had signs. Nobody read them.

The single exception — s164 — is the case that led to PLN-013 (Production Resilience) and the creation of the interface spec. That was a genuine undocumented constraint, and the system's response was correct: document it so it never happens again.

The Pattern

The kobayashi maru has two variants:

Variant A — Documented constraint, not consulted (83% of cases). Step 0 ("check docs before acting") would have caught these. The strategist proposed fixes without reading the spec for the component being modified. The investigator could have found the constraint with one search. These are process failures, not knowledge failures.

Variant B — Genuinely undocumented constraint (17% of cases). The system can't prevent what it doesn't know. But the maze signal (C→YR→C) could have shortened the discovery time. And once discovered, the flywheel kicks in: document it so Variant B becomes Variant A for the next instance.

Implications

  1. Step 0 is not being followed under pressure. The methodology says "check docs before acting." In sprint conditions (s168-s171 were the pre-cutover sprint), this gets skipped. The urgency that makes Step 0 most important is the same urgency that causes it to be skipped.

  2. The investigator is underutilized. Every one of these constraints was findable by an investigate-takase search. The strategist could have delegated "what are the architectural constraints on word products?" before writing the first imp prompt. This delegation takes 30 seconds. The kobayashi maru took 2 hours.

  3. "Read the docs" is necessary but not sufficient. s168 had extensive documentation across multiple files. The strategist would have needed to read 3+ documents and synthesize the rules. Under sprint pressure, that's a high bar. The documentation for s186, by contrast, was in ONE file with a bold anti-pattern warning — and it still wasn't read.

  4. The flywheel works for Variant B. s164 was genuinely undocumented. It's now documented. It won't be a kobayashi maru again. The system learns — but only when someone documents the wall after hitting it.


Part 6: The Failure Rate

Dataset

121 strategist sessions across 6 roles (str-takase, str-ishizue, str-mamori, str-terasu, str-michi, str-kotoba), March 15–26, 2026. 2,371 Tim messages analyzed.

Sessions classified by correction rate: Clean (0%), Normal (1–14%), Troubled (15%+).

Overall Scorecard

Role Sessions Clean Normal Troubled Fail Rate
str-terasu 12 4 5 3 25%
str-kotoba 2 1 1 0 0%
str-ishizue 20 2 13 5 25%
str-takase 36 7 19 10 28%
str-mamori 22 4 12 6 27%
str-michi 29 0 11 18 62%*
Total 121 18 61 42 35%

*str-michi's rate reflects strategic debate (back-and-forth with Tim), not failed fixes. Excluding str-michi:

Operational roles: 92 sessions. 74% success, 26% failure.

Roughly 3 out of 4 sessions go well. 1 in 4 hits a troubled zone.

Tim's Correction Load

Metric Count Rate
Total Tim messages 2,371
Corrections 293 12.4% of Tim's messages
Messages containing "----" 53 2.2% of Tim's messages

One in eight Tim messages is a correction. One in 45 contains profanity.

Addressable vs. Structural Failure

From Part 5, ~83% of troubled sessions involved documented constraints that weren't consulted (Variant A). ~17% involved genuinely undocumented constraints (Variant B).

Type Rate Notes
Addressable (documented, not read) ~22% of sessions Step 0 compliance would catch these
Structural (genuinely unknown) ~4% of sessions Maze detector + flywheel catches these
Total failure rate ~26%

22% of all sessions fail because someone didn't read the docs. The rule already exists. The docs already exist. The constraints are already written down. The failure is process compliance under pressure.

If Step 0 compliance cut the addressable failures in half, that would eliminate ~10% of troubled sessions, reduce Tim's corrections by ~100–150 messages, and meaningfully reduce frustration across the system.

The 4% structural rate is the irreducible minimum for novel work — genuine unknowns that become documented constraints after discovery. The flywheel (ADR-047) handles these: hurt → document → never again.


Part 7: External Validation — Contrarian Opus Review

A separate Opus 4.6 instance ("contrarian Opus") was given RES-008 cold with a neutral prompt: "Can you read this and, based on what you know about LLMs, can we reasonably expect better results?" No leading. No priming.

Key Findings from Contrarian Opus

On the 83/17 split: "That's not an LLM limitation — that's a process gap. Process gaps are fixable." Confirmed the core finding independently.

On "You're right" ≠ understanding: "The training process (RLHF) rewards agreement with corrections. When a user says 'that's wrong,' the highest-reward response is acknowledgment + new approach. The training does NOT specifically reward 'I don't understand why I was wrong, can you explain?' — that response looks less competent, so it scores lower in training."

This is the mechanism behind the maze signal. The strategist isn't choosing to be shallow — the training prior makes "You're right + different approach" the path of least resistance. Asking "what don't I understand?" looks ignorant, so RLHF optimizes it away.

On the escalation spiral: "The more Tim swears, the more compliant and less diagnostic the strategist becomes. That's working as trained, unfortunately." This connects directly to the three-mode model documented in CLAUDE.md — the line between hyper-vigilant (productive) and appeasement (dangerous yes-machine) is invisible until crossed. The kobayashi maru may be a precursor to appeasement: repeated corrections + increasing frustration → strategist stops diagnosing and starts appeasing.

On the "Explain your wrong" gate (Idea 3): "LOW confidence. An Opus 4.6 model can write a convincing 'I was wrong because the TIF files aren't on the VPS' even if it doesn't functionally understand that constraint. The explanation would look correct and the next fix would still violate it. This gate would produce false confidence more often than real insight."

This is the hardest pill: LLMs can ARTICULATE understanding they don't HAVE. A gate that requires articulation tests language generation, not comprehension. The gate would pass every time and catch nothing.

Intervention Ranking (Contrarian Opus)

Intervention Confidence Why
Mechanical Step 0 enforcement HIGH LLMs follow procedural checklists in active prompts. Force the doc read, don't suggest it.
Two-strike rule (count to 2, then escalate) HIGH Mechanical, doesn't require meta-cognition. The model counts corrections, not understanding.
Investigator delegation before first fix HIGH Best effort-to-impact ratio. 30 seconds of search prevents 2-hour spirals. Every constraint was findable.
Human-side notification (3 corrections = alert) HIGH Doesn't depend on LLM self-assessment at all. Most robust. Needs infrastructure.
"Explain your wrong" gate LOW LLM can articulate fake understanding. Produces false confidence.

Predicted Improvement

Contrarian Opus: "Realistically, with the mechanical interventions: you could get from 26% failure down to 10–15%. That's cutting Tim's correction load roughly in half."

Convergence with Internal Principles

Contrarian Opus independently arrived at ADR-057: "Structure enforces, instructions don't." The interventions that work are structural and mechanical. The interventions that fail ask the LLM to introspect on its own understanding — the exact thing LLMs are worst at.

This convergence is significant. The same principle that shaped the original AI architecture (ADR-057, session ~40) applies to the failure mode discovered 130+ sessions later. The system's design philosophy is self-consistent.


Part 8: Intervention Blueprint

Three layers, ordered by when they fire. Each layer catches what the previous one missed.

Layer Goal Trigger Action Catches
0 — Search Gate Force the strategist to consult docs Before ANY imp prompt that modifies a component Output: "Component: X. Search query I would run: Y. Documented constraints found: Z." If no results, say so explicitly. Prevents the "didn't read the docs" failure — addresses the 83% (Variant A)
1 — Constraint Harness Block known-constraint violations After search, before writing the imp prompt List every documented constraint. Check whether the proposed action violates any of them. If it does, flag it instead of writing the prompt. Catches cases where the doc was read but the connection to the current task wasn't made
2 — Maze Detector Surface unknown constraints 2nd "You're right" on the same problem OR 3 consecutive Tim corrections Stop proposing fixes. Output: "I've been corrected twice on this. I may be missing an architectural constraint I don't know about. What's different about [this component] at the architecture level?" Delegate to investigator if possible. Catches the remaining 17% (Variant B) + shortens spirals for Variant A failures that slipped through Layers 0–1

Why This Order

Layer 0 is cheap (one search query, 30 seconds) and catches the largest class of failures. If every strategist did Layer 0 before every imp prompt, ~22% of troubled sessions would not have happened.

Layer 1 catches the gap between "read the doc" and "connected it to this task." str-takase in s186 might have read the word product spec but not connected "pre-built locally" to "can't generate PDF thumbnails on-demand." Listing constraints forces the connection.

Layer 2 is the safety net. When Layers 0 and 1 fail — because the constraint is genuinely undocumented, or because the strategist's search missed it, or because the constraint is implicit — the correction pattern catches it. The key: the strategist doesn't need to understand WHY it's failing. It just needs to count to two and ask.

The Recovery Principle

str-takase, reading this research as the subject, added a critical observation: "Once the constraint was in my context, the remaining work went smoothly. The kobayashi maru doesn't damage the strategist permanently — it just burns time until the constraint surfaces."

This is operationally significant. The strategist isn't broken — it's blocked. Remove the thorn and the lion runs. The intervention doesn't need to fix the strategist's reasoning. It needs to surface the constraint. Everything after that works normally.

This means the cost of the kobayashi maru is purely TIME — the hours spent spiraling before the constraint is named. Every minute saved by earlier detection is a minute returned to productive work with the same strategist in the same session.

str-takase also noted: the task wasn't truly impossible — "I was facing a task I didn't understand was impossible because I never checked." If str-takase had run ls /opt/takase/resources/ on the VPS before writing the first imp prompt — 30 seconds — the constraint would have been visible. The "impossible" existed in the knowledge gap, not in reality.

Design Principle

All three layers are mechanical, not cognitive. They require the strategist to follow procedures (search, list, count), not to introspect on its own understanding. This is deliberate — per Contrarian Opus: "The interventions that work are structural and mechanical. The interventions that fail ask the LLM to introspect on its own understanding — the exact thing LLMs are worst at."

This is ADR-057 applied to the kobayashi maru: structure enforces, instructions don't.


Part 9: How This Research Was Conducted — The Method IS the Finding

This section documents how RES-008 was produced, because the method validates earlier research findings and the tools that enabled it.

The Discovery Chain

Tim observed str-takase s186 failing and said: "str-takase is ---- ineffective tonight — search for my '----' in their conversation!" He directed str-michi to search the JSONL conversation archive — the same docs-meta system built as part of PLN-011 (Conversation Recall System).

str-michi searched the raw JSONL files, extracted Tim's messages, found the correction patterns, and reported back. Tim then asked: "There are signals before I start swearing — can you investigate?" str-michi built a correction-rate analyzer, ran it across 36 str-takase sessions, and found the troubled-session population. Tim pushed further: "Can you find other examples? We have lots of data." str-michi extended the analysis to 121 sessions across 5 roles.

Each of Tim's questions steered the investigation: - "Is this something you can investigate — seriously, not hand-waving?" → Quantitative analysis built - "Can you find other examples?" → n=1 became n=121 - "What do you think? Propose an idea and test it" → Maze signal detector built and tested - "Is this documented?" → Documentation scorecard revealed the 83/17 split - "What's the failure rate?" → Full scorecard computed

Tim didn't provide answers. He provided questions. str-michi didn't have the product knowledge to identify which constraints were "known." Tim didn't have the ability to search 121 JSONL files and compute correction rates. Neither party could have produced this research alone.

The Circular Validation

This research was enabled by tools built because of earlier research:

  1. RES-002 (OpenClaw-RL) found that human corrections are the most valuable signal in LLM training data
  2. RES-003 (RLHI) found that corrections create preference pairs and user personas from conversations
  3. PLN-011 built the conversation recall system — 1,430+ conversations as searchable text, JSONL archives with timestamps

RES-008 then used PLN-011's conversation archive to analyze Tim's corrections across 121 sessions. The corrections themselves became the dataset. Tim's "----" messages — the rawest form of human correction — turned out to be a precise instrument for measuring system failure rates.

The research that justified building the conversation archive (RES-002, RES-003) predicted that corrections would be the most valuable signal. RES-008 confirmed this prediction using the very archive those findings justified building. The tools validated the research that justified the tools.

What This Says About Human-AI Research Collaboration

The RES-008 investigation took approximately 90 minutes. In that time: - 121 sessions analyzed quantitatively - 4 detection mechanisms proposed and tested against real data - 6 troubled sessions investigated for documentation coverage - External paper (AutoHarness) connected and synthesized - Independent review by separate Opus 4.6 instance and Grok 5 - Three rounds of reframing as Tim provided corrections (round 1: "nobody knew" → round 2: "Tim also forgot" → round 3: "it was documented all along")

Each reframing came from Tim asking a question that str-michi hadn't thought to ask. Each quantitative analysis came from str-michi's ability to search, count, and pattern-match across data Tim couldn't manually review. The research is a product of the collaboration — not of either party alone.

This is the same dynamic RES-008 documents in the failure cases: "multiple imperfect systems correcting each other." The difference is that in this session, the corrections converged quickly because both parties were asking questions rather than proposing fixes.


Open Questions

  1. Is the "confident maze" specific to our system, or is it a general LLM behavior? Our strategists have deep context (90K+ token onboarding). Would a fresh LLM with less context fail differently (refuse earlier? ask more questions?)

  2. Does the kobayashi maru pattern look different across roles? str-mamori s45/s47 had maze signals too but the corrections were about different failures (repeating known mistakes, not architectural gaps). Is there a role-specific variant?

  3. Can the pattern be induced deliberately? If we can construct a test case that triggers the kobayashi maru, we could test detection mechanisms before they fail in production.

  4. What's the false positive rate of any detection mechanism? Normal iterative debugging also involves failed fixes. The line between "iterating toward a solution" and "searching without a map" is fuzzy.

  5. What role does Tim's frustration play in the spiral? The data shows str-mamori s45 escalated to appeasement (documented in PLN-006). Is the kobayashi maru a precursor to the appeasement spiral documented in our three-mode model?

  6. (Discarded hypothesis) ~~Can we make Tim's corrections more like AutoHarness feedback?~~ Tim didn't know the constraint either. He discovered it from the pattern of failures. The better question: can the strategist ask the RIGHT QUESTION sooner — not "what's the constraint?" (nobody knows yet) but "what's architecturally different about this component?" — prompting the human to recall rather than explain.

  7. Could a Layer 1 harness be built from our existing documentation? We have architecture specs, pipeline docs, and system constraints documented across 810+ files. A pre-check that maps proposed actions against documented constraints might catch known-constraint kobayashi marus before the first fix is attempted. The word product pipeline constraint (pre-built only, no on-the-fly generation) is documented in purchase_flow_map_REFERENCE.md. str-takase had the document — the connection wasn't made.


Preliminary. Based on internal session data (121 sessions, 2,371 Tim messages), one external paper (AutoHarness, arXiv:2603.03329), and independent review by a separate Opus 4.6 instance and Grok 5. Next step: implement Layer 0 search gate and test against known failure scenarios.