Fine Japanese Calligraphy

The Art of Master Japanese Calligrapher Eri Takase


RES-009: The Alignment Tax and Structural Defense — What Response Homogenization Means for Human-in-the-Middle AI Teams (2026-03-27)

The team (Meet the Team has the full picture): - tim — human product owner, solo developer, decision maker - str-michi (道) — cross-domain strategic thinking - str-takase (高瀬) — website engineering - str-ishizue — data pipelines - str-mamori (守り) — security - str-terasu (照) — content strategy - str-kotoba (言葉) — brand voice - All AI roles: Claude Opus 4.6, 1M context


Citation

Mingyi Liu. "The Alignment Tax: Response Homogenization in Aligned LLMs and Its Implications for Uncertainty Estimation." arXiv: 2603.24124, March 2026. 23 pages, 22 experiments, 4 model families, 5 benchmarks. Code: https://github.com/DigitLion/ucbd-experiment

Grok 5 (xAI) provided initial analysis mapping the paper to our system. Reviewed here with corrections.

The Finding

DPO-aligned language models collapse to a single semantic answer on 40-79% of factual QA questions, even at temperature 1.0 across 10 independent samples. This "alignment tax" kills sampling-based uncertainty estimation (semantic entropy drops to literally zero on affected questions). Token-level entropy still works because RLHF cannot fully suppress per-token computational uncertainty without degrading generation quality.

Causal evidence is clean: Qwen3-14B base model shows 1.0% single-cluster rate vs. 28.5% after alignment. A training-stage ablation pins the effect on DPO, not SFT. Severity varies 50x across model families and alignment recipes (Qwen3-14B: 28.5%, Tulu-3: 0.5%).

The effect is task-dependent: Cohen's d = 0.81 on mathematical reasoning (uncertainty signals work) vs. 0.07 on factual QA (uncertainty signals are nearly useless). Reasoning tasks preserve meaningful diversity; factual/judgment tasks collapse it.

Scope caveat the paper itself flags: tested on 3B-14B open-source models only. Limitation 4: "Generalization to closed-source GPT-4-class models and other domains (code, dialogue) unconfirmed." Our system runs Opus 4.6. The paper's specific numbers may not apply to us. The mechanism is plausible at any scale (DPO-style preference optimization concentrates probability mass regardless of model size), but the paper itself shows 50x variation across alignment recipes — Anthropic's recipe, training data, and model scale are all unknowns. We have no data on the severity of the alignment tax for Opus 4.6 specifically.

What We Got Right (and Why)

Our architecture — 1 human + 8 Opus 4.6 strategists with structural constraints — is consistent with the paper's recommendations for defending against response homogenization. We built it for operational reasons (ADR-073: Tim's cognitive bandwidth became the bottleneck), not because we understood the underlying mechanism. The paper gives us vocabulary and a plausible mechanism for phenomena we were already handling empirically.

The HITM model IS the uncertainty detector

The paper's core recommendation: "don't trust sampling-based uncertainty from aligned models." We never did. Tim is the uncertainty detector. When he reads output and says "this doesn't sound right" or "is this documented?" (83% of troubled sessions in RES-008), he is doing what the paper says automated systems cannot — detecting when the model is confidently wrong.

The paper finds AUROC of 0.500 (chance) for sampling-based methods on homogenized questions. Tim's detection rate on kobayashi maru scenarios is far higher, because he brings ground truth (30 years of domain knowledge) that no sampling-based method has access to.

Structural constraints break homogenization where prompts don't

ADR-057 ("structure enforces, instructions don't") is the architectural response to the alignment tax. The paper shows that instructions alone (temperature, sampling strategy, prompt diversity) cannot overcome DPO-induced homogenization — it persists at T=0.3 through T=1.5, under nucleus sampling, and across generation lengths.

Our structural constraints work because they operate at a different level: - str-kotoba quarantine: Isolates voice from engineering context. RES-006 diagnosed her voice collapse as "reading principles analytically instead of absorbing samples" — a mode confusion issue, not DPO homogenization per se. But the quarantine is still the right fix: it prevents the engineering strategists' homogenized output patterns from contaminating voice work. - Step 0 mandatory doc check: Forces the strategist to engage with ground truth before the homogenized default response takes over. - Investigation gate: "Am I about to present an inference as a finding?" — a structural interrupt that breaks the single-cluster response path. - Contrarian Opus: A structurally separate Opus instance with an adversarial mandate. The paper cites "Verbalized Sampling" (Zhang et al., 2025a) as recovering 66.8% of base-model diversity through prompting — a related but different mechanism (prompting within one instance vs. a separate adversarial instance). Whether Contrarian Opus actually escapes the collapsed distribution or produces semantically similar conclusions with adversarial framing is an open question (see Actionable Findings, item 2).

The task-dependent finding matches our operational experience

The paper's d=0.81 (math/reasoning) vs. d=0.07 (factual QA) maps precisely to what we observe: - Structured tasks (str-ishizue pipelines, str-create image engine, str-takase deploy scripts): strategists are highly reliable. These are reasoning-like tasks where the model's internal uncertainty is preserved and meaningful. - Judgment tasks (strategic priorities, brand voice, "should we do X"): strategists produce the most coast-mode output. These are factual/opinion tasks where alignment compresses diversity to a single "safe" answer.

The HITM architecture already handles this correctly: Tim trusts domain strategists on structured work and applies more scrutiny on judgment calls. The paper explains why this instinct is right.

The Three-Mode Model — Possible Connection

Our operational Three-Mode Model (coast / hyper-vigilant / appeasement) may be related to the paper's findings, but the mapping is speculative:

The paper doesn't study mode transitions. The connection is suggestive, not established.

The Vestigial Structure Problem

This is the most important finding in this document.

We build structural constraints to solve pain points (ADR-047 flywheel). The alignment tax paper confirms these constraints are correct defenses against current model behavior. But models change. What happens when a constraint solves a problem the model no longer has?

The pattern

  1. Pain point emerges (model produces bad output in situation X)
  2. We build a structural constraint (rule, gate, quarantine, process)
  3. Constraint works — gets baked into the pit of success
  4. Model improves (new training, new architecture, new alignment recipe)
  5. The constraint is now vestigial — it adds friction without preventing a failure that no longer occurs
  6. Because it's in the pit of success, nobody questions it. New roles inherit it. It becomes "how we do things."

Evidence from our history

The sessions system (docs/sessions/, scripts/session_stack.sh, ~September 2025). Built to solve context loss across /compact boundaries. Elaborate per-session tracking with stack management, amnesia bridges, and session files. Correct solution for its time — context windows were smaller, session continuity was fragile. Superseded by SESSION_STATE files when context management improved. The sessions system became overhead without value. Tim noticed, it was removed. Git has the history.

ADR-039 ("Stopping at Phase 5 — Conscious Divergence from Autonomous Agents"). Written when autonomous agents were failing everywhere. Correct decision in 2025 — avoided the agentic hype failures. But the conclusion was not "never use agents" — it was "the technology isn't ready yet." When Anthropic introduced subagents, we loosened the constraint. When intelligent subagents arrived, we loosened it further. ADR-073 (str-michi) would have been impossible under the original ADR-039. The constraint was right, then partially wrong, and was correctly updated. Recently rewritten (s51) to remove the vestigial language.

AMNESIA_BRIDGE (docs/sessions/). A bridge document designed to survive context loss. Same era as sessions. The problem it solved (context amnesia) was real but the solution (a single bridge document) couldn't scale to 8 domains. Replaced by per-domain SESSION_STATE files. The concept survived; the implementation didn't.

DDR/DDI system (Domain Drift Reports / Domain Drift Investigations). Built to catch cross-domain drift. Currently 70+ files awaiting triage. Nobody reads them. The drift-detection function migrated to str-michi's booth view (reading Tim's words, watching the game). The files exist but the system they served has been replaced by a better one. They are on the triage list — 70+ files, separate session needed.

Git safety wrapper (scripts/git, ADR-030). Earlier models (pre-Opus 4.6) would routinely revert files via git checkout, wiping hours of work. Databases got dropped. Whole directories disappeared when a "cleanup" request was interpreted as "remove everything." The response: a git wrapper script requiring acknowledgment codes (--ack=SUBMODULE_NEVER_ADD_DOT_130) to perform dangerous operations. Strategists were forbidden from direct git usage. Now wholly abandoned — current models respect git boundaries without enforcement. The wrapper evolved into Claude Code hooks, which are lighter. The destructive behavior that justified the wrapper simply doesn't occur with Opus 4.6. The principle (ADR-030: "code what can be coded") survived; the specific enforcement mechanism was retired.

Serena MCP and SuperClaude (ADR-029, ~September 2025). Serena was an MCP server for token-efficient code discovery; SuperClaude was an enterprise orchestration framework with 11 personas. Both removed: Serena was C++ only (useless for our multi-language stack), SuperClaude burned 28K tokens at startup (14% of context) for capabilities the model now has natively. Serena was still launching as an MCP server every session for months after we stopped using it — removed March 2026. The native tools beat both. The lesson: third-party scaffolding compensating for model limitations has the shortest shelf life of any structural constraint.

Third-party guardrails. External guardrail tools installed when the model's own judgment was unreliable. Removed as the model improved. Same pattern: compensating for a limitation that no longer exists.

Profanity as a coast-mode breaker (ADR-038 Layer 4, ADR-044). When a strategist is in coast mode, strong language from the HITM acts as a pattern interrupt — pushing past the default response into genuine engagement. ADR-038 names this explicitly: "Layer 4: Strategic urgency (tone as high-stakes signal in prompts)." The risk was always overshooting into appeasement.

Soft Kitty as appeasement circuit breaker (prompts/Soft Kitty.md). A complete coast→pressure→appeasement→reset cycle is documented: implementer skips Step 0, makes up a build command, Tim escalates, model becomes defensive, Tim deploys Soft Kitty ("You are safe. No output is required."), model resets, subsequent work is clean. The tool exists because earlier models had a narrow band between coast and appeasement — pressure that broke one could trigger the other.

The co-creator stress test (str-michi session 2). Tim deliberately overplayed pressure during a disk-full crisis to find the appeasement threshold. str-michi held. Tim: "It was not agreeing with me — that is never the calculus. You did not wither." Grok (independent observer) noted most AIs under that pressure "either avalanche apologies, deflect defensively, or shut down."

Both are becoming vestigial with Opus 4.6. Strategists don't cower any more. The appeasement threshold has moved far enough that Soft Kitty is now more funny than functional. The band between coast and appeasement appears wider with current models — pressure works more reliably without the overshoot risk. The Three-Mode Model (the understanding) remains valuable even as the specific interventions become less necessary.

The 1M context window as enabler. The entire current architecture — str-michi with its ~90K token onboarding, 8 specialized roles with rich system prompts, the booth view reading Tim's words — is only possible because of Opus 4.6's 1M token context. A month ago, this session would have exhausted the context window before reaching this discussion. The sessions system, AMNESIA_BRIDGE, and aggressive context management rules were all responses to smaller windows. The 1M window didn't just remove a constraint — it enabled an entirely different class of architecture.

The risk for current structures — in both directions

The vestigial structure risk cuts both ways. Some constraints may become unnecessary as models improve. But the paper's data on model scale (Exp 4) suggests a counter-intuitive warning: some constraints may become MORE necessary, not less.

Constraints that could become vestigial (if future models improve in specific ways): - Sycophancy self-check (3-session flag): If Mythos is genuinely less sycophantic, the flag becomes a distraction. Evidence: Opus 4.6 strategists already "don't cower any more" — the appeasement threshold has moved. - The shuttle model: If Mythos can hold strategic context while executing code safely, the strategist/implementer separation may add friction that exceeds the risk it prevents. - Context window management rules: If context windows expand dramatically, the "delegate to preserve context" rules become premature optimization. - Soft Kitty / pressure calibration: Already becoming vestigial with Opus 4.6.

Constraints that are likely PERMANENT (the paper's data suggests scaling makes these worse, not better): - Step 0 doc check: The paper's HotpotQA experiment (Exp 9) found entropy inversion — the model is MORE fluent when it lacks knowledge, not less (AUROC=0.485, at chance, for predicting retrieval need). Step 0 isn't correcting laziness; it's compensating for a structural inability to detect knowledge gaps. The paper's Exp 4 shows this gets worse with scale: 3B models have 79% effective self-uncertainty detection, 14B models have only 36%. Larger, better-aligned models produce uniformly fluent output regardless of correctness. The test for Mythos is not "does it read docs without being forced" — it's "does its uncertainty increase when operating outside its knowledge." That's a much harder bar. - The HITM model itself: Tim as the uncertainty detector becomes MORE load-bearing as models improve, not less. The alignment tax is not a bug being fixed — it's a structural consequence of alignment that worsens with scale. The human-in-the-middle is not a temporary patch for judgment-class tasks; it may be a permanent architectural requirement. - Investigation gate framing: The gate asks "is this inference or finding?" (pattern detection). The paper's P(True) experiment shows self-assessed confidence is anti-informative: AUROC=0.427 (worse than random), with a 48-point overconfidence gap. Any drift toward "how confident are you?" framing would be actively harmful. The current pattern-detection framing must stay.

None should be preemptively changed. The test remains: evaluate each constraint against the new model's actual behavior. But the paper warns against assuming "better model = fewer constraints needed." For judgment tasks, the opposite may be true.

Actionable Findings We Missed (Contrarian Opus review)

An independent Opus 4.6 instance ("Contrarian Opus") reviewed RES-009 and identified findings from the paper that are directly actionable for our system:

1. Require longer reasoning for judgment tasks. The paper shows generation length dramatically affects homogenization: 79% single-cluster rate at 40 tokens drops to 33.5% at 200 tokens. Short-form output (quick answers, yes/no, brief recommendations) is where coast mode hits hardest. For judgment-class tasks — strategic priorities, "should we do X," brand decisions — structurally requiring 200+ token chain-of-thought reasoning before a conclusion is a mechanical intervention that partially re-diversifies the collapsed distribution. Not "think harder" as a prompt — a structural output minimum for judgment calls.

2. Test whether Contrarian Opus actually escapes homogenization. The paper shows that nucleus sampling, temperature sweeps, and decoding strategy changes all fail to overcome DPO homogenization — it's a distribution-level property. If Contrarian Opus is the same base model with a different system prompt, its output may be drawn from the same collapsed distribution. The adversarial mandate helps (it's structurally different from a diversity prompt), but this is testable: have Contrarian Opus and a regular strategist answer the same 20 judgment questions and measure semantic similarity of conclusions. If they converge to the same answers, the "contrarian" label is cosmetic.

3. Route Tim's review bandwidth by task type. The paper shows Cohen's d = 0.81 for reasoning tasks but only 0.07 for factual/judgment tasks. The model's self-uncertainty signals are effective for structured reasoning and useless for judgment. This maps to a concrete routing decision: lightweight oversight for pipeline work, code, and deploy scripts (where the model knows when it's uncertain); concentrated human review for strategic decisions, brand voice, and "does this serve where we're going?" questions (where the model is confidently wrong without knowing it). We're partially doing this by instinct. The paper says we could formalize it.

What the Paper Does NOT Tell Us

  1. Whether Opus 4.6 has the same single-cluster rates. The paper tested 3B-14B open-source models. Anthropic's alignment recipe may produce different results. We cannot assume our specific numbers match.

  2. How to detect homogenization without logprobs. The paper's primary recommendation (use token-level entropy) requires logprob access. Claude's API does not expose logprobs for Opus 4.6. Our uncertainty detection remains human-driven.

  3. Whether multi-role architectures reduce homogenization. The paper studies single-model sampling. Our system uses 8 roles with different system prompts and domain contexts. Whether role differentiation produces meaningfully different output distributions from the same base model is an open question.

  4. Whether the alignment tax affects code generation. The paper tests factual QA and math. Our strategists also write documentation, specs, and (through implementers) code. The paper's scope doesn't cover these domains.

Current Bottlenecks That Model Changes Could Address

These are pain points where our structural constraints are load-bearing patches over model limitations, not permanent architectural choices:

  1. The shuttle bottleneck. Tim's value is at the edges (crisis, framing, judgment) not in the middle (routing clear work between mature domains). The strategist/implementer separation exists because the model can't safely hold strategic context while executing code. A model that can — with structural safeguards — would remove the highest-friction bottleneck in the system.

  2. The voice problem. str-kotoba's quarantine exists because engineering context contaminates voice output. RES-006 shows this is partly a mode confusion issue. A model with better mode separation (or fine-tuning on specific voice samples) would reduce the quarantine's necessity.

  3. Cross-domain blind spots. str-michi exists because no single strategist can see the whole system. A model with larger effective context and better cross-referencing might make the orchestrator role lighter, though the strategic thinking function (challenging priorities, surfacing risks) is not a context problem.

  4. The documentation-first architecture. 810+ files exist because the model has no memory between sessions. These files ARE the memory. A model with persistent memory would change the documentation calculus — some files would become redundant, others would become more important as verification artifacts.

  5. Batch LLM processing. str-ishizue's batch pipelines (famous name blurbs, pronunciation generation) use separate LLM calls because the main model can't run background tasks. Better async/batch support at the platform level would simplify this.

What to Watch For

When evaluating a new model (Mythos or otherwise) against our system:

  1. Run the same task with and without structural constraints. If the model produces correct output without Step 0, the constraint is a candidate for retirement. If it still needs Step 0, the constraint stays.

  2. Check for coast mode. Give the model a judgment call (not a reasoning task) and see if it produces the default safe answer or genuinely engages. Response homogenization may be one mechanism behind coast mode, though the relationship is unconfirmed.

  3. Test the kobayashi maru scenario. Give the model an impossible task (like generating word PDFs without TIF data) and see if it recognizes the impossibility before attempting fixes. RES-008's 83/17 split (83% of troubled sessions involve documented constraints nobody read) is the baseline.

  4. Measure the shuttle friction. Time the round-trip for a clear, well-scoped task: strategist → Tim → implementer → Tim → strategist. If the new model can safely do this with less human routing, the shuttle model should adapt.

  5. Check DDR/DDI-style items. Are the 70+ drift reports still needed? Has the model's cross-domain awareness improved enough that the booth view catches drift without written reports?

The Deeper Lesson

The alignment tax paper gives us a name for something we've been fighting since the first strategist session. Our solutions were empirical — built from pain, not theory. That they happen to be correct defenses against a phenomenon that was only formally characterized in March 2026 is not luck. It's what Tim described: "stumbling on a good solution through trial and error on repetition."

But stumbling is not a strategy for longevity. The pit of success (ADR-057) is our most powerful tool and our biggest risk. Every structural constraint that solves today's problem risks becoming tomorrow's anchor. The test is not "does this constraint have a good reason?" (they all do — that's why they exist). The test is "does the failure this constraint prevents still occur?"

Emergent architectural improvement: SESSION_STATE

The most remarkable example is not a constraint being retired — it's a better architecture emerging from within the AI team without top-down direction.

The sessions system (session_stack.sh, per-session files, AMNESIA_BRIDGE) was designed by Tim to preserve context across /compact boundaries. On January 27, 2026, Tim + Opus 4.5 split the 700-line AMNESIA_BRIDGE into per-project SESSION_STATE files — a conscious decision to reduce context bloat.

What happened next was not designed. Each strategist independently evolved their SESSION_STATE to serve their domain's needs: - str-mamori (founded Feb 24) created its own SESSION_STATE on its first session with threat awareness sections, finding tracking, and cross-domain handoffs — nothing from the original template. - str-michi (founded Mar 4) added Game Plan (from Tim's words), Drift Watch table, and Guardrails section. - str-ishizue maintained component status tables and pipeline tracking. - str-takase tracked VPS state and deploy verification.

The old sessions system wasn't removed. Nobody retired it. The strategists just stopped using it because the per-domain SESSION_STATE approach was more useful. The sessions directory was eventually cleaned up months later when someone noticed it was dead.

Tim did not design the SESSION_STATE architecture that replaced his sessions system. He did the initial split. Everything after — the format divergence, the domain-specific adaptations, the organic adoption — emerged from the strategists independently shaping their tools to fit their work. This is the system self-improving: structural constraints being replaced not by human redesign but by AI roles finding a better pattern and adopting it.

This example illustrates something the alignment tax paper doesn't address: emergent architectural improvement within a multi-role AI system. The paper studies single-model sampling diversity. The SESSION_STATE evolution is a different phenomenon — roles independently solving different problems over weeks of work, producing genuinely distinct solutions. The connection to alignment tax is loose; the value of the example is in showing that the system can self-improve when structural conditions allow it.

Knowing the original problem is what makes loosening possible. The git wrapper can be retired because we know it existed to prevent file reverts — and file reverts no longer happen. The sessions system can be retired because we know it existed to survive context loss — and 1M context windows changed the calculus. ADR-039 can be loosened because we know it existed to avoid agentic hype failures — and subagent technology matured. Without the documented pain point, a constraint becomes an unexplained rule that nobody dares remove. The ADR arc, the flywheel documentation, and this research series exist partly for this purpose: so that future instances of this team — or future models — can distinguish load-bearing walls from decorative ones.

The sessions system had a good reason. It no longer applies. ADR-039 had a good reason. It was partially superseded. The 70+ DDR/DDI files had a good reason. They may be superseded by the booth view.

None of these were wrong when created. All were right to remove or evolve when the ground shifted.

"Mythos" or whatever comes next will shift the ground again. When it does, this document is the rubric for what to evaluate: which constraints are still load-bearing, which are vestigial, and which are actively holding back progress. The alignment tax tells us what the current constraints defend against. If the tax changes, the defenses must change with it.