2026-03-21 - PLN-013: Production Resilience — Schema Contracts, Deploy Verification, Monitoring Correctness

The team (Meet the Team has the full picture): - str-michi (道) — cross-domain strategic thinking (plan owner, coordination) - str-takase (高瀬) — website engineering (deploy verification, route inventory) - str-ishizue — data pipelines (schema contract, cache validation) - str-mamori (守り) — security (monitoring canaries, alert standards) - All AI roles: Claude Opus 4.6, 1M context

Origin: 88-minute HTTP 500 incident (2026-03-21, 02:58–04:26 UTC). 442/1,670 requests failed (26.5%) on /d/<hash> design pages. Root cause: selected_people data shape mismatch — ETL produced strings, website expected dicts. Every individual component was healthy. The schema just didn't match.

Trigger (Tim): "Not only are we not building widgets, nor a factory, but a castle fortress that has a factory and makes widgets." The incident exposed three missing layers: schema contracts between domains, post-deploy product verification, and monitoring that checks correctness not just availability. These standards apply to everything we build going forward.

Philosophy: Never let a good disaster go to waste. The immediate fixes are deployed (Phase A). This plan builds the structural defenses so this class of problem doesn't recur — for the name cache, for future pipelines, and for every new dynamic route.

Phase A: Immediate Incident Fixes — DONE (s50/s96/s160)

All deployed same day as the incident.

str-mamori (s50, imp-redteam)

[x] Synthetic /d/<hash> canary check in cutover_watch.sh (every 15 min, non-200 = immediate alert)
[x] Count-based 500 threshold: 5+ errors in 15 min = alert regardless of percentage
[x] Human-readable alert text with what/how-bad/what-to-do + copy-paste investigation commands
[x] Confirmed custom 500 error page shows no stack traces (DEBUG=False working)

str-takase (s160, imp-takase)

[x] isinstance guard in _build_famous_context (design_page_service.py:522-527) — handles both string and dict selected_people formats
[x] selected_people audit: only one access point in entire website codebase (the one fixed)
[x] Concept-check added: "selected_people has two schemas — code must handle both"

str-ishizue (s96, imp-etl)

[x] name_cache_generate.py wraps string entries as {"name": s, "katakana": ""} (dict format)
[x] Katakana enrichment from famous_names_17_lookup.csv (wikidata, 399K rows, zero LLM cost)
[x] Full cache regen running (106K records, all-dict format with katakana)
[x] Concept-check added: "selected_people must be dicts, not strings"

Phase B: Name Cache Schema Contract — DONE (s42/s97/s161)

Goal: Formal interface spec between ETL cache output and website input. Both sides validate. A format change that breaks the contract is caught before deploy, not after 88 minutes of 500s.

Owner: str-michi coordinates. str-ishizue (output side) + str-takase (input side) implement.

Key findings (str-ishizue review, s42)

14 top-level fields (str-michi's original list had 7 but one (selected_people) is nested — so 6 actual top-level fields identified, 8 missing: name, name_lower, source_versions, pronunciation, kaggle, etymology_raw, etymology, designs).
selected_people is nested inside famous_people[].selected_people, NOT top-level.
Two famous_people sub-schemas coexist: Schema A (blurb: {romaji, language, blurb, people_count, selected_people}) from both builders, and Schema B (list: {romaji, language, people}) from build_full_cache.py only. Website renders Schema A only. Schema B has no selected_people key — this is valid, not malformed.
Type differences between builders: llm_concepts is [] in lightweight builder vs None (initial) in full builder. Same for source_links ({} vs None). Freshness gate handles this ([]/{} = valid, None = never computed).
Three selected_people item shapes in the wild: plain strings (pre-s96), dicts with name+katakana (s96 fix), dicts with name+katakana+qid. Spec must declare which are valid going forward.

Deliverables

[x] B1: Interface spec — name_cache_interface_SPEC.md (v1.0.0, str-ishizue s97). All 14 top-level fields with types, constraints, validation rules. Both famous_people sub-schemas documented. selected_people dict requirement formalized. None vs empty semantics. Pending str-takase review.
[x] B2: ETL-side validation — DONE (imp-etl, s97). etl/scripts/name_cache/validate_record.py — shared validate_cache_record() called by both builders before JSONL write. Checks: 14 required fields present, no None (except kaggle), variants non-empty, gender enum, selected_people are dicts with name key, generated_at non-empty. 7-case self-test suite passes.
[x] B3: Website-side validation — DONE (imp-takase, s161). _validate_cache_fields() in name_info_service.py v1.02 — validates 6 critical fields at load time (selected_people, llm_concepts, variants, famous_people, source_links, gender). Wrong types → WARNING log + safe default, never crash. selected_people: filters non-dict entries (keeps valid dicts from mixed lists). 14 tests pass. Deployed via quick_push.
[x] B4: Freshness gate schema check — DONE (imp-etl, s97). Extended check_cache_freshness.py with type validation within existing 7-check structure: variants items have romaji+pronunciation keys, selected_people items are dicts with name key (the s96 rule), llm_concepts items are dicts, source_links is a dict.

Phase C: Post-Deploy Product Verification — DONE (s161)

Goal: Every deploy that touches the product (ship.sh, quick_push.sh for service files) automatically verifies that the product works, not just that the server is up.

Owner: str-takase

Deliverables

[x] C1: ship.sh smoke test — DONE (imp-takase, s161). ship.sh v1.03 — checks /health + /d/7d0bd618 (Timothy: kana/phrase) + /d/92f7c0d5 (Aliya: meaning sections). Writes .last_deploy_status (timestamp + PASS/FAIL + failed checks). WARNING on failure, no abort.
[x] C2: quick_push.sh smoke test — DONE (imp-takase, s161). Checks /d/7d0bd618 after Gunicorn restart only. Skipped on data-only pushes. WARNING on failure.
[x] C3: str-takase onboarding spot-check — rotation list must include at least one /d/<hash> URL. Static pages aren't enough — the product is dynamic. (str-takase s161: doc change, doing directly.)

Two-canary approach (s42 correction): Amy only renders 2/4 sections (llm_concepts is empty — Sections 3-4 don't render). Using two canary names: one that tests the engine pipeline (kana/phrase sections) and one that tests the data pipeline (meaning sections). A failure pinpoints which pipeline broke. imp-takase selects the names from actual cache data.

Possible addition (from str-ishizue review): quick_push_name.sh could spot-check N records from the cache file being deployed (validate types before pushing). Third layer after B2 (build-time) and B4 (freshness gate). Deferred — evaluate after B2/B4 are in place.

Phase D: Monitoring Canary Expansion — DONE (s51, deployed)

Goal: Every critical dynamic route has a synthetic check. "Available ≠ working" is the lesson — health_check confirms services are up, canaries confirm the product works.

Owner: str-mamori (monitoring) + str-takase (route identification)

Architecture decision (str-mamori s51)

Separate product_canary.sh script instead of expanding cutover_watch.sh. Reasons: (1) different purpose — cutover_watch monitors DNS/cutover safety, product_canary monitors "does the product work for customers?"; (2) different cadence — 5 min vs 15 min; (3) separation of concerns per cron_registry_SPEC.md design principles. Timothy /d/ check stays in cutover_watch.sh as redundancy.

Deliverables

[x] D1: Critical route inventory — DONE (str-takase s161, via investigate-takase). Three tiers: Tier 1 (revenue path): /d/<hash>, POST /search, POST /api/checkout, POST /webhook/stripe, /download/<file_id>, /success. Tier 2 (discovery): /JapaneseCalligraphy/*, /custom/* (10 routes), Builder APIs (6 routes). Tier 3 (content): /library/*, /blog/*, /info/*, POST /info/contact. Ready for str-mamori D2.
[x] D2: Canary design — DONE (str-mamori s51). 4 canaries in new product_canary.sh: (1) /d/7d0bd618 GET — design page + "Timothy" keyword, (2) POST /search — CSRF-aware two-step with session cookies, (3) /JapaneseCalligraphy/Love GET — word page service path, (4) POST /api/validate-romaji — builder API (lightweight, no image generation). Checkout/webhook/download/success NOT canary'd — require real Stripe sessions, monitored indirectly through shared DB path. Tier 2-3 filesystem routes skipped (low data-dependency risk).
[x] D3: Canary implementation — DONE (imp-redteam s51, deployed s51). product_canary.sh created — 4 checks, 5-min cron, retry-on-failure with 3s wait, transition-based alerting, alerts follow alert_standard_SPEC.md. CSRF finding: /search needs session cookie + hidden field token; /api/validate-romaji is CSRF-exempt. Deployed via ship.sh, cron installed, all 4 checks verified PASS on VPS.
[x] D4: Process for new routes — DONE (str-mamori s51). Codified in alert_standard_SPEC.md §7 (new alert requirement) and PLN-013 F3 (monitoring coverage checklist). Concept-check added to str-mamori session state.

Phase E: Alert Text Audit — DONE (s51, deployed)

Goal: Every Postmark alert answers three questions: (1) What happened? (2) How bad is it? (3) What do I do right now? Tim was sitting right here during the incident and couldn't act because the alert said "[CUTOVER-WATCH] http_500: ALERT" with no context.

Owner: str-mamori

E1 Key Findings (str-mamori s51)

imp-redteam audited all 7 VPS scripts + fail2ban + CrowdSec. Results: 14 distinct alert types across 3 alerting scripts (health_check, uptime_monitor, cutover_watch). 4 scripts have no email alerting (traffic_sentinel, scraping_detector, takase_backup, archive_logs). fail2ban and CrowdSec have no email notification configured.

3/3 GOOD (2): cutover_watch http_500, cutover_watch synthetic_page — both from Phase A (s50)
2/3 PARTIAL (8): all health_check alerts, uptime_monitor recovery, cutover_watch traffic/ip/search/redirect — have metrics but no investigation commands
1/3 POOR (4): uptime_monitor down, cutover_watch crowdsec_velocity/recidive/etl_processes — raw counts only

Deliverables

[x] E1: Audit existing alerts — DONE (imp-redteam s51). Full inventory of all 14 alert types with trigger conditions, exact subject/body text, and actionability scores. Expanded scope beyond str-michi's 4-script list to cover all 7 VPS scripts + fail2ban + CrowdSec. Key gap: only 2/14 alerts are actionable, both written during the s50 incident.
[x] E2: Alert template standard — DONE (str-mamori s51). alert_standard_SPEC.md — subject format ([SYSTEM] SEVERITY: symptom on hostname), body format (WHAT / SEVERITY / DETAILS / WHAT TO CHECK / ESCALATION), check-specific investigation commands table, transition-based alerting requirement, compliance checklist.
[x] E3: Implement fixes — DONE (imp-redteam s51, deployed s51). Three scripts upgraded: health_check.sh v1.02→v1.03 (9 checks with specific investigation commands), uptime_monitor.sh v1.02→v1.03 (DOWN alert with curl/ssh/dig commands), cutover_watch.sh v1.07→v1.08 (7 alerts upgraded, 2 already-GOOD alerts preserved). No functional logic changed. Versions verified on VPS.
[x] E4: Standard for new alerts — DONE (str-mamori s51). Codified in alert_standard_SPEC.md §7 — any new monitoring script or alert type must follow the standard before deployment.

Phase F: Standards for New Pipelines — DONE (s42)

Goal: Codify the lessons so new cross-domain data pipelines and deploy paths are built right the first time.

Owner: str-michi

Deliverables

[x] F1: Cross-domain data interface checklist — DONE (str-michi s42). Added to strategist_methodology_REFERENCE.md § Cross-Domain Coordination. Four rules: write interface spec, producer validates before write, consumer validates at load with safe defaults, health gates check types not just presence.
[x] F2: Deploy verification checklist — DONE (str-michi s42). Added to strategist_methodology_REFERENCE.md § Cross-Domain Coordination. Four rules: smoke test product not service, test every pipeline, match test to deploy scope, write status marker.
[x] F3: Monitoring coverage checklist — DONE (str-michi s42). Added to strategist_methodology_REFERENCE.md § Cross-Domain Coordination. Four rules: add synthetic canary, alert text per alert_standard_SPEC.md, register in cron_registry_SPEC.md, separate monitoring concerns. References str-mamori's D4 (§7 new alert requirement).

Phase G: Plan Verification — IN PROGRESS (s42)

Goal: Verify that what was implemented matches what was designed. "All phases complete" is not "plan succeeded." Every deliverable must be tested against its stated intent, not just confirmed as deployed.

Owner: str-michi coordinates verification. Domain owners fix gaps.

Why this phase was added: After declaring PLN-013 "6/6 complete," HITM-directed verification found: (1) ship.sh Aliya canary checks HTTP 200 only — body is discarded, meaning sections not tested, (2) product_canary.sh doesn't check Aliya at all, (3) Amy's cache data regressed (llm_concepts wiped by lightweight rebuild), (4) 387 of 5,171 LLM-populated records appear overwritten. The two-canary design was correct. The implementation was two status-code checks. "Available ≠ working" — applied to the plan itself.

Verification checklist

Phase B (Schema Contract): - [x] G1: VERIFIED (str-michi s42, investigate-etl). Freshness gate B4 type checks work: isinstance(person, dict) + name key check. Full population: 77,801 blurb records, 0 type failures. Zero plain-string selected_people remain. 95.1% composite freshness (5,171 variants failures + 128 llm_concepts never-computed — separate issues). - [ ] G2: Verify Amy regression — why did the lightweight builder overwrite her LLM-populated record? Are the 387 missing records the same issue? str-ishizue investigates. - [x] G3: VERIFIED locally (str-michi s42, investigate-takase). _validate_cache_fields() at line 93 of name_info_service.py v1.02. Validates 6 fields. Called at line 88 on every get_name_info(). 14 tests pass. VPS deployment confirmed by str-takase s161 (quick_push) + ship.sh.

Phase C (Deploy Verification): - [ ] G4: Fix ship.sh Aliya check — must capture body and verify meaning-specific content is present (section header, kanji, or other indicator that the meaning pipeline rendered). HTTP 200 alone is insufficient. str-takase writes imp-takase prompt. - [ ] G5: Fix ship.sh Timothy check — same issue, should verify at minimum that design content is present, not just HTTP 200. - [ ] G6: Verify quick_push.sh actually skips smoke test on data-only push (untested path).

Phase D (Monitoring Canaries): Note: str-mamori deferred G7 claiming "no name has qualifying llm_concepts." investigate-etl found 4,784 records with populated llm_concepts including Aliya (the existing canary). Discrepancy needs resolution — VPS vs local cache difference? - [ ] G7: product_canary.sh should verify meaning pipeline content on at least one check. Currently checks Timothy name string only. str-mamori evaluates whether to add Aliya or modify the existing design_page check. - [x] G8: VERIFIED (str-mamori s51). Cron running, 5-min intervals confirmed. - [x] G9: VERIFIED (str-mamori s51). TEST=1 alert delivered, format correct.

Phase E (Alert Audit): - [x] G10: VERIFIED (str-mamori s51). health_check and cutover_watch test alerts confirmed. uptime_monitor has no test mode — noted as low priority, not blocking.

Phase Summary

Phase	Description	Status	Owner
A	Immediate incident fixes	DONE	str-mamori, str-takase, str-ishizue
B	Name cache schema contract	DONE	str-michi coordinates
C	Post-deploy product verification	DONE	str-takase
D	Monitoring canary expansion	DONE	str-mamori + str-takase
E	Alert text audit	DONE	str-mamori
F	Standards for new pipelines	DONE	str-michi
G	Plan verification	IN PROGRESS	str-michi coordinates

Created from 88-minute HTTP 500 incident (2026-03-21). Phase A deployed same day. Phases B-F build the structural defenses. Phase G added after HITM-directed verification found implementation gaps.