PLN-013: Production Resilience — Schema Contracts, Deploy Verification, Monitoring Correctness (2026-03-21)
The team (Meet the Team has the full picture): - str-michi (道) — cross-domain strategic thinking (plan owner, coordination) - str-takase (高瀬) — website engineering (deploy verification, route inventory) - str-ishizue — data pipelines (schema contract, cache validation) - str-mamori (守り) — security (monitoring canaries, alert standards) - All AI roles: Claude Opus 4.6, 1M context
Origin: 88-minute HTTP 500 incident (2026-03-21, 02:58–04:26 UTC). 442/1,670 requests failed (26.5%) on /d/<hash> design pages. Root cause: selected_people data shape mismatch — ETL produced strings, website expected dicts. Every individual component was healthy. The schema just didn't match.
Trigger (Tim): "Not only are we not building widgets, nor a factory, but a castle fortress that has a factory and makes widgets." The incident exposed three missing layers: schema contracts between domains, post-deploy product verification, and monitoring that checks correctness not just availability. These standards apply to everything we build going forward.
Philosophy: Never let a good disaster go to waste. The immediate fixes are deployed (Phase A). This plan builds the structural defenses so this class of problem doesn't recur — for the name cache, for future pipelines, and for every new dynamic route.
Phase A: Immediate Incident Fixes — DONE (s50/s96/s160)
All deployed same day as the incident.
str-mamori (s50, imp-redteam)
- [x] Synthetic
/d/<hash>canary check in cutover_watch.sh (every 15 min, non-200 = immediate alert) - [x] Count-based 500 threshold: 5+ errors in 15 min = alert regardless of percentage
- [x] Human-readable alert text with what/how-bad/what-to-do + copy-paste investigation commands
- [x] Confirmed custom 500 error page shows no stack traces (DEBUG=False working)
str-takase (s160, imp-takase)
- [x] isinstance guard in
_build_famous_context(design_page_service.py:522-527) — handles both string and dictselected_peopleformats - [x] selected_people audit: only one access point in entire website codebase (the one fixed)
- [x] Concept-check added: "
selected_peoplehas two schemas — code must handle both"
str-ishizue (s96, imp-etl)
- [x]
name_cache_generate.pywraps string entries as{"name": s, "katakana": ""}(dict format) - [x] Katakana enrichment from
famous_names_17_lookup.csv(wikidata, 399K rows, zero LLM cost) - [x] Full cache regen running (106K records, all-dict format with katakana)
- [x] Concept-check added: "
selected_peoplemust be dicts, not strings"
Phase B: Name Cache Schema Contract — DONE (s42/s97/s161)
Goal: Formal interface spec between ETL cache output and website input. Both sides validate. A format change that breaks the contract is caught before deploy, not after 88 minutes of 500s.
Owner: str-michi coordinates. str-ishizue (output side) + str-takase (input side) implement.
Key findings (str-ishizue review, s42)
- 14 top-level fields (str-michi's original list had 7 but one (
selected_people) is nested — so 6 actual top-level fields identified, 8 missing:name,name_lower,source_versions,pronunciation,kaggle,etymology_raw,etymology,designs). selected_peopleis nested insidefamous_people[].selected_people, NOT top-level.- Two
famous_peoplesub-schemas coexist: Schema A (blurb:{romaji, language, blurb, people_count, selected_people}) from both builders, and Schema B (list:{romaji, language, people}) frombuild_full_cache.pyonly. Website renders Schema A only. Schema B has noselected_peoplekey — this is valid, not malformed. - Type differences between builders:
llm_conceptsis[]in lightweight builder vsNone(initial) in full builder. Same forsource_links({}vsNone). Freshness gate handles this ([]/{}= valid,None= never computed). - Three
selected_peopleitem shapes in the wild: plain strings (pre-s96), dicts withname+katakana(s96 fix), dicts withname+katakana+qid. Spec must declare which are valid going forward.
Deliverables
- [x] B1: Interface spec —
name_cache_interface_SPEC.md(v1.0.0, str-ishizue s97). All 14 top-level fields with types, constraints, validation rules. Bothfamous_peoplesub-schemas documented.selected_peopledict requirement formalized.Nonevs empty semantics. Pending str-takase review. - [x] B2: ETL-side validation — DONE (imp-etl, s97).
etl/scripts/name_cache/validate_record.py— sharedvalidate_cache_record()called by both builders before JSONL write. Checks: 14 required fields present, no None (except kaggle), variants non-empty, gender enum, selected_people are dicts with name key, generated_at non-empty. 7-case self-test suite passes. - [x] B3: Website-side validation — DONE (imp-takase, s161).
_validate_cache_fields()in name_info_service.py v1.02 — validates 6 critical fields at load time (selected_people, llm_concepts, variants, famous_people, source_links, gender). Wrong types → WARNING log + safe default, never crash. selected_people: filters non-dict entries (keeps valid dicts from mixed lists). 14 tests pass. Deployed via quick_push. - [x] B4: Freshness gate schema check — DONE (imp-etl, s97). Extended
check_cache_freshness.pywith type validation within existing 7-check structure: variants items have romaji+pronunciation keys, selected_people items are dicts with name key (the s96 rule), llm_concepts items are dicts, source_links is a dict.
Phase C: Post-Deploy Product Verification — DONE (s161)
Goal: Every deploy that touches the product (ship.sh, quick_push.sh for service files) automatically verifies that the product works, not just that the server is up.
Owner: str-takase
Deliverables
- [x] C1: ship.sh smoke test — DONE (imp-takase, s161). ship.sh v1.03 — checks
/health+/d/7d0bd618(Timothy: kana/phrase) +/d/92f7c0d5(Aliya: meaning sections). Writes.last_deploy_status(timestamp + PASS/FAIL + failed checks). WARNING on failure, no abort. - [x] C2: quick_push.sh smoke test — DONE (imp-takase, s161). Checks
/d/7d0bd618after Gunicorn restart only. Skipped on data-only pushes. WARNING on failure. - [x] C3: str-takase onboarding spot-check — rotation list must include at least one
/d/<hash>URL. Static pages aren't enough — the product is dynamic. (str-takase s161: doc change, doing directly.)
Two-canary approach (s42 correction): Amy only renders 2/4 sections (llm_concepts is empty — Sections 3-4 don't render). Using two canary names: one that tests the engine pipeline (kana/phrase sections) and one that tests the data pipeline (meaning sections). A failure pinpoints which pipeline broke. imp-takase selects the names from actual cache data.
Possible addition (from str-ishizue review): quick_push_name.sh could spot-check N records from the cache file being deployed (validate types before pushing). Third layer after B2 (build-time) and B4 (freshness gate). Deferred — evaluate after B2/B4 are in place.
Phase D: Monitoring Canary Expansion — DONE (s51, deployed)
Goal: Every critical dynamic route has a synthetic check. "Available ≠ working" is the lesson — health_check confirms services are up, canaries confirm the product works.
Owner: str-mamori (monitoring) + str-takase (route identification)
Architecture decision (str-mamori s51)
Separate product_canary.sh script instead of expanding cutover_watch.sh. Reasons: (1) different purpose — cutover_watch monitors DNS/cutover safety, product_canary monitors "does the product work for customers?"; (2) different cadence — 5 min vs 15 min; (3) separation of concerns per cron_registry_SPEC.md design principles. Timothy /d/ check stays in cutover_watch.sh as redundancy.
Deliverables
- [x] D1: Critical route inventory — DONE (str-takase s161, via investigate-takase). Three tiers: Tier 1 (revenue path):
/d/<hash>,POST /search,POST /api/checkout,POST /webhook/stripe,/download/<file_id>,/success. Tier 2 (discovery):/JapaneseCalligraphy/*,/custom/*(10 routes), Builder APIs (6 routes). Tier 3 (content):/library/*,/blog/*,/info/*,POST /info/contact. Ready for str-mamori D2. - [x] D2: Canary design — DONE (str-mamori s51). 4 canaries in new
product_canary.sh: (1)/d/7d0bd618GET — design page + "Timothy" keyword, (2)POST /search— CSRF-aware two-step with session cookies, (3)/JapaneseCalligraphy/LoveGET — word page service path, (4)POST /api/validate-romaji— builder API (lightweight, no image generation). Checkout/webhook/download/success NOT canary'd — require real Stripe sessions, monitored indirectly through shared DB path. Tier 2-3 filesystem routes skipped (low data-dependency risk). - [x] D3: Canary implementation — DONE (imp-redteam s51, deployed s51).
product_canary.shcreated — 4 checks, 5-min cron, retry-on-failure with 3s wait, transition-based alerting, alerts followalert_standard_SPEC.md. CSRF finding:/searchneeds session cookie + hidden field token;/api/validate-romajiis CSRF-exempt. Deployed viaship.sh, cron installed, all 4 checks verified PASS on VPS. - [x] D4: Process for new routes — DONE (str-mamori s51). Codified in
alert_standard_SPEC.md§7 (new alert requirement) and PLN-013 F3 (monitoring coverage checklist). Concept-check added to str-mamori session state.
Phase E: Alert Text Audit — DONE (s51, deployed)
Goal: Every Postmark alert answers three questions: (1) What happened? (2) How bad is it? (3) What do I do right now? Tim was sitting right here during the incident and couldn't act because the alert said "[CUTOVER-WATCH] http_500: ALERT" with no context.
Owner: str-mamori
E1 Key Findings (str-mamori s51)
imp-redteam audited all 7 VPS scripts + fail2ban + CrowdSec. Results: 14 distinct alert types across 3 alerting scripts (health_check, uptime_monitor, cutover_watch). 4 scripts have no email alerting (traffic_sentinel, scraping_detector, takase_backup, archive_logs). fail2ban and CrowdSec have no email notification configured.
- 3/3 GOOD (2): cutover_watch http_500, cutover_watch synthetic_page — both from Phase A (s50)
- 2/3 PARTIAL (8): all health_check alerts, uptime_monitor recovery, cutover_watch traffic/ip/search/redirect — have metrics but no investigation commands
- 1/3 POOR (4): uptime_monitor down, cutover_watch crowdsec_velocity/recidive/etl_processes — raw counts only
Deliverables
- [x] E1: Audit existing alerts — DONE (imp-redteam s51). Full inventory of all 14 alert types with trigger conditions, exact subject/body text, and actionability scores. Expanded scope beyond str-michi's 4-script list to cover all 7 VPS scripts + fail2ban + CrowdSec. Key gap: only 2/14 alerts are actionable, both written during the s50 incident.
- [x] E2: Alert template standard — DONE (str-mamori s51).
alert_standard_SPEC.md— subject format ([SYSTEM] SEVERITY: symptom on hostname), body format (WHAT / SEVERITY / DETAILS / WHAT TO CHECK / ESCALATION), check-specific investigation commands table, transition-based alerting requirement, compliance checklist. - [x] E3: Implement fixes — DONE (imp-redteam s51, deployed s51). Three scripts upgraded: health_check.sh v1.02→v1.03 (9 checks with specific investigation commands), uptime_monitor.sh v1.02→v1.03 (DOWN alert with curl/ssh/dig commands), cutover_watch.sh v1.07→v1.08 (7 alerts upgraded, 2 already-GOOD alerts preserved). No functional logic changed. Versions verified on VPS.
- [x] E4: Standard for new alerts — DONE (str-mamori s51). Codified in
alert_standard_SPEC.md§7 — any new monitoring script or alert type must follow the standard before deployment.
Phase F: Standards for New Pipelines — DONE (s42)
Goal: Codify the lessons so new cross-domain data pipelines and deploy paths are built right the first time.
Owner: str-michi
Deliverables
- [x] F1: Cross-domain data interface checklist — DONE (str-michi s42). Added to
strategist_methodology_REFERENCE.md§ Cross-Domain Coordination. Four rules: write interface spec, producer validates before write, consumer validates at load with safe defaults, health gates check types not just presence. - [x] F2: Deploy verification checklist — DONE (str-michi s42). Added to
strategist_methodology_REFERENCE.md§ Cross-Domain Coordination. Four rules: smoke test product not service, test every pipeline, match test to deploy scope, write status marker. - [x] F3: Monitoring coverage checklist — DONE (str-michi s42). Added to
strategist_methodology_REFERENCE.md§ Cross-Domain Coordination. Four rules: add synthetic canary, alert text peralert_standard_SPEC.md, register incron_registry_SPEC.md, separate monitoring concerns. References str-mamori's D4 (§7 new alert requirement).
Phase G: Plan Verification — IN PROGRESS (s42)
Goal: Verify that what was implemented matches what was designed. "All phases complete" is not "plan succeeded." Every deliverable must be tested against its stated intent, not just confirmed as deployed.
Owner: str-michi coordinates verification. Domain owners fix gaps.
Why this phase was added: After declaring PLN-013 "6/6 complete," HITM-directed verification found: (1) ship.sh Aliya canary checks HTTP 200 only — body is discarded, meaning sections not tested, (2) product_canary.sh doesn't check Aliya at all, (3) Amy's cache data regressed (llm_concepts wiped by lightweight rebuild), (4) 387 of 5,171 LLM-populated records appear overwritten. The two-canary design was correct. The implementation was two status-code checks. "Available ≠ working" — applied to the plan itself.
Verification checklist
Phase B (Schema Contract):
- [x] G1: VERIFIED (str-michi s42, investigate-etl). Freshness gate B4 type checks work: isinstance(person, dict) + name key check. Full population: 77,801 blurb records, 0 type failures. Zero plain-string selected_people remain. 95.1% composite freshness (5,171 variants failures + 128 llm_concepts never-computed — separate issues).
- [ ] G2: Verify Amy regression — why did the lightweight builder overwrite her LLM-populated record? Are the 387 missing records the same issue? str-ishizue investigates.
- [x] G3: VERIFIED locally (str-michi s42, investigate-takase). _validate_cache_fields() at line 93 of name_info_service.py v1.02. Validates 6 fields. Called at line 88 on every get_name_info(). 14 tests pass. VPS deployment confirmed by str-takase s161 (quick_push) + ship.sh.
Phase C (Deploy Verification): - [ ] G4: Fix ship.sh Aliya check — must capture body and verify meaning-specific content is present (section header, kanji, or other indicator that the meaning pipeline rendered). HTTP 200 alone is insufficient. str-takase writes imp-takase prompt. - [ ] G5: Fix ship.sh Timothy check — same issue, should verify at minimum that design content is present, not just HTTP 200. - [ ] G6: Verify quick_push.sh actually skips smoke test on data-only push (untested path).
Phase D (Monitoring Canaries): Note: str-mamori deferred G7 claiming "no name has qualifying llm_concepts." investigate-etl found 4,784 records with populated llm_concepts including Aliya (the existing canary). Discrepancy needs resolution — VPS vs local cache difference? - [ ] G7: product_canary.sh should verify meaning pipeline content on at least one check. Currently checks Timothy name string only. str-mamori evaluates whether to add Aliya or modify the existing design_page check. - [x] G8: VERIFIED (str-mamori s51). Cron running, 5-min intervals confirmed. - [x] G9: VERIFIED (str-mamori s51). TEST=1 alert delivered, format correct.
Phase E (Alert Audit): - [x] G10: VERIFIED (str-mamori s51). health_check and cutover_watch test alerts confirmed. uptime_monitor has no test mode — noted as low priority, not blocking.
Phase Summary
| Phase | Description | Status | Owner |
|---|---|---|---|
| A | Immediate incident fixes | DONE | str-mamori, str-takase, str-ishizue |
| B | Name cache schema contract | DONE | str-michi coordinates |
| C | Post-deploy product verification | DONE | str-takase |
| D | Monitoring canary expansion | DONE | str-mamori + str-takase |
| E | Alert text audit | DONE | str-mamori |
| F | Standards for new pipelines | DONE | str-michi |
| G | Plan verification | IN PROGRESS | str-michi coordinates |
Created from 88-minute HTTP 500 incident (2026-03-21). Phase A deployed same day. Phases B-F build the structural defenses. Phase G added after HITM-directed verification found implementation gaps.
