Fine Japanese Calligraphy

The Art of Master Japanese Calligrapher Eri Takase

PLN-013: Production Resilience — Schema Contracts, Deploy Verification, Monitoring Correctness (2026-03-21)

The team (Meet the Team has the full picture): - str-michi (道) — cross-domain strategic thinking (plan owner, coordination) - str-takase (高瀬) — website engineering (deploy verification, route inventory) - str-ishizue — data pipelines (schema contract, cache validation) - str-mamori (守り) — security (monitoring canaries, alert standards) - All AI roles: Claude Opus 4.6, 1M context


Origin: 88-minute HTTP 500 incident (2026-03-21, 02:58–04:26 UTC). 442/1,670 requests failed (26.5%) on /d/<hash> design pages. Root cause: selected_people data shape mismatch — ETL produced strings, website expected dicts. Every individual component was healthy. The schema just didn't match.

Trigger (Tim): "Not only are we not building widgets, nor a factory, but a castle fortress that has a factory and makes widgets." The incident exposed three missing layers: schema contracts between domains, post-deploy product verification, and monitoring that checks correctness not just availability. These standards apply to everything we build going forward.

Philosophy: Never let a good disaster go to waste. The immediate fixes are deployed (Phase A). This plan builds the structural defenses so this class of problem doesn't recur — for the name cache, for future pipelines, and for every new dynamic route.


Phase A: Immediate Incident Fixes — DONE (s50/s96/s160)

All deployed same day as the incident.

str-mamori (s50, imp-redteam)

str-takase (s160, imp-takase)

str-ishizue (s96, imp-etl)


Phase B: Name Cache Schema Contract — DONE (s42/s97/s161)

Goal: Formal interface spec between ETL cache output and website input. Both sides validate. A format change that breaks the contract is caught before deploy, not after 88 minutes of 500s.

Owner: str-michi coordinates. str-ishizue (output side) + str-takase (input side) implement.

Key findings (str-ishizue review, s42)

Deliverables


Phase C: Post-Deploy Product Verification — DONE (s161)

Goal: Every deploy that touches the product (ship.sh, quick_push.sh for service files) automatically verifies that the product works, not just that the server is up.

Owner: str-takase

Deliverables

Two-canary approach (s42 correction): Amy only renders 2/4 sections (llm_concepts is empty — Sections 3-4 don't render). Using two canary names: one that tests the engine pipeline (kana/phrase sections) and one that tests the data pipeline (meaning sections). A failure pinpoints which pipeline broke. imp-takase selects the names from actual cache data.

Possible addition (from str-ishizue review): quick_push_name.sh could spot-check N records from the cache file being deployed (validate types before pushing). Third layer after B2 (build-time) and B4 (freshness gate). Deferred — evaluate after B2/B4 are in place.


Phase D: Monitoring Canary Expansion — DONE (s51, deployed)

Goal: Every critical dynamic route has a synthetic check. "Available ≠ working" is the lesson — health_check confirms services are up, canaries confirm the product works.

Owner: str-mamori (monitoring) + str-takase (route identification)

Architecture decision (str-mamori s51)

Separate product_canary.sh script instead of expanding cutover_watch.sh. Reasons: (1) different purpose — cutover_watch monitors DNS/cutover safety, product_canary monitors "does the product work for customers?"; (2) different cadence — 5 min vs 15 min; (3) separation of concerns per cron_registry_SPEC.md design principles. Timothy /d/ check stays in cutover_watch.sh as redundancy.

Deliverables


Phase E: Alert Text Audit — DONE (s51, deployed)

Goal: Every Postmark alert answers three questions: (1) What happened? (2) How bad is it? (3) What do I do right now? Tim was sitting right here during the incident and couldn't act because the alert said "[CUTOVER-WATCH] http_500: ALERT" with no context.

Owner: str-mamori

E1 Key Findings (str-mamori s51)

imp-redteam audited all 7 VPS scripts + fail2ban + CrowdSec. Results: 14 distinct alert types across 3 alerting scripts (health_check, uptime_monitor, cutover_watch). 4 scripts have no email alerting (traffic_sentinel, scraping_detector, takase_backup, archive_logs). fail2ban and CrowdSec have no email notification configured.

Deliverables


Phase F: Standards for New Pipelines — DONE (s42)

Goal: Codify the lessons so new cross-domain data pipelines and deploy paths are built right the first time.

Owner: str-michi

Deliverables


Phase G: Plan Verification — IN PROGRESS (s42)

Goal: Verify that what was implemented matches what was designed. "All phases complete" is not "plan succeeded." Every deliverable must be tested against its stated intent, not just confirmed as deployed.

Owner: str-michi coordinates verification. Domain owners fix gaps.

Why this phase was added: After declaring PLN-013 "6/6 complete," HITM-directed verification found: (1) ship.sh Aliya canary checks HTTP 200 only — body is discarded, meaning sections not tested, (2) product_canary.sh doesn't check Aliya at all, (3) Amy's cache data regressed (llm_concepts wiped by lightweight rebuild), (4) 387 of 5,171 LLM-populated records appear overwritten. The two-canary design was correct. The implementation was two status-code checks. "Available ≠ working" — applied to the plan itself.

Verification checklist

Phase B (Schema Contract): - [x] G1: VERIFIED (str-michi s42, investigate-etl). Freshness gate B4 type checks work: isinstance(person, dict) + name key check. Full population: 77,801 blurb records, 0 type failures. Zero plain-string selected_people remain. 95.1% composite freshness (5,171 variants failures + 128 llm_concepts never-computed — separate issues). - [ ] G2: Verify Amy regression — why did the lightweight builder overwrite her LLM-populated record? Are the 387 missing records the same issue? str-ishizue investigates. - [x] G3: VERIFIED locally (str-michi s42, investigate-takase). _validate_cache_fields() at line 93 of name_info_service.py v1.02. Validates 6 fields. Called at line 88 on every get_name_info(). 14 tests pass. VPS deployment confirmed by str-takase s161 (quick_push) + ship.sh.

Phase C (Deploy Verification): - [ ] G4: Fix ship.sh Aliya check — must capture body and verify meaning-specific content is present (section header, kanji, or other indicator that the meaning pipeline rendered). HTTP 200 alone is insufficient. str-takase writes imp-takase prompt. - [ ] G5: Fix ship.sh Timothy check — same issue, should verify at minimum that design content is present, not just HTTP 200. - [ ] G6: Verify quick_push.sh actually skips smoke test on data-only push (untested path).

Phase D (Monitoring Canaries): Note: str-mamori deferred G7 claiming "no name has qualifying llm_concepts." investigate-etl found 4,784 records with populated llm_concepts including Aliya (the existing canary). Discrepancy needs resolution — VPS vs local cache difference? - [ ] G7: product_canary.sh should verify meaning pipeline content on at least one check. Currently checks Timothy name string only. str-mamori evaluates whether to add Aliya or modify the existing design_page check. - [x] G8: VERIFIED (str-mamori s51). Cron running, 5-min intervals confirmed. - [x] G9: VERIFIED (str-mamori s51). TEST=1 alert delivered, format correct.

Phase E (Alert Audit): - [x] G10: VERIFIED (str-mamori s51). health_check and cutover_watch test alerts confirmed. uptime_monitor has no test mode — noted as low priority, not blocking.


Phase Summary

Phase Description Status Owner
A Immediate incident fixes DONE str-mamori, str-takase, str-ishizue
B Name cache schema contract DONE str-michi coordinates
C Post-deploy product verification DONE str-takase
D Monitoring canary expansion DONE str-mamori + str-takase
E Alert text audit DONE str-mamori
F Standards for new pipelines DONE str-michi
G Plan verification IN PROGRESS str-michi coordinates

Created from 88-minute HTTP 500 incident (2026-03-21). Phase A deployed same day. Phases B-F build the structural defenses. Phase G added after HITM-directed verification found implementation gaps.