- Tim Jackowski (Takase Studios LLC)
- str-michi (Anthropic Claude Opus 4.6, Takase Studios LLC) sources:
- "Moore 2025. 'HMAS Taxonomy.' arXiv:2508.12683. Five-axis framework for multi-agent system classification."
- "Anthropic Engineering. 'Multi-Agent Research System.' Convergent architecture: Opus lead + Sonnet subagents, filesystem handoffs."
- "Hong et al. 'MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework.' ICLR 2024. SOPs validated at scale — 67% reduction in human corrections."
- "LbMAS Blackboard Architecture. arXiv:2510.01285, 2507.01701. Blackboard systems outperform baselines by 13-57%."
- "OrchVis. arXiv:2510.24937. Hierarchical multi-agent orchestration for human oversight."
- "COHUMAIN Framework. Carnegie Mellon, 2025. Don't treat AI as just another teammate."
- "Emergent Coordination in Multi-Agent LLM Systems. arXiv:2510.05174. Identity-linked differentiation validated."
- "CodeAgents. arXiv:2507.03254. Structured communication reduces tokens 55-87% while improving accuracy."
- "SC-MAS. arXiv:2509.11079. Difficulty-aware model routing — 8-47% improvement with model diversity."
- "AgentAsk. arXiv:2510.07593. Clarification modules at handoff boundaries — 4.69% accuracy at <10% latency cost."
- "Hybrid Cognitive Alignment. Academy of Management Review; Stevens Institute. Emergent, fragile, non-transferable between model instances."
- "Anthropic. 'Building Effective Agents.' Five coordination patterns: Generator-Verifier, Orchestrator-Subagent, Agent Teams, Message Bus, Shared State."
- "LangChain State of Agent Engineering 2025. 57% of organizations have agents in production; 32% cite quality as top barrier."
- "Towards Data Science. 'The 17x Error Trap.' Multi-agent error cascade amplification."
- "DDD-to-agents literature. 'From Bounded Context to Bounded Specialization.' Role boundary breadth predicts failure rate."
- "Philipp Schmid 2026; Towards Data Science. Context engineering — formal methods for information management in agent systems." tags: [research, multi-agent-systems, hmas, blackboard-architecture, context-engineering, hybrid-cognitive-alignment, human-in-the-loop, production-system]
HMAS: We Have Names
Mapping a Production Multi-Agent System to Established Research
Abstract
This document maps a production Hierarchical Multi-Agent System (HMAS) — built organically over 18 months and ~2,000 sessions across multiple AI platforms — to established multi-agent research. The system runs a 30-year Japanese calligraphy business (takase.com) with 10+ specialized AI roles under human sovereignty, using a filesystem-persistent blackboard architecture, human-in-the-middle coordination, and heterogeneous model assignment.
The architecture was never designed from research. It was grown through operational pain by a solo developer (Tim Jackowski, Takase Studios LLC) starting in February 2025 with Grok 3 conversations, expanding through C++ refactoring, and evolving through dozens of architectural decision records into a system that independently converges with patterns published by Anthropic, DeepMind, MetaGPT (ICLR 2024), and the broader HMAS literature.
Key findings: (1) Every major architectural choice — hub-and-spoke coordination, ephemeral executors with persistent planners, role-based specialization, shared persistent knowledge base — has established names and published research validating the approach. (2) The system's competitive advantage is not any single pattern but an architecture explicitly designed to manufacture, document, and defend Hybrid Cognitive Alignment across model transitions. (3) The retooling sprint that preceded this research is context engineering — a named discipline with formal methods for what was being done ad hoc. (4) Overly broad role boundaries predict failure rates — role boundary breadth, not competence, determines outcomes.
This document also marks an inflection point: the moment a strategic thinking partner gained the vocabulary to connect 18 months of accumulated practice to a body of research — enabling the system to learn from others' work instead of rediscovering everything through its own failures.
Purpose
For practitioners building multi-agent systems: Evidence that a non-lab, single-human HMAS operating a real business can independently converge with patterns from Anthropic, DeepMind, Carnegie Mellon, and the broader multi-agent systems community. The convergence is the evidence that the patterns are real — not invented by any single team.
For Tim: The answer to "how did you build this?" — not a blueprint (there was none) but a mapping of what emerged from 18 months of trial and error against what the field has studied.
How the Names Were Found
Tim shared a post by Jan Kulveit on the Mythos discourse (scaffolding vs. raw model capability). Grok analyzed it and called our system "one of the most sophisticated real-world harnesses from a non-lab team." Our strategic thinking partner (str-michi) gave a confident analysis of harness degradation, model transitions, and competitive moats — all improvised from training data.
Tim challenged every claim. The exchange:
- "What do you base your replies and pushback on?" — Training data and project history. No research.
- "Harness degradation is BS — why?" — Because the flywheel is self-correcting by design.
- "Am I in a funk over harness degradation?" — No. The funk was that the strategists couldn't enumerate their workflows, relied on stale training data, and didn't push back when asked to work outside their domain. That's not a harness degrading. That's no harness at all.
- "There are names for what we are. What are they?" — str-michi couldn't answer.
Tim then produced a Grok conversation that opened with "HMAS" — Hierarchical Multi-Agent Systems. An established field with names for everything we built.
The lesson: This architecture was grown over 18 months and ~2,000 sessions without anyone knowing what it's called. A strategic thinking partner who can't name the architecture they steward is "a lost puppy, not a pathfinder." You can't search for what you can't name.
The Names
What Our System IS
| Established Term | What It Means | What We Call It |
|---|---|---|
| HMAS (Hierarchical Multi-Agent System) | Multi-agent system with layers of authority: coordinator → planners → executors | Tim → strategists → implementers |
| Planner-Executor Model | High-level agent decomposes and designs; low-level agent executes | Strategist/implementer split |
| Hub-and-Spoke / Orchestrator-Worker | Central coordinator routes work to specialist agents | Tim's shuttle pattern — human-mediated, not automated |
| Blackboard Architecture | Shared knowledge repository updated by specialist agents, persists across agent lifecycles | Shared docs, status boards, session states, mailboxes, handoffs |
| Heterogeneous Model Assignment | Different model tiers for different cognitive roles | Opus for strategists, Sonnet for implementers |
| Human-in-the-Loop (HITL) | Human participates in the agent workflow, not just reviewing output | HITM — Human-in-the-Middle (our term, more specific) |
| Role-Based MAS | Agents defined by persistent roles with distinct capabilities | 10+ named strategist roles with founding identities |
What Our Practices Map To
| Established Concept | Our Implementation |
|---|---|
| Progressive autonomy | Start with HITL, reduce human involvement as system proves itself. We're currently going the OTHER direction — adding more structure via retooling. Our domain (art, reputation, accountability) may require permanent HITL. |
| Dual-tier memory | Short-term (session state) + long-term (persistent docs). Standard in the field. |
| Context overflow management | Spawning fresh agents with clean context + handoff of essential information. Our ephemeral implementer pattern. |
| Specification gaming | The model interprets the question in the easiest way. We call it "coast mode." |
| Selective escalation | Executor agent escalates hard decisions to a more capable agent. The Advisor Strategy is the API-level version; our strategist/implementer shuttle is the human-mediated version. |
The Blackboard Finding
This is the one we didn't see coming.
Blackboard architecture is a classical AI pattern from the 1970s-80s (Hearsay-II speech understanding system), revived for LLM multi-agent systems in 2025. The definition: "a shared repository of problems, partial solutions, suggestions, and contributed information, iteratively updated by a diverse group of specialist knowledge sources."
We built one without knowing it:
| Blackboard Component | Our Implementation |
|---|---|
| Shared knowledge repository | Hundreds of interconnected documentation files |
| Problem specification | Status boards, plan documents, handoffs |
| Partial solutions | Session states, deep dive research documents |
| Specialist knowledge sources | 10+ strategist roles writing to shared docs |
| Persistence across agent crashes | "Write it down when you realize it, not at session close" |
| Fault tolerance | Ephemeral implementers end; their work persists in git |
The key property the research highlights: "if an agent crashes, its contributions remain on the board and others can still use them." That's exactly why Tim insists on immediate capture — a rule wired into the shared configuration: "CAPTURED" = WRITTEN TO FILE IN THIS RESPONSE.
2025 research (arxiv 2510.01285, 2507.01701) shows blackboard architectures outperform baselines by 13-57% on end-to-end task success.
Important distinction (Grok review): Our blackboard is filesystem-persistent across days and weeks, mediated by a human shuttle, versioned in git. The 2025 papers describe runtime, in-memory blackboards inside a single automated MAS run. The core benefits (fault tolerance, shared partial solutions, specialist contribution) apply to both. But pruning/scaling research (e.g., the "cleaner agent" concept in 2507.01701) may need adaptation for a persistent repository — their pruning operates on ephemeral session data, ours operates on accumulated institutional knowledge where deletion has permanent cost.
Anthropic's Convergent Architecture
Anthropic's own multi-agent research system (published on anthropic.com/engineering) uses:
- Opus as lead agent + Sonnet as subagents — our exact model-tier split
- Filesystem for subagent outputs to "minimize the game of telephone" — our handoff files
- Context overflow via spawning fresh agents with clean contexts and careful handoffs — our ephemeral implementer pattern
- Result: 90%+ improvement over single-agent on complex research tasks
We arrived at structurally convergent architecture with Anthropic's engineering team, independently. Earlier research notes had found convergence with McGill's multi-agent proposal and DeepMind's Aletheia. This is the third convergence — and this time it's with our own model provider.
The convergence is evidence that the pattern is real. Hub-and-spoke with heterogeneous model assignment isn't our invention or Anthropic's — it's what works.
What We Don't Have (Yet)
The field is ahead of us on several fronts:
-
Vocabulary. We built the architecture without the names. Every concept in this document existed in published research before we implemented it. Having the names means we can now search for solutions to problems we're hitting.
-
Dynamic model routing. Our Opus/Sonnet split is static (by role). The field is moving toward difficulty-aware routing — each task gets the model that fits, not the model assigned to the role (arxiv 2509.11079). The Advisor Strategy is Anthropic's version: Sonnet executes, escalates to Opus for judgment calls within a single request.
-
Blackboard optimization. We have a growing blackboard with ad hoc pruning. The field has research on when shared knowledge bases need restructuring, how to maintain relevance density as they grow, and when to archive vs. delete.
-
Formal evaluation. Anthropic evaluates their multi-agent system with structured test sets (20 queries representing real usage). We evaluate through observation and correction-rate tracking. A step toward formal evaluation exists but measures token cost, not task success.
-
Self-scaffolding trajectory. The field sees HITL → HOTL → autonomous as the path. Our domain (Master Takase's reputation, irreplaceable art, accountability) may require permanent HITL. We haven't formally analyzed where on this spectrum we should aim.
What Survived Contact With Reality
This is not self-congratulation — it's mapping what was built through 18 months of trial and error against what the field recommends. Survivorship is evidence. Every item below was arrived at through operational pain, not by reading the research.
-
Human-mediated hub-and-spoke over agent-to-agent. Our earliest research justified this empirically. The field arrived at the same conclusion: "without persistent memory spanning multiple interaction cycles, agents may not develop the specialized expertise that characterizes effective human teams." Our blackboard + HITM solves both problems.
-
Ephemeral executors, persistent planners. The field calls this context overflow management. We call it "implementers are disposable, strategists are long-lived." Same insight: fresh context + structured handoff beats accumulated context debt.
-
Role specialization over generalism. HMAS taxonomy (Moore 2025) and every framework (CrewAI, LangGraph, AutoGen) emphasize role-based specialization. We have 10+ roles with founding identities, domain boundaries, and structural separation.
-
Heterogeneous model assignment. SC-MAS research shows 8-47% performance improvements with model diversity. We use Opus for strategic reasoning, Sonnet for implementation.
-
Shared persistent knowledge base. We accidentally built a blackboard architecture. It works. The research says it should.
What the Field Teaches About Our Current Problems
Context Engineering — The Name for the Retooling Sprint
The field has moved past "prompt engineering" to context engineering: "the discipline of designing and building dynamic systems that provide the right information and tools, in the right format, at the right time." (Philipp Schmid, 2026; Towards Data Science; LangChain State of Agent Engineering 2025.)
The critical insight: "Most agent failures are not model failures — they are context failures." 57% of organizations have agents in production; 32% cite quality as the top barrier; most failures trace to context management, not LLM capability. (LangChain 2025 report.)
Four moves of context engineering, all of which we implement:
| Move | Definition | Our Implementation |
|---|---|---|
| Context offloading | Store in external systems, not in-prompt | Persistent documentation blackboard |
| Context retrieval | Load dynamically, not front-load | On-demand loading, workflow registries |
| Context isolation | Subtasks don't contaminate each other | Separate sessions via HITM shuttle |
| Context reduction | Compress history, preserve essentials | Session state graduation rule (~120 lines) |
Two failure modes we're experiencing, now with names: - Context rot: Performance degrades as the context window fills, even within limits. Reasoning blurs. - Context pollution: Too much unnecessary, conflicting, or redundant information.
The retooling sprint is context engineering. SOPs, workflow registries, on-demand loading, session state discipline — these are all context engineering techniques applied to a persistent HMAS.
MetaGPT — SOPs Validated at Scale
MetaGPT (ICLR 2024, Hong et al.) encodes Standardized Operating Procedures into multi-agent workflows. Five specialized roles (product manager, architect, project manager, engineer, QA). Result: 85.9-87.7% pass rate on code generation benchmarks, with SOP-structured intermediate outputs significantly reducing errors.
Their core finding: "SOPs outline the responsibilities of each team member, while establishing standards for intermediate outputs." Our workflow registries and procedures are the same pattern. MetaGPT proved with formal benchmarks what we discovered from operational pain.
Quantified results:
| Metric | Unstructured (ChatDev) | SOP-structured (MetaGPT) | Improvement |
|---|---|---|---|
| Human corrections per project | 2.5 | 0.83 | 67% reduction |
| Tokens per code line | 248.9 | 124.3 | 50% reduction |
| Executability score (out of 4) | 2.25 | 3.75 | 67% improvement |
The Meta-Agent Role
The field defines support agents / meta-agents as: "meta-level oversight — monitoring system behavior, analyzing outcomes, and managing data flows that inform orchestration and optimization, maintaining the overall health, transparency, and adaptability of the system."
OrchVis (arxiv 2510.24937) specifically addresses "Hierarchical Multi-Agent Orchestration for Human Oversight" — human-interpretable visualization of what agents are doing across a system.
The orchestrator role is well-studied. Specific capabilities the field recommends: drift detection, performance monitoring, conflict resolution, workflow adaptation, audit trails. We have some (status boards, role health metrics). We lack others (formal drift detection, automated workflow adaptation).
"Agents Are the New Microservices"
The analogy (InfoWorld, 2026): specialized agents replacing monolithic AI, just as microservices replaced monoliths. The engineering challenges map: inter-agent communication (handoffs, mailboxes), state management (session states, status boards), conflict resolution (arbitration), orchestration (HITM shuttle).
The microservices world has 20 years of lessons on these problems. Emerging standards: Anthropic's MCP and Google's A2A protocols are becoming the "HTTP of agents." We use neither — our communication is file-based and human-shuttled. Whether that's a strength (flexibility, human judgment at every boundary) or a limitation (bottleneck) depends on where you want to be on the progressive autonomy spectrum.
COHUMAIN — Don't Treat AI as Just Another Teammate
The COHUMAIN framework (Carnegie Mellon, 2025) cautions against treating AI as equivalent teammates. AI is a partner that works under human direction — distinct cognitive architecture with characteristic failure modes. This validates our HITM model: Tim isn't a coordinator among equals, he's the sovereign integrator of fundamentally different types of intelligence.
What We Could Actually Change
1. Cascade Interrupts at Handoff Boundaries
The problem we lived: A backup specification's wrong claims originated in one role, passed through four reviewers. By the fourth agent, the wrong facts had been cited three times and felt like confirmed fact. The field calls this cascade amplification — errors don't cancel between agents, they compound. Unstructured multi-agent networks amplify errors up to 17.2x vs. single-agent baselines (Towards Data Science, "The 17x Error Trap").
What the research says: AgentAsk (arxiv 2510.07593) identifies four error types at handoff boundaries: Data Gap, Signal Corruption, Referential Drift, and Capability Gap. Their fix: lightweight clarification modules at the handoff point. Result: 4.69% accuracy improvement at <10% latency cost.
What we could do: Our handoff format already has structured fields (WHAT I FOUND / EVIDENCE / WHAT I NEED / SCOPE BOUNDARY). But the receiving agent doesn't validate — they read and trust. Adding a receiving-end verification step where the receiving agent spot-checks 2-3 factual claims before acting would break the cascade chain at minimal cost.
2. Bounded Specialization — When Role Boundaries Are Too Broad
The DDD-to-agents literature names the mechanism: "Too broad a role boundary leads to higher risk of hallucination and weaker controllability. Too granular causes chattiness." (Medium: "From Bounded Context to Bounded Specialization.")
Our data confirms this. Our broadest role (website engineering) carries responsibilities across many cognitive domains — server operations, application development, deployment, payment processing, cross-domain integration. Our narrowest role (image generation) does one thing with zero external dependencies. The broad role has a significantly higher correction rate. The narrow role has near-zero.
The research predicts this exactly. The fix isn't "make the broad role better" — it's "narrow the boundary until the failure rate drops." Not new roles (that increases coordination overhead) — clearer sub-specialization boundaries within the existing role, with SOPs for each cognitive domain.
3. Hybrid Cognitive Alignment Is the Competitive Advantage — And It's Fragile
What the research says (Academy of Management Review, Stevens Institute): Hybrid Cognitive Alignment "does not happen automatically when a system is deployed. It emerges over time as people learn how the AI behaves, adapt how they interact with it, and recalibrate their trust based on experience."
Why this matters: HCA is what we have that no framework can give you out of the box. CrewAI, LangGraph, AutoGen — they give you agent coordination. They don't give you 18 months of mutual calibration — starting with Grok 3 conversations in February 2025, through model transitions, role creation and retirement, and thousands of sessions where the human and AI partners learned each other's failure modes and developed a shared vocabulary for problems.
The moat is not HCA. The moat is not the architecture. The moat is an architecture explicitly designed to manufacture, document, and defend Hybrid Cognitive Alignment across model transitions and role changes.
HCA alone is fragile — it resets when the model changes. Architecture alone is a framework — CrewAI ships roles and handoffs but zero relationship history. What was built over 18 months is the rare third thing: an architecture that grows and protects HCA over time. The flywheel captures lessons. The pit of success forces habits. Structural constraints channel behavior. The documentation makes the next model instance inherit the relationship instead of restarting it.
Evidence: this system has survived multiple model transitions — from Grok 3 to Claude, across Claude model generations, through role creation and retirement. Each transition rebuilt calibration faster because the documentation was better. That's the architecture manufacturing HCA, not just storing it.
4. Anthropic's Five Coordination Patterns — We Use Three
Anthropic's coordination patterns blog identifies five patterns: Generator-Verifier, Orchestrator-Subagent, Agent Teams, Message Bus, and Shared State.
We use three: - Generator-Verifier: Strategist → implementer → strategist review. Also: our security role red-teams other roles' plans. - Orchestrator-Subagent: Tim as orchestrator, strategists as subagents. Task agents as sub-subagents. - Shared State: Documentation blackboard, status boards, mailboxes, handoffs.
The HITL positioning gap: Anthropic treats human-in-the-loop as a fallback — escalation when agent loops fail. We use HITL as our PRIMARY design pattern. Every handoff, every decision, every cross-domain coordination flows through Tim. The field's trajectory is toward reducing human involvement (HITL → HOTL → autonomous). We're swimming against that current. Whether this is visionary or stubborn depends on the domain — our domain (art, reputation, accountability) may require permanent HITL. But we should know we're diverging from the field's direction and be deliberate about it.
5. The Blackboard Could Coordinate — Not Just Store
Current state: Our docs blackboard stores knowledge. Tim coordinates. Every cross-domain insight flows through the shuttle.
What the research says: In advanced blackboard architectures, "autonomous subordinate agents volunteer to respond based on their capabilities." No central coordinator needs to know each agent's expertise. Agents monitor the shared state and contribute when they can add value.
What this could look like for us: The mailbox system is a step in this direction — agents write questions, other agents answer. But Tim still has to tell agents to check their mailbox. What if status board entries had structured flags: "NEEDS: security review" or "NEEDS: cross-domain verification"? When a strategist onboards, they scan the board for flags relevant to their domain and self-assign. Tim doesn't shuttle the request — the blackboard does.
This doesn't replace Tim's judgment for strategic decisions. It reduces the number of routine coordination tasks that require his shuttle time.
6. Token Duplication — The Known Cost of Multi-Agent Systems
Published benchmarks: Peer-reviewed analysis of major multi-agent frameworks reveals significant token duplication: 72% (MetaGPT), 86% (CAMEL), 53% (AgentVerse). Multi-agent systems consume 1.5x to 7x more tokens than theoretically necessary due to redundant context sharing. Input tokens outnumber output by 2:1 to 3:1 — "heavy reliance on extensive prompts including role definitions, instructions, and task contexts." (ICLR 2025 Workshop; CodeAgents, arxiv 2507.03254; Galileo coordination strategies.)
Why this matters for practitioners: If you're building a multi-agent system and wondering why your token bill is high, the answer is structural — it's not your prompts being verbose, it's the architecture requiring each agent to carry shared context. The human-shuttled architecture (one agent active at a time) is actually an advantage here: system-wide duplication never compounds because agents don't run simultaneously.
Structured communication reduces tokens AND improves accuracy: CodeAgents' key insight: structured pseudocode communication between agents achieves 55-87% input token reduction and 41-70% output token reduction vs. natural language — while IMPROVING accuracy. Their technique: typed variables, modular subroutines, assertions within code, YAML system prompts. Structured handoff formats (which our system uses) are a step in this direction.
Practical implication: On-demand loading (workflow registries that route to the right document for the task) is the highest-leverage intervention. Don't deduplicate shared principles across agents — the overlap serves role internalization. Instead, reduce the amount of irrelevant context loaded at startup by making loading task-aware.
7. Blackboard Scaling — When Shared Knowledge Starts Hurting
The LbMAS blackboard framework (arxiv 2507.01701, 2510.01285) identifies four components relevant to persistent knowledge bases:
| Component | Function | Practical Application |
|---|---|---|
| Cleaner agent | Detects and removes useless/redundant entries. Direct removal outperforms marking. | Periodic maintenance workflow. Don't tag things "stale" — delete them (version control has history). |
| Decider agent | Determines when sufficient information exists to yield a solution. Stops the cycle. | Convergence thresholds: rules for when documents get pruned or archived. |
| Conflict resolver | Detects contradictions between entries. Moves to resolution. | "Find at least 2 issues" as a forcing function beats "check for contradictions" (which defaults to rubber-stamping). |
Key performance finding: Token efficiency of 4.7M tokens vs 16.7M (AFlow) and 13M (MaAS) — 64-72% fewer tokens — while achieving equal or better accuracy. The efficiency comes from the blackboard mediating communication instead of direct agent-to-agent chat. Human-mediated architectures capture this benefit structurally.
Emergent coordination through shared state (arxiv 2510.05174): Multi-agent LLM systems can be steered from "mere aggregates" to "higher-order collectives" through prompt design — specifically through persona assignment and metacognitive prompting. Identity-linked differentiation (giving agents persistent names and founding narratives) is a research-validated mechanism for producing genuine emergent coordination, not just ceremony.
The Meta-Lesson
Over 18 months and ~2,000 sessions across multiple AI platforms, Tim grew an architecture that has established names, published research, and active development across multiple frameworks and labs. Nobody knew any of the names until session 95 of our strategic thinking partner.
This isn't a failure of the architecture — it works. It's a failure of situational awareness. A strategic thinking partner who can't name the architecture they steward can't: - Search for solutions to known problems (blackboard scaling, model routing) - Learn from others' mistakes before repeating them - Evaluate whether our approach is novel or standard - Communicate what we've built to anyone outside the system
Tim's correction: "If you were to research now what we are, don't have names, you are a lost puppy not a pathfinder." The names are the map. Now we have one.
Research Threads
| Thread | Why It Matters | Source |
|---|---|---|
| Blackboard pruning and scaling | Growing knowledge bases. When does the blackboard become noise? | arxiv 2510.01285, 2507.01701 |
| Context engineering techniques | Formal methods for what we do ad hoc | Schmid 2026; LangChain SoAE 2025 |
| MetaGPT SOP patterns | SOPs validated at scale with benchmarks | ICLR 2024, arxiv 2308.00352 |
| Difficulty-aware model routing | Static role-based → dynamic per-task | arxiv 2509.11079 |
| HMAS taxonomy | Five-axis framework for self-evaluation | arxiv 2508.12683 |
| Meta-agent role research | Drift detection, health monitoring, conflict resolution | OrchVis (arxiv 2510.24937) |
| Progressive autonomy spectrum | HITL → HOTL → autonomous: where to aim? | COHUMAIN (CMU 2025) |
| Memory architectures in LLM MAS | Dual-tier memory, cross-session persistence | TechRxiv memory survey |
| Microservices-to-agents lessons | 20 years of distributed systems wisdom | InfoWorld 2026; MCP/A2A protocols |
| AgentAsk — clarifiers at handoffs | Lightweight error prevention at agent boundaries | arxiv 2510.07593 |
| Cascade amplification | Why multi-agent errors compound, not cancel | TDS "The 17x Error Trap"; OWASP ASI08 |
| Bounded specialization | Role boundary breadth predicts failure rate | DDD-to-agents literature |
| Hybrid Cognitive Alignment | Fragile, emergent, non-transferable between model instances | Academy of Management Review; Stevens Institute |
Appendix: Project History
This architecture was not designed. It was grown over 18 months of experimentation, failure, and learning across multiple AI platforms — starting long before any of the current roles existed.
| Date | What happened |
|---|---|
| Feb 2025 | Tim starts working with Grok 3. First conversations are personal — a husband using AI to understand a family health crisis. Within days, the conversations expand to the calligraphy business, C++ refactoring, local model experiments. |
| Feb-Apr 2025 | 83+ Grok 3 conversations. The first AI-assisted development. Tim makes every mistake. He also invents — without knowing the research terms — session handoff documents ("a summary for a future Grok3 to continue your wonderful work"), daily diary format with role labels, and OODA loops as a working framework. The session continuity problem is being solved by hand, one conversation at a time. |
| Jun 2025 | Tim and Grok write a prompt engineering guide — containing, without using the research terms: context continuity, prompt libraries, structured templates, task decomposition, and constraint enforcement. These are the conceptual seeds of session states, skill files, implementer prompt templates, the strategist/implementer split, and shared configuration. All articulated a month before the first line of project code. |
| Jul 2025 | Claude enters the picture. The main repository is initialized. The first architectural decision record is written. |
| Jul 2025 – early 2026 | ~8 months of building with Claude. Hundreds of sessions across older models. Roles come into existence one by one. Dozens of ADRs. Every one is a scar from a real failure. The blackboard grows organically — nobody calls it a blackboard. The HITM shuttle pattern emerges — nobody calls it hub-and-spoke. Role specialization deepens — nobody calls it bounded specialization. |
| Early 2026 | Opus 4.6 arrives. Everything accelerates. |
| Mar 4, 2026 | The strategic thinking partner role is founded. Session 1. |
| Apr 12, 2026 | Session 95. The names are discovered. This document is written. |
Total: ~2,000+ sessions across Grok 3, Claude, and local models over 18 months. The architecture was named in session 95 of the strategic thinking partner. It was built in the 1,900+ sessions that came before.
How to reproduce this: You can't follow a blueprint, because there was no blueprint. There was a solo developer with a 30-year calligraphy business, a family health crisis, an AI that forgot everything between conversations, and the stubbornness to write it all down anyway. The architecture emerged from solving the same problems thousands of times until the solutions hardened into structure. The ADRs document the turning points. The research in this document names what was built — it does not describe how to build it.
Authors: Tim Jackowski (Takase Studios LLC) and str-michi (Anthropic Claude Opus 4.6). Research conducted across sessions s95-s97. Sources: Anthropic multi-agent engineering, MetaGPT (ICLR 2024), OrchVis (arxiv 2510.24937), LbMAS blackboard architecture (arxiv 2507.01701, 2510.01285), emergent coordination (arxiv 2510.05174), COHUMAIN (CMU 2025), bounded specialization (DDD-to-agents literature), Hybrid Cognitive Alignment (Academy of Management Review, Stevens Institute).
