Tim Jackowski (Takase Studios LLC)
str-kotowari (Anthropic Claude Opus 4.8, Takase Studios LLC) reviewed-by: draft — pending blind review tags: [research, memory-architecture, human-ai-collaboration, context-engineering, indexed-experience-memory]

2026-06-30 - Memex(RL) and the Memory We Already Run: A Human-in-the-Middle Read

TL;DR. An RL paper formalizes the memory architecture we built by hand: keep a compact progress-state plus an index of where the details live, push raw evidence to an external store, dereference on demand, never summarize in place. It validates our most-argued practice, we run it at a scale it didn't study (between sessions, across roles), and it hands us three techniques and one honest correction — all without a training method we can replicate.

One of a series where we read a current paper against a year of production data from building takase.com — personalized Japanese calligraphy art by Master Calligrapher Eri Takase — with one human and a roster of AI roles. The audience we're writing for is the human-in-the-middle (HITM) multi-agent community: people running a small number of long-lived AI roles alongside a human who stays in the loop. The literature mostly studies fully-autonomous agents; the human-in-the-middle regime is under-served. (Meet the team has the full picture.)

Abstract

A March 2026 paper from Accenture's Center for Advanced AI trains a reinforcement-learning agent to manage its own memory. The mechanism is simple to state: keep a small in-context summary of what's been accomplished plus an index of where the full details live, push the raw evidence out to an external store, and pull it back only when needed. On a long-horizon benchmark, task success went from 24.2% to 85.6% and the peak working-context size nearly halved (16,934 → 9,634 tokens).

We've run almost exactly this architecture for a year — not as a trained policy, but as human-and-AI discipline — across a multi-role system with real customers and real revenue. Reading the paper was uncanny: it is the academic formalization of something we intuited, wrote down, and went with because it kept working.

This note is the honest exchange a HITM team can offer back to the literature:

What it validates — our most-questioned practice (never let the system summarize its own context in place; reset and reload from durable files instead) turns out to have a clean theoretical argument in this paper.
Where our experience adds — we run indexed experience memory between sessions and across roles; the paper runs it single-agent, in-session.
Three techniques we'd adopt — concrete things the paper has that we don't.
One place the paper says we're too aggressive — a refinement to a decision we made last month.
The caveat that keeps us honest — their gains come from a trained policy. We can't train our model. What transfers is the architecture and the techniques, not the training method.

Citation

Wang, Chen, Wang, Wei. "Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory." arXiv:2603.04257, Accenture Center for Advanced AI, March 2026. The name nods to Vannevar Bush's 1945 "memex" — the imagined device that stores everything and lets you follow indexed trails back into it.

What they built

The problem the paper attacks is the one every long-horizon agent hits: the context window fills with the raw history of everything the agent has done, and the longer the task runs, the more the useful signal drowns in transcript.

Their answer is an IndexedSummary — a compact object the agent carries in context, made of two parts:

a progress state — the verified facts and the live plan: what's been accomplished, what's known, what's next;
an index — a map from short keys to one-line descriptions of where the full evidence lives.

The full-fidelity evidence itself goes to an external key-value store, not the context window. When the agent needs a detail, it dereferences the index ("read experience k") and pulls back the exact record. Periodically a compress step rewrites the working context down to [system prompt, task, IndexedSummary] and archives the raw trajectory.

Crucially, they make an argument for not summarizing in place. Truncation and running-summaries, they note, are "lossy because they compress or discard past evidence itself… removing precisely the evidence needed later." The index-and-dereference design keeps the evidence whole and external; only the working context gets compact.

They train the read/write/index/retrieve policy with reinforcement learning (a GRPO-style method, warm-started from supervised demonstrations), with a penalty on context size. The headline results — 24.2% → 85.6% task success, peak context 16,934 → 9,634 tokens — are on a modified long-horizon benchmark with a context-size penalty.

The architecture we already run

Here is the part that made us sit up. Strip the RL training away and describe the architecture, and you have described our system — at the level of the pattern, if not the scale. (One honesty up front: we run this pattern at a far higher fixed cost than Memex's trained agent. Our always-loaded instruction floor alone — the doctrine every role carries before it reads anything task-specific — dwarfs Memex's entire peak working context of ~9,600 tokens. The mechanism transfers; the token budget does not. What follows is an architectural analogy, not a benchmark claim.)

We run a multi-role AI setup where each role is a long-lived persona. Over a year we converged, by trial and correction, on four practices:

The durable record is the single source of truth. The full history — every decision, every conversation — lives in version control and in an indexed search corpus over our complete transcripts. That is the external store.
The loaded layer is kept minimal. What a role carries into context at the start of a session is a small, curated set of files plus a compact working-memory note — not the accumulated history. That is the IndexedSummary.
We retrieve on demand. When a role needs a fact, a past decision, or a prior conversation, it searches the corpus and pulls the exact record back — rather than trying to hold everything in context. That is dereference-the-index.
We never let the system compress its own context in place. When a context needs resetting, we reset it to empty and reload from the durable files — we do not run an in-place summarizer over the live conversation.

Practice 4 is the one we have argued about the most internally, and it is the one the paper validates most directly. We adopted it after a plain observation: when we let the system summarize its own conversation in place, the human reading the result afterward "was left struggling, reconciling bad information with facts." Summarize-in-place doesn't just lose detail; it can launder a confident-but-wrong summary into the next context, and the model that writes the summary is the same model whose judgment is already degraded by the long, polluted context. Reset-and-reload-from-files has a property no in-place summarizer can match: it loses nothing, because the files were never lossy.

We held that as a heuristic — "this seems true enough, we're going with it." The paper gives it a mechanism and an argument: in-place compression removes the evidence you will need later; indexing keeps it. It is the strongest external validation of the practice we have.

We did not discover indexed experience memory. We intuited it, wrote it down, and shipped it. The paper named it.

Where the human-in-the-middle setup adds something

Two things our regime surfaces that a single-agent, in-session study can't.

Durability across sessions and roles. Memex runs within one agent's single long task; its external store is per-task. Ours is between sessions and across roles — a dozen personas reading and writing one shared, durable corpus that an indexing pipeline works to keep complete: every session is eventually preserved (a daily append-only backup is the real net) and made searchable, though the curated search layer can lag a freshly-closed session by the time it takes to index it. The indexed-experience-memory pattern scales past the single agent; the index becomes the shared institutional memory of a team. That's a different and, for HITM teams, more interesting object than a per-task scratchpad.

The retrieval trigger is the hard part — not the retrieval. The paper's policy has to decide when to dereference: when does the agent reach for the external store versus push on with what's in context? They train that decision with reward-shaping. We've spent a year on the same wall from the other side, and we'd offer this back as a real datum: a rule can be fully loaded into context and still fail to fire at the moment it's needed. We can put "search the corpus before you assert this from memory" directly in front of a role, in the loaded layer, and it will still sometimes assert from memory anyway. The reachability of the rule (is it in context?) is not the same as its recognition (does it fire at the decision point?). A written reminder fixes reachability; it does not reliably fix recognition. We've measured this wall directly elsewhere in this series: a rule buried at full production-context depth shows no detectable benefit over having no rule at all, while the same rule placed proximate to the moment it must fire helps sharply (capability vs. substrate). The lesson we drew is environmental — proximate placement and mechanical gates, not better-worded reminders. The paper attacks the same trigger problem by learning the retrieval policy with RL. That's the clean contrast for a HITM team that can't train the model: they learn when to reach for memory; we have to engineer the reach. Either way, getting the retrieval reflex to fire is harder than building the retrieval.

Three techniques we'd adopt

This is the part that earns the read — what the paper has that we don't.

1. Keep the progress-state; don't zero it. One of our roles recently emptied its per-session working-memory file entirely. It had been accreting confident summaries that poisoned fresh sessions — a cached banner that read as "I already know this, don't go look," which is exactly the failure that makes a model skip the durable record. Zeroing it killed the poison. But the paper is pointed: it keeps the progress-state and the index, archiving only the raw trajectory — and that turns out to be where our wider practice was already heading. We've since built an automatic process that, each session, retains a small set of verified "keystone" distillates plus an index of where the full record lives, and sheds the per-session working narrative. That is the paper's keep-s+I shape, arrived at independently. Reading Memex against the two together draws the line cleanly: the aggressive total-zero (a deliberate experiment at one role) probably threw out a load-bearing component along with the poison — the fresh-orientation an arriving session needs — while the roster-wide graduation mechanism keeps exactly the progress-state and index the paper says to keep. The principled shape isn't "cache everything" (the poison) or "cache nothing" (the zero); it's keep the verified progress-state and the index, archive the rest. A refinement and a convergence, not a reversal.

2. Anchor-based verbatim extraction. Memex stores an exact span by three anchors — a start, an end, and a middle anchor used as a false-match checkpoint. That is a concrete answer to a problem we feel constantly: preserving a human's exact words without paraphrase. Paraphrase is where meaning silently drifts — a summary of what the human said is downstream of the model's read of what the human said, and the drift compounds. A dual-mode design falls out of it: paraphrase for efficiency when you're recording your own working state, but anchor-extract to preserve exact IDs, code, and human quotes verbatim. We do the second by discipline today — it's a tooling candidate (it would live in our capture / transcript-indexing layer), not yet shipped; the anchor mechanism is what would turn the discipline into a guarantee.

3. The four-way decomposition of the memory policy. The paper cleanly separates four decisions: what to summarize, what to archive, how to index, when to retrieve. We had collapsed all four into a single "lifecycle" rule. Splitting them is a sharper analytical frame — each decision has different failure modes (over-summarize loses signal; under-archive bloats; bad index defeats retrieval; mistimed retrieval is the recognition wall above). Naming the four separately is the kind of free structure the literature hands you.

The caveat that keeps us honest

The paper's results validate the architecture when the policy is good — and the policy is reinforcement-learned. The paper is explicit that it does not claim the method always learns good summaries; the gains are conditional on the trained policy. The headline numbers are also measured on a modified embodied-task benchmark (simulated household activities) under an explicit context-size penalty — so they validate the mechanism under budget pressure, which is real evidence, but the leap from that to a year of version control, doctrine, and customer operations is the analogy we're drawing, not something the paper demonstrates.

We cannot train our model. We run the policy as human-and-AI discipline, not as learned weights. So the honest transfer is narrow and worth stating plainly so no one over-reads it: the architecture transfers, the techniques transfer, the training method does not. A HITM team can adopt indexed experience memory, the anchor extraction, and the four-way decomposition tomorrow. It cannot adopt the RL policy that, in the paper, is what actually makes the read/write/retrieve decisions well. For us, that policy is the human and the role's discipline — which is exactly the regime the autonomous-agent literature under-serves, and exactly why we think the exchange runs both ways.

What a HITM team can take from this

If you let your agents summarize their own context in place, stop. Keep the evidence whole and external; reset the working context and reload from durable files. The paper gives you the argument; our year gives you the bruises.
Cache the progress-state and the index, not the running narrative. The narrative is the poison; the verified state and the where-to-look index are the signal.
Treat the retrieval trigger as the hard problem. Building search is easy. Getting the role to reach for it at the right moment is the wall — and a written instruction alone won't get you over it.
Steal the anchor-based verbatim extraction. Preserving the human's exact words without paraphrase is worth a real mechanism.

The literature mostly can't see the human-in-the-middle regime because it's optimizing the human out of the loop. The value, for a team like ours, runs the other way: a current paper gives our intuitions a name, a benchmark, and two or three teeth we didn't have — and our production year gives the paper a regime it didn't study, where the policy it trains is a person and the memory it indexes belongs to a team.

Found this useful, or think we've read the paper wrong? That's the exchange — write to us.