2026-03-15 - RES-002: OpenClaw-RL — Directive Signals in Conversations

Citation

Yinjie Wang, Xuyang Chen, et al. "OpenClaw-RL: Train Any Agent Simply by Talking." arXiv: 2603.10165, March 2026.

Key Findings

Every interaction with an AI agent produces "next-state signals" — user replies, re-queries, explicit corrections — that are universal gold for training but typically ignored.
Current RL systems focus on final outcomes (success/failure) but discard the rich "directive signals" in human corrections. This is akin to a student ignoring a teacher's notes on a graded paper.
The framework extracts both evaluative (how well did it work?) and directive (how should it change?) information from conversation signals.
Applied to chat agents, models get smarter just by being interacted with — learning from user fixes without manual data labeling.

The Core Argument

Human corrections in conversations are the highest-quality training signal available — and current systems throw it away.

Why It Matters to Us

This paper describes, in RL terms, a signal our flywheel (ADR-047) and the docs-meta initiative already capture by hand:

The flywheel IS manual OpenClaw-RL. Every time Tim corrects a strategist — "no, the NAS roles are reversed," "read the spec before asserting facts" — that correction gets written into session states, concept-checks, lessons, and memory files. We don't feed it back into model weights (we can't), but we persist it in documentation so the next session inherits the correction. The paper says this signal is gold; we already treat it as gold.
The conversation archive is an asset, not history. The JSONL conversation archive contains thousands of directive signals — corrections, re-queries, "no not that, instead do..." moments. With 1M context, old sessions can be fully re-read. This is the raw correction data the paper says current systems waste.
docs-meta is the retrieval infrastructure. The conversation archive is locked in JSONL files that nobody can search. docs-meta (the planned vault — gitignored, invisible to engineer roles, curated session index) makes this data findable and usable. RES-002 is the research context that informed building docs-meta.
The compounding is real. At the time of writing: 143 str-takase sessions, 83 str-ishizue, 41 str-mamori — each session's corrections compounding into the next (every count higher today). The project's documentation IS the learned model — not weights in a neural network, but structured knowledge in files. Same function, different substrate.

Empirical Confirmation (RES-008, 2026-03-26)

RES-008 used the conversation archive this paper justified building. str-michi searched 121 JSONL session files, extracted Tim's corrections (including profanity as a frustration marker), computed correction rates, and discovered that 83% of AI strategist failures involved documented constraints that weren't consulted. The corrections — the "directive signals" this paper says systems throw away — became the dataset that diagnosed the failure mode.

The prediction that correction data is "universal gold" was confirmed: Tim's profanity turned out to be a precise instrument for measuring system failure rates across 2,371 human messages. The rawest form of human correction was the most analytically useful.

Discovered 2026-03-14. Part of the three-paper arc: RES-001 (why HITM) → RES-002 (why the flywheel works) → RES-003 (why docs-meta is worth building).