RES-003: RLHI — Reinforcement Learning from Human Interaction (2026-03-15)
Citation
Chuanyang Jin, Bo Liu, et al. "The Era of Real-World Human Interaction: RL from User Conversations." arXiv: 2509.25137, September 2025. Meta FAIR and Johns Hopkins University.
Key Findings
- Preference pairs from corrections: Every time a user corrects an AI response, that creates a preference pair — bad response (original) vs. good response (incorporating feedback). This signal is real but, as the paper puts it, historically hard to extract — so static training data gets collected instead.
- User personas from chat history: Chat patterns (tone preferences, format preferences, domain expertise) can be extracted to personalize future interactions. Noisy interactions are quality-filtered before training.
- Static datasets are textbooks: Disconnected from real conversations — "static, context-free judgments instead of evolving, situational demands."
- Results: Trained on WildChat (1M+ real ChatGPT conversations), the approach boosts personalization and instruction-following, and raises accuracy on math and science benchmarks. (Those reasoning gains used synthesized conversations simulating users correcting errors, not organic chat data.) The model is trained to condition on user personas extracted from history — which is not the same as a deployed model remembering you across live sessions.
Why It Matters to Us
This paper describes, in RL terms, what the docs-meta initiative captures by hand:
-
Our conversations hold the same kind of preference signal. Every "no, not that — instead do..." moment from Tim is the shape of preference pair this paper extracts: the AI's wrong answer + the corrected direction. We have hundreds of sessions of these across 10+ roles. RES-002 (OpenClaw-RL) describes why this data is valuable; RES-003 describes the mechanism — preference pairs + persona extraction. (The scale gap is large and worth stating plainly: RLHI trained on 1M+ conversations from many users; our corpus is hundreds of sessions spanning multiple roles, not one user — whether the mechanism transfers at our scale is an open question, not a settled result.)
-
We already build "user personas" manually. Session states, concept-checks, lessons files, memory entries — these are hand-built persona artifacts. "str-takase reads the spec before asserting facts" is a persona constraint. "str-mamori never guesses IP attribution" is a persona constraint. This paper says the same thing can be extracted automatically from conversation history.
-
docs-meta is the retrieval layer. The raw conversations (JSONL in
~/.claude/) contain the preference pairs. docs-meta (the planned vault) makes them searchable and extractable. Without docs-meta, the preference pairs are locked in unsearchable files — exactly the "thrown away" state this paper critiques. -
The compounding argument gets stronger. RES-001 says LLMs converge without human intervention. RES-002 says corrections are the most valuable signal. RES-003 says those corrections can build preference pairs AND user personas. Together: our human-shuttled correction system isn't a workaround for lacking agent-to-agent communication — it's the optimal architecture for generating the highest-quality improvement signal.
-
Future possibility. If model fine-tuning ever becomes accessible for our use case, the docs-meta archive would be the training corpus — hundreds of sessions of corrections, already tagged by domain (str-takase, str-ishizue, str-mamori, etc.). The infrastructure investment pays off regardless — useful now for context retrieval, and possibly for model adaptation later.
The Three-Paper Arc
| Paper | What it says | What it means for us |
|---|---|---|
| RES-001 (Hivemind) | LLMs converge without humans | Why we need HITM |
| RES-002 (OpenClaw-RL) | Corrections are gold, systems throw them away | Why the flywheel works |
| RES-003 (RLHI) | Corrections create preference pairs + user personas | Why docs-meta is worth building |
Empirical Confirmation (RES-008, 2026-03-26)
RES-008 confirmed the core prediction of all three papers in this arc. Using the conversation archive (PLN-011), str-michi analyzed 121 sessions and 2,371 Tim messages to discover the "kobayashi maru" failure pattern — AI strategists hitting invisible architectural constraints and spiraling through fixes that can't converge.
The key confirmation of RES-003 specifically: the preference pairs this paper describes were empirically useful. Tim's corrections ("that's wrong," "stop guessing," "read the documentation") created clear signal/noise separation in the conversation data. Sessions with correction rates above 15% were a distinct troubled population — not gradually worse, but a qualitatively different class. The corrections didn't just create preference pairs for potential future training — they were immediately analytically useful for diagnosing system-level failure modes.
The conversation archive (1,430+ searchable text files, JSONL with timestamps) made this research possible. Without the retrieval infrastructure this paper justified building, the correction patterns would have been locked in unsearchable files — exactly the "thrown away" state both RES-002 and RES-003 warn against.
Published September 2025, discovered 2026-03-15. Part of the three-paper arc: RES-001 (why HITM) → RES-002 (why the flywheel works) → RES-003 (why docs-meta is worth building).
