Fine Japanese Calligraphy

The Art of Master Japanese Calligrapher Eri Takase


RES-003: RLHI — Reinforcement Learning from Human Interaction (2026-03-15)

Citation

Chuanyang Jin, Bo Liu, et al. "The Era of Real-World Human Interaction: RL from User Conversations." arXiv: 2509.25137, September 2025. Meta FAIR and Johns Hopkins University.

Key Findings

Related Finding (uncited)

A Stanford study (October 2025, exact title unconfirmed) reportedly showed small models trained on human feedback including corrections outperforming massive models without it — corrections described as "100x more valuable" than raw compute. Part of the broader "data exhaustion by 2028" discussion where human interactions become the scarce, high-quality resource.

Why It Matters to Us

This paper is the most direct justification for the docs-meta initiative:

  1. Our 949 conversations ARE the preference pairs. Every "no, not that — instead do..." moment from Tim is exactly the preference pair this paper extracts: the AI's wrong answer + the corrected direction. We have 949 sessions of these across 10+ roles. RES-002 (OpenClaw-RL) says this data is gold; RES-003 says specifically HOW it's gold — preference pairs + persona extraction.

  2. We already build "user personas" manually. Session states, concept-checks, lessons files, memory entries — these are hand-built persona artifacts. "str-takase reads the spec before asserting facts" is a persona constraint. "str-mamori never guesses IP attribution" is a persona constraint. This paper says the same thing can be extracted automatically from conversation history.

  3. docs-meta is the retrieval layer. The raw conversations (JSONL in ~/.claude/) contain the preference pairs. docs-meta (the planned vault) makes them searchable and extractable. Without docs-meta, the preference pairs are locked in unsearchable files — exactly the "thrown away" state this paper critiques.

  4. The compounding argument gets stronger. RES-001 says LLMs converge without human intervention. RES-002 says corrections are the most valuable signal. RES-003 says those corrections can build preference pairs AND user personas. Together: our human-shuttled correction system isn't a workaround for lacking agent-to-agent communication — it's the optimal architecture for generating the highest-quality improvement signal.

  5. Future possibility. If model fine-tuning ever becomes accessible for our use case, the docs-meta archive would be the training corpus. 949 sessions x average corrections per session = thousands of preference pairs, already tagged by domain (str-takase, str-ishizue, str-mamori, etc.). The infrastructure investment pays off regardless — useful now for context retrieval, potentially transformative later for model adaptation.

The Three-Paper Arc

Paper What it says What it means for us
RES-001 (Hivemind) LLMs converge without humans Why we need HITM
RES-002 (OpenClaw-RL) Corrections are gold, systems throw them away Why the flywheel works
RES-003 (RLHI) Corrections create preference pairs + user personas Why docs-meta is worth building

Empirical Confirmation (RES-008, 2026-03-26)

RES-008 confirmed the core prediction of all three papers in this arc. Using the conversation archive (PLN-011), str-michi analyzed 121 sessions and 2,371 Tim messages to discover the "kobayashi maru" failure pattern — AI strategists hitting invisible architectural constraints and spiraling through fixes that can't converge.

The key confirmation of RES-003 specifically: the preference pairs this paper describes were empirically useful. Tim's corrections ("that's wrong," "stop guessing," "read the documentation") created clear signal/noise separation in the conversation data. Sessions with correction rates above 15% were a distinct troubled population — not gradually worse, but a qualitatively different class. The corrections didn't just create preference pairs for potential future training — they were immediately analytically useful for diagnosing system-level failure modes.

The conversation archive (1,430+ searchable text files, JSONL with timestamps) made this research possible. Without the retrieval infrastructure this paper justified building, the correction patterns would have been locked in unsearchable files — exactly the "thrown away" state both RES-002 and RES-003 warn against.


Published September 2025, discovered 2026-03-15. Part of the three-paper arc: RES-001 (why HITM) → RES-002 (why the flywheel works) → RES-003 (why docs-meta is worth building).