R0055/2026-04-01/C002/SRC01/E01¶
RLHF pipeline described: human labelers express preferences used to train reward models
URL: https://arxiv.org/pdf/2310.13548
Extract¶
RLHF trains models using human preference data. Labelers compare outputs and express which they prefer, creating training signal for reward models that guide policy optimization via reinforcement learning.
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Supports | Strong |
| H2 | Supports | Moderate |
| H3 | Contradicts | Strong |
Context¶
Evidence directly relevant to testing the claim's factual assertions.