Skip to content

R0040/2026-03-28/Q001/S02

Research R0040 — RLHF Alternatives
Run 2026-03-28
Query Q001
Search S02

WebSearch — Focused search on DPO, RLAIF, and Constitutional AI

Summary

Field Value
Source/Database WebSearch
Query terms RLHF alternatives DPO RLAIF constitutional AI training methods
Filters None
Results returned 10
Results selected 4
Results rejected 6

Selected Results

Result Title URL Rationale
S02-R01 Direct Preference Optimization: Your Language Model is Secretly a Reward Model https://arxiv.org/abs/2305.18290 Original DPO paper — primary source
S02-R02 Constitutional AI & AI Feedback (RLHF Book) https://rlhfbook.com/c/13-cai Comprehensive CAI chapter with citations
S02-R03 Constitutional AI: Harmlessness from AI Feedback https://arxiv.org/abs/2212.08073 Original Anthropic CAI paper — primary source
S02-R04 RL Meets LLMs: Survey of Advancements https://arxiv.org/html/2509.16679v1 Comprehensive survey of RL methods for LLMs

Rejected Results

Result Title URL Rationale
S02-R05 Fine-tune LLMs with RL from human or AI feedback (AWS) https://aws.amazon.com/blogs/machine-learning/fine-tune-large-language-models-with-reinforcement-learning-from-human-or-ai-feedback/ Practitioner guide, not primary research
S02-R06 Comprehensive Guide to RL in Modern AI (HuggingFace) https://huggingface.co/blog/ProCreations/guide-to-rl Tutorial-level, no novel findings
S02-R07 Topic 46: RLHF variations: DPO, RRHF, RLAIF https://turingpost.substack.com/p/topic-46-rlhf-variations-dpo-rrhf Newsletter summary, covered by primary sources
S02-R08 RLHF Wikipedia https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback Encyclopedia entry, useful for context but not evidence
S02-R09 RLHF for LLMs (Neptune.ai) https://neptune.ai/blog/reinforcement-learning-from-human-feedback-for-llms Tutorial, no primary data
S02-R10 rlhf, rlaif, ppo, dpo and more (arXiv survey) https://arxiv.org/pdf/2407.16216 Survey paper — useful but largely redundant with S02-R04

Notes

This search successfully surfaced the primary academic sources for DPO and Constitutional AI. The original papers provide the strongest evidence for these methods' technical foundations and empirical results.