R0040/2026-03-28/Q001/S02
WebSearch — Focused search on DPO, RLAIF, and Constitutional AI
Summary
| Field |
Value |
| Source/Database |
WebSearch |
| Query terms |
RLHF alternatives DPO RLAIF constitutional AI training methods |
| Filters |
None |
| Results returned |
10 |
| Results selected |
4 |
| Results rejected |
6 |
Selected Results
| Result |
Title |
URL |
Rationale |
| S02-R01 |
Direct Preference Optimization: Your Language Model is Secretly a Reward Model |
https://arxiv.org/abs/2305.18290 |
Original DPO paper — primary source |
| S02-R02 |
Constitutional AI & AI Feedback (RLHF Book) |
https://rlhfbook.com/c/13-cai |
Comprehensive CAI chapter with citations |
| S02-R03 |
Constitutional AI: Harmlessness from AI Feedback |
https://arxiv.org/abs/2212.08073 |
Original Anthropic CAI paper — primary source |
| S02-R04 |
RL Meets LLMs: Survey of Advancements |
https://arxiv.org/html/2509.16679v1 |
Comprehensive survey of RL methods for LLMs |
Rejected Results
| Result |
Title |
URL |
Rationale |
| S02-R05 |
Fine-tune LLMs with RL from human or AI feedback (AWS) |
https://aws.amazon.com/blogs/machine-learning/fine-tune-large-language-models-with-reinforcement-learning-from-human-or-ai-feedback/ |
Practitioner guide, not primary research |
| S02-R06 |
Comprehensive Guide to RL in Modern AI (HuggingFace) |
https://huggingface.co/blog/ProCreations/guide-to-rl |
Tutorial-level, no novel findings |
| S02-R07 |
Topic 46: RLHF variations: DPO, RRHF, RLAIF |
https://turingpost.substack.com/p/topic-46-rlhf-variations-dpo-rrhf |
Newsletter summary, covered by primary sources |
| S02-R08 |
RLHF Wikipedia |
https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback |
Encyclopedia entry, useful for context but not evidence |
| S02-R09 |
RLHF for LLMs (Neptune.ai) |
https://neptune.ai/blog/reinforcement-learning-from-human-feedback-for-llms |
Tutorial, no primary data |
| S02-R10 |
rlhf, rlaif, ppo, dpo and more (arXiv survey) |
https://arxiv.org/pdf/2407.16216 |
Survey paper — useful but largely redundant with S02-R04 |
Notes
This search successfully surfaced the primary academic sources for DPO and Constitutional AI. The original papers provide the strongest evidence for these methods' technical foundations and empirical results.