S02¶


Research	R0040 — RLHF Alternatives
Run	2026-04-01
Query	Q002
Search	S02

WebSearch — Reward shaping and sycophancy mitigation within RLHF

Summary¶

Field	Value
Source/Database	WebSearch
Query terms	"reward shaping" "reward correction" sycophancy mitigation RLHF training intervention 2025 2026
Filters	None
Results returned	10
Results selected	3
Results rejected	7

Selected Results¶

Result	Title	URL	Rationale
S02-R01	Reward Shaping to Mitigate Reward Hacking in RLHF	https://arxiv.org/abs/2502.18770	Directly addresses reward shaping as RLHF mitigation
S02-R02	How RLHF Amplifies Sycophancy (Benade PDF)	https://www.gerdusbenade.com/files/26_sycophancy.pdf	Proposed reward correction targeting sycophancy amplification
S02-R03	One Bias After Another: Mechanistic Reward Shaping	https://arxiv.org/html/2603.03291	Analysis of persistent biases in language reward models

Rejected Results¶

Result	Title	URL	Rationale
S02-R04	Reward Hacking in RL (Lil'Log)	https://lilianweng.github.io/posts/2024-11-28-reward-hacking/	General reward hacking overview, not sycophancy-specific
S02-R05	Reward Shaping OpenReview PDF	https://openreview.net/pdf?id=62A4d5Mokc	Same paper as R01, different format
S02-R06	Reward Shaping ResearchGate	https://www.researchgate.net/publication/389392526_Reward_Shaping_to_Mitigate_Reward_Hacking_in_RLHF	Same paper as R01, different platform
S02-R07	Reward Hacking Defense Rubric	https://www.emergentmind.com/topics/reward-hacking-defense-rubric	Aggregator topic page
S02-R08	Natural Emergent Misalignment (Anthropic)	https://assets.anthropic.com/m/74342f2c96095771/original/Natural-emergent-misalignment-from-reward-hacking-paper.pdf	Focuses on emergent misalignment, tangential to sycophancy
S02-R09	Sycophancy Whitepaper	https://jinaldesai.com/wp-content/uploads/2026/02/AI_Sycophancy_Whitepaper_JinalDesai.pdf	Duplicate from S01, non-peer-reviewed
S02-R10	Reward Shaping arXiv PDF	https://arxiv.org/pdf/2502.18770	Same paper as R01, PDF format

Notes¶

Reward shaping within RLHF is an active research area. The PAR method (Fu et al.) and Shapira et al.'s agreement penalty represent two distinct approaches to mitigating sycophancy without abandoning RLHF.