Research	R0040 — RLHF Alternatives
Run	2026-03-29
Query	Q002 — RLHF and Sycophancy
Source	SRC04
Evidence	SRC04-E01

SRC04-E01 — Pinpoint Tuning Reduces Sycophancy by Targeting <5% of Modules¶

Extract¶

Supervised Pinpoint Tuning (SPT) identifies "a small percentage (<5%) of basic modules that significantly affect a particular behavior of LLMs" and fine-tunes "only these identified modules while freezing the rest." SPT "significantly mitigates the sycophancy issue of LLMs (even better than SFT)" with "limited or even no side effects on the general capability of LLMs." Llama-2-13B with SPT showed a "71.84% increase in confidence and a 67.83% increase in truthfulness."

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Supports — targeted technical solution being developed	Moderate
H2	Contradicts — active research on solutions	Strong
H3	Partially contradicts — suggests surgical fixes may work without changing RLHF itself	Moderate

Context¶

SPT is notable because it addresses sycophancy post-hoc rather than changing the training method. This suggests sycophancy can potentially be corrected after RLHF training.

Notes¶

SPT does not change RLHF; it fixes the sycophancy after training. This means RLHF could continue to be used if post-hoc fixes are effective.