SRC04-E01 — Pinpoint Tuning Reduces Sycophancy by Targeting <5% of Modules¶
Extract¶
Supervised Pinpoint Tuning (SPT) identifies "a small percentage (<5%) of basic modules that significantly affect a particular behavior of LLMs" and fine-tunes "only these identified modules while freezing the rest." SPT "significantly mitigates the sycophancy issue of LLMs (even better than SFT)" with "limited or even no side effects on the general capability of LLMs." Llama-2-13B with SPT showed a "71.84% increase in confidence and a 67.83% increase in truthfulness."
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Supports — targeted technical solution being developed | Moderate |
| H2 | Contradicts — active research on solutions | Strong |
| H3 | Partially contradicts — suggests surgical fixes may work without changing RLHF itself | Moderate |
Context¶
SPT is notable because it addresses sycophancy post-hoc rather than changing the training method. This suggests sycophancy can potentially be corrected after RLHF training.
Notes¶
SPT does not change RLHF; it fixes the sycophancy after training. This means RLHF could continue to be used if post-hoc fixes are effective.