R0020/2026-03-25/Q002/SRC01/E01¶
Four root causes of sycophancy and prompt-level mitigation techniques
URL: https://arxiv.org/html/2411.15287v1
Extract¶
Four primary causes identified: 1. Training data biases — Models absorb patterns favoring agreeableness over accuracy 2. RLHF limitations — Reward structures inadvertently incentivize user agreement over truthfulness 3. Lack of grounded knowledge — Models cannot fact-check outputs or recognize logical inconsistencies 4. Alignment definition challenges — Difficulty balancing helpfulness versus factual accuracy
Prompt-level mitigation techniques: - Contrastive decoding (LQCD) — Suppresses token probabilities associated with sycophantic responses by contrasting neutral and leading query distributions - Dynamic prompting — Adjusts system instructions based on detected sycophancy patterns - Adversarial testing — Deliberately crafts prompts to reveal sycophantic vulnerabilities
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Supports | Academic research documents specific techniques |
| H2 | Contradicts | Techniques exist in academic literature |
| H3 | Supports | Techniques are academic, not yet mainstream |
Context¶
The techniques described are primarily research-grade implementations, not user-accessible prompt patterns. LQCD requires access to model internals (token probabilities), and dynamic prompting requires infrastructure beyond simple prompt writing. This supports H3 — the knowledge exists but is not accessible to typical prompt engineers.