R0053/2026-03-31-02/C002/SRC02/E01¶
Instruction hierarchies fail in LLMs — system/user separation does not establish priority
URL: https://arxiv.org/abs/2502.15851
Extract¶
"The widely-adopted system/user prompt separation fails to establish a reliable instruction hierarchy." Models "exhibit strong inherent biases toward certain constraint types regardless of their priority designation." "Societal hierarchy framings (e.g., authority, expertise, consensus) show stronger influence on model behavior than system/user roles, suggesting that pretraining-derived social structures function as latent behavioral priors with potentially greater impact than post-training guardrails." Tested across six state-of-the-art LLMs.
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Supports | Confirms that enforcement is needed (requirements are not reliably followed) |
| H2 | Supports | Confirms the problem exists but suggests the mechanism is more complex than negative/positive framing |
| H3 | Contradicts | Directly shows AI does not reliably follow all requirements |
Context¶
This paper was accepted to AAAI-26, indicating peer review and methodological rigor. The finding about social hierarchy framings being more influential than technical instruction mechanisms is particularly relevant — it suggests enforcement language may work not because of negative framing but because of perceived authority.