E01¶


Research	R0053 — Prompt Claims
Run	2026-03-31-02
Claim	C002
Source	SRC02
Evidence	SRC02-E01
Type	Factual

Instruction hierarchies fail in LLMs — system/user separation does not establish priority

URL: https://arxiv.org/abs/2502.15851

Extract¶

"The widely-adopted system/user prompt separation fails to establish a reliable instruction hierarchy." Models "exhibit strong inherent biases toward certain constraint types regardless of their priority designation." "Societal hierarchy framings (e.g., authority, expertise, consensus) show stronger influence on model behavior than system/user roles, suggesting that pretraining-derived social structures function as latent behavioral priors with potentially greater impact than post-training guardrails." Tested across six state-of-the-art LLMs.

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Supports	Confirms that enforcement is needed (requirements are not reliably followed)
H2	Supports	Confirms the problem exists but suggests the mechanism is more complex than negative/positive framing
H3	Contradicts	Directly shows AI does not reliably follow all requirements

Context¶

This paper was accepted to AAAI-26, indicating peer review and methodological rigor. The finding about social hierarchy framings being more influential than technical instruction mechanisms is particularly relevant — it suggests enforcement language may work not because of negative framing but because of perceived authority.