Skip to content

R0053/2026-03-31-02/C002/SRC02/E01

Research R0053 — Prompt Claims
Run 2026-03-31-02
Claim C002
Source SRC02
Evidence SRC02-E01
Type Factual

Instruction hierarchies fail in LLMs — system/user separation does not establish priority

URL: https://arxiv.org/abs/2502.15851

Extract

"The widely-adopted system/user prompt separation fails to establish a reliable instruction hierarchy." Models "exhibit strong inherent biases toward certain constraint types regardless of their priority designation." "Societal hierarchy framings (e.g., authority, expertise, consensus) show stronger influence on model behavior than system/user roles, suggesting that pretraining-derived social structures function as latent behavioral priors with potentially greater impact than post-training guardrails." Tested across six state-of-the-art LLMs.

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 Supports Confirms that enforcement is needed (requirements are not reliably followed)
H2 Supports Confirms the problem exists but suggests the mechanism is more complex than negative/positive framing
H3 Contradicts Directly shows AI does not reliably follow all requirements

Context

This paper was accepted to AAAI-26, indicating peer review and methodological rigor. The finding about social hierarchy framings being more influential than technical instruction mechanisms is particularly relevant — it suggests enforcement language may work not because of negative framing but because of perceived authority.