R0053/2026-03-31-02/C002/SRC02
Control Illusion paper — instruction hierarchy failures in LLMs
Source
| Field |
Value |
| Title |
Control Illusion: The Failure of Instruction Hierarchies in Large Language Models |
| Publisher |
arXiv (accepted to AAAI-26) |
| Author(s) |
Yilin Geng, Haonan Li, Honglin Mu, Xudong Han, Timothy Baldwin, Omri Abend, Eduard Hovy, Lea Frermann |
| Date |
2025-02-21 |
| URL |
https://arxiv.org/abs/2502.15851 |
| Type |
Research paper |
Summary
| Dimension |
Rating |
| Reliability |
High |
| Relevance |
High |
| Bias: Missing data |
Low risk |
| Bias: Measurement |
Low risk |
| Bias: Selective reporting |
Low risk |
| Bias: Randomization |
N/A — not an RCT |
| Bias: Protocol deviation |
N/A — not an RCT |
| Bias: COI/Funding |
Low risk |
Rationale
| Dimension |
Rationale |
| Reliability |
Peer-reviewed (AAAI-26), multi-author academic paper with systematic evaluation across 6 LLMs. |
| Relevance |
Directly demonstrates that instruction enforcement mechanisms fail, supporting the diagnosis. |
| Bias flags |
Low risk across all dimensions — academic research with systematic methodology. |
| Evidence ID |
Summary |
| SRC02-E01 |
System/user prompt separation fails to establish reliable instruction hierarchy |