Skip to content

R0053/2026-03-31-02/C002/SRC02

Research R0053 — Prompt Claims
Run 2026-03-31-02
Claim C002
Search S02
Result S02-R01
Source SRC02

Control Illusion paper — instruction hierarchy failures in LLMs

Source

Field Value
Title Control Illusion: The Failure of Instruction Hierarchies in Large Language Models
Publisher arXiv (accepted to AAAI-26)
Author(s) Yilin Geng, Haonan Li, Honglin Mu, Xudong Han, Timothy Baldwin, Omri Abend, Eduard Hovy, Lea Frermann
Date 2025-02-21
URL https://arxiv.org/abs/2502.15851
Type Research paper

Summary

Dimension Rating
Reliability High
Relevance High
Bias: Missing data Low risk
Bias: Measurement Low risk
Bias: Selective reporting Low risk
Bias: Randomization N/A — not an RCT
Bias: Protocol deviation N/A — not an RCT
Bias: COI/Funding Low risk

Rationale

Dimension Rationale
Reliability Peer-reviewed (AAAI-26), multi-author academic paper with systematic evaluation across 6 LLMs.
Relevance Directly demonstrates that instruction enforcement mechanisms fail, supporting the diagnosis.
Bias flags Low risk across all dimensions — academic research with systematic methodology.

Evidence Extracts

Evidence ID Summary
SRC02-E01 System/user prompt separation fails to establish reliable instruction hierarchy