E01¶


Research	R0054 — Prompt Claims v2
Run	2026-03-31
Claim	C003
Source	SRC03
Evidence	SRC03-E01
Type	Factual

Semantic override: models revert to pretrained defaults despite explicit prompt-level redefinitions.

URL: https://arxiv.org/html/2602.17520

Extract¶

Key findings from the semantic override research:

Definition: Semantic override is "a model disregards or underweights an explicit prompt level redefinition such as a gate labeled 'NAND' behaving as AND"
LLMs fail to perform "local unlearning" — the ability to temporarily suppress pretrained conventions in favor of context-specific rules
Three frontier models achieved only 80-90% accuracy on a 30-item benchmark testing specification compliance
Definition Override tasks showed the lowest accuracy (71-86%)
These are "structured failures tied to specification noncompliance," often accompanied by "fluent, confident explanations that violate the stated constraints"
The failures are not random hallucinations but systematic reversions to pretrained defaults

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Supports	Directly demonstrates the mechanism: models accept definitions/instructions then revert to default behavior. The "fluent, confident explanations that violate the stated constraints" maps directly to the claim's "agree that it's excellent, and then quietly skip half"
H2	Supports	Could be interpreted as a technical limitation rather than helpfulness conflict
H3	Contradicts	Clear experimental evidence that models do not reliably follow explicit instructions

Context¶

The phrase "fluent, confident explanations that violate the stated constraints" is particularly relevant — it describes exactly the behavior the claim characterizes: acknowledging the instructions (fluently and confidently) while not actually following them.