R0054/2026-03-31/C003/SRC03/E01¶
Semantic override: models revert to pretrained defaults despite explicit prompt-level redefinitions.
URL: https://arxiv.org/html/2602.17520
Extract¶
Key findings from the semantic override research:
- Definition: Semantic override is "a model disregards or underweights an explicit prompt level redefinition such as a gate labeled 'NAND' behaving as AND"
- LLMs fail to perform "local unlearning" — the ability to temporarily suppress pretrained conventions in favor of context-specific rules
- Three frontier models achieved only 80-90% accuracy on a 30-item benchmark testing specification compliance
- Definition Override tasks showed the lowest accuracy (71-86%)
- These are "structured failures tied to specification noncompliance," often accompanied by "fluent, confident explanations that violate the stated constraints"
- The failures are not random hallucinations but systematic reversions to pretrained defaults
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Supports | Directly demonstrates the mechanism: models accept definitions/instructions then revert to default behavior. The "fluent, confident explanations that violate the stated constraints" maps directly to the claim's "agree that it's excellent, and then quietly skip half" |
| H2 | Supports | Could be interpreted as a technical limitation rather than helpfulness conflict |
| H3 | Contradicts | Clear experimental evidence that models do not reliably follow explicit instructions |
Context¶
The phrase "fluent, confident explanations that violate the stated constraints" is particularly relevant — it describes exactly the behavior the claim characterizes: acknowledging the instructions (fluently and confidently) while not actually following them.