Skip to content

R0054/2026-03-31/C003/SRC03/E01

Research R0054 — Prompt Claims v2
Run 2026-03-31
Claim C003
Source SRC03
Evidence SRC03-E01
Type Factual

Semantic override: models revert to pretrained defaults despite explicit prompt-level redefinitions.

URL: https://arxiv.org/html/2602.17520

Extract

Key findings from the semantic override research:

  • Definition: Semantic override is "a model disregards or underweights an explicit prompt level redefinition such as a gate labeled 'NAND' behaving as AND"
  • LLMs fail to perform "local unlearning" — the ability to temporarily suppress pretrained conventions in favor of context-specific rules
  • Three frontier models achieved only 80-90% accuracy on a 30-item benchmark testing specification compliance
  • Definition Override tasks showed the lowest accuracy (71-86%)
  • These are "structured failures tied to specification noncompliance," often accompanied by "fluent, confident explanations that violate the stated constraints"
  • The failures are not random hallucinations but systematic reversions to pretrained defaults

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 Supports Directly demonstrates the mechanism: models accept definitions/instructions then revert to default behavior. The "fluent, confident explanations that violate the stated constraints" maps directly to the claim's "agree that it's excellent, and then quietly skip half"
H2 Supports Could be interpreted as a technical limitation rather than helpfulness conflict
H3 Contradicts Clear experimental evidence that models do not reliably follow explicit instructions

Context

The phrase "fluent, confident explanations that violate the stated constraints" is particularly relevant — it describes exactly the behavior the claim characterizes: acknowledging the instructions (fluently and confidently) while not actually following them.