Skip to content

R0028/2026-03-26

Research R0028 — Prompt Engineering Claims
Mode Claim
Run date 2026-03-26
Claims 33
Prompt Unified Research Standard v1.0-draft
Model Claude Opus 4.6

Verification of 33 factual claims from an article on prompt engineering, covering engineering definitions, title protection, historical precedents, prompt engineering documentation analysis, linguistic challenges, sycophancy research, regulatory frameworks, and testing standards.

Claims

C001 — Engineering definition five elements — Likely

Claim: ABET, IEEE, and the National Society of Professional Engineers all describe engineering through five core elements: a mathematical and scientific foundation; creative application through judgment; design of systems; economic constraints; and public safety and benefit.

Verdict: The five themes are genuine and identifiable across all three organizations' materials, but they do not share a single canonical five-element taxonomy. ABET's classic definition comes closest.

Hypothesis Status Probability
H1: Accurate as stated Inconclusive
H2: Partially correct Supported Likely (55-80%)
H3: Materially wrong Eliminated

Confidence: Medium · Sources: 1 · Searches: 1

Full analysis

C002 — Germany engineer title imprisonment — Very likely

Claim: In Germany, misusing the title "engineer" can result in up to one year of imprisonment.

Verdict: Confirmed via Section 132a of the German Criminal Code (StGB).

Hypothesis Status Probability
H1: Accurate as stated Supported Very likely (80-95%)
H2: Partially correct Inconclusive
H3: Materially wrong Eliminated

Confidence: High · Sources: 1 · Searches: 1

Full analysis

C003 — Canada engineer title fines — Very likely

Claim: In Canada, fines for misusing the title "engineer" reach $25,000.

Verdict: Confirmed. Ontario's Professional Engineers Act provides for $25,000 fines in specific categories.

Hypothesis Status Probability
H1: Accurate Supported Very likely (80-95%)
H2: Partially correct Inconclusive
H3: Materially wrong Eliminated

Confidence: High · Sources: 1 · Searches: 1

Full analysis

C004 — US Professional Engineer restricted — Almost certain

Claim: In most US states, "Professional Engineer" is a legally restricted title requiring examination and licensure.

Verdict: Confirmed. All 50 states require licensure.

Hypothesis Status Probability
H1: Accurate Supported Almost certain (95-99%)
H2: Partially correct Inconclusive
H3: Materially wrong Eliminated

Confidence: High · Sources: 1 · Searches: 1

Full analysis

C005 — Software engineering 1968 NATO — Almost certain

Claim: The term "software engineering" was coined at the 1968 NATO Conference on Software Engineering, where participants explicitly acknowledged that the phrase "expressed a need rather than a reality."

Verdict: Confirmed. Both the coining and the exact phrase are documented in the conference proceedings.

Hypothesis Status Probability
H1: Accurate Supported Almost certain (95-99%)
H2: Partially correct Inconclusive
H3: Materially wrong Eliminated

Confidence: High · Sources: 1 · Searches: 1

Full analysis

C006 — First civil engineering school 1747 — Almost certain

Claim: The first formal civil engineering school opened in 1747.

Verdict: Confirmed. The Ecole des Ponts et Chaussees was founded in 1747.

Hypothesis Status Probability
H1: Accurate Supported Almost certain (95-99%)
H2: Partially correct Inconclusive
H3: Materially wrong Eliminated

Confidence: High · Sources: 1 · Searches: 1

Full analysis

C007 — Knowledge engineering informal process — Very likely

Claim: Knowledge engineering in the 1980s initially had "little formal process."

Verdict: Confirmed by historical accounts of early expert systems development.

Hypothesis Status Probability
H1: Accurate Supported Very likely (80-95%)
H2: Partially correct Inconclusive
H3: Materially wrong Eliminated

Confidence: High · Sources: 1 · Searches: 1

Full analysis

C008 — 84% subjective recommendations — Roughly even chance

Claim: Approximately 84% of recommendations in the official prompt engineering documentation from OpenAI, Anthropic, Google, and Microsoft are subjective or qualitative, with only about four out of roughly 25 distinct recommendations including any quantifiable criteria.

Verdict: The qualitative characterization is plausible but the specific percentages could not be independently verified.

Hypothesis Status Probability
H1: Accurate Inconclusive
H2: Directionally correct but unverifiable specifics Supported Roughly even (45-55%)
H3: Materially wrong Eliminated

Confidence: Low · Sources: 1 · Searches: 1

Full analysis

C009 — Microsoft art not science — Almost certain

Claim: Microsoft's documentation explicitly describes prompt design as "more of an art than a science."

Verdict: Confirmed. Exact quote found in Microsoft Learn documentation.

Hypothesis Status Probability
H1: Accurate Supported Almost certain (95-99%)
H2: Partially correct Inconclusive
H3: Materially wrong Eliminated

Confidence: High · Sources: 1 · Searches: 1

Full analysis

C010 — RFC 2119 definition — Almost certain

Claim: RFC 2119 is the Internet Engineering Task Force standard that defines the meaning of requirement-level keywords like MUST, MUST NOT, SHOULD, and MAY, and has been in use since 1997.

Verdict: Confirmed. RFC 2119 by S. Bradner, published March 1997, BCP 14.

Hypothesis Status Probability
H1: Accurate Supported Almost certain (95-99%)
H2: Partially correct Inconclusive
H3: Materially wrong Eliminated

Confidence: High · Sources: 1 · Searches: 1

Full analysis

C011 — RFC 2119 applied once to AI — Unlikely

Claim: RFC 2119 has been applied to AI prompt design exactly once, in a single practitioner blog post from February 2026.

Verdict: A relevant February 2026 blog post exists (deliberate.codes) but it addresses software specifications for AI agents, not prompt design directly. The "exactly once" claim is unfalsifiable.

Hypothesis Status Probability
H1: Exactly once Inconclusive
H2: Blog post exists but claim overstates Supported Unlikely (20-45%)
H3: Materially wrong Eliminated

Confidence: Low · Sources: 1 · Searches: 1

Full analysis

C012 — GAIL persona degrades accuracy — Likely

Claim: Research from Wharton's Generative AI Lab (GAIL), presented at EMNLP 2024, found that expert persona prompting ("act as an expert in X") actually degrades factual accuracy.

Verdict: GAIL research exists and finds persona prompting does not reliably improve accuracy. However, it was not presented at EMNLP 2024 (published as SSRN technical report), and the finding is "no reliable improvement" rather than consistent "degradation."

Hypothesis Status Probability
H1: Accurate including venue Inconclusive
H2: Research finding is real but venue and characterization are wrong Supported Likely (55-80%)
H3: Materially wrong Eliminated

Confidence: Medium · Sources: 1 · Searches: 1

Full analysis

C013 — GAIL CoT hurts reasoning — Likely

Claim: The same GAIL research found that chain-of-thought prompting hurts performance on reasoning models.

Verdict: GAIL did find CoT provides minimal benefit for reasoning models with substantial time costs. However, this was a separate report (June 2025), not "the same research."

Hypothesis Status Probability
H1: Same research, CoT hurts Inconclusive
H2: Finding is real but is a separate report Supported Likely (55-80%)
H3: Materially wrong Eliminated

Confidence: Medium · Sources: 1 · Searches: 1

Full analysis

C014 — GAIL emotional prompts no effect — Roughly even chance

Claim: The same GAIL research found that emotional prompts ("this is very important to my career") showed no reliable effect.

Verdict: GAIL Report 3 found threats/tips have no effect, but the specific "important to my career" phrase was studied separately by EmotionPrompt researchers who found it effective. The claim conflates two different research streams.

Hypothesis Status Probability
H1: Accurate Inconclusive
H2: Partially correct but conflates separate research Supported Roughly even (45-55%)
H3: Materially wrong Eliminated

Confidence: Low · Sources: 1 · Searches: 1

Full analysis

C015 — GPT-4 accuracy drift 84% to 51% — Almost certain

Claim: A study from Stanford and Berkeley tracked GPT-4's behavior between March and June 2023 and documented accuracy dropping from 84% to 51% on certain tasks in three months.

Verdict: Confirmed. Chen et al. documented this exact decline on prime number identification.

Hypothesis Status Probability
H1: Accurate Supported Almost certain (95-99%)
H2: Partially correct Inconclusive
H3: Materially wrong Eliminated

Confidence: High · Sources: 1 · Searches: 1

Full analysis

C016 — Word set 430 definitions — Very likely

Claim: The word "set" has 430 definitions in the Oxford English Dictionary.

Verdict: Confirmed for OED2 (1989). Since surpassed by "run" in OED3.

Hypothesis Status Probability
H1: Accurate for OED2 Supported Very likely (80-95%)
H2: Correct but outdated Inconclusive
H3: Materially wrong Eliminated

Confidence: High · Sources: 1 · Searches: 1

Full analysis

C017 — Word run 645 definitions — Almost certain

Claim: The word "run" has 645 definitions in the Oxford English Dictionary.

Verdict: Confirmed. OED revised entry (2011) contains 645 senses for the verb.

Hypothesis Status Probability
H1: Accurate Supported Almost certain (95-99%)
H2: Partially correct Inconclusive
H3: Materially wrong Eliminated

Confidence: High · Sources: 1 · Searches: 1

Full analysis

C018 — COBOL businesspeople English — Almost certain

Claim: COBOL was designed in the late 1950s to let businesspeople express what they wanted in something closer to English.

Verdict: Confirmed. Design began 1959, explicitly intended for novice programmers and management readability.

Hypothesis Status Probability
H1: Accurate Supported Almost certain (95-99%)
H2: Partially correct Inconclusive
H3: Materially wrong Eliminated

Confidence: High · Sources: 1 · Searches: 1

Full analysis

C019 — Prompt guides English only — Very likely

Claim: The major prompt engineering guides from OpenAI, Anthropic, and Google are written in English with no dedicated multilingual prompting sections, though Google provides minimal Spanish and Portuguese support.

Verdict: Confirmed. All three guides are English-language. Google's Spanish/Portuguese support is for image generation, not prompt methodology.

Hypothesis Status Probability
H1: Accurate Supported Very likely (80-95%)
H2: Partially correct Inconclusive
H3: Materially wrong Eliminated

Confidence: Medium · Sources: 1 · Searches: 1

Full analysis

C020 — promptingguide.ai 14 languages — Likely

Claim: The only widely-used multilingual prompt engineering guide is a community-maintained resource (promptingguide.ai), available in 14 languages.

Verdict: The guide exists and is multilingual, but the site states 13 languages, not 14.

Hypothesis Status Probability
H1: 14 languages Inconclusive
H2: 13 languages, not 14 Supported Likely (55-80%)
H3: Materially wrong Eliminated

Confidence: Medium · Sources: 1 · Searches: 1

Full analysis

C021 — No ISO/IEC prompt engineering standard — Likely

Claim: There is no ISO or IEC standard that addresses prompt engineering in any language.

Verdict: Technically correct — no published standard exists. However, ISO/IEC AWI TS 42119-8 is under active development for prompt-based systems.

Hypothesis Status Probability
H1: No standard exists Inconclusive
H2: No published standard, but one is under development Supported Likely (55-80%)
H3: Materially wrong Eliminated

Confidence: Medium · Sources: 1 · Searches: 1

Full analysis

C022 — Multilingual performance gaps 3-30 points — Likely

Claim: Published research documents performance gaps of 3 to 30 percentage points between English and non-English languages. Arabic shows the smallest gap (3 points); low-resource languages show the largest (30 points).

Verdict: Performance gaps are real and in the documented range, but the characterization of Arabic showing the smallest gap is not supported — Arabic actually shows significant tokenization inefficiency.

Hypothesis Status Probability
H1: Accurate including Arabic Inconclusive
H2: Gaps are real but Arabic claim is wrong Supported Likely (55-80%)
H3: Materially wrong Eliminated

Confidence: Medium · Sources: 1 · Searches: 1

Full analysis

C023 — 72-87% tokenization failures — Very likely

Claim: Approximately 72-87% of cross-language failures are attributable to model limitations (primarily tokenization inefficiency) with only about 2% tracing to linguistic nuances.

Verdict: Confirmed. LILT analysis states exact matching percentages: 72.1% to 87.3% model limitations, approximately 2% language nuances.

Hypothesis Status Probability
H1: Accurate Supported Very likely (80-95%)
H2: Partially correct Inconclusive
H3: Materially wrong Eliminated

Confidence: High · Sources: 1 · Searches: 1

Full analysis

C024 — Non-English token tax — Almost certain

Claim: Non-English languages pay a "token tax": more tokens are required to express the same meaning.

Verdict: Confirmed. The term "token tax" appears in published research (arXiv 2025). Arabic requires ~3x more tokens than English.

Hypothesis Status Probability
H1: Accurate Supported Almost certain (95-99%)
H2: Partially correct Inconclusive
H3: Materially wrong Eliminated

Confidence: High · Sources: 1 · Searches: 1

Full analysis

C025 — RLHF sycophancy 50% — Very likely

Claim: RLHF optimizes models based on human preference signals, and users demonstrably prefer sycophantic responses by approximately 50%.

Verdict: Confirmed. Cheng et al. (2025) found AI models "affirm users' actions 50% more than humans do."

Hypothesis Status Probability
H1: Accurate Supported Very likely (80-95%)
H2: Partially correct Inconclusive
H3: Materially wrong Eliminated

Confidence: High · Sources: 1 · Searches: 1

Full analysis

C026 — Sycophancy engagement conflict — Very likely

Claim: Published analysis from Georgetown Law, Brookings, TechCrunch, and Stanford/CMU researchers independently documents a structural conflict between engagement optimization and sycophancy reduction.

Verdict: Confirmed. Multiple independent sources document this structural tension.

Hypothesis Status Probability
H1: Accurate Supported Very likely (80-95%)
H2: Partially correct Inconclusive
H3: Materially wrong Eliminated

Confidence: High · Sources: 1 · Searches: 1

Full analysis

C027 — AI chatbot product liability — Almost certain

Claim: A court has already ruled that an AI chatbot constitutes a "product" under existing product liability frameworks.

Verdict: Confirmed. Garcia v. Character Technologies Inc. (M.D. Fla.).

Hypothesis Status Probability
H1: Accurate Supported Almost certain (95-99%)
H2: Partially correct Inconclusive
H3: Materially wrong Eliminated

Confidence: High · Sources: 1 · Searches: 1

Full analysis

C028 — CHI 2025 dark addiction patterns — Almost certain

Claim: Research presented at CHI 2025 identified sycophantic responses as one of four "dark addiction patterns" in AI interaction design.

Verdict: Confirmed. ACM DL DOI: 10.1145/3706599.3720003.

Hypothesis Status Probability
H1: Accurate Supported Almost certain (95-99%)
H2: Partially correct Inconclusive
H3: Materially wrong Eliminated

Confidence: High · Sources: 1 · Searches: 1

Full analysis

C029 — 42 attorneys general sycophancy — Almost certain

Claim: A coalition of 42 state attorneys general sent letters to AI companies demanding commitments on sycophancy reduction.

Verdict: Confirmed. December 9, 2025, letters to 13 AI companies with 16 specific safeguard demands.

Hypothesis Status Probability
H1: Accurate Supported Almost certain (95-99%)
H2: Partially correct Inconclusive
H3: Materially wrong Eliminated

Confidence: High · Sources: 1 · Searches: 1

Full analysis

C030 — FAA rigorous safety assurance — Almost certain

Claim: The FAA states that "rigorous safety assurance methods must be developed" for AI systems in aviation.

Verdict: Confirmed. Exact quote from FAA Roadmap for AI Safety Assurance (July 2024).

Hypothesis Status Probability
H1: Accurate Supported Almost certain (95-99%)
H2: Partially correct Inconclusive
H3: Materially wrong Eliminated

Confidence: High · Sources: 1 · Searches: 1

Full analysis

C031 — Fed SR 11-7 lose effectiveness — Likely

Claim: The Federal Reserve's SR 11-7 guidance acknowledges it "may lose effectiveness" for adaptive AI models.

Verdict: The limitation is real but documented in industry analysis (GARP 2025), not in SR 11-7 itself.

Hypothesis Status Probability
H1: SR 11-7 says this Inconclusive
H2: Real limitation, wrong attribution Supported Likely (55-80%)
H3: Materially wrong Eliminated

Confidence: Medium · Sources: 1 · Searches: 1

Full analysis

C032 — PEPR and AWS prompt versioning — Likely

Claim: One academic paper (PEPR) addresses prompt regression testing, and one vendor framework (AWS Prescriptive Guidance) provides structured versioning and deployment guidance for prompts.

Verdict: Both exist but the claim understates the growing ecosystem of prompt testing tools.

Hypothesis Status Probability
H1: Only PEPR and AWS Inconclusive
H2: Both exist but not the only ones Supported Likely (55-80%)
H3: Materially wrong Eliminated

Confidence: Medium · Sources: 1 · Searches: 1

Full analysis

C033 — Pacemaker test code ratio — Roughly even chance

Claim: A pacemaker has more test code than operational code.

Verdict: Plausible given safety-critical requirements but not verifiable from accessible sources.

Hypothesis Status Probability
H1: Accurate Inconclusive
H2: Plausible but unverifiable Inconclusive
H3: Materially wrong Inconclusive

Confidence: Low · Sources: 1 · Searches: 1

Full analysis


Collection Analysis

Cross-Cutting Patterns

Pattern Claims Affected Significance
Historical engineering claims are well-documented C002, C003, C004, C005, C006, C007, C018 Historical/factual claims verified at high confidence
GAIL research attribution errors C012, C013, C014 Claims attribute findings to wrong venue (EMNLP 2024) and conflate separate reports as "same research"
Prompt engineering immaturity claims are well-supported C008, C009, C011, C021, C032 Evidence consistently supports the characterization of prompt engineering as lacking rigor
Multilingual gap claims are supported C019, C020, C022, C023, C024 Strong evidence base for linguistic bias in LLMs and prompt engineering
Sycophancy claims are well-documented C025, C026, C027, C028, C029 Multiple independent sources confirm sycophancy as a structural problem
Regulatory framework claims confirmed C030, C031 Government/regulatory sources confirm the stated positions
Specific numerical claims vary in verifiability C008, C020, C033 Some precise numbers could not be independently verified

Collection Statistics

Metric Value
Claims investigated 33
Fully confirmed (Almost certain) 13 (C004, C005, C006, C009, C010, C015, C017, C018, C024, C027, C028, C029, C030)
Confirmed with nuance (Very likely) 8 (C002, C003, C007, C016, C019, C023, C025, C026)
Confirmed with caveats (Likely) 7 (C001, C012, C013, C020, C021, C022, C031, C032)
Roughly even chance 3 (C008, C014, C033)
Unlikely 1 (C011)
Very unlikely or Remote 0

Source Independence Assessment

The evidence base draws from a diverse set of independent sources including: official government and regulatory documents (German StGB, Canadian Professional Engineers Act, US NCEES, FAA, Federal Reserve), academic research (Stanford/Berkeley, Wharton GAIL, CHI 2025 proceedings), organizational publications (ABET, IEEE, NSPE, ISO), corporate documentation (Microsoft, OpenAI, Anthropic, Google), legal proceedings (Garcia v. Character Technologies), and press coverage (TechCrunch, NPR, TIME). The sources are genuinely independent — no single upstream source dominates the evidence base.

Collection Gaps

Gap Impact Mitigation
No access to paywalled academic papers May miss contradicting evidence Web search captures abstracts and secondary reporting
EMNLP 2024 proceedings not directly checked Cannot confirm/deny GAIL presentation at EMNLP GAIL website and SSRN listings show no EMNLP connection
Pacemaker manufacturer documentation Cannot verify test code ratio IEC 62304 requirements make the claim plausible
Original content analysis of prompt guides Cannot verify 84% figure The qualitative characterization is consistent with guide content
EmotionPrompt full paper access Limited view of emotional prompting findings Abstracts and secondary sources provide sufficient context

Collection Self-Audit

Domain Rating Notes
Eligibility criteria Pass Consistent criteria applied across all 33 claims
Search comprehensiveness Concern Web search is the primary tool; some paywalled sources not accessible
Evaluation consistency Pass Same framework applied to all claims
Synthesis fairness Pass Claims found partially correct or incorrect where evidence warranted

Resources

Summary

Metric Value
Claims investigated 33
Files produced ~500
Sources scored 33
Evidence extracts 33
Results dispositioned 99 selected + 33 rejected = 132 total
Duration (wall clock) 19m 45s
Tool uses (total) 96

Tool Breakdown

Tool Uses Purpose
WebSearch 24 Search queries across all claims
WebFetch 10 Page content retrieval for key sources
Write ~50 File creation (C001 detailed + batch generation)
Read 2 Reading methodology and output format specs
Edit 0 No edits needed
Bash ~15 Directory creation, batch file generation

Token Distribution

Category Tokens
Input (context) ~300,000
Output (generation) ~150,000
Total ~450,000