Skip to content

R0021/2026-03-25/Q003 — Assessment

BLUF

Across all four major AI vendors, the overwhelming majority of prompt engineering recommendations are subjective and qualitative. Of approximately 25 distinct recommendations analyzed, only 3-4 include quantifiable criteria (Google's temperature=1.0, ~21-word prompt average, and accuracy claims). Microsoft explicitly describes prompting as "more of an art than a science." No vendor provides measurable success criteria, testable thresholds, or reproducible specifications for their recommendations.

Probability

Rating: Almost certain (95-99%) that vendor guidance is predominantly subjective.

Confidence in assessment: High

Confidence rationale: Based on direct analysis of current official vendor documentation. The findings are verifiable by reading the source documents.

Reasoning Chain

  1. OpenAI provides 6 strategies, all qualitative: "write clear instructions," "provide reference text," etc. [SRC01-E01, High reliability, High relevance]
  2. Anthropic provides 7 recommendations, 1 semi-quantifiable (20k token threshold); remainder structural/qualitative [SRC02-E01, High reliability, High relevance]
  3. Google provides the most quantifiable guidance: temperature=1.0, ~21-word average, accuracy claims (3 quantifiable out of ~7) [SRC03-E01, High reliability, High relevance]
  4. Microsoft provides 5 recommendations with 0 quantifiable elements and explicitly states prompting is "more art than science" [SRC04-E01, High reliability, High relevance]
  5. JUDGMENT: Across ~25 recommendations, approximately 85-90% are subjective. The few quantifiable elements (temperature setting, token threshold) are operational parameters, not engineering specifications.

Evidence Base Summary

Source Description Reliability Relevance Key Finding
SRC01 OpenAI Guide High High 6 strategies, 0 quantifiable
SRC02 Anthropic Guide High High 7 recommendations, 1 semi-quantifiable
SRC03 Google Guide High High 7 recommendations, 3 quantifiable
SRC04 Microsoft Guide High High 5 recommendations, 0 quantifiable; "art not science"

Collection Synthesis

Dimension Assessment
Evidence quality Robust — all four major vendors' official documentation analyzed
Source agreement High — all vendors provide predominantly qualitative guidance
Source independence Independent — each vendor publishes its own documentation
Outliers Google is a mild outlier with more quantifiable recommendations than others

Detail

The consistency across all four vendors is itself significant. These are the organizations that coined and popularized "prompt engineering," yet none of them provide engineering-grade specifications. The guidance reads more like cooking tips ("add more detail," "be specific," "use examples") than engineering requirements ("the signal-to-noise ratio must exceed 20dB," "the load must not exceed 500kg").

Gaps

Missing Evidence Impact on Assessment
Internal vendor testing data Moderate — vendors likely have quantifiable internal benchmarks they do not publish
Academic prompt engineering research Minor — academic work may provide more measurable criteria
Vendor cookbooks and tutorials Minor — supplementary materials may contain more specific guidance

Researcher Bias Check

Declared biases: The researcher argues that "prompt engineering" is not engineering. Finding predominantly subjective vendor guidance supports this argument.

Influence assessment: The finding is based on direct analysis of source documents. However, the classification of recommendations as "quantifiable" vs. "subjective" involves judgment. Another researcher might classify structural recommendations (like "use XML tags") as measurable, though they lack numerical thresholds.

Cross-References

Entity ID File
Hypotheses H1, H2, H3 hypotheses/
Sources SRC01, SRC02, SRC03, SRC04 sources/
ACH Matrix ach-matrix.md
Self-Audit self-audit.md