R0043/2026-04-01/Q003/SRC02/E01¶
CSIRO/UNSW framework harmonizing AI evaluation terminology
URL: https://arxiv.org/html/2404.05388v3
Extract¶
The framework identifies "divergent practices and terminologies across different communities (i.e., AI, software engineering, and governance)" as obstructing "a holistic evaluation approach."
Three-component framework: 1. Harmonised terminology to facilitate communication across disciplines 2. Taxonomy identifying essential elements for AI system evaluation 3. Lifecycle mapping between stakeholders and requisite evaluations
Key harmonized definitions: - Evaluation — "The process of assessing against specific criteria with or without executing the artefacts" - Testing — "The process of executing an AI model/system to verify and validate that it exhibits expected behaviours" - Verification — "Confirming AI models/systems meet specified requirements" - Validation — "Confirming that AI models/systems meet intended uses/expectations"
JUDGMENT: The framework addresses process terminology (evaluation, testing, verification) but does NOT address behavioral risk terminology (sycophancy, automation bias, overreliance). It solves a related but different problem — how different communities describe the evaluation process, not how they name specific risks.
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Supports | Active effort to harmonize terminology exists |
| H2 | Contradicts | People ARE working on the problem |
| H3 | Supports | Effort focuses on process terminology, not behavioral risk vocabulary |
Context¶
The CSIRO/UNSW framework demonstrates that the terminology gap is recognized at the process level (what do we call "evaluation" vs "testing"?) but has not yet reached the behavioral risk level (what do we call "sycophancy" across domains?).