R0023/2026-03-25/Q003/SRC01
Landmark Stanford/Berkeley study documenting GPT-4 behavior changes over time
Source
| Field |
Value |
| Title |
How is ChatGPT's behavior changing over time? |
| Publisher |
arXiv / Harvard Data Science Review |
| Author(s) |
Lingjiao Chen, Matei Zaharia, James Zou |
| Date |
2023-07-18 (preprint), 2023-10-31 (final) |
| URL |
https://arxiv.org/abs/2307.09009 |
| Type |
Research paper |
Summary
| Dimension |
Rating |
| Reliability |
High |
| Relevance |
High |
| Bias: Missing data |
Low risk |
| Bias: Measurement |
Low risk |
| Bias: Selective reporting |
Low risk |
| Bias: Randomization |
N/A — not an RCT |
| Bias: Protocol deviation |
N/A — not an RCT |
| Bias: COI/Funding |
Low risk |
Rationale
| Dimension |
Rationale |
| Reliability |
Stanford and UC Berkeley researchers. Published in Harvard Data Science Review (peer-reviewed). Systematic comparison of same prompts across March and June 2023 versions across 7 task categories. |
| Relevance |
The most cited study specifically documenting prompt degradation across model versions. Directly answers Q003. |
| Bias flags |
Low risk across the board. Academic researchers with no vendor affiliation. Tested multiple task types rather than cherry-picking. |
| Evidence ID |
Summary |
| SRC01-E01 |
GPT-4 prime number accuracy dropped from 84% to 51% between March and June 2023 |
| SRC01-E02 |
Performance changes were mixed — some tasks improved while others degraded |