Skip to content

R0007/2026-03-19

Research R0007 — AI Made Everyone Faster
Mode Claim
Run date 2026-03-19
Claims 15
Prompt Unified Research Standard 1.0-draft
Model Claude Opus 4.6

Fifteen claims from "AI Made Everyone Faster" were investigated, covering individual performance distributions, toxic worker effects, and AI productivity studies. The majority of claims were confirmed; three required corrections to specific details (attribution, year, or source).

Claims

C001 — O'Boyle & Aguinis power-law distribution — Likely

Claim: O'Boyle and Aguinis (2012) studied five studies, 198 samples, 633,263 individuals across researchers, entertainers, politicians, and athletes and found individual performance follows a power-law distribution, not a normal distribution. The top decile produces roughly 30% of total output; the top quartile produces over 50%.

Verdict: Study parameters and power-law finding confirmed. Output concentration percentages (30%, 50%) could not be verified from available secondary sources.

Hypothesis Status Probability
H1: Claim accurate as stated Inconclusive
H2: Core correct, percentages approximate Supported Likely (55-80%)
H3: Materially wrong Eliminated

Confidence: Medium · Sources: 1 · Searches: 1

Full analysis

C002 — Personnel Psychology Best Article award — Almost certain

Claim: O'Boyle and Aguinis won the Personnel Psychology Best Article award for this study.

Verdict: Confirmed. Author's publications page explicitly lists the award with link to announcement.

Hypothesis Status Probability
H1: Award was given Supported Almost certain (95-99%)
H2: Award details differ Eliminated
H3: No such award Eliminated

Confidence: High · Sources: 1 · Searches: 1

Full analysis

C003 — Aguinis follow-up heavy tails — Likely

Claim: Their 2014 follow-up found that 82.5% of 229 samples had significantly heavy right tails.

Verdict: The statistic (82.53% of 229 samples) is confirmed but comes from the 2016 paper "Conductors and Insulators," not a 2014 follow-up.

Hypothesis Status Probability
H1: 2014 paper, exact stats Eliminated
H2: Stats real, wrong year (2016) Supported Likely (55-80%)
H3: Stats fabricated Eliminated

Confidence: Medium · Sources: 1 · Searches: 1

Full analysis

C004 — Software engineering individual variation — Very likely

Claim: In software engineering, every major study from Sackman (1968) through Oliveira (2023) confirms large individual variation. The most careful recent work suggests log-normal distributions with roughly a 2.4x ratio between top and bottom halves.

Verdict: Large individual variation confirmed across decades. The 2023 study confirms a 2.44x ratio. Substantially correct.

Hypothesis Status Probability
H1: Claim accurate as stated Supported Very likely (80-95%)
H2: Core correct with nuance Inconclusive
H3: Materially wrong Eliminated

Confidence: Medium-High · Sources: 1 · Searches: 1

Full analysis

C005 — Schulmeyer NNPP concept — Almost certain

Claim: Schulmeyer formalized the "Net Negative Producing Programmer" concept in 1992 — programmers whose defect rates are high enough that the cost of their errors exceeds the value of their output. In a typical team of ten, he estimated up to three may qualify.

Verdict: Confirmed. Published in American Programmer in 1992 with the exact concept and team estimates as described.

Hypothesis Status Probability
H1: Claim accurate as stated Supported Almost certain (95-99%)
H2: Details differ Eliminated
H3: Materially wrong Eliminated

Confidence: High · Sources: 1 · Searches: 1

Full analysis

C006 — Felps bad apple effect — Almost certain

Claim: Felps, Mitchell, and Byington demonstrated experimentally that a single negative team member reduces team performance by 30-40%.

Verdict: Confirmed. 40 groups studied with planted actors. 30-40% performance reduction from a single bad apple.

Hypothesis Status Probability
H1: Claim accurate as stated Supported Almost certain (95-99%)
H2: Effect size differs Eliminated
H3: No such finding Eliminated

Confidence: High · Sources: 1 · Searches: 1

Full analysis

C007 — Housman & Minor toxic vs superstar — Almost certain

Claim: Housman and Minor studied 50,000 workers and found that avoiding one toxic hire saves $12,489 while hiring a top-one-percent superstar adds only $5,303.

Verdict: Confirmed. Over 50,000 workers across 11 firms. Dollar amounts match.

Hypothesis Status Probability
H1: Claim accurate as stated Supported Almost certain (95-99%)
H2: Dollar amounts differ Eliminated
H3: No such study Eliminated

Confidence: High · Sources: 1 · Searches: 1

Full analysis

C008 — Toxic workers above-average output — Almost certain

Claim: Toxic workers often have above-average raw output (Housman and Minor).

Verdict: Confirmed. Toxic workers are "much more productive than average" in raw output, though quality is slightly lower.

Hypothesis Status Probability
H1: Claim accurate as stated Supported Almost certain (95-99%)
H2: Output higher but quality lower Inconclusive
H3: Materially wrong Eliminated

Confidence: High · Sources: 1 · Searches: 1

Full analysis

C009 — Enterprise surveys no stratification — Unlikely

Claim: No major enterprise survey (McKinsey n=1,933; BCG n=10,600; Deloitte n=3,235) identified capability-based stratification in AI deployment.

Verdict: Partially incorrect. McKinsey sample is ~1,993 not 1,933. More importantly, these surveys do identify capability stratification — McKinsey identifies "AI high performers" (6%), BCG finds a three-tier capability model. The surveys frame stratification as organizational maturity rather than individual skill, but stratification is present.

Hypothesis Status Probability
H1: No stratification identified Eliminated
H2: Stratification exists but framed as organizational maturity Supported Unlikely (20-45%)
H3: Surveys explicitly identify stratification Inconclusive

Confidence: Medium · Sources: 1 · Searches: 1

Full analysis

C010 — Brynjolfsson customer service AI — Very likely

Claim: Brynjolfsson, Li, and Raymond studied 5,172 customer service agents and found that low-skilled workers improved by 34% with AI, while experienced workers saw minimal gains.

Verdict: Confirmed with minor caveat. Published version (QJE 2025) reports 5,172 agents. 34% improvement for novice workers confirmed. Overall 14% average. Experienced workers saw minimal gains.

Hypothesis Status Probability
H1: Claim accurate as stated Supported Very likely (80-95%)
H2: Numbers slightly differ between versions Inconclusive
H3: Materially wrong Eliminated

Confidence: High · Sources: 1 · Searches: 1

Full analysis

C011 — Noy & Zhang writing leveling — Likely

Claim: Noy and Zhang found the same leveling pattern in professional writing.

Verdict: Confirmed with nuance. ChatGPT reduced time by 40%, improved quality by 18%. Inequality compression observed. However, some analyses show grade gains were roughly flat across skill terciles, complicating the pure "leveling" narrative.

Hypothesis Status Probability
H1: Same leveling pattern Supported Likely (55-80%)
H2: Pattern exists with nuances Inconclusive
H3: No leveling pattern Eliminated

Confidence: Medium · Sources: 1 · Searches: 1

Full analysis

C012 — Dell'Acqua BCG bottom-half improvement — Almost certain

Claim: Dell'Acqua and Mollick's study of 758 BCG consultants found that bottom-half performers improved by 43% versus 17% for the top half on tasks inside AI's capability frontier.

Verdict: Confirmed. 758 consultants. 43% bottom-half vs 17% top-half improvement on inside-frontier tasks.

Hypothesis Status Probability
H1: Claim accurate as stated Supported Almost certain (95-99%)
H2: Percentages differ Eliminated
H3: No differential found Eliminated

Confidence: High · Sources: 1 · Searches: 1

Full analysis

C013 — BCG outside-frontier degradation — Almost certain

Claim: In the same BCG study, consultants using AI on tasks beyond its capability were 19 percentage points less likely to get correct answers than those working without AI.

Verdict: Confirmed. 19 percentage point performance degradation on outside-frontier tasks. This is the "jagged frontier" finding.

Hypothesis Status Probability
H1: Claim accurate as stated Supported Almost certain (95-99%)
H2: Effect size differs Eliminated
H3: No degradation found Eliminated

Confidence: High · Sources: 1 · Searches: 1

Full analysis

C014 — Otis Kenyan entrepreneurs — Almost certain

Claim: The Otis study of Kenyan entrepreneurs gave GPT-4 business advice via WhatsApp. High performers gained roughly 15%. Low performers declined by roughly 8%.

Verdict: Confirmed. 640 Kenyan entrepreneurs over 5 months. High performers gained ~15%. Low performers declined ~8% in revenue.

Hypothesis Status Probability
H1: Claim accurate as stated Supported Almost certain (95-99%)
H2: Percentages approximate Inconclusive
H3: No divergent effects Eliminated

Confidence: High · Sources: 1 · Searches: 1

Full analysis

C015 — DORA 2025 individual vs org — Likely

Claim: The DORA 2025 report found individual developers using AI completed 21% more tasks and merged 98% more pull requests, but organizational delivery metrics stayed flat.

Verdict: The AI Productivity Paradox is confirmed by DORA 2025, but the specific 21%/98% figures come from Faros AI telemetry research, not directly from the DORA report. DORA found a 7.2% reduction in delivery stability. The general pattern is correct but the attribution needs correction.

Hypothesis Status Probability
H1: DORA reported these exact figures Eliminated
H2: Paradox confirmed, figures from different source Supported Likely (55-80%)
H3: No paradox exists Eliminated

Confidence: Medium · Sources: 1 · Searches: 1

Full analysis


Collection Analysis

Cross-Cutting Patterns

Pattern Claims Affected Significance
Individual performance is highly skewed C001, C003, C004, C005 Multiple independent research traditions confirm non-normal performance distributions across domains
Negative contributors disproportionately impact teams C005, C006, C007, C008 Toxic/negative performers have outsized negative effects that exceed the positive effects of superstars
AI levels up low performers more than high performers C010, C011, C012, C014 Four independent studies across different domains find AI compresses the performance distribution
AI outside its frontier degrades performance C013, C014 When AI is applied to tasks beyond its capability, it actively harms outcomes
Individual AI gains do not translate to organizational gains C015 The DORA 2025 / Faros research reveals a systemic bottleneck in converting individual speed to team throughput
Enterprise surveys measure organizational maturity, not individual capability C009 Major surveys frame AI adoption as organizational, not individual capability stratification

Collection Statistics

Metric Value
Claims investigated 15
Fully confirmed (Almost certain) 8 (C002, C005, C006, C007, C008, C012, C013, C014)
Confirmed with nuance (Very likely) 2 (C004, C010)
Confirmed with caveats (Likely) 4 (C001, C003, C011, C015)
Partially incorrect (Unlikely) 1 (C009)

Source Independence Assessment

The claims draw from independent research traditions: organizational behavior (O'Boyle/Aguinis), software engineering (Sackman/Oliveira), organizational psychology (Felps/Schulmeyer), labor economics (Housman/Minor), and AI productivity studies (Brynjolfsson, Noy/Zhang, Dell'Acqua, Otis, DORA). The AI productivity studies (C010-C014) are genuinely independent — different populations, different tasks, different research teams. The convergence of findings across these independent streams strengthens the overall collection's validity.

Collection Gaps

Gap Impact Mitigation
PDF extraction failures for several key papers Could not verify specific numerical claims in original text Relied on multiple secondary sources; flagged unverifiable figures
McKinsey sample size discrepancy (1,933 vs 1,993) Minor factual error in C009 Noted in assessment
C003 year attribution (2014 vs 2016) Misattribution of paper year Corrected in assessment
C015 source attribution (DORA vs Faros) Specific figures attributed to wrong source Corrected in assessment
No researcher profile provided Cannot check for specific researcher biases Applied generic anti-bias measures

Collection Self-Audit

Domain Rating Notes
Eligibility criteria Low risk Consistent criteria applied across all 15 claims
Search comprehensiveness Some concerns PDF extraction failures limited access to some primary sources. Multiple search strategies used per claim.
Evaluation consistency Low risk Same scoring framework applied to all claims
Synthesis fairness Low risk Three claims received corrections (C003, C009, C015). Contradictory evidence surfaced where found. No claim was rubber-stamped.

Resources

Metric Value
Web searches 22
Web fetches 14
Files produced 312
Sources scored 15
Evidence extracts 30
Results dispositioned 45 selected + 105 rejected = 150 total
Duration 20m 46s
Tool uses 111
Tokens 137,633