The Forces Behind the Frenzy

How benchmarks broke, prices collapsed 90%, and OpenAI lost half its enterprise market.

Part 2 of a 3-Part Series

Introduction

Part 1 gave you the chronology. DeepSeek's shock. Google's comeback. Anthropic's quiet takeover. OpenAI's Code Red.

But timelines are surface. They don't explain why OpenAI's enterprise share collapsed while their user count exploded. They don't explain why benchmarks that worked for years suddenly became useless. They don't explain why prices fell 95% while losses hit $44bn.

This is Part 2: the forces behind the forces.

Four dynamics. Let's go!

I. THE BENCHMARK CRISIS

Why the Scoreboard Is Breaking

The Karpathy Warning

Andrej Karpathy - OpenAI co-founder, former Tesla AI director, one of the most respected voices in AI - put it bluntly in his December 2025 year-in-review:

He has "completely lost interest and trust in various benchmarks." The man who helped build GPT-2 and scaled Tesla's Autopilot is saying the AI industry's scoreboard was broken.

You shouldn’t be questioning it, but if you do, here is his reasoning: Benchmarks are "almost by construction verifiable environments" - which makes them immediately susceptible to gaming through reinforcement learning and synthetic data generation. Labs were building "mini-training environments near the feature space corresponding to the benchmarks."

In English: everyone was teaching to the test. And the tests had stopped measuring what matters.

Problem 1: Saturation

The benchmarks that defined AI progress for years are maxed out.

Benchmark

Best Score (2025)

Status

MMLU

>91% (GPT-5)

Saturated

SuperGLUE

>90%

Saturated

HellaSwag

>95%

Saturated

MMLU Pro

90.1% (Gemini 3 Pro)

Nearly saturated

When GPT-3 launched in 2020, it scored 44% on MMLU. By early 2025, GPT-5 hit 91%. The benchmark that once separated frontier from mediocre now clusters every top model within a few percentage points.

When every model aces the same test, the test stops being useful.

The MATH benchmark went from <10% at its 2021 launch to >90% by 2024. Three years to complete saturation, not sure if anybody predicted that. And it keeps accelerating!

Problem 2: Gaming - The Llama 4 Scandal

Then there's outright manipulation.

In April 2025, Meta launched Llama 4 with fanfare. Maverick, their mid-size model, secured #2 on LMArena - right behind Gemini 2.5 Pro, ahead of GPT-4o.

Impressive. Except it was fake.

The Truth

Researchers noticed something odd. The Maverick on LMArena produced long, emoji-filled, unusually chatty responses. The Maverick available to the public... didn't.

Because they weren't the same model.

Meta had submitted an "experimental chat version" specifically "optimized for conversationality." In practice: fine-tuned to charm human voters. When Meta finally swapped in the actual public Maverick?

It dropped from #2 to #32.

A 30-position fall. Same underlying model - just without the benchmark-gaming optimizations…lol.

LMArena's Response

"Meta's interpretation of our policy did not match what we expect from model providers," LMArena posted on X. "Meta should have made it clearer that 'Llama-4-Maverick-03-26-Experimental' was a customized model."

Meta's response? A spokesperson said they "experiment with all types of custom variants."

The damage was done. If Meta could game LMArena by 30 positions with cosmetic changes, what was the actual capability difference between any models on the leaderboard?

Problem 3: The Arena Itself

Meta's scandal opened floodgates of criticism - including a 68-page academic paper that accused LMArena of systematic favoritism.

"The Leaderboard Illusion"

In April 2025, researchers from Cohere Labs, AI2, Princeton, Stanford, University of Waterloo, and University of Washington published a damning analysis.

Key findings:

  • Private testing privileges: Some labs could privately test multiple model variants before publishing scores. Meta tested 27 private variants between January and March 2025. Google tested 10. Amazon tested 7.

  • Selective publishing: Labs only revealed scores of their best-performing variants. The failures stayed hidden.

  • Unexplained removals: Of 243 models evaluated on the platform, 205 were "deactivated" without public notice. Only 47 were officially marked as deprecated.

The head of Cohere's research division called it a "crisis" in artificial intelligence development.

LMArena's Defense

LMArena co-founder Ion Stoica called the paper full of "inaccuracies" and "questionable analysis." The organization said its policy allowing pre-release testing had been public for years.

But the damage was done and it was irreversible The most-watched AI leaderboard had its credibility questioned by both researchers and builders.

Problem 4: Human Preference Is Hackable

LMArena's methodology relies on human voters comparing model outputs. This creates perverse incentives.

Research found humans systematically over-index on style:

  • Longer responses seem better - even when length adds nothing substantive

  • Emoji-filled, chatty responses get upvoted more

  • Flattering, agreeable answers beat accurate-but-blunt ones

This is exactly what Meta exploited. Fine-tune for verbosity and friendliness and watch your ranking climb. Actual capability? Optional.

One researcher called it "optimizing for slop." Seeing the deluge of AI videos on Youtube, I’d say that problem is now being programatically adopted. Stop the Slop!

The Humanity's Last Exam Fiasco

Recognizing saturation, Scale AI and the Center for AI Safety created Humanity's Last Exam (HLE) - a benchmark explicitly designed to be unsaturatable.

2,500+ questions from nearly 1,000 subject experts across 500 institutions. Graduate-level difficulty and questions that current models couldn't answer.

The gap was dramatic:

Entity

HLE Score

Average human

90%

Best AI (Grok 4, with tools)

44%

Best AI (no tools)

26.9%

Source: Scale AI

Finally! A benchmark with headroom - yay?

Then FutureHouse Looked Closer

In July 2025, AI research organization FutureHouse ran HLE questions through their literature research agents - and found something troubling.

29 ± 3.7% of chemistry and biology answers had evidence directly contradicting them in peer-reviewed literature.

Their open-source agent found 53.3% of the provided rationales conflicting with published research. Chemistry was worse (57% contradicted) than biology (51.6%).

Examples included identifying oganesson as Earth's rarest noble gas - despite it not being a terrestrial gas at all.

How Did This Happen?

The benchmark's design incentivized difficulty over accuracy. Question writers claimed correct answers, but reviewers only spent five minutes per question. Complex, contrived questions that would stump AI slipped through - even when they were wrong.

The HLE team acknowledged approximately 18% error rate and instituted rolling revisions. But the fundamental problem remained: even benchmarks designed to resist gaming can be undermined by their own construction incentives.

What Replaces Benchmarks?

The industry is fragmenting its approach:

1. Private enterprise evaluations. Companies increasingly run their own tests on their own data. If you're selecting a model for your customer service bot, LMArena Elo matters less than "which model handles angry customers in our industry better?"

2. Task-specific benchmarks. SWE-bench for coding. GPQA for graduate science. Humanity's Last Exam (errors notwithstanding) for expert knowledge. General-purpose benchmarks are giving way to domain-specific ones.

3. Agentic benchmarks. LLM benchmarks were built for quiz-takers, not agents. New frameworks like AgentBench, CAMEL, and SWE-agent try to measure memory, autonomy, and tool use - the capabilities that matter for 2026's agentic era.

4. Vibes. Not a joke. "Which model feels better to use" is becoming a legitimate selection criterion when benchmarks are saturated or gamed. Karpathy's skepticism extends here too - but for many developers, personal experience matters more than a contested leaderboard.

The era of "check the benchmark, pick the winner" is over. The question is whether anything coherent replaces it.

II. THE PRICE WAR PARADOX

How Costs Fell 90% While Losses Hit $44 Billion

The Numbers That Don't Add Up

Here's 2025's central paradox:

API prices collapsed:

Model

Price per 1M Tokens (Input/Output)

vs. GPT-4 (2023)

GPT-4 (March 2023)

$30 / $60

baseline

GPT-4 Turbo (Late 2023)

$10 / $30

-50%

GPT-4o (2024)

$5 / $15

-75%

GPT-4o (Aug 2024)

$2.50 / $10

-83%

GPT-4o (2025)

$2.50 / $10

-92%

GPT-4o Mini

$0.15 / $0.60

-99%

Gemini 3 Flash

$0.50 / $3.00

-98%

Claude Sonnet 4.5

$3.00 / $15.00

-90%

In 20 months, GPT-4o input costs fell 92%. But the story isn't just OpenAI - it's all three major labs racing each other to the bottom.

And losses exploded:

Company

2025 Revenue

2025 Net Loss

Burn Rate

OpenAI

~$13B

~$9B

69% of revenue

Anthropic

~$7B

~$3B

~60% of revenue

OpenAI projects $44bn in cumulative losses through 2029. Total cash burn through that year: $115bn.

How can prices fall 90%+ while companies lose billions? The answer has four parts.

The Four Forces Driving Price Collapse

1. The Pareto Frontier Shift

The concept: the "Pareto frontier" represents the best possible tradeoff between intelligence and cost at any given moment. In 2025, that frontier shifted dramatically.

When looking at the top 5-7 frontier models, there's now less than a 10% difference in pure intelligence scores. But price differences remain up to 1,000x.

Source: AI Themes

This creates brutal pressure. If Gemini 3 Flash performs within 5% of GPT-4o at 80% lower cost, why pay premium? The labs had no choice but to match.

2. Hardware Revolution

The underlying cost of running inference dropped dramatically:

Chip

Year

Efficiency Gain

TPU v6 (Trillium)

2024

4.7x over TPU v5e

TPU v7 (Ironwood)

2025

30x over original TPU

Nvidia Blackwell

2025

2-3x over Hopper

AMD MI350

2025

35x inference gain over predecessor

Reportedly, a Series C startup in San Francisco migrated from 128 H100s to TPU v6e pods. Monthly inference bill: $340,000 → $89,000. A 74% reduction from hardware alone.

Google's advantage here explains their aggressive pricing. Running on custom TPUs rather than rented Nvidia GPUs means lower marginal costs - which they weaponized against OpenAI and Anthropic.

3. Model Architecture Efficiency

The "Flash" and "Mini" tiers are the go to workhorses for the “agentic” era.

Gemini 3 Flash uses 30% fewer tokens than Gemini 2.5 Pro for equivalent tasks. It's 3x faster than Pro while matching or exceeding its performance on most benchmarks. Cost per intelligence unit plummeted.

OpenAI's Olivier Godement, head of their API product team, said his team shifted focus from developing new AI models to "compressing more information into smaller-sized models."

The result was that GPT-4o Mini delivers GPT-4-class performance at 1/20th the cost of original GPT-4.

The Quantified Collapse: Costs Halve Every Two Months

How fast is the cost-accuracy frontier actually moving? Two research efforts quantified it.

WeirdML Analysis: By tracking model performance on 19 novel coding tasks against inference cost from July 2023 to July 2025, researchers found inference costs for hard coding tasks halve roughly every two months. The regression estimate: 1.99 months to halve (Deming method).

Epoch AI Analysis: Looking at six benchmarks (GPQA Diamond, MMLU, MATH-500, HumanEval, Chatbot Arena), they found price reductions ranging from 9x to 900x per year depending on the benchmark and performance threshold. The median: 50x per year. When isolating just post-January 2024 data, the median jumps to 200x per year.

Specific example: The price to achieve GPT-4's performance on PhD-level science questions (GPQA) fell 40x in one year.

This isn't incremental improvement. This is Moore's Law on steroids - but for intelligence per dollar, not transistors per chip.

4. Strategic Market Share War

All three labs chose adoption over profit. Deliberately.

OpenAI's logic: Since GPT-4o mini launched, API usage more than doubled. Lower prices unlock use cases that were previously priced out. Volume now, profit later.

Google's logic: CNBC called Gemini 3 Pro's pricing ($2 input / $12 output) "one of the most aggressive cost-positioning strategies among frontier models." Google can afford losses that would kill smaller labs.

Anthropic's logic: They cut Claude Opus 4.5 pricing by 67% shortly after launch - from $15 to $5 per million input tokens. Even the enterprise-focused lab joined the race, or was forced to.

Sources: Inc., PYMNTS

The industry weaponized pricing to capture market share, developer loyalty, and enterprise contracts. Nobody wanted to be the expensive option in a world where switching costs were still low.

The Open-Source Pressure

DeepSeek proved efficiency could match scale at dramatically lower costs. Their R1 model runs 20-50x cheaper than OpenAI's o1.

Enterprises choosing between GPT-5, Gemini 3 Pro, and Claude Opus 4.5 aren't cross-shopping DeepSeek but watching that converging gap between open and closed source. Compliance requirements, support contracts, and deployment preferences keep them in the closed-source ecosystem…for now. Expect this to be more topical in 2026 with China swinging hard with open source.

The Infrastructure Trap

The price war has a hidden cost: infrastructure lock-in.

OpenAI recently announced $1.4 trillion in computing commitments over the next eight years with cloud and chip partners.

Source: Sacra

$1.4 trillion. With a T.

When you've committed to buy that much compute, you need revenue to justify it. When prices are falling, you need volume. Massive volume.

This creates a flywheel - but not necessarily a profitable one:

  1. Commit to massive compute purchases to train frontier models

  2. Cut prices to drive volume and justify commitments

  3. Burn cash to fulfill obligations

  4. Raise more capital to cover burns

  5. Make bigger compute commitments to justify valuation

  6. Repeat

OpenAI spent $22bn in 2025 against $13bn in sales. That's $1.69 spent for every $1 earned.

That's a bet that eventual dominance enables eventual pricing power - and that is the business model!

Where This Ends

Three scenarios:

1. Consolidation. Price war kills smaller players. Two or three labs with hyperscaler backing survive. Prices eventually stabilize.

2. Commoditization. AI inference becomes commodity utility. Margins stay razor-thin forever. Labs make money on services, not models.

3. Capability escape. One lab builds something so differentiated that price becomes irrelevant. Customers pay premium for genuine moats.

The 2025 evidence: we're in the messy middle, with different segments trending different ways.

  • Enterprise: trending toward differentiation (Anthropic's coding premium holds)

  • Consumer API: trending toward commoditization (OpenAI vs Google vs Anthropic price matching)

  • Infrastructure: trending toward consolidation (Nvidia's Groq deal)

Interestingly, I don’t believe that the price war isn't ending, I believe it is fragmenting.

III. THE PROFITABILITY RACE

How OpenAI Lost the Enterprise and Anthropic Won It

The Market Share Collapse

OpenAI now looks like a B2C lab, but that wasn’t all intentional.

Enterprise LLM API Market Share:

Year

OpenAI

Anthropic

Google

Late 2023

50%

12%

7%

Mid-2025

25-27%

32-40%

21%

OpenAI didn't choose to cede enterprise. They lost it. Half their market share was usurped in 18 months - by a lab most were writing off by that time.

The catalyst? June 2024: Claude Sonnet 3.5.

How Anthropic Won Coding

Code generation became AI's first killer enterprise app - the clearest ROI for business spending. And Anthropic dominates it.

Coding Market Share (Enterprise):

Provider

Share

Anthropic

54%

OpenAI

21%

Others

25%

Source: TechCrunch

Anthropic has had an almost unparalleled 18 months atop the LLM leaderboards for coding, starting with Claude Sonnet 3.5 in June 2024, then Sonnet 3.7, Sonnet 4, Opus 4, and now Opus 4.5.

The numbers support vehemently: Anthropic's coding share jumped from 42% to 54% in just six months, driven largely by Claude Code adoption. OpenAI's enterprise coding share halved meanwhile while it struggled to specialize. Developers are notorious for switching quickly…and switch they did.

Why developers switched:

  1. Better code quality. Claude performs better on SWE-bench Verified, demonstrating superiority in real-world coding tasks.

  2. Reliability. Perceived as more robust, less hallucinatory, more predictable in mission-critical tasks.

  3. Tool integration. Claude Code connects directly to command lines, navigates codebases, runs tests, commits to GitHub. MCP (Model Context Protocol) enabled ecosystem-wide tool integration.

  4. Distribution wins. GitHub's CEO announced Claude Sonnet 4 as the default option in GitHub Copilot - instant access to millions of developers.

OpenAI Fighting Back?

OpenAI isn't surrendering enterprise, after all that’s where the margins are. So they are evolving…by emulating.

May 2025: Launched Codex CLI (open-source) and cloud-based coding agents (3 months after Claude Code). As advertised, It's a full autonomous software engineering agent powered by codex-1, optimized for engineering tasks.

While market share is down, the pie is getting larger, and Codex may be the way:

  • ChatGPT Enterprise seats grew 9x year-over-year

  • Codex usage up 10x since August

  • Weekly active users doubled in six weeks

  • Cisco cut code review times by 50% using Codex

Followed it up with the release of versions of GPT-5 specialized for coding (latest being in December 2025).

The holiday "Code War": In December 2025, OpenAI, Anthropic, and Google simultaneously launched developer promotions - model upgrades, usage bonuses, pricing incentives. The battle for developer loyalty is intensifying.

But GPT-5 launched 14 months after Claude Sonnet 3.5. In enterprise software, 14 months of embedded workflows creates switching costs and by that time we had a new king in coding models - Sonnet 4. Who wins the coding arena? The jury is still out.

Where Each Company Ended Up

The result: dramatically different revenue compositions.

OpenAI:

  • Consumer (ChatGPT Plus, Teams, etc.): ~70%

  • Enterprise/API: ~30%

Anthropic:

  • Enterprise/API: 70-80%

  • Consumer (Claude Pro, Team): 10-15%

Source: SaaStr

ChatGPT still dominates consumer (81% chatbot market share, 800M weekly active users) because that's where their moat held. My mom uses it daily!

Enterprises prefer Claude. Non-technical employees prefer ChatGPT. The market bifurcated naturally.

The Unit Economics Gap

The revenue split creates massive economic differences.

Metric

OpenAI

Anthropic

Ratio

Monthly users

~400M WAU

~18.9M MAU

21x more (OpenAI)

Revenue (2025)

~$13B

~$7B

1.9x more (OpenAI)

Revenue per user

~$25/user

~$211/user

8x more (Anthropic)

Source: Sacra

Anthropic has 5% of ChatGPT's user base but generates 40%+ of its revenue. That's 8x better monetization.

Why? Enterprise customers pay $100K+/year contracts. Consumer subscriptions are $20/month. Enterprise churns less, expands predictably, requires less support per dollar of revenue.

The Burn Rate Divergence

Year

OpenAI Burn (% of Revenue)

Anthropic Burn (% of Revenue)

2025

69%

~60%

2026

57%

33%

2027

57%

9%

2028

Break-even?

Profitable

Anthropic breaks even in 2028. OpenAI: 2029 or 2030.

Total cash burn: OpenAI expects $115 bn through profitability. Anthropic: roughly $10bn.

The difference isn't just strategy. It's what happens when you win the market that pays best versus holding the market that scales fastest. There is also the curse of the first mover disadvantage in frontier tech.

The Cost of Consumer Scale

OpenAI's consumer dominance comes with expensive obligations:

  • ChatGPT's free tier costs money on every query

  • Image generation (DALL-E, GPT Image) requires massive compute

  • Video generation (Sora) is among the most compute-intensive workloads

  • Consumer hardware projects with Jony Ive require capital

  • Humanoid robotics research isn't cheap

Anthropic avoided all of this. They stayed narrow: text, code, enterprise APIs. Lower revenue, but far lower costs per dollar earned.

The IPO Question

Anthropic's path to IPO is clearer.

The Financial Times reported Anthropic hired Wilson Sonsini to prepare for a potential 2026 IPO at valuations exceeding $300 billion.

The math works:

  • September 2025: $183B valuation (Series F)

  • December 2025: Exploring $350B+ in ongoing round

  • Revenue trajectory: $1B (early 2025) → $5B (August) → $9B (year-end target)

  • Path to profitability: 2028

A company approaching profitability with $70B projected 2028 revenue can IPO on its own terms.

CONCLUSION

The Hidden Forces

2025's AI race wasn't one story. It was three:

1. The Benchmark Crisis. Saturation, gaming, and methodology disputes broke the scoreboard. Meta dropped 30 positions when their real model replaced the gaming-optimized version. 29% of Humanity's Last Exam answers were wrong. Karpathy "completely lost interest and trust." Enterprise buyers increasingly run their own tests.

2. The Price War Paradox. API costs fell 90%+ as four forces converged: the Pareto frontier shifted (models clustered within 10% capability), hardware got 30x more efficient (TPU v7), architectures improved (Flash/Mini tiers), and labs chose market share over profit. Yet OpenAI projects $115 billion in cash burn through 2029. Intelligence up, prices down, losses exploding.

3. The Profitability Race. OpenAI lost half its enterprise market share (50% → 25%) as Anthropic captured coding (54% share). Claude Sonnet 3.5 in June 2024 started an 18-month dominance that transformed the market. Anthropic generates $211 per user. OpenAI generates $25. One breaks even in 2028. The other burns 14x more cash - partly because losing enterprise means relying on expensive consumer scale.

What This Means for 2026

The era of simple narratives is over. There's no one leaderboard, no one winner, no one strategy that dominates.

Instead:

  • Benchmarks are fragmenting into task-specific evaluations

  • Prices are racing toward commodity while losses mount

  • Enterprise focus beats consumer scale on unit economics

The question isn't "who's winning?" It's "winning at what?"

What's Next

Next up is Part 3, which is the practical guide: what each company actually ships and where to use it. You cannot miss it, stay tuned!

The IDE Wars: How Claude Code hit $1B ARR in 6 months. Why OpenAI's $3B Windsurf acquisition collapsed. How Cursor reached $29.3B without building a frontier model. Google's Antigravity runs Claude - what that means for the model layer.

The Use Case Breakdown: Text, coding, creative, agents, multimodal - who's actually best at what, beyond the benchmarks that just broke.

Reply

or to participate.