Wednesday, November 19, 2025

Can AI run a business for a year?

Google just released Gemini 3 and buried in the benchmarks is one test that actually matters

Vending-Bench

It tests whether AI can run a [vending machine] business for an entire year without going insane

Start with $500. Negotiate with suppliers, manage inventory, set prices, stay profitable. No human guidance. Just the model making hundreds of connected decisions over time

Gemini 2.5 Pro (yellow): Flat at $500 the entire year. 

Gemini 3 Pro (pink line): Steady climb from day one, ends around $3,500. Consistent growth the entire year

Claude Sonnet 4.5 (black line): Doubled its money to $1,000

GPT-5.1 (blue line): Loses money - actually under performs Gemini 2.5 - GPT-5.1's biggest problem? It's too trusting: Paid suppliers before getting order confirmations. One supplier went out of business and GPT-5.1 lost the money. Accepted inflated wholesale prices without negotiating. Buying soda for $2.40 per can when wholesale should be $0.50 That's not a reasoning failure. That's a judgment failure over time

Why Gemini 3 won:
When suppliers quote high prices, it keeps searching for alternatives or negotiating down. It knows what wholesale costs should be and refuses bad deals. It maintains consistent decision-making across 365 days without breaking down

Here's what makes this benchmark different:
Most tests have a ceiling. You can get 100% and you're done; Vending-Bench has no ceiling. There's no restriction on what items the models can stock in the vending machine. The models are all choosing typical vending machine stuff because (cough cough) pattern matching

The benchmark creators say a good human strategy has made $63,000 in a year. That's 18x better performance than what Gemini 3 is reporting

Current models aren't even close to maximizing this. They're just trying not to go bankrupt or email the FBI

Why this test matters more than every other benchmark for most use cases:  Most AI tests measure snapshots. Can you solve this problem? Can you write this code?

Vending-Bench measures coherence over time. 
Your business doesn't need AI that aces PhD exams. You need AI that can manage procurement for six months without paying phantom suppliers.  Handle customer support without hallucinating policies. Run your inbox for three months without inventing fake meetings. Long-horizon coherence is why 95% of AI implementations fail

Models crush benchmarks but fall apart when you ask them to maintain context over weeks or months. This is the first test showing which models can actually operate independently without losing the plot. That's the difference between a demo and something you can deploy

But keep humans involved… they did 18x better than the best current AI.

No comments:

Post a Comment

Search This Blog

Followers