Why OpenAI’s o3 Won’t Replace You (Yet)

*Photo by* *Anita Austvika* *@anitaaustvika* *on Unsplash*

Software developers have been used to watching AI models slowly improve performance in math, coding, science and reasoning. So when OpenAI ended its 12 Days of OpenAI event with the bafflingly named o3 model got people excited. Very Excited. I’m going to explain the leap this model provides, and then go on to explain why software developers probably don’t need to use it.

The Breakthrough

The only way to represent this leap in quality is through a graph. So here is one:

*Image:* *https://www.youtube.com/watch?v=SKBG1sqdyIU*

Full o3 is a little more than twice as good as o1 preview (that itself was only released 3 months ago). That’s competitive code on Codeforces, which is generally harder than the interview questions given by LeetCode.

If that’s not enough SWE-bench Verified is an agent-focused evaluation with questions that are typical problems software engineers face in their usual work. Nearly 72% accuracy is a great score.

The latest model, o3, has performed strongly (to underplay it) on the ARC-AGI benchmark, designed to test artificial intelligence’s ability to generalize and adapt to new tasks. The score of 87.5% has the AI world buzzing with the question — is this going to actually pass as Artificial General Intelligence (AGI)?

*Image:* *https://arcprize.org/media/images/blog/o-series-performance.jpg*

So this test is designed to measure generalizing entirely new task. Previous iterations like GPT-4o scored a measly 5% on the same benchmark, even though these tests are usually easy to pass for humans. Here is an example of the type of test:

(the answer here is to add a block in the spaces), and these tests are deceptively simple by being easy for humans and difficult for AI.

The Buzz

Here’s what many software developers are thinking based on this paper. When the model is released in January people are saying it will likely be better than all but the most gifted human programmers.

Sam Altman has lost his Caps Lock key, but underlines that o3-mini will bring a cost reduction for AI (or for Sam, ai I guess) users.

Why You Don’t Need it

AI isn’t a threat. o3 is not close to your performance.It’s almost certainly not better than you in many respects. Here is why.

You don’t code much

The first (and obvious) point is that developers do not spend much time coding (some put it at less than an hour a day). This means we are optimizing the amount of time coding which wasn’t that long in the beginning.

It isn’t always correct

That SWE-bench Verified number is likely the first attempt at each problem. Even if it is you need someone to check the code, as it’s incorrect for the simple unit tests in 25% of cases (and how many times would the generated code fail a robust review?).

There is no robust solution to hallucinations and no sign of one coming over the hill.

Cost and power

The performance has scaled, but so have the costs. This is just throwing raw power at the problem and getting much better results but it’s not sustainable over the long term. The cost of completing those semi-private tasks is $20 per task — not the $20 per month that OpenAI currently charge for their subscription.

AI is power hungry. The price of power is expensive. We aren’t going to see a reduction to pre-2020 prices for some time.

*Image:* *https://www.eia.gov/todayinenergy/detail.php?id=26752*

The ARC-AGI-PUB isn’t all you might think

The first line of the paper is:

“OpenAI’s new o3 system — trained on the ARC-AGI-1 Public Training set”

To get the score on the Semi-Private Evaluation set. That’s not as impressive as it might seem, as the Semi-Private Evaluation set could be in the model’s training set (pollution) and the model was trained on the public training set.

The model didn’t just come in off the street and achieve a good score out of the ARC-AGI-PUB test. Something looks off to me.

It’s not available

o3 mini is being released towards the end of January (or in OpenAI time, that’s February). If you want to help OpenAI out for their security testing (why?) you can apply here. That’s the mini version, and I can’t see a timeline for the full fat version.

And…Don’t buy into the hype train

Can we just calm down, everyone?

Conclusion

You probably don’t need o3, and you’ll carry on with your job as usual. Don’t worry about too much and just keep going as before.

If and when AGI comes we’ll worry about these things as and when they happen.

The last time I wrote an article like this on Medium somebody copied it. I wonder if the same will happen this time?