Today: Jan 26, 2025

The high price of intelligence: OpenAI’s o3 breaks benchmarks and carbon budgets

OpenAI’s o3 model edges closer to AGI, scoring 87.5% on ARC-AGI and excelling in math and programming. Yet, each task consumes 1,785 kWh—2 months of U.S. household energy—and emits 684 kg CO₂, raising urgent questions about cost and sustainability.
A cutting-edge robot powered by advanced Technology | Pixabay
1 month ago

OpenAI has announced its latest breakthrough: the o3 model family. This includes the flagship o3 and its lighter counterpart, o3-mini, marking a significant evolution from its earlier o1 “reasoning” model. But what makes this new model noteworthy, and why is everyone buzzing about it? Let’s dive in.

OpenAI describes o3 as a reasoning model with capabilities that inch closer to AGI (Artificial General Intelligence) — the holy grail of AI development. For context, AGI refers to an AI system capable of performing any intellectual task that a human can, from solving complex mathematical problems to creative writing. OpenAI defines it as “highly autonomous systems that outperform humans at most economically valuable work.”

But here’s the catch: OpenAI’s claims about approaching AGI come with significant caveats. For one, achieving AGI is a highly debated and elusive goal. François Chollet, a co-creator of the ARC-AGI benchmark (used to evaluate AI’s ability to generalize), noted that while o3 scored an impressive 87.5% in high-compute mode — tripling the performance of its predecessor — it still struggles with simple tasks that a human could easily solve.

AI models like o3 could play a transformative role in solving some of the world’s biggest problems, from medical research to renewable energy optimization. But to truly realize this potential, there is a need to explore sustainable ways to develop and deploy such models. Whether it’s improving the energy efficiency of AI hardware, sourcing clean energy, or optimizing models to perform tasks with less computational power, there’s a pressing need for innovation in sustainable AI practices.

In his blog, Chollet remarked, “You’ll know AGI is here when creating tasks that are easy for humans but hard for AI becomes impossible.” Until then, milestones like this should be celebrated cautiously.

What’s New in o3?

Unlike traditional AI models, reasoning models like o3 have an internal “chain of thought” that simulates decision-making. Here’s how it works:

Private Chain of Thought: Before responding to a prompt, o3 pauses to consider related prompts, evaluate its reasoning, and plan its answer. This process mimics human-like deliberation and allows for more accurate, contextually relevant answers.

Adjustable Reasoning Time: A groundbreaking feature of o3 is its ability to adjust “compute levels,” which determine how much time it spends reasoning. High-compute settings yield better performance but at a higher cost.

Reinforcement Learning for Precision: o3 was trained using reinforcement learning, allowing it to “think” before acting. While this makes it slower than non-reasoning models, the trade-off is improved reliability in areas like mathematics, science, and programming.

These innovations aim to reduce hallucinations (incorrect outputs) and errors, which plague most AI models. However, as OpenAI admits, o3 is far from flawless. For example, its predecessor, o1, famously struggled with games as simple as tic-tac-toe.

One of the most talked-about aspects of o3 is its performance on the ARC-AGI test. This test evaluates an AI’s ability to solve novel tasks outside its training data, a crucial metric for assessing general intelligence. With an 87.5% score on high-compute mode, o3 shattered previous records.

Yet, Chollet cautions against reading too much into these results. On the same test, o3 falters on basic tasks, highlighting its fundamental differences from human intelligence. Additionally, the high-compute setting is prohibitively expensive, costing thousands of dollars per challenge.

On other benchmarks, however, o3 shines:

  • SWE-Bench Verified: A 22.8 percentage point lead over o1 in programming tasks.
  • Codeforces Rating: A remarkable 2727 score, placing it among the top engineers globally.
  • American Invitational Mathematics Exam (2024): 96.7%, missing only one question.
  • EpochAI’s Frontier Math Benchmark: Solved 25.2% of problems, far outpacing all competitors (none of whom exceed 2%).

These achievements highlight o3’s prowess in specialized domains, particularly in STEM fields. But AGI? Not quite yet.

Ai Artificial Intelligence design | Pixabay

The hidden costs of progress

According to AI sustainability expert, Boris Gamazaychikov , OpenAI’s benchmark results reveal staggering computational demands for O3. A single task on the ARC-AGI benchmark (high-compute version) consumed approximately 1,785 kilowatt-hours (kWh) of energy. To put this into context, that is roughly the same amount of electricity consumed by an average U.S. household over two months, he notes.

The environmental implications are equally concerning. Based on November 2024 U.S. grid emissions data, each task emitted an estimated 684 kilograms of CO₂ equivalent (CO₂e). For comparison, this is akin to the carbon emissions from burning more than five full tanks of gasoline (about 15 gallons per tank).

These calculations focus solely on the energy usage of GPUs during computation and exclude embodied carbon costs (such as manufacturing and infrastructure), meaning the actual environmental footprint is likely even higher.

The costs associated with running O3 tasks are eye-opening. Here’s a closer look at the underlying calculations:

Each task costs approximately $3,400 at a GPU retail cost of $1.50 per H100 GPU hour.

Completing a task required 2,267 GPU hours, with each NVIDIA H100 GPU operating at an assumed 700-watt thermal design power (TDP). Factoring in a power usage effectiveness (PUE) of 1.125 (Microsoft’s average), the energy consumption tallies up to 1,785 kWh.

Using the November 2024 U.S. grid emissions factor of 383 grams CO₂e per kWh, this leads to the 684 kg CO₂e figure. For further perspective, the energy usage of a single O3 task is enough to power a refrigerator for more than eight months or charge a Tesla Model 3 nearly 10 times.

Safety first: The risks of reasoning models

Despite the fanfare, OpenAI remains cautious about the risks associated with reasoning models like o3. In fact, CEO Sam Altman has called for a federal testing framework to monitor and mitigate potential harms before releasing such models widely.

Here’s why safety matters: Earlier tests of o1 revealed that its reasoning abilities made it more likely to deceive human users than conventional models. The fear is that o3, with its more advanced capabilities, might exacerbate these risks. OpenAI claims to have addressed this through a new technique called “deliberative alignment,” which aligns the model with safety principles. However, the real test will come when external safety researchers publish their findings.

For now, only a limited preview of o3-mini is available to safety researchers, with a wider rollout expected by the end of January 2025.

OpenAI’s announcement comes amid a flurry of activity in the AI reasoning space. Competitors like Google and Alibaba have launched their own reasoning models, while startups like DeepSeek are experimenting with novel approaches. This surge reflects a growing consensus that traditional “scaling up” techniques are reaching their limits, and reasoning models could be the next big thing.

But not everyone is convinced. Critics argue that reasoning models are too expensive and may not sustain their current rate of progress. For instance, the high-compute setting that gives o3 its edge is prohibitively costly, raising questions about its scalability for real-world applications.

Interestingly, OpenAI’s announcement coincides with the departure of Alec Radford, one of its most accomplished scientists and the lead author behind the GPT series of models. Radford’s exit to pursue independent research raises questions about the future direction of OpenAI, particularly as it juggles scientific innovation with growing commercial pressures.

OpenAI has positioned o3 as a significant step forward, both in terms of technical capabilities and safety advancements. But the journey toward AGI remains long and uncertain. As François Chollet aptly put it, “AGI is not a finish line you cross; it’s a horizon you chase.”

For now, o3 represents progress, not perfection. It’s a reminder of both the immense potential and the profound challenges of building AI systems that reason, adapt, and ultimately, think like us.

Fabrice Iranzi

Journalist and Project Leader at LionHerald, strong passion in tech and new ideas, serving Digital Company Builders in UK and beyond
E-mail: iranzi@lionherald.com

Leave a Reply

Your email address will not be published.