OpenAI o1 scores an estimated IQ of 120-125, up from the 96-98 range that characterized GPT-4 and comparable models. This is the first time a language model has tested above the human average on IQ-style reasoning benchmarks, and it raises the question of what these numbers actually tell us about the path to artificial general intelligence.
On coding benchmarks and complex reasoning tasks, o1 surpasses every previous model by a significant margin. The gains are concentrated in problems that require multi-step deduction precisely the kind of task where the reflection mechanism adds value by allowing the model to reconsider intermediate steps before committing to a solution.
The IQ comparison is illustrative but should be interpreted carefully. IQ tests measure a specific set of cognitive abilities, and a model that scores 120 on these tests has not necessarily crossed a threshold into general intelligence. It has become better at the particular kinds of pattern recognition and logical sequencing that IQ tests evaluate. For tasks outside this distribution open-ended creativity, common-sense reasoning about novel physical situations, learning from a single example the improvement over GPT-4 is far less dramatic.
The critical distinction is that o1's improvements come from spending more compute at inference time, not from a new architecture. The underlying transformer is essentially unchanged. What has changed is the amount of work the model does before producing output: instead of generating tokens immediately, it runs an extended internal reasoning process that can consume tens of thousands of tokens worth of computation.
This is a meaningful advance, but it is a process innovation rather than an architectural one. The history of AI progress suggests that the largest capability jumps come from architectural breakthroughs backpropagation, the attention mechanism, the scaling laws that made large language models viable. Process improvements tend to produce diminishing returns as they are pushed further. Whether scaling inference-time reflection will continue to yield gains or will plateau is an open empirical question.
Sam Altman has framed o1's reflection capability as a step toward AGI, with a timeline of roughly 1000 days. The vision is that future versions will reflect not for 50 seconds but for hours or days, tackling complex research problems the way a human scientist would.
There are several problems with this framing. First, extending reflection time does not guarantee proportionally better results. A model that reasons poorly for 50 seconds will not necessarily reason well given 50 hours the quality of the reasoning process matters more than its duration. Second, the problems that separate current AI from AGI are not primarily problems of insufficient thinking time. They involve capabilities like causal reasoning, genuine understanding of physical systems, and the ability to learn new abstractions from minimal data none of which are addressed by longer reflection windows.
The 1000-day prediction reflects OpenAI's institutional optimism more than any technical argument. AGI predictions have a poor track record across the field, and the gap between impressive benchmarks and genuine general intelligence remains substantial.
One consequence of o1's design is that the reflection approach is not proprietary in any deep sense. It does not depend on novel hardware or a fundamentally new training paradigm. Any team with access to a capable base model can implement inference-time reasoning through chain-of-thought prompting, reinforcement learning on reasoning traces, or similar techniques.
This means o1's competitive advantage is likely temporary. Other providers and open-source projects will implement comparable approaches, and the differentiator will shift to training data quality, implementation efficiency, and integration into useful products. The innovation is important, but it is the kind of innovation that diffuses rapidly through the field.
OpenAI o1 scores an estimated IQ of 120-125, placing it above the human average. Previous models like GPT-4 scored in the 96-98 range. This represents a jump from roughly average to above-average human-level performance on IQ-style reasoning tests.
O1 surpasses all previous models on coding benchmarks and complex reasoning tasks. Its gains are most pronounced on problems that require multi-step logical deduction, where the reflection mechanism allows it to reconsider and refine its approach before producing output.
Sam Altman views o1's reflection capability as a step toward AGI and predicts its arrival within 1000 days. However, o1's improvements come from inference-time compute rather than architectural innovation. Scaling reflection time may hit diminishing returns, and genuine AGI likely requires breakthroughs that cannot be scheduled on a timeline.
O1 uses the same underlying transformer architecture as its predecessors. Its innovation is spending more compute at inference time through extended internal reasoning. This is a process improvement, not a fundamental architectural advance. The distinction matters because architectural breakthroughs like the invention of attention itself tend to produce qualitatively different capabilities rather than incremental gains.
The reflection approach is likely to be widely adopted. It does not require proprietary hardware or fundamentally new architectures, making it reproducible by other providers and open-source projects. The competitive advantage lies in implementation quality and training data, not in the concept itself.
.
Copyright 2026 - Joel P. Barmettler