2024·Commentary

OpenAI o1's self-reflection mechanism and the cost of thinking

Q: How does OpenAI o1's self-reflection mechanism work?

OpenAI o1 processes 10,000 to 70,000 internal tokens during a roughly 50-second reasoning phase before producing a visible response. Only a fraction of these tokens appear in the final output. This hidden chain-of-thought allows the model to explore multiple reasoning paths and self-correct before committing to an answer.

Q: What is the energy cost of OpenAI o1's reasoning process?

A single chat with o1 consumes roughly as much electricity as running a living room lamp for one and a half days. This high energy draw stems from the model processing tens of thousands of internal tokens during each reflection phase, making it significantly more resource-intensive than standard language models.

Q: When should you use o1 instead of a standard language model?

O1 is best suited for complex reasoning tasks such as multi-step programming problems and scientific analysis. For 80-90% of everyday tasks like translations or routine emails, standard models like GPT-4 are equally capable, faster, and more energy-efficient.

Q: Why did OpenAI need to give o1 a 'personality' for it to work?

OpenAI developers found that the reflection mechanism only produced strong results after the model was given a human-like personality a tendency to approach problems the way a person would. This suggests that effective reasoning may require more than pure logic; it may depend on human-like cognitive heuristics, including the capacity to make and then correct typical logical errors.

Q: What is the model selection problem with OpenAI's product lineup?

Users must choose between o1, o1 mini, GPT-4, and other variants without clear guidance on which model fits which task. This creates unnecessary complexity. An intelligent routing system that automatically selects the appropriate model based on task requirements would be a more practical design.

OpenAI's o1, originally codenamed Strawberry, introduced a mechanism that no previous commercial language model had shipped: a built-in reflection phase where the model reasons internally for roughly 50 seconds before generating a visible response. The shift goes beyond producing better answers. It changes the fundamental economics and energy profile of inference.

Hidden chain-of-thought at scale

During its reflection phase, o1 processes between 10,000 and 70,000 tokens internally. Users see only a small fraction of this computation in the final response. The rest is a hidden chain-of-thought: the model exploring reasoning paths, reconsidering intermediate conclusions, and self-correcting before committing to an answer. This is architecturally distinct from earlier models, which begin generating output tokens immediately and have no mechanism for internal deliberation.

The approach produces measurably better results on complex tasks programming challenges, multi-step mathematical proofs, scientific reasoning problems where the ability to reconsider an initial approach before outputting it is genuinely valuable.

The personality requirement

One of the more surprising findings from OpenAI's development process is that the reflection mechanism only worked well after the model was given what the developers describe as a "personality" a human-like way of approaching problems. Without it, the raw reflection process produced mediocre results. With it, o1 began exhibiting recognizably human reasoning patterns, including making the same kinds of logical errors humans make and finding similar paths to correct them.

This is a noteworthy data point for AI research more broadly. It suggests that effective multi-step reasoning may depend on cognitive heuristics in addition to computational capacity. The kinds of shortcuts and framings that human thinkers use instinctively may be essential.

Energy costs of extended reasoning

The energy implications are substantial. A single o1 chat session consumes roughly as much electricity as running a living room lamp for a day and a half. This is a direct consequence of the token volume: processing 70,000 internal tokens requires the same GPU compute as generating 70,000 output tokens in a standard model, but none of that computation is visible to the user.

For tasks where the reflection genuinely improves output quality, this trade-off may be justified. But for the 80-90% of everyday use cases translations, routine emails, simple questions o1 is no better than its predecessors. It is simply slower and more expensive. The right mental model is not "o1 replaces GPT-4" but "o1 is a specialist tool for hard problems."

The model selection problem

OpenAI's current product lineup o1, o1 mini, GPT-4, GPT-4o, and various other variants forces users to make a selection that most of them are not equipped to make. Choosing the right model for a given task requires understanding the performance characteristics and cost profiles of each option. A technology company building products for a general audience should not externalize this complexity to end users.

The more practical design would be a routing layer that automatically dispatches queries to the appropriate model based on estimated task complexity. This is a solvable engineering problem, and it is worth noting that some competitors are already moving in this direction.

How does OpenAI o1's self-reflection mechanism work?

OpenAI o1 processes 10,000 to 70,000 internal tokens during a roughly 50-second reasoning phase before producing a visible response. Only a fraction of these tokens appear in the final output. This hidden chain-of-thought allows the model to explore multiple reasoning paths and self-correct before committing to an answer.

What is the energy cost of OpenAI o1's reasoning process?

A single chat with o1 consumes roughly as much electricity as running a living room lamp for one and a half days. This high energy draw stems from the model processing tens of thousands of internal tokens during each reflection phase, making it significantly more resource-intensive than standard language models.

When should you use o1 instead of a standard language model?

O1 is best suited for complex reasoning tasks such as multi-step programming problems and scientific analysis. For 80-90% of everyday tasks like translations or routine emails, standard models like GPT-4 are equally capable, faster, and more energy-efficient.

Why did OpenAI need to give o1 a 'personality' for it to work?

OpenAI developers found that the reflection mechanism only produced strong results after the model was given a human-like personality a tendency to approach problems the way a person would. This suggests that effective reasoning may require more than pure logic; it may depend on human-like cognitive heuristics, including the capacity to make and then correct typical logical errors.

What is the model selection problem with OpenAI's product lineup?

Users must choose between o1, o1 mini, GPT-4, and other variants without clear guidance on which model fits which task. This creates unnecessary complexity. An intelligent routing system that automatically selects the appropriate model based on task requirements would be a more practical design.

< Back

Joel P. Barmettler

AI Architect & Researcher

OpenAI o1's self-reflection mechanism and the cost of thinking

Hidden chain-of-thought at scale

The personality requirement

Energy costs of extended reasoning

The model selection problem