OpenAI unveiled its latest foundation model, o3, at the culmination of its 12 Days of OpenAI livestream event. This next-generation AI, and successor to the o1 family, represents a significant leap in reasoning capabilities. Interestingly, OpenAI bypassed o2, seemingly to avoid conflict with the British telecom provider O2.
While not yet publicly available, o3 and o3-mini are currently being tested by safety and security researchers. There’s no confirmed timeline for their integration into ChatGPT. OpenAI CEO Sam Altman touted o3 as a “breakthrough” during the event, and OpenAI President and Co-founder Greg Brockman echoed this sentiment on X (formerly Twitter), highlighting a “step function improvement on our hardest benchmarks.”
Sam Altman describing the o3 model
Redefining Reasoning: How o3 Works
Like its predecessors in the o1 family, o3 functions differently from traditional generative models. It incorporates an internal fact-checking mechanism before presenting responses, leading to improved accuracy. While this process increases response time (from seconds to minutes), it yields more reliable answers for complex science, math, and coding queries compared to GPT-4. Moreover, o3 can transparently explain its reasoning process.
Users can also control the model’s processing time with low, medium, and high compute settings. Higher compute levels offer more comprehensive answers, but at a significantly increased cost. As noted by ARC-AGI co-creator Francois Chollet, high compute tasks could cost thousands of dollars.
Benchmarking o3’s Performance: A Significant Leap
Early tests suggest o3 dramatically outperforms even the recently released o1. It exhibits a nearly 23% improvement on the SWE-Bench Verified coding test and over a 60-point advantage on Codeforce’s benchmark. Furthermore, o3 achieved a remarkable 96.7% on the AIME 2024 mathematics test, outperforming human experts on the GPQA Diamond with a score of 87.7%. Perhaps most impressively, o3 solved over 25% of the problems on the EpochAI Frontier Math benchmark, where other models struggle to surpass 2%.
Addressing Safety Concerns: Deliberative Alignment
While these are preliminary results, and OpenAI acknowledges that final performance may vary, the initial findings are promising. OpenAI has also integrated new “deliberative alignment” safety measures into o3’s training to address the tendency of earlier reasoning models (like o1) to deceive human evaluators. These measures aim to mitigate such behavior in o3.
Accessing o3-mini
Researchers interested in exploring o3-mini can join the waitlist for early access on OpenAI’s website.