According to OpenAI, o1 performs similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology, and even excels in math and coding. Credit: Vitor Miranda / Shutterstock OpenAI said its project Strawberry has graduated to a new family of large language models (LLMs) that the company has christened OpenAI o1. The new family of models, which also includes an o1-mini version for cost efficiency, according to the company, can be differentiated from the latest GPT-4o models basis their reasoning abilities. “We’ve developed a new series of AI models designed to spend more time thinking before they respond. They can reason through complex tasks and solve harder problems than previous models in science, coding, and math,” the company wrote in a blog post, adding that the models were currently in preview. According to OpenAI, the next model update performs similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology, and even excels in math and coding. “In a qualifying exam for the International Mathematics Olympiad (IMO), GPT-4o correctly solved only 13% of problems, while the reasoning model scored 83%. Their coding abilities were evaluated in contests and reached the 89th percentile in Codeforces competitions,” it added. The reasoning capabilities inside the OpenAI o1 models are expected to help tackle complex problems in the fields of science, coding, and mathematics among others, according to OpenAI. “For example, o1 can be used by healthcare researchers to annotate cell sequencing data, by physicists to generate complicated mathematical formulas needed for quantum optics, and by developers in all fields to build and execute multi-step workflows,” it explained. How do the models get reasoning capabilities? The new family of o1 models gets its reasoning capabilities from the company’s large-scale reinforcement learning algorithm that teaches the models how to think productively using its “Chain of Thought” mechanism in a “highly data-efficient training process.” “We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute),” the company said in another blog post and highlighted that this approach has substantially different constraints when compared to LLM pretraining. In the field of AI and generative AI, experts say that any model, during training time, tries to rearrange or modify its parameters depending on the training data it has been fed to reduce errors in an effort to increase accuracy. In contrast, during testing time, developers and researchers expose the model to new data in order to measure its performance and how it adapts to new instances of data. Therefore, in the case of the new models, the more time it spends analyzing and solving a problem, the more it learns resulting in the sharpening of its reasoning abilities. This learning is activated by the model’s Chain of Thought algorithm that works similar to how a human may think for a long time before responding to a difficult question, often breaking the problem into smaller chunks. Speaking about the models’ reasoning capabilities, Nvidia senior research manager Jim Fan, via a LinkedIn post, said that the world is finally seeing the paradigm of inference-time scaling popularized and deployed in production. “You don’t need a huge model to perform reasoning. Lots of parameters are dedicated to memorizing facts, in order to perform well in benchmarks like trivia QA. It is possible to factor out reasoning from knowledge, i.e. a small ‘reasoning core’ that knows how to call tools like browsers and code verifiers. Pre-training compute may be decreased,” Fan explained. Further, Fan said that OpenAI must have figured out the inference scaling law a long time ago, which academia is just recently discovering. However, he did point out that productionizing o1 is much harder than nailing the academic benchmarks and raised several questions. “For reasoning problems in the wild, how (the model) to decide when to stop searching? What’s the reward function? Success criterion? When to call tools like code interpreter in the loop? How to factor in the compute cost of those CPU processes? Their research post didn’t share much. OpenAI, too, in one of the blog posts has said that the new model, which is still in the early stages of development and is expected to undergo significant iteration, doesn’t yet have many of the features that make ChatGPT useful, such as browsing the web for information and uploading files and images. “For many common cases GPT-4o will be more capable in the near term,” the company said. OpenAI is hiding the reasoning tokens Although the new family of models has better reasoning, OpenAI is hiding the reasoning tokens or the Chain of Thought algorithm for the models. While the company acknowledges that exposing the Chain of Thought algorithm could allow enterprises to understand how the models were functioning and if they were showing signs of manipulating a user, it has decided that it would not be helpful to open up a model’s unaligned Chain of Thought or reasoning tokens directly visible to its users. Interfering with any unaligned Chain of Thought or reasoning tokens is counterintuitive to the model’s functioning, the company explained, adding that to exactly understand how the model is reasoning, it must have the freedom to express its thoughts in unaltered form. This is why OpenAI cannot train any policy compliance or user preferences onto the Chain of Thought. “We acknowledge this decision has disadvantages. We strive to partially make up for it by teaching the model to reproduce any useful ideas from the Chain of Thought in the answer,” it added. British programmer Simon Wilson, who is the co-founder of the social conference directory Lanyrd and co-creator of the Django Web framework, in his blog post said he wasn’t happy with the OpenAI’s policy decision. “The idea that I can run a complex prompt and have key details of how that prompt was evaluated hidden from me feels like a big step backward,” he wrote. Other limitations of the o1 model Another issue about the reasoning tokens that Wilson pointed out is that though reasoning tokens are not visible in the API response, they are still billed and counted as output tokens. From a technical standpoint, this means that enterprises will have to increase their prompt budgets due to the reasoning tokens. “Thanks to the importance of reasoning tokens — OpenAI suggests allocating a budget of around 25,000 of these for prompts that benefit from the new models — the output token allowance has been increased dramatically — to 32,768 for o1-preview and 65,536 for the supposedly smaller o1-mini,” Wilson wrote. These output token allowances are an increase from the gpt-4o and gpt-4o-mini models, both of which currently have a 16,384 output token limit, the programmer added. OpenAI is also advising enterprises to use retrieval-augmented generation (RAG) differently for the new models. Unlike the usage of RAG presently where the advice is to potentially cram as many relevant documents as possible, OpenAI suggests that in the case of the new models, users should include only the most relevant information to prevent the model from overcomplicating its response, Wilson explained. How to get the new o1 family of models? ChatGPT Plus and Team users will be able to access o1 models in ChatGPT starting Thursday. Both o1-preview and o1-mini can be selected manually in the model picker, and at launch, weekly rate limits will be 30 messages for o1-preview and 50 for o1-mini, the company said, adding that it was working to increase those rates and enable ChatGPT to automatically choose the right model for a given prompt. Alternatively, ChatGPT Enterprise and Edu users will get access to both models beginning next week. Open AI said that developers who qualify for API usage tier 5 can start prototyping with both models in the API starting Thursday with a rate limit of 20. “We’re working to increase these limits after additional testing. The API for these models currently doesn’t include function calling, streaming, support for system messages, and other features,” the company said, adding that it was planning to bring o1-mini access to all ChatGPT Free users. SUBSCRIBE TO OUR NEWSLETTER From our editors straight to your inbox Get started by entering your email address below. Please enter a valid email address Subscribe