O1 not Good Enough (for us)
The last couple of days we've been playing around with OpenAI's latest O1 models. Initially we were super jazzed about these new models, and thought they could be a valuable addition to our product - But after having played around with these models and done some experiments, our conclusion is that they're simply not good enough for us, and suffers from many problems making them basically useless for our use cases.
Problems with these models
First of all, all the O1 models are using a different API. You cannot add system messages to them, implying you'll have to change your middleware to use these models. In addition, you cannot change its temperature, and the counting of tokens is completely different due to the models' reliance upon "reasoning tokens." This forced me to completely change the API middleware to simply be able to test these models with our tech.
Even after having done the above changes, we had to rip out streaming support. The new O1 models doesn't support stream-events, implying it can't stream tokens as it proceeds with its answer. This is a big deal for us, since we're using a CDN provider with a maximum timeout of 60 seconds. Since these models sometimes spends more than 60 seconds answering questions, this results in a timeout from our CDN provider.
Basically, these O1 models are useless for us
Overkill IQ
Below is a screenshot of gpt-4o evaluating a company based upon historical accounting figures as reported to the government by the company itself.
Notice how it's perfectly capable of breaking down the evaluation process into a step by step method, using the discounted cash flow evaluation process. This is a fairly complex mathematical process, and requires hours of calculations for a human being. gpt-4o is just magically capable of performing the whole process by itself, by looking up accounting data from its database, and performing the whole evaluation process - 100% autonomously.
The complete result is several pages long, and goes into details such as:
- Free Cash Flow
- Weighted Average Cost of Capital
- Forecast Future Cash Flows
- Terminal Value
- Discount Future Cash Flow
- Enterprise Value
- Equity Value
- Etc, etc, etc.
This is currently the most complex calculation requirement we've got, and it's a project we're doing for one of our customers who wants to create an AI chatbot SaaS company giving investment advice based upon historical company data from a database with some roughly 5 to 10 million records. The project is using gpt-4o, and its average response time is less than 20 seconds due to using gpt-4o. The project is delivered as an AI chatbot SaaS company based upon our AI Expert System leveraging our AI Agent capabilities. In addition, it's got a database of roughly 5 to 10 million records with historical financial data it's doing lookups into.
Adding gpt-o1 as our model for something such as the above, would simply be overkill and not interesting for us. The only thing we'd achieve is higher expenses and lower quality user experience.
No reasons to use a nuke when a shotgun is sufficient
gpt-4o is good enough
OpenAI's existing models, in particular gpt-4o, is simply good enough for us. In addition it's a fraction of the cost, it supports streaming, and it answers much faster. Using o1 instead would accomplish nothing for us.
These new o1 models might be truly incredible for some use cases, in particular complex research, used on scientific problems. According to OpenAI themselves, it's got the reasoning capability of a PhD student, where 4o has the reasoning capability of a high school student.
Our problem is that 100% of our current use cases can easily be solved with high school math. Implying the only thing we'd achieve by using o1 is higher costs, lower quality UX, and slower responses.
Conclusion
Until O1 can be used with a similar API as gpt-4o, and it supports streaming, O1 is simply not interesting for us - Or our customers. These models might be incredibly useful for complex research and science, but for us they're simply overkill.
In fact, the only reason we're using gpt-4o and not gpt-4o-mini, is because the mini model isn't good enough at following instructions, so OpenAI's gpt-4o model is currently our weapon of choice for our customers.
My wish list for OpenAI is not stronger models such as the o1 suite of models - But rather better "weak" models. Improve gpt-4o-mini to allow it to better follow instructions would be high on my wish list.
O1 might be incredibly valuable for complex research, but none of our problems currently requires complex research capabilities. Our current use cases are simply easily soved using high school capabilities. So at least in the foreseeable future, we see no reasons to spend time implementing support for o1. This might change in the future, but for now these models simply aren't interesting to us ...