GPT 4.5 versus GPT 4o

OpenAI just released a new model named GPT 4.5. It's ridiculously expensive compared to GPT-4o. Input tokens are $75 per million tokens, versus GPT-4o costing $2.50. Output tokens are $150 versus $10 for GPT-4o. This implies it's literally 15 to 30 times as expensive as GPT-4o. The natural question therefore obviously becomes ...

Is it worth it?

According to OpenAI, this is how it measures according to benchmark tests. However, most LLM providers seems to be optimising and training on specific benchmark tests, so better performance on such tests doesn't necessarily imply better performance in general.

OpenAI GPT-4.5 versus GPT-4o

The above figures are truly impressive, if they reflect reality during daily usage. But before we put it to the test we need to talk about AI saturation.

AI Model Saturation

Every single week a new and revolutionary AI model is released. At AINIRO it's literally our job to measure these and see if they are good enough for our clients. However, most others obviously don't have the resources to do this the way we do. For most others AI is a tool, and if it does its job, most people don't really care about how their particular model scores on specific benchmarks. Most people just want an honest answer and some guidance on which model to use, and they really don't care about performance differences.

Implying that unless you've got very unique problems, requiring very good models to solve, the model's "IQ" is barely relevant anymore. For a simple customer service type of AI chatbot, based upon RAG, as long as the LLM can provide correct answers, most other parameters are irrelevant.

There are exceptions of course, such as content production and complex medical and legal information - But in general, above some threshold, the model's score on these tests seems to become irrelevant.

AI Function Instructions

The most important part for us is how easy it is for us to have the model execute functions. Both models seems to perform similarly here.

GPT-40

GPT-40 executing function

GPT-4.5

GPT-4.5 executing function

I wouldn't get too hung up on the differences in answers here, since the same model typically reformats the same answer multiple different ways given the exact same input. The fact that the model was able to correctly execute functions at all is what's crucial for us.

Speed

GPT-40 is about 5 to 10 times faster than GPT-4.5. For daily simple tasks, this is a really big deal. When you're searching for a contact in NetSuite or HubSpot, you don't want to wait for 60 seconds before the model has returned the data you want to work with. This creates additional latency during your daily work, and for easy tasks this is simply not something you can tolerate.

Unless you've got very complex problems, the additional time 4.5 is using is not worth it for most problems.

Other Differences

According to OpenAI, GPT-4.5 represents a significant shift from its predecessor, GPT-40, in terms of its approach to artificial intelligence. While GPT-40 was primarily optimized for reasoning-intensive tasks and complex STEM problem-solving, GPT-4.5 pivots towards enhancing emotional intelligence (EQ) and natural conversational abilities. GPT-4.5 leverages scaled unsupervised learning, resulting in broader world knowledge, improved factual accuracy, and reduced hallucination rates.

However, this shift means GPT-4.5 does not emphasize chain-of-thought reasoning or coding benchmarks as strongly as GPT-40. Instead, GPT-4.5 excels in applications requiring nuanced conversation, empathy, and understanding subtle emotional cues, making it particularly suitable for creative writing, customer service, and mental health chatbots. Thus, when measuring differences in “IQ,” GPT-40 might outperform GPT-4.5 in traditional logical reasoning and coding tasks, whereas GPT-4.5 demonstrates superior “emotional IQ,” excelling in human-like interactions and empathetic communication.

I actually asked GPT-4.5 the following question; "What model would you use for a customer service AI chatbot based upon RAG?" - You can find its answer below ...

Based on the detailed comparison between GPT-40 and GPT-4.5, GPT-4.5 would be the more suitable choice for a customer service AI chatbot built upon Retrieval-Augmented Generation (RAG). GPT-4.5 has been specifically optimized for enhanced emotional intelligence (EQ), nuanced conversational style, and improved understanding of subtle emotional cues and implicit user intent. These attributes make GPT-4.5 particularly effective in customer service scenarios, where empathy, natural interactions, and robust context understanding are crucial. Additionally, GPT-4.5’s reduced hallucination rates and improved factual accuracy further support its suitability for customer-facing applications, ensuring responses are both accurate and contextually appropriate.

Personally I fundamentally disagree, since in customer services speed and costs are the only thing that matters. With GPT40 being 5 to 10 times faster, and comes with a cost being 5% of its older sibling, I personally believe 40 would be a natural choice for 90% of our customers.

Conclusion

There exists actual problems that are complex enough to justify GPT-4.5, and we will of course advise our clients to use it if it makes sense. However, the 15 to 30 times increase in costs, and its speed being at 10 to 20 percent of its older sibling, I suspect less than 10% of our customers will want to use 4.5.

Today any new model have basically such small changes and improvements that for 90% of all use cases they're not even interesting for us - At least for the moment.

GPT-40 has been our working horse ever since it came out, with some few solutions running 03-mini. We will probably advise maybe 5 to 10 percent of our customers to use 4.5, after thoroughly having confirmed it's necessary - But for most people it's simply not worth it.

GPT-4.5 is a nice model, and we're happy OpenAI is innovating and making better products. However, I see nothing in 4.5 that makes us want to change our goto model from 40 to 4.5. And the reason is that better LLMs isn't even on our radar. Our current focus is simplifying creating AI agents, complex workflows, and wrapping existing APIs and databases into the LLM. This is because of the LLM market has become so saturated with new models, and SOTA models having become so good at what they do, that it's really not a question of better models anymore - But rather in fact what vendors such as us can do with the model. And so far we've not seen anything out there that's "better" than OpenAI's GPT-40 model ...