GPT-4.1, the Best Coding Model?

Some few weeks ago OpenAI released GPT-4.1. We've been too busy using it to even speak about it. However, now that our own LLM based upon GPT-4.1 is almost finished, we have some time to tell you about our findings. And its performance is simply stellar!

GPT-4.1 on Coding

According to OpenAI GPT-4.1 scores 54.6% on the SWE-bench verified benchmark. This is an increase from 21.4% up from GPT-4o, and it even blasts GPT-4.5 completely out of the water. In fact, it's almost twice as good as GPT-4.5, which used to be OpenAI's best coding model until 14th of April 2025, when OpenAI released GPT-4.1.

We're using it internally as a base model for our Hyperlambda Generator, and once we switched from GPT-4o, we could see a drastic improvement in performance. When we started creating our own LLM, we were using GPT-4o. Halfway through the project OpenAI released GPT-4.1, and as we switched, we could see performance soaring, using the same data.

This implies that at least for coding tasks, not only is GPT-4.1 much better than GPT-4o, but it also outshines 4o on fine tuning, at least for coding tasks.

Instruction following

Our AI chatbots and AI Agents requires an LLM that's capable of perfectly following inctructions. According to OpenAI, GPT-4.1 scores 33% better than GPT-4o here. To understand the importance here, realise that a custom AI chatbot or agent is basically a set of instructions combined with information and functions. This implies that every time the LLM doesn't follow its instructions, it's basically "a bug" - And we need to spend huge amounts of energy figuring out how to correctly prompt engineer its system instruction and RAG data. This is painful for us, and cost us a lot of energy and resources. With GPT-4.1 this problem have been cut in half!

To give you an idea of the importance here, consider this question, which you can ask yourself to our AI chatbot.

Instruction following

If you give it your website URL, it will create a personalised sales pitch for your specific needs, using your vertical and your own website information as its foundation. Below is an example.

Instruction following scraping website

With GPT-4o, it would correctly execute the above task roughly 9 out of 10 times. We haven't measured the difference here, but with GPT-4.1 we've seen significant improvements here. This makes our lives easier, since we don't need to waste time modifying system instructions and RAG data to ensure the LLM is doing what it's supposed to do.

Context window

The context window of an LLM roughly translates to "how much facts it can keep in memory during a single task". GPT-4o had a context window of 128K, and GPT-4.1 blows this figure out of the water and has 10x the size, with a context window of 1 million tokens. This allows the model to solve much more complex tasks.

This is a dual edged sword, because you pay for input tokens - Implying the more data you provide to OpenAI, the more you pay. For an embeddable AI chatbot such as ours, extracting 1 million tokens worth of RAG data is complete overkill. But for complex tasks in our AI Expert System for instance, the context window size is a complete game changer, since it allows us to deliver AI agents with literally 10x as much capacity.

To give you an idea of what such figures implies, realise you can submit 8 PDFs, with 300 pages each, in one go to OpenAI, and ask questions related to your data, and it will be able to read all books, and answer your questions based upon information it finds.

Wrapping up

At this point you might believe that 4.1 is twice as expensive as 4o. However, the opposite is true; It's actually significantly less expensive, and also much faster. Notice, this is on token count. If you consistently send 1 million tokens as RAG data, then obviously it will rapidly become very expensive.

Every time OpenAI released a new model the last year, we've been enthusiastic about it at AINIRO, and started benchmarking. So far, every single time we've been disappointed. Either the price, latency, or API incompatibilities have made us conclude with that these new models have been basically "useless" - At least for us.

With GPT-4.1 we're finally convinced, to the point where we'll be using it as the primary and default model for everything. You can of course change model if you wish, but by default, any machine learning types you create with Magic will be configured to use GPT-4.1.

Bravo OpenAI, more of this, and less of whatever it is that you've been doing previously

Then the remaining question becomes; Should you switch from GPT-4o to GPT-4.1? Our answer here is; "Yes, definitely!" 4.1 is the best model and release we've seen so far coming out of OpenAI, and it runs in circles around anything they previously released. GPT-4.1 is the only really good model we've seen being released by OpenAI since the 13th of May 2024! And that sums up our feelings about it.

GPT-4.1, the Best Coding Model?

GPT-4.1 on Coding

Instruction following

Context window

Wrapping up

Thomas Hansen

THE Open Source AI Code-Generator

Vibe Coding an API with Magic

The Only 100% Free AI Chatbot

Solutions

Misc

Legal

Solutions

Case Studies

Contact Us