RAG and VSS for Numbers and Ranges

VSS or RAG is difficult to use when you want to match numbers or records within a range. An example of that comes from one of our clients; Imperial Properties whom needed an AI chatbot for their real estate agency.

For Imperial it was crucial to have an AI chatbot that would correctly find properties matching the client's budget. If a user for instance asked the AI chatbot; "My budget is 2 million EUROs and I want a 3 bedroom apartment", it was crucial that the chatbot would return properties in the 2 million EUROs range. Without this feature they ran the risk of suggesting a property for 1 million EUROs, resulting in significantly "underselling" to the client.

A good real estate agency should never show a property for half the price the client can afford if they have a property in the range of the client's budget

With SQL, ranges are easy, it's basically a simple where clause adding more than 1.8 million EUROs and less than 2.2 million EUROs for the above example. However, with VSS or RAG it becomes much harder. This is because VSS search through a database is based upon language matching using LLMs, and LLMs are not optimal when dealing with math or numbers.

The solution, converting numbers to words

The trick is to spell out the price as words, implying transforming our above 2,000,000 EURO price to "two million EUROs" in the RAG database. It sounds almost absurd if you know anything about computing, and I had absolutely no faith in it - But I reluctantly implemented it when Mark suggested it because I thought I should at least give it a chance - And it worked!

For weird reasons I cannot explain, when you spell out the price in your RAG database using words instead of numbers, the VSS search process can suddenly correctly match queries such as for instance "I've got 780,000 EUROs and I need an apartment" towards records in the range of 650,000 to 820,000 EUROs. Without spelling out the price as numbers, it would just return whatever it found based upon the rest of your query. When spelling out the price as words, it would correctly match your query towards relevant properties.

I suspect a part of the explanation is simply because more tokens are being used for the price, since "two million five hundred thousand and fifty-five" is after all a much longer string of characters than simply "2,500,055". Like everything related to LLMs it's almost impossible to know for sure, all I know with certainty is that it works!

This might be unique for OpenAI's "text-embedding-ada-002" embeddings model, but at least for this model, this trick seems to actually work. During import of properties from Imperial's backend systems, we basically take the price, and convert it into its word representation. If the price is for instance 695,550, we convert this number to "six hundred and ninety five thousand five hundred and fifty".

Without this feature Imperial's AI chatbot would sometimes recommend properties for 400,000 when the user explicitly told the chatbot it was looking for something for 2,000,000 EUROs, or vice versa, which significantly degraded the end result, making the chatbot for all practical concerns useless.

From a practical standpoint this is definitely a breakthrough for us. This implies we can deliver AI chatbots that takes budgets into consideration for real estate companies, in addition to also solving a whole range of additional problems related to budget requirements and numbers in general. It's also useful for a whole range of other areas related to budgets and numbers, such as for instance.

"I want to buy something for 2,000 dollars"
"I need something larger than 50 meters"
"I need a car with more than 400 in horsepower"
Etc, etc, etc

For us, this opens up a whole new range of use cases for our AI chatbot, allowing us to deliver chatbots, solving problems that were previously unsolvable. Starting from today, this feature is an "out of the box feature" we will provide automatically to all clients needing it, allowing us to deliver an AI chatbot that is at least 10x as good when dealing with numbers, budgets, ranges, and numbers.

Disclaimer

I need to emphasize that LLMs are still not very good with numbers, and even though we've found one tiny implementation detail that makes our RAG database perform much better on numbers, you should still not expect absolute perfection. LLMs such as ChatGPT and OpenAI's APIs are still much worse when dealing with numbers than a traditional calculator, but with this tiny detail, we've improved their accuracy by at least one order of magnitude it seems.

RAG and VSS for Numbers and Ranges

The solution, converting numbers to words

Disclaimer

Thomas Hansen

The Smallest AI Chatbot in the World, and Why it Matters

How to Choose a Next Generation AI Chatbot

Seattle Ballooning - An AI Chatbot Hospitality Case Study

Solutions

Misc

Legal

Solutions

Case Studies

Contact Us