RAG and VSS for Numbers and Ranges

RAG and VSS for Numbers and Ranges

VSS or RAG is difficult to use when you want to match numbers or records within a range. An example of that comes from one of our clients; Imperial Properties.

For Imperial it was crucial to have an AI chatbot that would correctly find properties matching the client's budget. If a user for instance asked the AI chatbot; "My budget is 2 million EUROs and I want a 3 bedroom apartment", it was crucial that the chatbot would return properties in the 2 million EUROs range. Without this feature they ran the risk of suggesting a property for 1 million EUROs, resulting in significantly "underselling" to the client.

A good real estate agency should never show a property for half the price the client can afford if they have a property in the range of the client's budget

With SQL, ranges are easy, it's basically a simple where clause adding more than 1.8 million EUROs and less than 2.2 million EUROs for the above example. However, with VSS or RAG it becomes much harder. This is because VSS search through a database is based upon language matching using LLMs, and LLMs are not optimal when dealing with math or numbers.

The solution, converting numbers to words

The trick is to spell out the price as words, implying transforming our above 2,000,000 EURO price to "two million EUROs" in the RAG database. It sounds almost absurd if you know anything about computing, and I had absolutely no faith in it - But I reluctantly implemented it when Mark suggested it because I thought I should at least give it a chance - And it worked!

For weird reasons I cannot explain, when you spell out the price in your RAG database using words instead of numbers, the VSS search process can suddenly correctly match queries such as for instance "I've got 780,000 EUROs and I need an apartment" towards records in the range of 650,000 to 820,000 EUROs. Without spelling out the price as numbers, it would just return whatever it found based upon the rest of your query. When spelling out the price as words, it would correctly match your query towards relevant properties.

I suspect a part of the explanation is simply because more tokens are being used for the price, since "two million five hundred thousand and fifty-five" is after all a much longer string of characters than simply "2,500,055". Like everything related to LLMs it's almost impossible to know for sure, all I know with certainty is that it works!

This might be unique for OpenAI's "text-embedding-ada-002" embeddings model, but at least for this model, this trick seems to actually work. During import of properties from Imperial's backend systems, we basically take the price, and convert it into its word representation. If the price is for instance 695,550, we convert this number to "six hundred and ninety five thousand five hundred and fifty".

Without this feature Imperial's AI chatbot would sometimes recommend properties for 400,000 when the user explicitly told the chatbot it was looking for something for 2,000,000 EUROs, or vice versa, which significantly degraded the end result, making the chatbot for all practical concerns useless.

From a practical standpoint this is definitely a breakthrough for us. This implies we can deliver AI chatbots that takes budgets into consideration for real estate companies, in addition to also solving a whole range of additional problems related to budget requirements and numbers in general. It's also useful for a whole range of other areas related to budgets and numbers, such as for instance.

  • "I want to buy something for 2,000 dollars"
  • "I need something larger than 50 meters"
  • "I need a car with more than 400 in horsepower"
  • Etc, etc, etc

For us, this opens up a whole new range of use cases for our AI chatbot, allowing us to deliver chatbots, solving problems that were previously unsolvable. Starting from today, this feature is an "out of the box feature" we will provide automatically to all clients needing it, allowing us to deliver an AI chatbot that is at least 10x as good when dealing with numbers, budgets, ranges, and numbers.

Disclaimer

I need to emphasize that LLMs are still not very good with numbers, and even though we've found one tiny implementation detail that makes our RAG database perform much better on numbers, you should still not expect absolute perfection. LLMs such as ChatGPT and OpenAI's APIs are still much worse when dealing with numbers than a traditional calculator, but with this tiny detail, we've improved their accuracy by at least one order of magnitude it seems. If you need an AI chatbot that correctly handles numbers and ranges, or "more correctly" to be specific, feel free to contact us below.

Thomas Hansen

Thomas Hansen I am the CEO and Founder of AINIRO.IO, Ltd. I am a software developer with more than 25 years of experience. I write about Machine Learning, AI, and how to help organizations adopt said technologies. You can follow me on LinkedIn if you want to read more of what I write.

Published 20. Mar 2024